Beruflich Dokumente
Kultur Dokumente
Mathematics
version 1.2
2 Variational principle 43
2.1 Multivariate calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2 Functionals and the Euler-Lagrange equation . . . . . . . . . . . . . . . 51
2.3 Hamilton’s principle and Noether’s theorem . . . . . . . . . . . . . . . . 58
2.4 Multivariate calculus of variations . . . . . . . . . . . . . . . . . . . . . 65
2.5 The second variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3 Optimization 73
3.1 Preliminaries and Lagrange multipliers . . . . . . . . . . . . . . . . . . . 73
3.2 Solutions of linear programs . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3 Non-cooperative games . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4 Network problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5 Analysis II 161
5.1 Sequence of functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.2 Series of functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.3 Normed space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.4 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
5.6 Differentiation from Rm to Rn . . . . . . . . . . . . . . . . . . . . . . . . 192
6 Methods 209
6.1 Fourier series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.2 Sturm-Liouville Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.3 Partial differential equations . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.4 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
6.5 Green’s Functions for ODEs . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.6 Fourier transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.7 PDEs and Method of Characteristics . . . . . . . . . . . . . . . . . . . . 258
6.8 Green’s Functions for PDEs . . . . . . . . . . . . . . . . . . . . . . . . . 265
i
ii CONTENTS
11 Geometry 473
11.1 Euclidean geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
11.2 Spherical geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
11.3 Triangulations and the Euler number . . . . . . . . . . . . . . . . . . . . 487
11.4 Hyperbolic geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
11.5 Smooth embedded surfaces (in R3 ) . . . . . . . . . . . . . . . . . . . . . 502
11.6 Abstract smooth surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 513
12 Statistics 517
12.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
12.2 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
12.3 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
12.4 Rules of thumb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
14 Electromagnetism 617
14.1 Electrostatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
14.2 Magnetostatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
14.3 Electrodynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
14.4 Electromagnetism and relativity . . . . . . . . . . . . . . . . . . . . . . 640
B List of symbols IX
iv CONTENTS
Preface
This book is a compilation of Part IB course notes that I edited based on lectures,
supervisions, notes taken by Dexter Chua in the previous year, and notes taken by
various friends of mine. I edited and formatted these materials into one book so that
this is one coherent and complete set of notes for the whole part IB maths course. In
a sense, this is a fusion of different notes taken by me and various other people. That
being said, any errors or mistakes are mine. I believe (but make no guarantee) that
this book is a complete set of notes for part IB mathematics for the year 2016/2017.
However since courses and lecturers change year by year, as time passes this book may
resemble less and less the part IB mathematics course. Please use these notes at your
own risk.
I have highlighted titles of propositions and theorems, so hopefully they stand out. At
some places I also added in more materials, sometimes giving proof to propositions that
are only stated in lectures, and sometimes giving answers to a few selected questions
from example sheets. The book is make up of many little sections (or boxes). In
particular, a section labelled D stands for definition, C stands for commentary, L
stands for lemma, P stands for proposition, T stands for theorem and E stands for
example or explanation.
Since now all courses are in this one book, I have removed repetition of contents in
different courses, and just reference the relevant bits either explicitly or implicitly. The
book is put together so that in theory one could (and should) read from beginning to
end; this might cause some problems for readers who are learning the courses simulta-
neously. For example, in this book the Methods course comes after the Linear Algebra
course since the start of the Methods course requires some knowledge of inner products
which the lecturer gave a short introduction of; however I removed this introduction
since inner products are covered in detail at the end of the Linear Algebra course. This
obviously isn’t a problem if one is reading this book from beginning to end in order, but
those who are doing courses simultaneously should bear this in mind. In particular,
knowledge of Metric and Topological Spaces, Linear Algebra, Analysis II and Methods
is heavily used or implied in subsequent courses, so one should get familiar with their
contents as early as possible.
I originally edited this book of notes for personal use. However I learned that people
might find it useful, so I decided to share it (for free, of course). The pdf version of
this and other notes can be found at the following URL:
https://1drv.ms/f/s!AtFdZ6-agiAQky4Y2DSTwhZeT7ha
Any future updates to any of the notes will be put into the above link (until I somehow
fail to maintain this link if that happens). Any comments, suggestions, or reporting
of typos and errors are welcomed at lamkingming@hotmail.com or just drop me a
Facebook message.
v
vi PREFACE
CHAPTER 1
D. 1-2
• A metric space is a pair (X, dX ) where X is a set (the space ) and dX is a
function dX : X × X → R (the metric ) such that for all x, y, z ∈ X,
• dX (x, y) ≥ 0 (non-negativity)
1
An alternative definition of f being continuous is that ∀x ∈ X, ∀ε > 0, ∃δ > 0 s.t. ∀y ∈
Y, dX (x, y) < δ ⇒ dY (f (x), f (y)) < ε. As we will show later, the two definitions are equivalent in
metric spaces. However they are not equivalent in the more general topological space.
1
2 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES
E. 1-3
The first condition in the definition of metric space is actually redundant because
2dX (x, y) = dX (x, y) + dX (y, x) ≥ dX (x, x) = 0.
C. 1-4
<Euclidean metric> Let X = Rn and
v
u n
uX
d(v, w) = |v − w| = t (vi − wi )2 .
i=1
This is the usual notion of distance we have in the Rn vector space. It is not
difficult to show that this is a metric (the fourth axiom follows from Cauchy-
Schwarz inequality).
<Discrete metric> Let X be a set, and
(
1 x 6= y
dX (x, y) =
0 x=y
To show this is indeed a metric, we have to show it satisfies all the axioms. The
first three axioms are trivially satisfied. How about the fourth? We can prove this
by exhaustion. Since the distance function can only return 0 or 1, d(x, z) can be 0
or 1, while d(x, y) + d(y, z) can be 0, 1 or 2. For the fourth axiom to fail, we must
have RHS < LHS. This can only happen if the right hand side is 0. But for the
right hand side to be 0, we must have x = y = z. So the left hand side is also 0.
So the fourth axiom is always satisfied.
<Manhattan metric> Let X = R2 , and define the metric as
The first three axioms are again trivial. To prove the triangle inequality, we have
using the triangle inequality for R. This metric represents the distance you have
to walk from one point to another if you are only allowed to move horizontally and
vertically (and not diagonally).
<British railway metric> Let X = R2 . We define
(
|x − y| if x = ky
d(x, y) =
|x| + |y| otherwise
To explain the name of this metric, think of Britain with London as the origin.
Since the railway system is less than ideal, all trains go through London. For
example, if you want to go from Oxford to Cambridge, you first go from Oxford to
London, then London to Cambridge. So the distance traveled is the distance from
London to Oxford plus the distance from London to Cambridge. The exception
is when the two destinations lie along the same line, then you can directly take
1.1. METRIC SPACES 3
the train from one to the other without going through London, and hence the “if
x = ky” clause.
d : R2 × R2 → R define by
(
kuk2 + kvk2 , if u 6= v,
d(u, v) =
0 if u = v,
is also another possible metric (To get from A to B always via London).
E. 1-5
• S n = {v ∈ Rn+1 : v = 1}, the n-dimensional sphere, is a subspace of Rn+1 .
• Let (vn ) be a sequence in Rk with the Euclidean metric. Write vn = (vn1 , · · · , vnk ),
and v = (v 1 , · · · , v k ) ∈ Rk . Then vn → v iff (vni ) → v i for all i.
• Let X have the discrete metric, and suppose xn → x. Pick ε = 21 . Then there is
some N such that d(xn , x) < 12 whenever n > N . But if d(xn , x) < 21 , we must
have d(xn , x) = 0. So xn = x. Hence if xn → x, then eventually all xn is equal to
x.
• Let X = R with the Euclidean metric. Let Y = R with the discrete metric. Then
f : X → Y that maps f (x) = x is not continuous. This is since 1/n → 0 in the
Euclidean metric, but not in the discrete metric. On the other hand, g : Y → X
by g(x) = x is continuous, since a sequence in Y that converges is eventually
constant.
P. 1-6
<Uniqueness of limits> If (X, d) is a metric space and (xn ) is a sequence in
X such that xn → x and xn → x0 , then x = x0 .
For any ε > 0, ∃N such that d(xn , x) < ε/2 if n > N . Similarly, there exists some
N 0 such that d(xn , x0 ) < ε/2 if n > N 0 . Hence if n > max(N, N 0 ), then
3. hv, wi = hw, vi
T. 1-8
<Cauchy-Schwarz inequality> If h , i is an inner product, then
C. 1-9
Let V = Rn . Possible norms on Rn includes:
v
Xn u n
uX
kvk1 = |vi |, kvk2 = t vi2 , kvk∞ = max{|vi | : 1 ≤ i ≤ n}.
i=1 i=1
Pn 1/p
In general, for any 1 ≤ p ≤ ∞ we can define the p-norm kvkp = i=1 |vi |p .
And kvk∞ is the limit as p → ∞.
L. 1-10
p
i. If h , i is an inner product on V , then kvk = hv, vi is a norm.
ii. If k k is a norm on V , then d(v, w) = kv − wk defines a metric on V .
p
i. 1. kvk = hv, vi ≥ 0
2. kvk = 0 ⇔ hv, vi = 0 ⇔ v = 0
p p
3. kλvk = hλv, λvi = λ2 hv, vi = |λ|kvk
2. d(v, w) = 0 ⇔ kv − wk = 0 ⇔ v − w = 0 ⇔ v = w.
So inner products induce norms and hence metrics. For example the inner product
hv, wi = n n
P
i=1 vi wi on R induces the k · k2 norm which induces the Euclidean
metric. Norms that are derived from inner products satisfy the parallelogram law
ku + vk2 + ku − vk2 = 2kuk2 + 2kvk2 since we can just “expand” it out, but
this is not true in general for norms. Note also that although a norm naturally
induces a metric, not all metrics can be derived from a norm. For example the
discrete metric cannot be derive from a norm. Similarly,
P k · k1 and k · k∞ cannot
be derived from inner products, since hv, vi = ij vi vj hei , ej i.
1.1. METRIC SPACES 5
E. 1-11
<p-adic metric> Let p ∈ Z be a (fixed) prime number. For n ∈ Z we define
|n|p = p−k , where k is the highest power of p that divides n. If n = 0, we let
|n|p = 0. For example, |20|2 = |22 · 5|2 = 2−2 . For q = m n
∈ Q we define
|q|p = |n|p /|m|p . Some properties satisfies by | · |P are |a|p |b|p = |ab|p and
|a + b|p ≤ max{|a|p , |b|p } ≤ |a|p + |b|p .
| · |p is sometimes called the p-adic norm, but it’s actually not a norm since it
doesn’t satisfy condition 3 kλvk = |λ|kvk of a norm. However it can still “induce”
a metric. Take X = Q, then dp (a, b) = |a − b|p is a metric. This works because in
the above the lemma, to show that a norm induce a metric, we didn’t make use
of the full kλvk = |λ|kvk, we only use kvk = k − vk which is true in this case.
Note with respect to d2 , we have 1, 2, 4, 8, 16, 32, · · · → 0, while 1, 2, 3, 4, · · · does
not converge. We can also use it to prove certain number-theoretical results, but
we will not go into details here.
L. 1-12
R 1 f ∈ C[0, 1] satisfy f (x) ≥ 0 for all x ∈ [0, 1]. If f (x) is not constantly 0, then
Let
0
f (x) dx > 0.
Pick x0 ∈ [0, 1] with f (x0 ) = a > 0. Then since f is continuous, there is a δ such
that |f (x) − f (x0 )| < a/2 if |x − x0 | < δ. So |f (x)| > a/2 in this region. Take
(
a/2 |x − x0 | < δ
g(x) =
0 otherwise
R1 R1 a
Then f (x) ≥ g(x) for all x ∈ [0, 1]. So 0
f (x) dx ≥ 0
g(x) dx = 2
(2δ) > 0.
C. 1-13
<Function space> C[0, 1], the set of all continuous functions satisfies the axiom
of being a vector space (see IA vector and matrices or [D.4-1]), so it forms a vector
space. A possible metric on C[0, 1] is the Uniform metric
The maximum always exists since continuous functions on [0, 1] are bounded
R 1 attain their bounds. Possible inner products on C[0, 1] include hf, gi =
and
0
f (x)g(x) dx. possible norms on C[0, 1] includes:
sZ
Z 1 1
kf k1 = |f (x)| dx, kf k2 = f (x)2 dx, kf k∞ = max |f (x)|
0 0 x∈[0,1]
The first two known as the L1 and L2 norms. The last is called the uniform norm
or supremum norm , since it induces the uniform metric. In general we can also
R1
define the Lp norm: kf kp = ( 0 |f (x)|p dx)1/p .
We can check that all the above are indeed norms. For example to show that
the L2 norm satisfies the 4. condition Rkv + wk R≤ kvkR + kwk we can use the
Cauchy–Schwartz inequity for integral, ( f g)2 ≤ ( f 2 )( g 2 ), proved in the Part
IA course: ( (f + g)2 )1/2 = ( f 2 + 2 f g + g 2 )1/2 ≤ ( f 2 )1/2 + ( g 2 )1/2 .
R R R R R R
6 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES
Another slightly tricky part is to show that kf k = 0 iff f = 0, which we obtain via
the lemma [L.1-12].
E. 1-14
Let F : C[0, 1] → R be defined by F (f ) = f ( 12 ). Then this is continuous with
respect to the uniform metric on C[0, 1] and the usual metric on R: Let fn → f in
the uniform metric, we have to show that F (fn ) → F (f ), ie. fn ( 21 ) → f ( 12 ). This
is easy, since we have
E. 1-15
R1
Let X = C[0, 1], and let d1 (f, g) = kf − gk1 = 0
|f (x) − g(x)| dx. Define the
sequence
x
(
1 − nx x ∈ [0, n1 ]
fn = f
0 x ≥ n1 .
1 1
n
Now kfn k1 = 21 · n1 · 1 = 1
2n
→ 0 as n → ∞. So fn → 0 in (X, d1 ) where 0(x) = 0.
On the other hand,
kfn k∞ = max kfn (x)k = 1.
x∈[0,1]
So the function (C[0, 1], d1 ) → (C[0, 1], d∞ ) that maps f 7→ f is not continuous.
This is similar to the case that the identity function from the usual metric of R to
the discrete metric of R is not continuous.
Using the same example, we can show that the function G : (C[0, 1], d1 ) →
(R, usual) with G(f ) = f (0) is not continuous.
D. 1-16
• Let (X, d) be a metric space. We say U ⊆ X is an open subset if for every x ∈ U ,
∃δ > 0 such that d(x, y) < δ ⇒ y ∈ U . We say C ⊆ X is a closed subset if X \ C
is open.
L. 1-17
The open ball Br (x) ⊆ X is an open subset, and the closed ball B̄r (x) ⊆ X is a
closed subset.
2
Some authors drop the requirement xn 6= x in the definition, but we are not going to.
1.1. METRIC SPACES 7
E. 1-18
1. When X = R, Br (x) = (x − r, x + r) and B̄r (x) = [x − r, x + r].
2. When X = R2 ,
v2
i. If d is the metric induced by the kvk1 = kv1 k+kv2 k,
v1
then an open ball is a rotated square.
v2
p
ii. If d is the metric induced by the kvk2 = v12 + v22 ,
v1
then an open ball is an actual disk.
v2
iii. If d is the metric induced by the kvk∞ =
v1
max{|v1 |, |v2 |}, then an open ball is a square.
E. 1-19
Note that openness is a property of a subset A ⊆ X being open depends on both
A and X, not just A. For example, [0, 21 ) is not an open subset of R, but is an
open subset of [0, 1] (since it is B 1 (0)), both with the Euclidean metric.
2
2. Q ⊆ R is neither open nor closed, since any open interval contains both
rational numbers and irrational numbers. So any open interval (open ball)
cannot be a subset of Q or R \ Q.
3. Let X = [−1, 1] \ {0} with the Euclidean metric. Let A = [−1, 0) ⊆ X. Then
A is open since it is equal to B1 (−1). A is also closed since it is equal to
B̄ 1 (− 12 ).
2
L. 1-20
In a metric space, xn → x iff for all open neighbourhood U of x, ∃N such that
xn ∈ U for all n > N .
(Forward) Since U is open, there exists some δ > 0 such that Bδ (x) ⊆ U . Since
xn → x, ∃N such that d(xn , x) < δ for all n > N . This implies that xn ∈ Bδ (x)
for all n > N . So xn ∈ U for all n > N .
(Backward) Given any ε > 0, Bε (x) is open, so ∃N such that d(xn , x) < ε for all
n > N.
8 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES
E. 1-21
1
• If A = (0, 1) ⊆ R, then 0 is a limit point of A, eg. take xn = n
.
P. 1-22
Let X be a metric space. C ⊆ X is a closed subset if and only if C contains all of
its limit points.3
(Backward) Suppose that C is not closed. We have to find a limit point not in C.
Since C is not closed, A is not open. So ∃x ∈ A such that Bδ (x) 6⊆ A for all δ > 0.
This means that Bδ (x) ∩ C 6= ∅ for all δ > 0. Pick xn ∈ B1/n (x) ∩ C for each
n > 0. Then xn ∈ C, xn 6= x and d(xn , x) ≤ n1 → 0. So xn → x. So x is a limit
point of C which is not in C.
P. 1-23
Let (X, dx ) and (Y, dy ) be metric spaces, and f : X → Y . The following are
equivalent:
1. f is continuous
2. If xn → x, then f (xn ) → f (x)
3. For any closed subset C ⊆ Y , f −1 (C) is closed in X.
4. For any open subset U ⊆ Y , f −1 (U ) is open in X.
5. For any x ∈ X and ε > 0, ∃δ > 0 such that f (Bδ (x)) ⊆ Bε (f (x)). (Equiva-
lently, dX (x, z) < δ ⇒ dY (f (x), f (z)) < ε.)
1 ⇔ 2: By definition.
1. Observe that
Z 1 Z 1
kf k1 = |f (t)| dt ≤ kf k∞ dt = kf k∞
0 0
Z 1 Z 1
kf k22 = |f (t)|2 dt≤ kf k2∞ dt = kf k2∞ ,
0 0
Z 1 Z 1/n
kfn − 0k22 = fn (t)2 dt = n2/3 (1 − nt)2 dt
0 0
3 1/n
n−1/3
(1 − nt)
= n2/3 − = →0
3n 0 3
as n → ∞, yet kfn − 0k∞ = n1/3 → ∞ so j1,∞ and j2,∞ are not continuous.
2. Observe that
Z 1
kf k1 = |f (t)| dt = h|f |, 1i ≤ kf k2 k1k2 ≤ kf k2 .
0
L. 1-26
1. ∅ and X are open subsets of X.
S
2. If Vα ⊆ X is open for all α ∈ A, then U = α∈A Vα is open in X.
Tn
3. If V1 , · · · , Vn ⊆ X are open, then so is V = i=1 Vi .
1. ∅ satisfies the definition of an open subset vacuously. X is open since for any
x, B1 (x) ⊆ X.
2. If x ∈ U , then x ∈ Vα for some
S α. Since Vα is open, there exists δ > 0 such
that Bδ (x) ⊆ Vα . So Bδ (x) ⊆ α∈A Vα = U . So U is open.
3. If x ∈ V , then x ∈ Vi for all i = 1, · · · , n. So ∃δi > 0 with Bδi (x) ⊆ Vi . Take
δ = min{δ1 , · · · , δn }. So Bδ (x) ⊆ Vi for all i. So Bδ (x) ⊆ V . So V is open.
Note that we can take infinite unions or finite intersection, but not infinite inter-
sections. For example, the intersection of all (− n1 , n1 ) is {0}, which is not open.
The elements of X are the points . We extend the notion of open set by calling
the elements of τ the open subsets of X.
• Let (X, d) be a metric space, then the topology induced by d is the set of all open
sets of X under d. A notion or property is said to be a topological notion or
topological property if it only depends on the topology, and not the metric.
• Let f : X → Y be a map of topological spaces. Then f is continuous if f −1 (U )
is open in X whenever U is open in Y .
• Let (X, τ ) be a topological space and x ∈ X. We say that V is a open neighbourhood
of x if V ∈ τ and x ∈ V (same as the metric space definition). We say that N is
a neighbourhood of x if we can find U ∈ τ with x ∈ U ⊆ N .
E. 1-28
When the topologies are induced by metrics, the topological and metric notions
of continuous functions coincide, as we showed previously. The new definition of
continuity is more general in the sense that it also works for topological spaces not
induced by a metric.
Notions of limits and continuity (in a metric space) are in fact topological proper-
ties since they only depends on the topology (i.e. which sets are open) as shown
previously, that’s also why we can define continuity solely using topology.
1.2. TOPOLOGICAL SPACES 11
C. 1-29
<Some common topology> Let X be any set.
1. τ = {φ, X} is the coarse topology or indiscrete topology on X.
2. τ = P(X) is the discrete topology on X, it is induced by the discrete metric.
3. τ = {A ⊆ X : X \ A is finite or A = ∅} is the cofinite topology on X.
Let X = R, and τ = {(a, ∞) : a ∈ R} is the right order topology on R.
E. 1-30
Note that if X is finite, then the cofinite topology is the same as the discrete
topology.
If F is a finite set with more than one point, the indiscrete topology is not induced
by any metric since every subset of F must be open, as there exist a minimum
distance between any two distinct points. So the induced topology must be the
discrete topology.
L. 1-31
Let τ1 and τ2 be two topologies on the same space X. Then τ1 ⊆ τ2 if and only
if, given x ∈ U ∈ τ1 , we can find V ∈ τ2 such that x ∈ V ⊆ U .
E. 1-32
Let X = Rn , the metrics induced by k · k1 , k · k2 and k · k∞ in fact all induces
∞
the same topology as Br/n (x) ⊆ Br1 (x) ⊆ Br2 (x) ⊆ Br∞ (x).
However if X = C[0, 1], then d1 (f, g) = kf − gk1 and d∞ (f, g) = kf − gk∞ do not
induce the same topology, since (X, d1 ) → (X, d∞ ) by f 7→ f is not continuous.
E. 1-33
• Any function f : X → Y is continuous if X has the discrete topology.
Note that by how neighbourhood is defined, the result holds if we replace neigh-
bourhoods with open neighbourhoods. So the result can be written as: U ∈ τ if
and only if ∀x ∈ U, ∃Vx ∈ τ s.t. x ∈ Vx ⊆ U .
Note that this result is the analogous of how we define openness of a set in a metric
space.
L. 1-37
Let (X, τ ) and (Y, σ) be topological spaces. Then f : X → Y is continuous if
and only if, given x ∈ X and M a neighbourhood of f (x) in Y , we can find a
neighbourhood N of x with f (N ) ⊆ M .
exhibited by one but not by the other. Later in this course we will meet some
topological properties like being Hausdorff and compactness.
D. 1-43
• A sequence xn converge to x (xn → x) if for every open neighbourhood U of x,
∃N such that xn ∈ U for all n > N . (Topological definition)4
• A topological space X is Hausdorff if for all x1 , x2 ∈ X with x1 6= x2 , there exists
open neighbourhoods U1 of x1 and U2 of x2 such that U1 ∩ U2 = ∅.
E. 1-44
1. If X has the course topology, then any sequence xn converges to every x ∈ X,
since there is only one open neighbourhood of x.
2. If X has the cofinite topology and all xn s are distinct, then xn → x for every
x ∈ X, since every open set can only have finitely many xn not inside it.
Note that convergence only depends on the topology (not the metric, if exist),
as can be seen by [L.1-20] and our new topological definition of convergence. So
convergence is a topological property.
Also it should be pointed out that for functions between topological spaces, in
general, f (xn ) → f (x) whenever xn → x does not mean f is continuous, although
the converse is true. A function satisfying f (xn ) → f (x) whenever xn → x is called
sequentially continuous . We already seen that this is equivalent to continuity in
a metric spaces, but this is not true in general for topological spaces.
L. 1-45
If X is Hausdorff and xn is a sequence in X with xn → x and xn → x0 , then
x = x0 (ie. limits are unique in Hausdorff space).
CA = {C that is closed in X : A ⊆ C}
OA = {U ∈ τ : U ⊆ A}
T
The closure of A in X is Cl(A) = Ā = C∈CA C.
S
The interior of A in X is Int(A) = U ∈OA U .
• Define L(A) = {x ∈ X : ∃(xn ) ∈ A s.t. xn → x}.
• Let F be a closed subset of X. We say that A ⊆ X is a dense subset of F if
Ā = F .
E. 1-50
Since Ā is defined as an intersection, we should make sure we are not taking an
intersection of no sets. Since X is closed in X (its complement ∅ is open), CA 6= ∅.
So we can safely take the intersection.
Note that Int A = {x ∈ A : ∃ U ∈ τ with x ∈ U ⊆ A} and Cl A = {x ∈ X : ∀U ∈
τ with x ∈ U , we have A ∩ U 6= ∅}.
P. 1-51
1. Ā is the smallest closed subset of X which contains A.
2. Int(A) is the largest open subset of X contained in A.
1. Since Ā is an intersection
T of closed sets, it is closed in X. Also, if C ∈ CA ,
then A ⊆ C. So A ⊆ C∈CA C = Ā. Let K ⊆ X be a closed set containing A.
T
Then K ∈ Ca . So Ā = C∈CA C ⊆ K, so Ā ⊆ K.
16 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES
2. From definition clearly Int A ⊆ A. Since the union of open sets is open, Int A ∈
τS. Let M ⊆ X be a open set contained in A. Then M ∈ OA , so M ⊆ Int(A) =
U ∈OA U .
P. 1-52
X \ Int(A) = X \ A (Equivalently (Int(A))c = Cl(Ac )).
C ⊆ L(A) ⊆ Ā ⊆ C, where the last step is since Ā is the smallest closed set
containing A. So C = L(A) = Ā.
This is useful for finding the closure of subsets.
E. 1-57
• If (a, b) ⊆ R, then (a, b) = [a, b].
• If Q ⊆ R, then Q̄ = R. Also R \ Q = R. So Q and R \ Q are both dense in R with
the usual topology. Note also Int(Q) = Int(R \ Q) = ∅.
• In Rn with the Euclidean metric, Br (x) = B̄r (x). In general, Br (x) ⊆ B̄r (x),
since B̄r (x) is closed and Br (x) ⊆ B̄r (x), but these need not be equal.
For example, if X has the discrete metric, then B1 (x) = {x}. Then B1 (x) = {x},
but B̄1 (x) = X.
1.2. TOPOLOGICAL SPACES 17
E. 1-58
1. Let (X, τ ) be a topological space and (Y, d) a metric space. If f, g : X → Y are
continuous show that E = {x ∈ X : f (x) = g(x)} is closed. If now f (x) = g(x)
for all x ∈ A, where A is dense in X, show that f (x) = g(x) for all x ∈ X.
2. Consider the unit interval [0, 1] with the Euclidean metric and A = [0, 1] ∩ Q
with the inherited metric. Exhibit, with proof, a continuous map f : A → R
(where R has the standard metric) such that there does not exist a continuous
map f˜ : [0, 1] → R with f˜(x) = f (x) for all x ∈ [0, 1].
|y − x| < δ ⇒ x2 < 1
2
⇒ f (x) = f (y).
2 1
Similarly if y ∈ A and y > we can find a δ > 0 such that |y − x| < δ ⇒
2
f (x) = f (y). Thus f is continuous.
Suppose that f˜ : [0, 1] → R is such that f˜(x) = f (x) for all x ∈ A. Choose
pn , qn ∈ A such that p2n > 21 > qn2 and |pn − 2−1/2 |, |qn − 2−1/2 | → 0. Then f˜
cannot be continuous since
L. 1-59
Let X be a space and let H be a collection of some subsets of X. Then there
exists a unique topology τH such that (i)[[ τH ⊇ H ]], and (ii)[[ if τ is a topology
with τ ⊇ H, then τ ⊇ τH ]].
(Uniqueness) Suppose that σ and σ 0 are topologies such that satisfying (i) and
(ii). By (i) of σ and (ii) of σ 0 , we have σ ⊇ σ 0 and by (i) of σ 0 and (ii) of σ, we
have σ 0 ⊇ σ. Thus σ = σ 0 .
We call τH the smallest (or coarsest) topology containing H. However note that
there need not exist a largest topology contained in H. For example let X =
{1, 2, 3} and θ = {∅, {1}, {2}, X}. There does not exist a topology τ1 ⊆ θ such
that, if τ ⊆ θ is a topology, then τ ⊆ τ1 . Thus there does not exist a largest
topology contained in θ.
L. 1-60
Suppose that A is non-empty, the spaces (Xα , τα ) are topological spaces and we
have maps fα : X → Xα [α ∈ A]. Then there is a smallest topology τ on X for
which the maps fα are continuous.
D. 1-61
• Let (X, τ ) be a topological space and Y ⊆ X. The subspace topology τY on Y
induced by τ is given by τY = {Y ∩ U : U ∈ τ }.
P. 1-62
If (X, τ ) is a topological space and Y ⊆ X, then the subspace topology τY on Y is
the smallest topology on Y for which the inclusion map is continuous.5
1. ∅ = Y ∩ ∅ ∈ τY and Y = Y ∩ X ∈ τY .
n n n
!
\ \ \
Vi = (Y ∩ Ui ) = Y ∩ Ui ∈ τY
i=1 i=a i=a
E. 1-63
If (X, d) is a metric space and Y ⊆ X, then the metric topology on (Y, d) is the
subspace topology, since BrY (y) = Y ∩ BrX (y).
P. 1-64
If Y ⊆ X has the subspace topology, then f : Z → Y is continuous iff ι◦f : Z → X
is continuous.
5
Some use this as the definition of subspace topology.
1.2. TOPOLOGICAL SPACES 19
Condition (ii) of a basis might seems strange. But as can be seen from the proof
(ii) ensures that (finite) intersections of sets in τB is in τB , i.e. (finite) intersections
of open sets are open.
D. 1-68
• Let (X, τ ) and (Y, σ) be topological spaces. The product topology λ on X × Y is
given by: U ∈ λ iff ∀(x, y) ∈ U , ∃Vx ∈ τ, ∃Wy ∈ σ s.t. (x, y) ∈ Vx × Wy ⊆ U .
P. 1-69
If (X, τ ) and (Y, σ) are topological spaces, then the product topology µ on X × Y
is the smallest topology on X × Y for which the projection maps πX and πY are
continuous.7
P. 1-70
If X × Y has the product topology, then f : Z 7→ X × Y is continuous iff πX ◦ f
and πY ◦ f are continuous.
7
Some use this as the definition of product topology.
1.2. TOPOLOGICAL SPACES 21
So f −1 (W ) is open.
E. 1-71
• If V ⊆ X and W ⊆ Y are open, then V × W ⊆ X × Y is open. (take VX = V ,
WY = W )
• Note that our definition of the product topology is rather similar to the definition
of open sets for metrics. We have a special class of subsets of the form V × W ,
and a subset U is open iff every point x ∈ U is contained in some V × W ⊆ U . In
some sense, these subsets “generate” the open sets.
S
Alternatively, if U ⊆ X × Y is open, then U = (x,y)∈U Vx × Wy . So U ⊆ X × Y
is open if and only if it is a union of members of our special class of subsets. We
call this special class the basis.
E. 1-72
Suppose that (X, τ ) and (Y, σ) are topological spaces and we give X×Y the product
topology µ. Now fix x ∈ X and give E = {x} × Y the subspace topology µE . Show
that the map k : (Y, σ) → (E, µE ) given by k(y) = (x, y) is a homeomorphism.
E. 1-73
Let (X1 , d1 ) and (X2 , d2 ) be metric spaces. Let τ be the product topology on
X1 × X2 where Xj has the topology induced by dj [j = 1, 2]. Define ρk : (X1 ×
X2 )2 → R by
Show that ρi are metrics. Show that each of the ρi induces the product topology
τ on X1 × X2 .
22 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES
It is easy to show that ρi (u, v) = 0 iff u = v and ρi (u, v) = ρi (v, u). Also,
where the second to last inequality is due to the triangle inequality kuk + kvk ≥
ku + vk where k k is the usual norm. So they are metrics. Suppose ρi induces the
topology τi . We now use [L.1-31].
Given any U ∈ τ and any (x1 , x2 ) ∈ U , there exist Vi open in Xi such that
(x1 , x2 ) ∈ V1 × V2 ⊆ U . Since Vi is open in Xi and xi ∈ Vi , there exist ri > 0 such
that xi ∈ Brdii (xi ) ⊆ Vi . Let r = min{r1 , r2 }, then (x1 , x2 ) ∈ Brd1 (x1 ) × Brd2 (x2 ) ⊆
U . Now (x1 , x2 ) ∈ Brρi (x1 , x2 ) ⊆ Brd1 (x1 ) × Brd2 (x2 ) ⊆ U for all i = 1, 2, 3. So
τi ⊇ τ for all i.
Given any Wi ∈ τi and any (x1 , x2 ) ∈ Wi . There exist Ri > 0 such that (x1 , x2 ) ∈
ρi d1 d2 ρi
BR i
(x1 , x2 ) ⊆ Wi . Now (x1 , x2 ) ∈ BR i /2
(x1 ) × BR i /2
(x2 ) ⊆ BR i
(x1 , x2 ) ⊆ Wi . So
τi ⊆ τ for all i.
E. 1-74
• The product topology on R × R is same as the topology induced by the k k∞ ,
hence is also the same as the topology induced by k k2 or k k1 .[E.1-33] Similarly,
the product topology on Rn = Rn−1 × R is also the same as that induced by k k∞ .
• (0, 1) × (0, 1) × · · · × (0, 1) = (0, 1)n ⊆ Rn is the open n−dimensional cube in Rn .
Since (0, 1) ' R, we have (0, 1)N ' RN ' Int(Dn ).
• [0, 1]×S n ' [1, 2]×S n ' {v ∈ Rn+1 : 1 ≤ |v| ≤ 2}, where the last homeomorphism
is given by (t, w) 7→ tw with inverse v 7→ (|v|, v̂). This is a thickened sphere.
• Let A ⊆ {(r, z) : r > 0} ⊆ R2 , R(A) be the set obtained by rotating A around the
z axis. Then R(A) ' S × A by (x, y, z) = (v, z) 7→ (v̂, (|v|, z)). In particular, if A
is a circle, then R(A) ' S 1 × S 1 = T 2 is the two-dimensional torus.
z
D. 1-75
• If X is a set and ∼ is an equivalence relation on X, then the quotient X/∼ is the
set of equivalence classes. The projection q : X → X/∼ is defined as q(x) = [x],
the equivalence class containing x.
• If (X, τ ) is a topological space and ∼ an equivalence relation on X, the quotient topology
σ on X/∼ is given by σ = {U ⊆ X/∼ : q −1 (U ) ∈ τ }.
1.2. TOPOLOGICAL SPACES 23
P. 1-76
Let (X, τ ) be a topological space and ∼ an equivalence relation on X. The quotient
topology σ is the largest topology on X/ ∼ for which q is continuous.
• Let X = [0, 1] × [0, 1] with ∼ given by (0, y) ∼ (1, y) and (x, 0) ∼ (x, 1), then
X/∼ ' S 1 × S 1 = T 2 , by, say (x, y) 7→ (cos 2πx, sin 2πx), (cos 2πy, sin 2πy)
L. 1-79
1. If (X, τ ) is a Hausdorff topological space and Y ⊆ X, then Y with the subspace
topology is also Hausdorff.
2. If (X, τ ) and (Y, σ) are Hausdorff topological spaces, then X × Y with the
product topology is also Hausdorff.
1.3 Connectivity
D. 1-82
A topological space X is disconnected if X can be written as A ∪ B, where A and
B are disjoint, non-empty open subsets of X. We say A and B disconnect X. A
space is connected if it is not disconnected.
If E is a subset of a topological space (X, τ ), then E is called connected if the
subspace topology on E is connected. The is equivalent to the condition that we
can find open sets U and V such that U ∪ V ⊇ E, U ∩ V ∩ E = ∅, U ∩ E 6= ∅ and
V ∩ E 6= ∅. Similar definition for disconnection.
E. 1-83
Intuitively, we would want to say that a space is “connected” if it is one-piece. For
example, R is connected, while R \ {0} is not. We will come up with two different
definitions of connectivity - normal connectivity and path connectivity, where the
latter implies the former, but not the other way round.
Note that being connected is a property of a space, not a subset. When we say “A
is a connected subset of X”, it means A is connected with the subspace topology
inherited from X.
1.3. CONNECTIVITY 25
Then f −1 (∅), f −1 ({0, 1}) = X, f −1 ({0}) = A and f −1 ({1}) = B are all open. So
f is continuous. Also, since A, B are non-empty, f is surjective.
(Backward) Given f : X 7→ {0, 1} surjective and continuous, define A = f −1 ({0}),
B = f −1 ({1}). Then A and B disconnect X.
In fact we can have, the topological space (X, τ ) is connected iff every continuous
map f : (X, τ ) → (A, ∆) is constant, where A is any set with more than 2 points
and ∆ its discrete topology.
Since Z and {0, 1} have the discrete topology when considered as subspaces of R
with the usual topology, we also have the following:
1. A topological space (X, τ ) is connected if and only if every continuous integer
valued function f : X → R (where R has its usual topology) is constant.
2. A topological space (X, τ ) is connected if and only if every continuous function
f : X → R (where R has its usual topology) which only takes the values 0 or
1 is constant.
P. 1-86
[0, 1] is connected. (More generally [a, b] is connected)
(Proof 1) Note that Q ∩ [0, 1] is disconnected, since we can pick our favorite
irrational number a, and then {x : x < a} and {x : x > a} disconnect the interval.
So we better use something special about [0, 1]. The key property of R is that
every non-empty A ⊆ [0, 1] has a supremum.
Suppose A and B disconnect [0, 1]. wlog, assume 1 ∈ B. Since A is non-empty,
α = sup A exists. Then either
26 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES
L. 1-90
Let E be a subset of a topological space (X, τ ). If E is connected so is Cl E.
D. 1-91
• Let X be a topological space, and x0 , x1 ∈ X. Then a path from x0 to x1 is a
continuous function γ : [0, 1] 7→ X such that γ(0) = x0 , γ(1) = x1 .
E. 1-92
• (a, b), [a, b), (a, b], R are all path connected (using paths given by linear functions).
• Rn \ {0} is path-connected for n > 1 (the paths either line segments or bent line
segments to get around the hole).
P. 1-93
If X is path connected, then X is connected.
(Proof 2) Suppose that (X, τ ) is path-connected and that U and V are open sets
with U ∩ V = ∅ and U ∪ V = X. Wlog U 6= ∅, choose x ∈ U . If y ∈ X,
we can find f : [0, 1] → X continuous with f (0) = x and f (1) = y. We have
U ∩ V ∩ f ([0, 1]) = ∅ and U ∪ V ⊇ f ([0, 1]). Now the continuous image of a
connected set is connected and [0, 1] is connected, so f ([0, 1]) is connected. Since
U ∩ f ([0, 1]) 6= ∅, V ∩ f ([0, 1]) = ∅. So U ⊇ f ([0, 1]), so y ∈ U . Thus U = X. So
U, V does not disconnect X, so X is connected.
L. 1-94
Suppose f : X → Y is a homeomorphism and A ⊆ X, then f |A : A → f (A) is a
homeomorphism.
1. For any x ∈ X, let γx : [0, 1] → X be γ(t) = x, the constant path. Then this
is a path from x to x. So x ∼ x.
2. If γ : [0, 1] → X is a path from x to y, then γ̄ : [0, 1] → X by t 7→ γ(1 − t) is a
path from y to x. So x ∼ y ⇒ y ∼ x.
1.3. CONNECTIVITY 29
is a path from x to z. So x ∼ y, y ∼ z ⇒ x ∼ z.
P. 1-99
T S
If Yα ⊆ X is connected for all α ∈ T and that α∈T Yα 6= ∅, then Y = α∈T Yα
is connected.
Let U and V be open sets such that
[ [
U ∪V ⊇ Yα and U ∩V ∩ Yα = ∅.
α∈A α∈A
E. 1-101
If a space is disconnected, we could divide the space into different components,
each of which is (maximally) connected.
P. 1-102
C(x) is the largest connected subset of X containing x.
Since [a, b], [a, b), (a, b] and (a, b) are path connected, they are connected.
Suppose, conversely, that E is bounded and contains at least two points. Since E
is bounded α = inf E and β = sup E exist. Further α < β. If c ∈ (α, β) \ E we
can find x, y ∈ E such that α < x ≤ c and c ≤ y < β. If c ∈
/ E, then U = (−∞, c)
and V = (c, ∞) are open U ∩ V = ∅, U ∪ V ⊇ E but x ∈ U ∩ E, y ∈ V ∩ E so
U ∩ E, V ∩ E 6= ∅ and E is not connected. Thus, if E is connected, E ⊇ (α, β)
and E is one of [α, β], (α, β), (α, β] or [α, β).
The same kind of argument shows that the connected subsets of R are precisely
the sets of the form [a, b], [a, b), (a, b], (a, b), (−∞, b], (−∞, b), [a, ∞), (a, ∞) and
R [a ≤ b].
Note that the condition open is important, as can be seen by the above example
where Y ∪ Z is connected but not path connected.
1.4 Compactness
Compactness is an important concept in topology. It can be viewed as a generalization
of being “closed and bounded” in R. Alternatively, it can also be viewed as a general-
ization of being finite. Compact sets tend to have a lot of really nice properties. For
example, if X is compact and f : X → R is continuous, then f is bounded and attains
its bound.
There are two different definitions of compactness - one based on open covers (which
we will come to shortly), and the other based on sequences. In metric spaces, these
two definitions are equal. However, in general topological spaces, these notions can
be different. The first is just known as “compactness” and the second is known as
“sequential compactness”.
The actual definition of compactness is rather weird and unintuitive, and is difficult to
comprehend at first. However, as we go through more proofs and examples, (hopefully)
you will be able to appreciate this definition.
D. 1-108
• Let (X, τ ) be a topological space and Y ⊆ X. An open cover of Y is a subset
V ⊆ τ such that V ∈V V ⊇ Y . We say V covers Y . If V 0 ⊆ V, and V 0 covers Y ,
S
∅, X ∈ τ . If Uα ∈ τ for α ∈ B, then
[ \ \ [
X\ Uα = (X \ Uα ), X\ Uα = (X \ Uα )
α∈B α∈B α∈B α∈B
1. Observe that the open balls B(x, δ) form an open cover of X and so have a
finite subcover.
S
2. For each n ≥ 1, choose
S∞ a finite subset En such that X = e∈En B(e, 1/n).
Observe that E = n=1 En is the countable union of finite sets, so countable.
If U is open and non-empty, then we can find a u ∈ U and a δ > 0 such
that U ⊇ B(u, δ). Choose N > δ −1 . We can find an e ∈ EN ⊆ E with
u ∈ B(e, 1/N ), so e ∈ B(u, 1/N ) ⊆ B(u, δ) ⊆ U . Thus Cl E = X and we are
done.
P. 1-113
[0, 1] with the usual topology is compact.
First show that A is non-empty. Since V covers [0, 1], in particular, there is some
V0 that contains 0. So {0} has a finite subcover V0 . So 0 ∈ A.
Again, since this is not true for [0, 1] ∩ Q, we must use a special property of reals.
P. 1-114
A closed subset of a compact set is compact. (If X is compact and C is closed
subset of X, then C is also compact.)
P. 1-115
Let X be a Hausdorff space, then every compact subset is closed. (If C ⊆ X is
compact, then C is closed in X.)
P. 1-120
Let (X, τ ) be a compact topological space and ∼ an equivalence relation on X.
Then the quotient topology on X/ ∼ is compact
T. 1-121
<Maximum value theorem> If f : X → R is continuous and X is compact,
then ∃x ∈ X such that f (x) ≥ f (y) for all y ∈ X.
L. 1-122
If f : [0, 1] → R is continuous, then ∃x ∈ [0, 1] such that f (x) ≥ f (y) for all
y ∈ [0, 1].
[0, 1] is compact.
E. 1-123
Let R have the usual metric.
1. If K is a subset of R with the property that, whenever f : K → R is continuous,
f is bounded, show that K is closed and bounded.
2. If K is a subset of R with the property that, f : K → R attains its bounds
whenever f is continuous and bounded, then K is closed and bounded.
T. 1-124
If X and Y are compact, then so is X × Y (under the product topology).
S
Let Oα ∈ λ [α ∈ A] and α∈A Oα = X × Y . Then, given (x, y) ∈ X × Y , we can
find Ux,y ∈ τ , Vx,y ∈ σ and α(x, y) ∈ A such that (x, y) ∈ Ux,y × Vx,y ⊆ Oα(x,y) .
36 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES
S
We have y∈Y {x}×V S x,y = {(x, y) : y ∈ Y } for each
Y Ux × Y
x ∈ X and so y∈Y Vx,y = Y . By compactness, Ux,y(x,1) × Vx,y(x,1)
we can find a positive integer n(x) and y(x, j) ∈ Y
Ux,y(x,2) × Vx,y(x,2)
Sn(x)
[1 ≤ j ≤ n(x)] such that j=1 Vx,y(x,j) = Y .
Tn(x)
Now Ux = j=1 Ux,y(x,j) is the finite intersection
of openSsets in X and so open. Further x ∈ Ux
and so x∈X Ux = X. By S compactness, we can find
x1 , x2 , . . . , xm such that m r=1 Uxr = X. And the X
result follows since x
m n(x
[ [r ) m n(x
[ [r ) m n(x
[ [r )
Oxr ,y(xr ,j) ⊇ Uxr ,y(xr ,j) × Vxr ,y(xr ,j) ⊇ Uxr × Vxr ,y(xr ,j)
r=1 j=1 r=1 j=1 r=1 j=1
m
[
⊇ Uxr × Y ⊇ X × Y
r=1
L. 1-125
Let (X, τ ) and (Y, σ) be topological spaces with subsets E and F . Let the subspace
topology on E be τE and on F be σF . Let the product topology on X × Y derived
from τ and σ be λ and let the product topology on E × F derived from τE and
σF be µ. Then µ is the subspace topology on E × F derived from λ.
T. 1-126
Let (X, τ ) and (Y, σ) be topological spaces and let λ be the product topology. If
K is a compact subset of X and L is a compact subset of Y , then K × L is a
compact in λ.
E. 1-127
The unit cube [0, 1]n = [0, 1] × [0, 1] × · · · × [0, 1] is compact.
T. 1-128
<Heine-Borel theorem> C ⊆ Rn is compact iff C is closed and bounded.
P. 1-129
Suppose f : X → Y is a continuous bijection. If X is compact and Y is Hausdorff,
then f is a homeomorphism.
E. 1-130
1. Give an example of a Hausdorff space (X, τ ) and a compact Hausdorff space
(Y, σ) together with a continuous bijection f : X → Y which is not a homeo-
morphism.
2. Give an example of a compact Hausdorff space (X, τ ) and a compact space
(Y, σ) together with a continuous bijection f : X → Y which is not a homeo-
morphism.
Let τ1 be the indiscrete topology on [0, 1], and τ2 the usual (Euclidean) topology
on [0, 1] and τ3 the discrete topology on [0, 1]. Then ([0, 1], τ1 ) is compact (but not
Hausdorff), ([0, 1], τ2 ) is compact and Hausdorff, and ([0, 1], τ3 ) is Hausdorff (but
not compact).
The identity maps id : ([0, 1], τ2 ) → ([0, 1], τ1 ) and id : ([0, 1], τ3 ) → ([0, 1], τ3 ) are
continuous bijections but not homeomorphisms.
L. 1-131
Let τ1 and τ2 be topologies on the same space X.
1. If τ1 ⊇ τ2 and (X, τ1 ) is compact, then so is (X, τ2 ).
2. If τ1 ⊇ τ2 and (X, τ2 ) is Hausdorff, then so is (X, τ1 ).
3. If τ1 ⊇ τ2 , (X, τ1 ) is compact and (X, τ2 ) is Hausdorff, then τ1 = τ2 .
L. 1-137
Suppose that (X, d) is a sequentially compact metric space and that the collection
Uα with α ∈ A is an open cover of X. Then there exists a δ > 0 such that, given
any x ∈ X, there exists an α(x) ∈ A such that the open ball B(x, δ) ⊆ Uα(x) .
Suppose the first sentence is true and the second sentence false. Then, for each
n ≥ 1, we can find an xn such that the open ball B(xn , 1/n) 6⊆ Uα for all α ∈ A.
By sequential compactness, we can find y ∈ X and n(j) → ∞ such that xn(j) → y.
Since y ∈ X, we must have y ∈ Uβ for some β ∈ A. Since Uβ is open, we can find
an such that B(y, ) ⊆ Uβ . Now choose J sufficiently large that n(J) > 2−1
and d(xn(J) , y) < /2. We now have, using the triangle inequality, that
(Forward) If (xni ) → x, then for every ε, we can find I such that i > I implies
xnj ∈ Bε (x) by definition of convergence. So (∗) holds.
(Backward) Suppose (∗) holds. We will construct a sequence xni → x inductively.
Take n0 = 0. Suppose we have defined xn0 , xni−1 . Now xn ∈ B1/i (x) for infinitely
many n. Take ni to be smallest such n with ni > ni−1 . Then d(xni x) < 1i implies
that xni → x.
T. 1-139
Let (X, d) be a metric space, then X is compact iff X is sequentially compact.
M
[ M
[
X= B(yj , δ) ⊆ Uα(yj ) ⊆ X
j=1 j=1
40 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES
SM
so X = j=1 Uα(yj ) and we have found a finite subcover. Thus X is compact.
E. 1-140
Prove [P.1-113]. ([a, b] with the usual topology is compact)
(Proof 1) By [P.1-116].
(Proof 2) If (X, d) is sequentially compact but not bounded, then for any x0 ,
we can find a sequence (xk ) such that d(xk , x0 ) > k for every k. But then (xk )
cannot have a convergent subsequence. Otherwise, if xkj → x, then d(xkj , x0 ) ≤
d(xkj , x) + d(x, x0 ) and is bounded, which is a contradiction.
E. 1-142
Let X = C[0, 1] with the topology induced d∞ (uniform norm). Let
y
nx
x ∈ [0, 1/n]
fn (x) = 2 − nx x ∈ [1/n, 2/n]
0 x ∈ [2/n, 1]
x
Then fn (x) → 0 for all x ∈ [0, 1]. We now claim that fn has no convergent
subsequence. Suppose fni → f . Then fni (x) → f (x) for all x ∈ [0, 1]. However,
we know that fni (x) → 0 for all x ∈ [0, 1]. So f (x) = 0. However, d∞ (fni , 0) = 1.
So fni 6→ 0. It follows that B1 (0) ⊆ X is not sequentially compact. So it is not
compact.
D. 1-143
• Let (X, d) be a metric space. A sequence (xn ) in X is Cauchy if for every ε > 0,
∃N such that d(xn , xm ) < ε for all n, m ≥ N .
• A metric space (X, d) is complete if every Cauchy sequence in X converges con-
verges to a limit in X.
E. 1-144
• xn = n
P
k=1 1/k is not Cauchy.
1
• Let X = (0, 1) ⊆ R with xn = n
. Then this is Cauchy but does not converge.
• If xn → x ∈ X, then xn is Cauchy. The proof is the same as that in Analysis I.
• Let X = Q ⊆ R. Then the sequence (2, 2.7, 2.71, 2.718, · · · ) is Cauchy but does
not converge in Q.
• (0, 1) and Q are not complete.
P. 1-145
If X is a compact metric space, then X is complete.
Given ε > 0, pick N such that d(xn , xm ) < ε/2 for n, m ≥ N . Pick I such
that nI ≥ N and d(xni , x) < ε/2 for all i > I. Then for n ≥ nI , d(xn , x) ≤
d(xn , xnI ) + d(xnI , x) < ε. So xn → x.
Observe that R with the usual Euclidean metric is complete but not compact.
P. 1-146
Rn is complete.
If xn ⊆ Rn is Cauchy, then xn ⊆ B̄R (0) for some R, and B̄R (0) is compact. So it
converges.
E. 1-147
Note that completeness is not a topological property. R ' (0, 1) but R is complete
while (0, 1) is not. This is since Cauchy-ness depends on the metric (not the
topology), Cauchy sequence only make sense in a metric. For example, R \ {0}
with the usual metric d1 (x, y) = |x − y| is incomplete since xn = 1/n is Cauchy
but does not converge, however it is complete under the metric d2 (x, y) = | x1 − y1 |.
Note that both metrics induce the same topology on R \ {0} since xn → x under
d1 iff xn → x under d2 , this is because for all x ∈ R \ {0} we have |xn − x| → 0 iff
| x1n − x1 | → 0.
E. 1-148
When searching for a counterexample we may start by looking at R and Rn with
the standard metrics and subspaces like Q, [a, b], (a, b) and [a, b). Then we might
look at the discrete and indiscrete topologies on a space. It is often worth looking
at possible topologies on spaces with a small number of points.
42 CHAPTER 1. METRIC AND TOPOLOGICAL SPACES
CHAPTER 2
Variational principle
43
44 CHAPTER 2. VARIATIONAL PRINCIPLE
(1 − t)f (x) + tf (y) ≥ f (z) + ((1 − t)x + ty − z) · ∇f (z) = f ((1 − t)x + ty)
By applying the first-order condition to the right hand side (with x and y swapped),
we know that the right hand side is ≥ 0. So we have the result.
(Backward) Let z(t) = (1 − t)x + ty. Now (z − x) · (∇f (z) − ∇f (x)) ≥ 0 implies
(y − x) · (∇f (z) − ∇f (x)) ≥ 0 since t ≥ 0. Now note that
Z 1 Z 1
d
f (y) − f (x) = [f (z(t))]10 = (f (z(t)))dt = (y − x) · ∇f (z(t))dt
0 dt 0
Z 1
=⇒ f (y) − f (x) − (y − x) · ∇f (x) = (y − x) · (∇f (z(t)) − ∇f (x))dt ≥ 0
0
For example, when n = 1, the equation states that (y − x)(f 0 (y) − f 0 (x)) ≥ 0,
which is the same as saying f 0 (y) ≥ f 0 (x) whenever y > x.
T. 2-5
<Second-order convexity condition> For a function f that is everywhere
twice differentiable, the function is convex iff the Hessian matrix Hij never has a
negative eigenvalue (equivalently positive semi-definite).
hi Hij hj + O(h3 ) ≥ 0
Since we can take |h| as small as we wish, this implies the Hessian has no negative
eigenvalue for all x ∈ D(f ).
(Backward) For n = 1. So we have f 00 (x) ≥ 0 for all x in some convex domain. So
R x+h 00
0 ≤ (sign h) x f (z)dz = (sign h)(f 0 (x + h) − f 0 (x)). Now integrate between 0
R y−x R0
and y −x (note that h > 0 when y > x and h < 0 when y < x, also 0 = − y−x )
we have the first order condition since
Z y−x
0≤ (f 0 (x + h) − f 0 (x))dh = f (y) − f (x) − (y − x)f 0 (x)
0
For general n. For any x, y, let n̂ = (y − x)/|y − x|. Define g(z) = f (x + zn̂),
then g is convex, so by above g(|y − x|) ≥ g(0) + |y − x|g 0 (0), so f (y) ≥ f (x) +
|y − x|n̂ · ∇f (x) = f (x) + (y − x) · ∇f (x).
2.1. MULTIVARIATE CALCULUS 45
If all eigenvalues are everywhere strictly positive then the function is strictly con-
vex, but this is only a sufficient condition for strict convexity, not a necessary one.
For example, the function f (x) = x4 is strictly convex despite the fact f 00 (x) is
zero at x = 0.
E. 2-6
1
Let f (x, y) = xy
for x, y > 0. Then the Hessian and its determinant and trace are
2 1
1 x2 xy 3 2 1 1
H= 1 2 det H = 4 4 > 0 tr H = + 2 > 0.
xy xy y2 x y xy x2 y
Since the products of eigenvalues are the determinant and the sum are the trace,
the Hessian never has negative eigenvalues. So f is convex.
Note that to conclude that f is convex, we only used the fact that xy is positive,
but we cannot relax the domain condition to be xy > 0 instead because the domain
has to be a convex set.
P. 2-7
A stationary point of a convex function is a global minimum. There can be more
than one global minimum (eg a constant function), but there is at most one if the
function is strictly convex.
Given x0 such that ∇f (x0 ) = 0, the first-order condition implies that for any y,
f (y) ≥ f (x0 ) + (y − x0 ) · ∇f (x0 ) = f (x0 ).
If f is strictly convex and f (x) and f (y) are both global minimum, then we must
have f (x) = f (y). So f ((1 − t)x − ty) ≤ (1 − t)f (x) − tf (y) = f (x), contradiction.
So there is at most one global minimum.
P. 2-8
Let A ⊆ Rn be an open convex set. Then f : A → R is a convex function iff for
any a ∈ A, ∃m ∈ Rn such that f (x) ≥ f (a) + m · (x − a) for all x ∈ A.
(Forward) First we prove the case n = 1. Pick any a ∈ A. Note that if f (x) is
convex then so is h(x) = f (x + µ) − λ. So wlog let a = 0 and f (0) = 0. Now let
m = inf {f (x)/x : x ∈ A, x > 0}.
Suppose m = −∞. Pick y < 0 s.t. y ∈ A. Let α = f (y)/y. We can find x > 0
such that f (x)/x < α − 1. We have
−y x −y x
0 = f (0) ≤ f (x) + f (y) < (α − 1)x + yα < 0
x−y x−y x−y x−y
Now f ∗ ((1−t)p+tq) is bounded by the sum of two finite terms, which is finite. So
(1 − t)p + tq is also in the domain of f ∗ (domain is convex). So f ∗ is convex.
E. 2-11
When f is once-differentiable, the maximum (a supremum that is attained) of
p · x − f (x), if it exist, is found by solving the equation p = ∇f (x) for x in terms
of p (note that ∇(p · x) = p). The Legendre transform of f is then
E. 2-12
y
slope = p
−f ∗ (p)
E. 2-13
1. Let f (x) = 12 ax2 for a > 0. Then p = ax at the maximum of px − f (x). So
p 1 p 2 1 2
f ∗ (p) = px − f (x) = p · − a = p , p ∈ R.
a 2 a 2a
v p
p = f 0 (v) = √ =⇒ v= p
1 − v2 1 + p2
p2 1
f ∗ (p) = pv − f (v) = p
p
+p = 1 + p2 .
1+p 2 1+p 2
3. Let f = cx for c > 0. This is convex but not strictly convex. Then px − f (x) =
(p − c)x. This has no maximum unless p = c. So the domain of f ∗ is the one
point set {c}. So f ∗ (p) = 0. So a line goes to a point.
48 CHAPTER 2. VARIATIONAL PRINCIPLE
T. 2-14
If f is a convex differentiable function with f ∗ , then f ∗∗ = f .
(Proof 1, for nice functions) Suppose we have f ∗ (p) = (p · x(p) − f (x(p)) where
x(p) satisfies p = ∇f (x(p)) and with x(p) differentiable. Differentiating with
respect to p, we have
Since f is convex, for fixed x ∈ Rn , there exist m ∈ Rn such that f (z) ≥ f (x) +
m · (z − x) for all z ∈ Rn . Note that m ∈ D(f ∗ ) since m · x − f (x) ≥ supz∈Rn {z ·
m − f (z)}. Combining these inequalities gives
f doesn’t need to be Strictly convex. For example, in our last example above with
the straight line, f ∗ (p) = 0 for p = c. So f ∗∗ (x) = (xp − f ∗ (p))|p=c = cx = f (x).
However, convexity is required. If f ∗∗ = f is true, then f must be convex, since it
is a Legendre transform. Hence f ∗∗ = f cannot be true for non-convex functions.
E. 2-15
<Application to thermodynamics> The first law of thermodynamics is
dE = T dS − P dV.
which states that a small change in the energy of a system in thermal equilibrium
at temperature T and pressure P is the sum of a heat energy term (T dS), due to
a change in the entropy S, and a mechanical work energy term (−P dV ), due to
a change in the volume V . The formula also shows that the total energy E is a
function of the two “extensive” variables (S, V ), so called because these variables
scale with the size of the system. This is in contrast to the “intensive” variables
(T, P ), which can be defined as (comparing with the chain rule)
∂E ∂E
T = , P =− (∗)
∂S ∂V
For a process occurring at fixed entropy the first law becomes dE + P dV = 0,
which tells us that work done by the system will lead to a corresponding reduction
of its energy E. However, many processes of interest occur at fixed temperature,
not at fixed entropy, and in such cases it is more useful to consider (T, V ) as the
independent variables. We can arrange for this by taking the Legendre transform
2.1. MULTIVARIATE CALCULUS 49
of E(S, V ) with respect to S (the volume variable V just goes along for the ride
here). We will call this new function −F (T, V ), so
Strictly speaking, we do not yet know that the new independent variable T is the
temperature appearing in the first law. However, the maximum of the RHS w.r.t.
variations of S occurs when T = ∂E∂S
, and this is indeed the temperature as in (∗).
Solving T = ∂E∂S
gives S = S(T, V ). So F (T, V ) = E(S(T, V ), V ) − T S(T, V ). It
then follows that
∂F ∂F
dF = dT + dV and dF = −SdT − T dS + dE = −SdT − P dV
∂T ∂V
∂F ∂F
=⇒ S=− , P =−
∂T ∂V
We now have an alternative version of the first law: dF = −SdT − P dV . For a
process at fixed T , this reduces to dF + P dV = 0, which tells us that work done by
the system at fixed T implies a corresponding reduction in F , which is therefore
the energy that is available to do work at fixed temperature; this is less than the
total energy E because F = E −T S and both T and S are positive. This “available
energy”, as it is sometimes called, is more usually called the Helmholtz free energy,
or just “free energy”. It is also possible to take the Legendre transform of E(S, V )
with respect to the volume V . This gives another “thermodynamic potential”
H(S, P ), known as enthalpy:
∂E
where V (S, P ) is the solution of P = − ∂V . We have
∂H ∂H
dH = dS + dV and dH = V dP − P dV + dE = V dP + T dS
∂S ∂P
∂H ∂H
=⇒ T = , V =−
∂S ∂P
and it gives us another alternative version of the first law: dH = T dS + V dP .
Enthalpy is useful for chemistry because chemical reactions often take place at
fixed (atmospheric) pressure P . At fixed P we have dH = T dS, which tells that
a transfer of heat to a substance raises its enthalpy by a corresponding amount.
Finally, if we take the Legendre transform with respect to both S and V , we get
the Gibbs free energy.
0 = df = ∇f · dx
We still need ∇f · dx = 0, but now dx is not arbitrary, we only consider the dx parallel
to the path. That is to say ∇f has to be entirely perpendicular to the path. Since we
know that the normal to the path is ∇p, our condition becomes ∇f = λ∇p for some
lambda λ. Of course, we still have the constraint p(x, y) = 0. So what we have to
solve is
∇f = λ∇p, p=0
for the three variables x, y, λ. We can change this into a single problem of unconstrained
extremization. We ask for the stationary points of the function φ(x, y, λ) (called the
Lagrangian ) given by
φ(x, y, λ) = f (x, y) − λp(x, y)
Just like the ordinary unconstrained maximization, we still have to determine by other
means which stationary point, if any, is the one we need. However, this is usually easy
to sort out, and a bonus is that the value of the Lagrange multiplier often has some
significance that aids understanding of the problem.
The method of Lagrange multipliers is easily extended to find the stationary points
of f : Rn → R subject to m < n constraints pk (x) = 0 (k = 1, · · · , m). In this
case we need m Lagrange multipliers, one for each constraint, and we extremise the
function
m
X
φ(x; λ1 , · · · , λm ) = f (x) − λk pk (x)
k=1
with respect to the n + m variables on which it depends. Similar as above the interpre-
tation is that each of pk (x) = 0 define a plane in Rn and the normal of the plane at x is
given by ∇pk (x). For x to be a solution of this constrained maximization,
P ∇f (x) must
be in the span of ∇pk (x) = 0 (k = 1, · · · , m), ie. ∇f (x) = k λk ∇pk (x).
E. 2-16
Find the radius of the smallest circle centered on origin that intersects y = x2 − 1.
• λ = −1. So the second equation gives y = − 21 and the third gives x = ± √12 .
√
Hence R = 3/2 is the minimum.
E. 2-17
For x ∈ Rn , find the minimum of the quadratic form f (x) = xi Aij xj on the surface
|x|2 = 1.
E. 2-18
P
Find the probability distributionP{p1 , · · · , pn } satisfying i pi = 1 that maximizes
n
the information entropy S = − i=1 pi log pi .
Pn
pi ln pi − λ( n
P
We look for stationary points of φ(p, λ) = − i=1 i=1 pi − 1).
∂φ
= − ln pi − (1 + λ) = 0.
∂pi
E. 2-20
• We can have functional of functions: F [x] ∈ R where x : R → R. We can also have
functional of many functions of many variables: F [x] ∈ R where x : Rn → Rk .
• Given a medium with refractive index n(x), Rthe time taken by a path x(t) from
x
x0 to x1 is given by the functional T [x] = x01 n(x) dt. We might want to ask
questions like what path minimises the time taken? In which case we need to find
the “stationary point” of the functional.
Our objective is to find a stationary point of the functional F [x]. To do so, suppose we
vary x(t) by a small amount δx(t). The corresponding change δF [x] of F [x] is
Z β
δF [x] = F [x + δx] − F [x] = f (x + δx, ẋ + δ ẋ, t) − f (x, ẋ, t) dt
α
Z β
∂f ∂f
= + δ ẋ
δx dt + o(δ 2 ) (Taylor expand)
α ∂x ∂ ẋ
Z β β
∂f d ∂f ∂f
= δx − dt + δx (Integration by parts)
α ∂x dt ∂ ẋ ∂ ẋ α
β
Usually we have boundary condition so that the boundary term δx ∂f
∂ ẋ α
vanishes.
There are three ways:
1. Fixed end boundary conditions. We specify the values of x(α) and x(β), so δx(α) =
δx(β) = 0.
2. Free end (or “natural”) boundary conditions. These are such that ∂f ∂ ẋ
= 0 at the
integration endpoints. Usually, this can be the case if we set ẋ(α) = ẋ(β) = 0.
3. Mixed boundary conditions. Fixed at one end and free at the other.
Now we write3
Z β
δF [x] δF [x] ∂f d ∂f
δF [x] = δx dt where = −
α δx(t) δx ∂x dt ∂ ẋ
We call δF [x]/δx the functional derivative of F [x] with respect to x(t). If we want
to find a stationary point of F , then we need δFδx[x] = 0. So
∂f d ∂f
<Euler-Lagrange equation> − = 0, for α ≤ t ≤ β.
∂x dt ∂ ẋ
E. 2-21
<Geodesics of a plane> What is the curve C of minimalR length between two
points A, B in the Euclidean plane? The length is L = C d` where d` =
p
dx2 + dy 2 . There are two ways we can do this:
1. We restrict to curves for which
p x (or y) is a good parameter, ie. y can be made
a function of x. Then d` = 1 + (y 0 )2 dx, so
Z βp
L[y] = 1 + (y 0 )2 dx.
α
∂f ∂f
We have, again ∂x
= ∂y
= 0. So
d ∂f d ∂f ẋ ẏ
= =0 =⇒ p = c, p =s
dt ∂ ẋ dt ∂ ẏ ẋ2 + ẏ 2 ẋ2 + ẏ 2
where c and s are constants. While we have two constants, they are not
independent. We must have c2 + s2 = 1. So we let c = cos θ, s = sin θ.
Then the two conditions are both equivalent to (ẋ sin θ)2 = (ẏ sin θ)2 . Hence
ẋ sin θ = ±ẏ cos θ. We can choose a θ such that we have a positive sign. So
y cos θ = x sin θ + A for a constant A. This is a straight line with slope tan θ.
We call this the first integral . First integrals are important in several ways. Firstly
it simplifies the problem a lot, we only have to solve a first-order differential equation,
54 CHAPTER 2. VARIATIONAL PRINCIPLE
∂f
instead of a second-order one. Not needing to differentiate ∂ ẋ
also prevents a lot of
mess arising from the product and quotient rules.
Also in physics, if we have a first integral, then we get ∂f
∂ ẋ
= constant. This corre-
sponds to a conserved quantity of the system. When formulating physics problems as
variational problems, the conservation of energy and momentum will arise as constants
of integration from first integrals.
There is also a more complicated first integral appearing when f does not explicitly
depend on t (ie. ∂f ∂t
= 0). Consider the total derivative df dt
, by the chain rule, we
have
df ∂f dx ∂f dẋ ∂f ∂f ∂f ∂f
= + + = + ẋ + ẍ .
dt ∂t dt ∂x dt ∂ ẋ ∂t ∂x ∂ ẋ
On the other hand, the Euler-Lagrange equation says that ∂f d ∂f
∂x
= dt ∂ ẋ
. Substituting
this into our equation for the total derivative gives
df ∂f d ∂f ∂f ∂f d ∂f d ∂f ∂f
= + ẋ + ẍ = + ẋ =⇒ f − ẋ = .
dt ∂t dt ∂ ẋ ∂ ẋ ∂t dt ∂ ẋ dt ∂ ẋ ∂t
∂f
So if ∂t
= 0, then we have the first integral
∂f
f − ẋ = constant.
∂ ẋ
E. 2-22
Find the path of the light
√ray travels in the vertical xz plane inside a medium with
refractive index n(z) = a − bz for positive constants a, b. (The velocity of light
in the medium is v = nc )
Fermat’s principle states that the path theR light takes from A to B is one that min-
B
imizes the time taken (ie. minimise T = A d` v
). This is equivalent to minimizing
RB
the optical path length P = cT = A n d`. We specify our path by the function
√ p
z(x). Then the path element is given by d` = dx2 + dz 2 = 1 + z 0 (x)2 dx, then
Z xB p
P [z] = n(z) 1 + (z 0 )2 dx.
xA
For a particle near the surface of the Earth, under the influence of gravity, U =
mgz. So we have
Z Bp p
A[z] = 2mE − 2m2 gz 1 + (z 0 )2 dx,
A
which is of exactly the same form as the optics problem we just solved. So the
result is again a parabola.
E. 2-23
<Brachistochrone> A bead slides on a frictionless wire in a vertical plane.
What shape of the wire minimises the time for the bead to fall from rest at point
A to a lower, and horizontally displaced, point B?
The conservation
√ of energy implies that R12 mv 2 = mgy. A
So v = 2gy. We want to minimize T = d` . So x
v
s
1
Z p 2
dx + dy 2 1
Z
1 + (y 0 )2 B
T = √ √ = √ dx
2g y 2g y y
Since there is no explicit dependence on x, we have the first integral
∂f 1
f − y0 = p = constant
∂y 0 y(1 + (y 0 )2 )
The Brachistochrone problem was one of the earliest problems in the calculus of
variations. The name comes from the Greek words brákhistos (“shortest”) and
khrónos (“time”).
stationary points of F [x] subject to the constraint P [x] = c, for some constant c, we
may extremize, without constraint,
with respect to the function x(t) and the variable λ. Assuming that the boundary
term in the variation is zero, this yields the equations
δF δP
−λ = 0, P [x] = c.
δx(t) δx(t)
−(ρy 0 )0 + σy − λwy = 0.
F [y]
Λ[y] =
G[y]
instead. It turns out that this Λ has some significance. To minimize Λ, we cannot
apply the Euler-Lagrange equations, since Λ is not of the form of an integral.
However, we can try to vary it directly:
1 F 1
δΛ = (F + δF )(G + δG)−1 − F/G ≈ δF − 2 δG = (δF − ΛδG).
G G G
When Λ is minimized, we have
δF δG
δΛ = 0 ⇐⇒ =Λ ⇐⇒ Ly = Λwy.
δy δy
with respect to the functions θ(t) and φ(t). If θ happened to be a good parameter
for the curve, we can use it instead of the extra variable t.
Alternatively, we can impose the condition g(x(t)) = 0 with a Lagrange multiplier.
Then our problem would be to find stationary values of
Z 1
Φ[x, λ] = |ẋ| − λ(t)g(x(t)) dt
0
58 CHAPTER 2. VARIATIONAL PRINCIPLE
E. 2-29
<Time-independent potential> Lagrangian mechanics applies even when V is
time-dependent. However, if V is independent of time, then so is L (i.e. ∂L
∂t
= 0).
Then we can obtain a first integral. The chain rule gives
dL ∂L ∂L ∂
= + ẋ · + ẍ ·
dt ∂t ∂x
∂ ẋ
∂L ∂L d ∂L d ∂L ∂L
= + ẋ · − +ẋ + ẍ ·
∂t ∂x dt ∂ ẋ dt ∂ ẋ ∂ ẋ
| {z }
=0
d ∂L ∂L ∂L
=⇒ L − ẋ · = =0 =⇒ ẋ · − L = E.
dt ∂ ẋ ∂t ∂ ẋ
for some constant E. For example, for one particle, the constant of motion is the
total energy E = T + V :
1
E = m|ẋ|2 − m|ẋ|2 + V = T + V = total energy.
2
E. 2-30
<Central force fields> Consider a central force field F = −∇V , where V =
V (r) is independent of time. We use spherical polar coordinates (r, θ, φ), where
We’ll use the fact that motion is planar (a consequence of angular momentum
conservation). So wlog θ = π2 . In this case z = 0 and (x, y) = r(cos φ, sin φ), so
the Lagrangian is
1 1
L = mṙ2 + mr2 φ̇2 − V (r).
2 2
The Euler Lagrange equations give
d
mr̈ − mrφ̇2 + V 0 (r) = 0 and (mr2 φ̇) = 0.
dt
mh2 mh2
mr̈ − + V 0 (r) = 0 =⇒ 0
mr̈ = −Veff (r) where Veff = V (r) +
r3 2r2
Since the Legendre transform is its self-inverse and Lagrangian is convex with respect
to ẋ, the Lagrangian is the Legendre transform of the Hamiltonian with respect to
p:
L(x, ẋ, t) = p(ẋ) · ẋ − H(x, p(ẋ), t)
This is the phase-space form of the action. The Euler-Lagrange equations for these
are know as the Hamilton’s equations :
∂H ∂H
ẋ = , ṗ = −
∂p ∂x
This is the same as the original action because variation with respect to p makes the
integrand become L(x, ẋ, t), so then variation of x would gives stationary point of the
original action.
Using the Hamiltonian, the Euler-Lagrange equations put x and p on a much more
equal footing, and the equations are more symmetric. Solving the Hamilton’s equations
yields a trajectory in the phase space parametrized by position and momenta, which
are said to be canonically conjugate to each other.
E. 2-31
So what does the Hamiltonian look like? Consider the case of a single particle.
The Lagrangian is given by
1 ∂L
L= m|ẋ|2 − V (x, t) =⇒ p= = mẋ
2 ∂ ẋ
p 1 p 2 1
=⇒ H(x, p) = p · − m + V (x, t) = |p|2 + V.
m 2 m 2m
So p happens to coincide with the usual definition of the momentum. However, the
conjugate momentum is often something more interesting when we use generalized
coordinates. For example, in polar coordinates, the conjugate momentum of the
angle is the angular momentum. Also we have H as the total energy, but expressed
in terms of x, p, not x, ẋ.
D. 2-32
Rβ
Given F [x] = α f (x, ẋ, t) dt, suppose we change variables by the transformation
t 7→ t (t) and x 7→ x∗ (t∗ ). Then we have a new independent variable and a new
∗
with α∗ = t∗ (α) and β ∗ = t∗ (β). If F ∗ [x∗ ] = F [x] for all x, α and β, then we say
the transformation ∗ is a symmetry . Symmetries may be discrete or continuous.
The transformations of a continuous symmetry will depend on a parameter ε ∈ R
such that t∗ (t) = t and x∗ (t∗ ) = x(t) when ε = 0.
E. 2-33
Transformation could be a translation of time, space, or a rotation, or even more
fancy stuff. The exact symmetries of F depends on the form of f . For example, if
f only depends on the magnitudes of x, ẋ and t, then rotation of space will be a
symmetry.
E. 2-34
1. Consider the transformation t 7→ t and x 7→ x + ε for some ε. Then
Z β
F ∗ [x∗ ] = f (x + ε, ẋ, t) dt
α
∂f
which equals F [x] for ε 6= 0 iff = 0. Hence this is a symmetry (time
∂t
translation invariant) iff ∂f
∂t
= 0. We already know that if ∂f
∂t
= 0 is true, then
we obtain a first integral and the conserved quantity f − ẋ ∂f
∂ ẋ
.
We see that for each simple continuous symmetry we have above, we can obtain
a first integral, which then gives a constant of motion. Noether’s theorem is a
powerful generalization of this.
T. 2-35
<Noether’s theorem> Every continuous symmetry of an action I[x] is asso-
ciated with a corresponding first integral, and hence a constant of the motion
(conserved quantity).
62 CHAPTER 2. VARIATIONAL PRINCIPLE
β Z β
∗ ∗ ∂L ∂L d ∂L
=⇒ I [x ] = I[x] + ε ξL + h + εh − dt
∂ ẋ α α ∂x dt ∂ ẋ
Z β
∂L d ∂L
=⇒ I ∗ [x∗ ] − I[x] = [εQ]βα + εh − dt
α ∂x dt ∂ ẋ
∂L ∂L
where Q = ξ L − ẋ +ζ (∗)
∂ ẋ ∂ ẋ
Z β
∂L d ∂L
=⇒ I ∗ [x∗ ] − I[x] = ε Q̇ + h − dt
α ∂x dt ∂ ẋ
Note that ε is a constant. The LHS of the last line is 0 for a symmetry transfor-
mation for all α and β, so
∂L d ∂L
Q̇ = −h − .
∂x dt ∂ ẋ
The RHS of this expression is 0 when the Euler-Lagrange equation is satisfied, in
which case Q̇ = 0. This is the first integral promised by the theorem, and Q is the
constant of motion associated to the symmetry of the action. (For a symmetry, Q
is constant as a consequence of the Euler-Lagrange equations)
Note that continuity is essential. For example, if f is quadratic in x and ẋ, then
x 7→ −x will be a symmetry. But since it is not continuous, there won’t be a
conserved quantity.
The proof given generalises to functions of many functions, and also (in modified
form) to functions of many independent variables (which was the original context).
2.3. HAMILTON’S PRINCIPLE AND NOETHER’S THEOREM 63
L. 2-36
Rβ
Given I[x] = α L(x, ẋ, t)dt, the transformation t∗ (t) = t + εξ(t), x∗ (t∗ ) = x(t) +
εζ(t) is a symmetry if
∂L ∂L ∂L ∂L
ξ + ξ˙ L − ẋ +ζ + ζ̇ =0
∂t ∂x ∂x ∂ ẋ
E. 2-39
<Application to Hamiltonian mechanics> We can apply this to Hamiltonian
mechanics. The motion of a particle is the stationary point of
Z
1
S[x, p] = (p · ẋ − H(x, p)) dt, where H= |p|2 + V (x, t).
2m
1. Space translation invariance Suppose the potential is position indepen-
dent. Since the action depends only on ẋ (or p) and not x itself, it is invariant
under the translation x 7→ x + ε, p 7→ p. For general ε that can vary with
time, we have
Z Z
δS = S ∗ [x∗ , p∗ ]−S[x, p] =
p·(ẋ+ ε̇)−H(p) − p· ẋ−H(p) dt = p· ε̇ dt.
rz 00 + z 0 + z 03 = 0.
Suppose we pull the line between x = 0 and x = a with some tension T . Then
we set it into motion such that the amplitude is given by y(x; t). Then the kinetic
energy is
1 a 2 ρ a 2
Z Z
T = ρv dx = ẏ dx.
2 0 2 0
The potential energy is the tension times the length. So
Z Z ap Z a
1
V =T d` = T 1 + (y 0 )2 dx = (T a) + T (y 02 ) dx.
0 0 2
Note that y 0 is the derivative wrt x while ẏ is the derivative wrt time. The T a
term can be seen as the ground-state energy. It is the energy initially stored if
there is no oscillation. Since this constant term doesn’t affect where the stationary
points lie, we will ignore it. Then the action is given by
ZZ a
1 2 1
S[y] = ρẏ − T (y 0 )2 dx dt
0 2 2
ÿ − v 2 y 00 = 0, where v 2 = T /ρ.
This is the wave equation in two dimensions. Note that this is a linear PDE, which
is a simplification resulting from our assuming the oscillation is small. The general
solution to the wave equation is
E. 2-42
<Maxwell’s equations> It is possible to obtain Maxwell’s equations from an
action principle, where we define a Lagrangian for the electromagnetic field. Note
that this is the Lagrangian for the field itself, and there is a separate Lagrangian
for particles moving in a field.
Let ρ represents the electric charge density and J represents the electric current
density. We have the potentials: φ is the electric scalar potential and A is the
magnetic vector potential. And we have the fields: E = −∇φ − Ȧ is the electric
field, and B = ∇ × A is the magnetic field.
2. A sufficient (but not necessary) condition is ρ(t) > 0 and σ(t) ≥ 0, because in this
case δ 2 F [x0 , ξ] > 0 for all allowed ξ (same as before ξ˙ cannot be 0 everywhere).
The intuition behind the Legendre condition is as follows: suppose that ρ(t) is negative
in some interval I ⊆ [α, β]. Then we can find a ξ(t) that makes δ 2 F [x0 , ξ] negative.
We simply have to make ξ zero outside I, and small but crazily oscillating inside I.
Then inside I, ξ˙2 will be very large while ξ 2 is kept tiny. So we can make δ 2 F [y, ξ]
arbitrarily negative.
E. 2-43
• In Geodesics of a plane[E.2-21] shown that a straight line is a stationary point for
the curve-lengthpfunctional, but we didn’t show it is in fact the shortest distance.
Recall that f = 1 + (y 0 )2 . Then
∂f ∂f y0 ∂2f 1
= 0, = , = p 3,
∂y 0 ∂y 02
p
∂y 1 + (y 0 )2 1 + (y 0 )2
This is zero for constant ξ but the only constant permitted by the boundary
conditions is zero, so δ 2 F [y, ξ] is positive for non-zero allowed, and this is true for
any allowed y, so a straight line really does minimise the distance between two
points.
where x is always positive. The cycloid (at least locally) minimize the time T
because
1 1
ρ(t) = p >0 and σ(t) = p > 0.
x(1 + ẋ2 )3 2x2 x(1 + ẋ2 )
R1 p
• Consider f [y] = −1
x 1 + y 02 dx. In this case
∂2f 3
02
= x(1 + y 02 )− 2
∂y
Jacobi condition
Legendre tried to prove that ρ > 0 is a sufficient condition for δ 2 F > 0. This is
known as the strong Legendre condition . However, he failed, since it is not a sufficient
condition. Yet, it turns out that he was close.
Thinking ρ > 0 is sufficient isn’t as crazy as it first sounds. If ρ > 0 and σ < 0, we
would want to create a negative δ 2 F [x0 , ξ] by choosing ξ to be large but slowly varying.
Then we will have a very negative σ(t)ξ 2 while a small positive ρ(t)ξ˙2 . But ξ has to
be 0 at the end points α and β, so if β − α is small, then ξ˙ cannot be small.
Now we show that if we have an extra condition on top of this, then we have an
sufficient condition. Assume ρ(t) > 0 for α < t < β (the strong Legendre condition)
and assume boundary conditions ξ(α) = ξ(β) = 0. First of all, notice that for any
smooth function w(t), we have
Z β Z β
0 = wξ 2 (α) − wξ 2 (β) = (wξ 2 )0 dt = (2wξ ξ˙ + ẇξ 2 ) dt.
α α
But 0 = ξ(α) = Ce0 , so C = 0. Hence equality holds only for ξ = 0, which is not an
allowed ξ. So if we can find a solution to w2 = ρ(σ + ẇ), we know that δ 2 F > 0.
w2 = ρ(σ + ẇ) is non-linear in w. We can convert this into a linear equation by defining
w in terms of a new function u by w = −ρu̇/u. Then it becomes
2 0 2
u̇ ρu̇ (ρu̇)0 u̇
ρ =σ− =σ− +ρ =⇒ −(ρu̇)0 + σu = 0
u u u u
This is called the Jacobi accessory equation . If we can find a solution u(t) of it such
that u(t) 6= 0 for all t ∈ [α, β], then we have δ 2 F > 0 for all allowed ξ. A suitable
solution will always exists for sufficiently small β − α, but may not exist if β − α is too
large.
E. 2-44
<Geodesics on unit sphere> For any curve C on the sphere, we have
Z q Z φ2 q
L= dθ2 + sin2 θ dφ2 or L[θ] = (θ0 )2 + sin2 θ dφ
C φ1
p
if φ is a good parameter. Assuming this, we have f (θ, θ0 ) = (θ0 )2 + sin2 θ. So
∂f sin θ cos θ ∂f θ0
= p , 0
= p .
∂θ (θ0 )2 + sin2 θ ∂θ (θ0 )2 + sin2 θ
2.5. THE SECOND VARIATION 71
∂f
Since ∂φ
= 0, we have the first integral
∂f sin2 θ
q
const = f − θ0 = p =⇒ c sin2 θ = (θ0 )2 + sin2 θ.
∂θ0 (θ )2 + sin2 θ
0
∂2f ∂2f ∂2
= 1, = −1, = 0.
∂(θ0 )2 ∂θ∂θ0 ∂θ∂θ0
1
Here almost everything we work with will be a vector, for convenience we will not bold them.
2
Here the meaning of ≤ is component-vise, ie. a ≤ b means ai ≤ bi for all i. This will be the
case for the rest of the chapter.
73
74 CHAPTER 3. OPTIMIZATION
E. 3-3
Minimise −(x1 + x2 ) subject to [[ x1 + 2x2 ≤ 6 ]], [[ x1 − x2 ≤ 3 ]] and [[ x1 , x2 ≥ 0 ]].
E. 3-4
Recall that the stationary point of a convex function (on a convex set) is the global
minimum. It is easy to see that in the case of linear programming, the feasible
set is convex and the objective function is both convex and concave. However the
above result cannot generally be used to solve constrained optimization problems,
because the gradient might not be zero anywhere on the feasible set. Instead we
would use Lagrange Multipliers.
T. 3-5
<Lagrangian sufficiency> Consider (∗) of [D.3-1]. Let L(x, λ) = f (x) −
λT (h(x) − b) for λ ∈ Rm (its Lagrangian). If x∗ ∈ X and λ∗ ∈ Rm are such
that L(x∗ , λ∗ ) = inf x∈X L(x, λ∗ ) and h(x∗ ) = b, then x∗ is an optimal solution for
(∗).
3.1. PRELIMINARIES AND LAGRANGE MULTIPLIERS 75
min f (x) = min (f (x) − λ∗T (h(x) − b)) [since ∀x ∈ X(b), h(x) − b = 0]
x∈X(b) x∈X(b)
This result say that if x∗ minimizes L for some fixed λ∗ , and x∗ satisfies the
constraints, then x∗ minimizes f .
This result is powerful and useful in the aspect that any solution found is definitely
a optimal solution (not just a stationary solution). However this is not a necessary
condition for the optimal solution. For example consider f (x, y) = − 1+x21+y2
subject to x = 1. The optimal solution is clearly f (1, 0) = − 21 . But L((1, 0), λ) 6=
inf (x,y)∈R2 (− 1+x21+y2 − λ(x − 1)) for any λ ∈ R.
E. 3-6
Minimise x1 + x2 − 2x3 subject to x1 + x2 + x3 = 5 and x21 + x22 = 4.
The Lagrangian is
L(x, λ) = x1 − x2 − 2x3 − λ1 (x1 + x2 + x3 − 5) − λ2 (x21 + x22 − 4)
= ((1 − λ1 )x1 − 2λ2 x21 ) + ((−1 − λ1 )x2 − λ2 x22 ) + (−2 − λ1 )x3 + 5λ1 + 4λ2
We want to pick a λ∗ and x∗ such that L(x∗ , λ∗ ) = inf x∈X L(x, λ∗ ). Then in
particular, for our λ∗ , L(x, λ∗ ) must have a finite minimum.
We note that (−2 − λ1 )x3 does not have a finite minimum unless λ1 = −2, since
x3 can take any value. Also, the terms in x1 and x2 do not have a finite mini-
mum unless λ2 < 0. With these in mind, we find a minimum by setting all first
derivatives to be 0:
∂L
= 1 − λ1 − 2λ2 x1 = 3 − 2λ2 x1
∂x1
∂L
= −1 − λ1 − 2λ2 x2 = 1 − 2λ2 x2
∂x2
Since these must be both 0, we must have x1 = 2λ32 , x2 = 1
2λ2
. To show that this
is indeed a minimum, we look at the Hessian matrix:
−2λ2 0
H(L) =
0 −2λ2
which is positive semidefinite everywhere when λ2 < 0, so it’s a global minimum.
Let Y = {λ : R2 : λ1 = −2, λ2 < 0} be our helpful values of λ. For every λ ∈ Y ,
L(x, λ) has a unique minimum at x(λ) = ( 2λ32 , 2λ12 , x3 )T . Now all we have to do
is find λ and x such that x(λ) satisfies the functional constraints. The second
constraint gives
r
2 2 9 1 5
x1 + x2 = + 2 =4 ⇐⇒ λ2 = − .
4λ22 4λ2 8
The first constraint gives x3 = 5 − x1 − x2 . So [T.3-5] implies that the following
is an optimal solution:
r r r !
2 2 2
(x1 , x2 , x3 ) = −3 ,− ,5 + 4
5 5 5
76 CHAPTER 3. OPTIMIZATION
C. 3-7
In general to minimize f (x) subject to h(x) ≤ b, x ∈ X, we can proceed as follows:
1. Introduce slack variables to obtain the equivalent problem, to minimize f (x)
subject to h(x) + z = b, x ∈ X, z ≥ 0.
2. Compute the Lagrangian L(x, z, λ) = f (x) − λT (f (x) + z − b).
3. Find Y = {λ : inf x∈X,z≥0 L(x, z, λ) > −∞}.
4. For each λ ∈ Y , minimize L(x, z, λ). That is, find x∗ (λ) ∈ X, z ∗ (λ) ≥ 0 such
that L(x∗ (λ), z ∗ (λ), λ) = inf x∈X,z≥0 L(x, z, λ).
5. Find λ∗ ∈ Y such that h(x∗ (λ∗ )) + z ∗ (λ∗ ) = b.
Then by [T.3-5], x∗ (λ∗ ) is optimal for the constrained problem.
It is worth pointing out we have a property known as complementary slackness .
If we introduce a slack variable z, changing the value of zj does not affect our
objective function, and we are allowed to pick any non-negative z. For each j
we must have (z ∗ (λ))j λj = 0, because by definition z ∗ (λ)j minimizes −zj λj , so
if zj λj 6= 0, we can tweak the values of zj to make a smaller −zj λj . This prop-
erty makes our life easier since our search space is smaller. Note this also means
(h(x∗ (λ∗ )) − b)j λ∗j = 0 for each j.
E. 3-8
Consider the following problem: maximize x1 − 3x2 subject to x21 + x22 + z1 = 4,
x1 + x2 + z2 = 2 and z1 , z2 ≥ 0, where z1 , z2 are slack variables.
We have φ(b) = f (x∗ (b)) − λ∗ (b)T (h(x∗ (b)) − b), so (with summation)
∗
∂φ(b) ∂f ∗ ∗ T ∂h ∗ ∂xj (b)
= (x (b)) − λ (b) (x (b))
∂bi ∂xj ∂xj ∂bi
∂λ∗ (b)T ∂b
− (h(x∗ (b)) − b) +λ∗ (b)
∂bi ∂bi
| {z }
=0
∂x∗j (b)
∂ ∂b
= f (x) − λ∗ (b)T (h(x) − b) + λ∗ (b) = λ∗i (b)
∂xj x=x∗ (b) ∂bi ∂bi
| {z }
=0
3.1. PRELIMINARIES AND LAGRANGE MULTIPLIERS 77
This result also holds when the functional constraints are inequalities: if the i
th constraint does not not hold with equality, then λ∗i = 0 by complementary
∂λ∗
slackness, and therefore also ∂bii = 0.
E. 3-10
The Lagrange multipliers are also known as shadow prices, due to an economic
interpretation of the problem to
Consider a firm that produces n different goods from m different raw materials.
Vector b ∈ Rm describes the amount of each raw material available to the firm,
and vector x ∈ Rn describes the quantity produced of each good. Functions
h : Rn → Rm describe the amounts of raw material required to produce a particular
quantities of the goods. And f : Rn → R gives the profit derived from producing
a particular quantities of the goods. The goal of the above problem thus is to
maximize the profit of the firm for given amounts of raw materials available to it.
The shadow price of raw material i then is the price the firm would be willing to
pay per additional unit of this raw material, which of course should be equal to
∂φ
the additional profit derived from it, i.e. ∂bi
(b) = λ∗i (b).
10
g(λ) = inf L(x, λ) = L(x∗ (λ), λ) = + 4λ2 − 10
x∈X 4λ2
10
maximise + 4λ2 − 10 subject to λ2 < 0
4λ2
p
The maximum is attained for√ λ2 = − 5/8 and the primal and dual have the same
optimal value, namely −2( 10 p+ 5). Note that it is not actually necessary to solve
the dual to see that λ2 = − 5/8 an optimizer, it suffices that the value of the
dual function at this point equals the value of the objective function of the primal
at some point in the feasible set of the primal.
D. 3-14
In Rn , given any fix m ∈ Rn and c ∈ R, the set {x ∈ Rn : x·m = c} is a hyperplane
(“planes in higher dimensions”). In an n-dimensional space, a hyperplane has n−1
dimensions.
Consider hyperplane given by α : Rm → R with α(x) = b + m · x for some b ∈ R
and m ∈ Rm . We say α is a supporting hyperplane to a function φ : Rm → R at
b ∈ Rm if φ(b) = α(b) and φ(c) ≥ α(c) for all c ∈ Rm .
E. 3-15
Note that α being a supporting hyperplane to φ at b means that α(c) = φ(b) + m ·
(c − b) for some m such that φ(c) ≥ φ(b) + m · (c − b) for all c ∈ Rm .
L. 3-16
Take the set up of [D.3-11]. Let βλ = sup{β : β + λT (c − b) ≤ φ(c) for all c ∈ Rm }
and φ(c) = inf x∈X(c) f (x), then g(λ) = βλ .
By weak duality, g(λ) ≤ φ(b). So φ(b) = g(λ) and strong duality holds.
3.1. PRELIMINARIES AND LAGRANGE MULTIPLIERS 79
(Forward) Assume strong duality, then ∃λ such that for any c ∈ Rm we have
T. 3-18
<Supporting hyperplane theorem> If φ : Rm → R is convex and b ∈ Rm lies
in the interior of the set of points where φ is finite, then there exists a (non-vertical)
supporting hyperplane to φ at b.
By [P.2-8].
T. 3-19
Consider φ(c) = inf x∈X {f (x) : h(x) ≤ c}. If X, f, h are convex, then so is φ(c).
Consider any b1 , b2 ∈ Rm such that φ(b1 ) and φ(b2 ) are defined (ie. {f (x) :
h(x) ≤ c} with c = b1 , b2 non-empty, we allowed φ to take the value −∞). Let
δ ∈ [0, 1] and write b = δb1 + (1 − δ)b2 . Choose x1 ∈ X(b1 ), x2 ∈ X(b2 ), and let
x = δx1 + (1 − δ)x2 . By convexity of X, x ∈ X. By convexity of h,
This holds for any x1 ∈ X(b1 ) and x2 ∈ X(b2 ). So by taking infimum of the right
hand side, we have φ(b) ≤ δφ(b1 ) + (1 − δ)φ(b2 ). So φ is convex.
h(x) = b is equivalent to h(x) ≤ b and −h(x) ≤ −b. So the result holds for
problems with equality constraints if both h and −h are convex, ie. if h(x) is
linear.
In the case with equality constraints, convexity of X, f and h does not suffice for
convexity of φ. For example consider minimise f (x) = x2 subject to h(x) = x3 = b
for some b > 0, then φ(b) = b2/3 which is not convex. Also L(x, λ) = x2 −λ(x3 −b),
so inf x L(x, λ) > −∞ iff λ = 0. So the dual has optimal value 0, which is strictly
greater than φ(b) if b > 0. So strong duality is not satisfied.
This result shows that almost all (if b ∈ Rm lies in the interior of the set of points
where φ is finite) convex optimisation problem (inf x∈X {f (x) : h(x) ≤ c} with
X, f, h convex) satisfies strong duality.
T. 3-20
If a linear program is feasible and bounded, then it satisfies strong duality.
φ(c) = inf x∈X(c) f (x) is convex by the above. It can be shown via a result know
as Slaters condition that if such a linear problem is feasible, then b ∈ Rm is an
interior point of φ, so it has a support hyperplane at b, so it satisfies strong duality.
80 CHAPTER 3. OPTIMIZATION
maximize cT x subject to Ax ≤ b, x ≥ 0.
This already allows us to solve linear programs, since we can just try all corners and
see which has the smallest value. However, this can be made more efficient, especially
when we have a large number of dimensions and hence corners.
Here we will assume that the rows of A are linearly independent, and any set of m
columns are linearly independent. Otherwise, we can just throw away the redundant
rows or columns since they are “repeated”. Note that assuming these, setting any
subset of n − m variables of x to zero uniquely determines the value of the remain-
ing.
D. 3-21
An extreme point x ∈ S of a convex set S is a point that cannot be written as a
convex combination of two distinct points in S, ie. if y, z ∈ S and δ ∈ (0, 1) satisfy
x = δy + (1 − δ)z, then x = y = z.
E. 3-22
Consider the linear program: maximize f (x) = x1 + x2 subject to
x1 + 2x2 + z1 = 6, x1 − x2 + z2 = 3, x1 , x2 , z1 , z2 ≥ 0.
Since we have 2 constraints, a basic solution have at most 2 non-zero entries, and so
at least 2 zero entries. Since setting any subset of 2 variables of x to zero uniquely
determines the value of the remaining, setting a different pair of variables to 0 at
a time gives us the 6 possible basic solutions which are listed below. Among the 6,
E and F are not feasible solutions since they have negative entries. So the basic
feasible solutions are A, B, C, D.
3.2. SOLUTIONS OF LINEAR PROGRAMS 81
x2
x1 x2 z1 z2 f (x)
x1 − x2 = 3
A 0 0 6 3 0 D
B 0 3 0 6 3 C
C 4 1 0 0 5 A B E
x1
D 3 0 3 0 3 x1 + 2x2 = 6
E 6 0 0 −4 6
F 0 −3 12 0 −3 F
So the extreme points are exactly the basic feasible solutions. In fact this is true
in general.
T. 3-23
A vector x is a basic feasible solution (BFS) of Ax = b if and only if it is an
extreme point of the set X(b) = {x0 : Ax0 = b, x0 ≥ 0}.
We assume that every basic solution is non-degenerate and teh assumption stated
at the beginning of the section.
(Forward) Consider a BFS x and suppose that x = δy + (1 − δ)z for y, z ∈ X(b)
and δ ∈ (0, 1). Since y ≥ 0 and z ≥ 0, x = δy + (1 − δ)z implies that yi = zi = 0
whenever xi = 0. So y and z are basic solutions with the same basis, ie. both have
exactly m non-zero entries, which occur in the same rows. Moreover, Ay = b = Az
and thus A(y − z) = 0. This says that a linear combination of the m columns of
A equals 0, but by assumption any set of m columns of A is linearly independent,
so y = z. So x is an extreme point of X(b).
(Backward) Consider a feasible solution x ∈ X(b) that is not a BFS. Let i1 , · · · , ir
be the rows of x that are non-zero, then r > m. This means that the columns
ai1 , · · · , air where ai = (A1i , · · · , Ami )T , have to be linearly dependent, so there
exist yi1 , · · · , yir not all equals 0 such that yi1 ai1 + · · · + yir air = 0. Extend y
to a vector in Rn by setting yi = 0 if i 6∈ {i1 , · · · , ir }, we have Ay = 0 and thus
A(x ± εy) = b for every ε ∈ R. By choosing ε > 0 small enough, x ± εy ≥ 0 and
so x ± εy ∈ X(b). Now x = 21 (x + εy) + 12 (x − εy), so x is not an extreme point of
X(b).
T. 3-24
If a LP is feasible and bounded, then there exists an optimal solution that is a
basic feasible solution.
Let x be optimal solution. If x has at m most non-zero entries, it is a basic feasible
solution, and we are done. Now suppose x has r > m non-zero entries. Since it is
not an extreme point, we have y 6= z ∈ X(b), δ ∈ (0, 1) such that x = δy + (1 − δ)z.
We will show there exists an optimal solution strictly fewer than r non-zero entries.
Then the result follows by induction.
By optimality of x, we have cT x ≥ cT y and cT x ≥ cT z. Since cT x = δcT y + (1 −
δ)cT z, we must have that cT x = cT y = cT z, ie. y and z are also optimal. Since
y ≥ 0 and z ≥ 0, x = δy + (1 − δ)z implies that yi = zi = 0 whenever xi = 0. So y
and z have at most r non-zero entries, which must occur in rows where x is also
non-zero.
If y or z has strictly fewer than r non-zero entries, then we are done. Otherwise,
for any δ̂ ∈ R (not necessarily in (0, 1)), let xδ̂ = δ̂y + (1 − δ̂)z = z + δ̂(y − z).
82 CHAPTER 3. OPTIMIZATION
Observe that xδ̂ is optimal for every δ̂ ∈ R. Moreover, y − z 6= 0, and all non-zero
entries of y − z occur in rows where x is non-zero as well. We can thus choose
δ̂ ∈ R such that xδ̂ ≥ 0 and xδ̂ has strictly fewer than r non-zero entries.
Intuitively, this is what we do when we “slide along the line” if c is orthogonal to
one of the boundary lines.
This result in fact holds more generally for the maximum of a convex function
f over a compact (ie. closed and bounded) convex setP X. In that case, we can
write any point x ∈ X as a convex combination x = ki=1 δi xi of extreme points
xi ∈ X, and where δi ∈ [0, ∞) with ki=1 δi = 1. Then, by convexity of f ,
P
k
X
f (x) ≤ δi f (xi ) ≤ max f (xi )
i
i=1
So any point in the interior cannot be better than the extreme points.
C. 3-25
<Linear Programming Duality> Consider LP in general form min{cT x :
Ax ≥ b, x ≥ 0}. With slack variables it is min{cT x : Ax − z = b, x, z ≥ 0}. We
have X = {(x, z) : x, z ≥ 0} ⊆ Rm+n . The Lagrangian is
T T
and the dual is: max{λ b : A λ ≤ c, λ ≥ 0}. Similarly the dual of the standard
form min{cT x : Ax = b, x ≥ 0} is max{λT b : AT λ ≤ c}.
T. 3-26
The dual of the dual of a linear program is the primal.
It suffices to show this for the linear program in general form. The dual problem
is: minimize −bT λ subject to −AT λ ≥ −c and λ ≥ 0. This problem has the same
form as the primal, with −b taking the role of c, −c taking the role of b, and −AT
taking the role of A. So doing it again, we get back to the original problem.
E. 3-27
Let the primal problem be: maximize 3x1 + 2x2 subject to
We can compute all basic solutions of the primal and the dual by setting n − m − 2
variables to be zero in turn. Given a particular basic solutions of the primal,
the corresponding solutions of the dual can be found by using the complementary
slackness solutions: λ1 z1 = λ2 z2 = 0 and µ1 x1 = µ2 x2 = 0.
3.2. SOLUTIONS OF LINEAR PROGRAMS 83
x1 x2 z1 z2 f (x) λ1 λ2 µ1 µ2 g(λ)
A 0 0 4 6 0 0 0 -3 -2 0
3
B 2 0 0 2 6 2
0 0 − 21 6
3 5
C 3 0 -2 0 9 0 2
0 2
9
3 13 5 1 13
D 2
1 0 0 2 4 4
0 0 2
2
E 0 2 2 0 4 0 3
− 35 0 4
F 0 4 0 -6 8 2 0 1 0 8
x2 λ2
F
C
E D
B D
C F
x1 λ1
A B A E λ1 + 3λ2 = 2
2x1 + 3x2 = 6
We see that D is the only solution such that both the primal and dual solutions
are feasible. So we know it is optimal without even having to calculate f (x). It
turns out this is always the case.
T. 3-28
Let x and λ be feasible for the primal and the dual of the linear program. Then
x and λ are optimal if and only if they satisfy complementary slackness, ie. (cT −
λT A)x = 0 and λT (Ax − b) = 0.
The below proof if for the general form of LP, but it should be clear that it also
holds for the standard form.
since Ax ≥ b and λ ≥ 0. The first and last term are the same. So the inequalities
hold with equality. Therefore λT b = cT x − λT (Ax − b) = (cT − λT A)x + λT b. So
(cT − λT A)x = 0. Also, λT (Ax − b) = 0.
= cTB A−1 T T −1
B b + (cN − cB AB AN )xN .
The algorithm
The simplex method is a systematic way of doing the above procedure. We made a
simplex tableau (a (n + 1) × (m + 1) table) of the form
aij = (A−1
B A)ij
aij ai0
where a0j = (cT − cTB A−1
B A)j
ai0 = (A−1
B b)i
a0j a00
a00 = −cTB A−1
B b
2. Check whether a0j ≤ 0 for every j. If so, the current solution is optimal, so stop.
−1 −1
Note that (cT − cT T T
B AB A)j = cj − (cB )p (AB )pq Aqj . If j ∈ B, write B(j) as the
reduced index of j such that Aij = (AB )iB(j) etc., we have
−1 −1
(cT − cT T T T T
B AB A)j = (cB )B(j) − (cB )p (AB AB )pB(j) = (cB )B(j) − (cB )B(j) = 0.
3. If not, choose a pivot column j such that a0j > 0. If aij ≤ 0 for all i, the problem
is unbounded, and we stop. Otherwise choose a pivot row i ∈ {i0 : ai0 j > 0}
that minimizes ai0 /aij . If multiple rows are minimize ai0 /aij , then the problem is
degenerate, and things might go wrong.
We chose j such that (cT T −1
N − cB AB AN )N (j) > 0. Note that the constraint b = Ax
is equivalent to A−1
B b = A −1
B Ax = A−1 −1
B AB xB + AB AN xN = IxB + AB AN xN .
−1
If aij ≤ 0 (ie. (A−1B AN )iN (j) ≤ 0) for all i, then as we increase (xN )N (j) , every
component of A−1B AN xN becomes smaller and smaller which can be offset by appro-
priate increasing in value of xB to maintain A−1 −1
B b = IxB + AB AN xN , this we can
do forever, hence the problem is unbounded.
As we increase (xN )N (j) , for every i0 such that ai0 j = (A−1B AN )i0 N (j) > 0, the value
of (IxB + A−1
B A N x N ) i0 would increase by a 0
i j ∆(x N ) −1
N (j) , so to maintain AB b =
−1
IxB + AB AN xN , we need to decrease (xB )i0 by the amount ai0 j ∆(xN )N (j) . Now
since the BSF solution we started with is (xB )i = ai0 with xN = 0. The i that
minimise ai0 /aij correspond the (xB )i that hits 0 first. Now we have found a better
BFS that has a different base.
4. We update the tableau by multiplying row i by 1/aij , and add a (−akj /aij ) multiple
of (original) row i to each row k 6= i, including k = 0. Now return to step 2 and
repeat.
This operation change the base B to our new base, so that our table is now in the new
base (all the B in the table becomes B 0 corresponding to the new base). Note that
row operations are allowed because doing so is equivalent to adding one constraints
to another or multiplying one constraints by a scaler, which are allowed. After this
step akj = 0 for all k 6= i and aij = 1, ie. j is now in the new base replacing the
original ith basis variable. So this operation is equivalent to changing base.
Note that in the tableau the column of ai0 is the xB of the current BFS x, and a00 is
the −f (x).
E. 3-29
Consider the following problem: maximize x1 + x2 subject to
x1 + 2x2 + z1 = 6, x1 − x2 + z2 = 3, x1 , x2 , z1 , z2 ≥ 0.
Note that (x1 , x2 , z1 , z2 ) = (0, 0, 6, 3) is a BFS, so A−1
= I. We make the simplex
B
tableau, which now has the simple form which simply contains the coefficients of
the constraints and objective function:
x1 x2 z1 z2
Constraint 1 1 2 1 0 6
Constraint 2 1 -1 0 1 3
Objective 1 1 0 0 0
It’s pretty clear that our basic feasible solution is not optimal, since our objective
function is 0. This is since something in the last row is positive, and we can
increase the objective by, say, increasing x2 (pivot column is 2). The pivot row is
row 1, which means z1 would be the first to hit 0 as we increase x2 . We multiply
the first row by 12 and then add 1 times of the first row to the second row and -1
times first row to the third row. We have
x
1 x
2 z1
z2
( 1 1
( ((((1
Constraint 2
1 2
0 3
(( 3 1
Constraint
( ( ( ( 2 2
0 2
1 6
1
− 12
Objective 0 0 -3
2
Now we have changed base to (x2 , z2 ). Our new and better BFS is (x1 , x2 , z1 , z2 ) =
(0, 3, 0, 6). Do this one more time we have,
x
1 x
2 z1
z2
1
− 31
(
( ((((1
Constraint 0 1 3
1
(( 1 2
Constraint
( ((( 2 1 0 3 3
4
Objective
0 0 − 23 − 31 −5
Now we have changed base to (x2 , x1 ). Our new and better BFS is (x1 , x2 , z1 , z2 ) =
(4, 1, 0, 0) with an objective of value 5. This is optimum since a0j ≤ 0.
E. 3-30
<Two-phase simplex method> Sometimes there isn’t a obvious BFS, we
would need to use the two-phase simplex method to find our first BFS. This is best
illustrate with an example. Consider the problem:
minimize 6x1 + 3x2 subject to x1 , x 2 ≥ 0 and
x1 + x2 ≥ 1, 2x1 − x2 ≥ 1, 3x2 ≤ 2,
This is a minimization problem. To avoid being confused, we maximize −6x1 −3x2
instead. We add slack variables to obtain: maximize −6x1 − 3x2 subject to
x1 + x2 − z1 = 1, 2x1 − x2 − z2 = 1, 3x2 + z3 = 2, x1 , x2 , z1 , z2 , z3 ≥ 0
We don’t have an obvious BFS since (x1 , x2 , z1 , z2 , z3 ) = (0, 0, −1, −1, 2) is not
feasible. So we add more variables (called the artificial variables) to places where
it “doesn’t work” previously, and we solve to minimise the sum of the new artificial
variables. So
3.2. SOLUTIONS OF LINEAR PROGRAMS 87
x1 + x2 − z1 + y1 = 1, 2x1 − x2 − z2 + y2 = 1, 3x2 + z3 = 2
This new problem has the obvious BFS (x1 , x2 , z1 , z2 , z3 , y1 , y2 ) = (0, 0, 0, 0, 2, 1, 1),
so we can solve this problem by the simplex method. If the original problem is
feasible, the optimal solution to the new problem must have y1 + y2 = 0 (ie.
y1 = 0 and y2 = 0), so the optimal solution for the new problem is a BFS to the
original problem. So we can solve the original problem by first solving this new
problem (phrase I), and then solve the original problem (phrase II). We write out
the coefficients in a table:
x1 x2 z1 z2 z3 y1 y2
Constraint 3 0 3 0 0 1 0 0 2
Constraint 1 1 1 -1 0 0 1 0 1
Constraint 2 2 -1 0 -1 0 0 1 1
Original objective -6 -3 0 0 0 0 0 0
New objective 0 0 0 0 0 -1 -1 0
x1 x2 z1 z2 z3 y1 y2
0 3 0 0 1 0 0 2
1 1 -1 0 0 1 0 1
2 -1 0 -1 0 0 1 1
-6 -3 0 0 0 0 0 0
3 0 -1 -1 0 0 0 2
In addition to the new objective, we also write our original objective in the tableau,
this is so that we can conveniently use it in the second phase (we can continue to
use this table when we go on to solve the original problem). Our pivot column is
x1 , and our pivot row is the third row, do the procedure we have:
x1 x2 z1 z2 z3 y1 y2
0 3 0 0 1 0 0 2
3 1
0 2
-1 2
0 1 − 12 1
2
1 − 12 0 − 12 0 0 1
2
1
2
0 -6 0 -3 0 0 3 3
3 1
0 2
−1 2
0 0 − 32 1
2
There are two possible pivot columns. We pick z2 and use the second row as the
pivot row. We have
88 CHAPTER 3. OPTIMIZATION
x1 x2 z1 z2 z3 y1 y2
0 3 0 0 1 0 0 2
0 3 -2 1 0 2 -1 1
1 1 -1 0 0 1 0 1
0 3 -6 0 0 6 0 6
0 0 0 0 0 -1 -1 0
We see that y1 and y2 are no longer in the basis, and hence take value 0. So phrase
I is complete. We drop all the phase I stuff in our table, what that remains is our
phrase II tableau (tableau of the original problem):
x1 x2 z1 z2 z3
0 3 0 0 1 2
0 3 -2 1 0 1
1 1 -1 0 0 1
0 3 -6 0 0 6
x1 x2 z1 z2 z3
0 0 2 -1 1 1
0 1 − 23 1
3
0 1
3
1 0 − 31 − 13 0 2
3
0 0 -4 -1 0 5
Since the last row is all negative, we have complementary slackness. So this is a
optimal solution. So (x1 , x2 , z1 , z2 , z3 ) = ( 23 , 13 , 0, 0, 1) is an optimal solution, and
our optimal value is 5.
Note that we previously said that the bottom right entry is the negative of the
optimal value, not the optimal value itself! This is correct, since in the tableau,
we are maximizing −6x1 − 3x2 , whose maximum value is −5. So the minimum
value of 6x1 + 3x2 is 5.
It is worth noting that the problem we have just solved is the dual of the LP in
[E.3-29], added in addition the constraint 3x2 ≤ 2. Ignoring the column and row
corresponding to z3 , the slack variable for this new constraint, the final tableau is
essentially (not quite because there is slack variables and the table is displaying
A−1
B A not A) the negative of the transpose of the final tableau we obtained in
[E.3-29]. This makes sense because the additional constraint is not tight in the
optimal solution, as we can see from the fact that z3 6= 0.
We mostly focus on games with two players, but note that most concepts extend in a
straightforward way to games with more than two players.
Here we see that regardless of what the other person does, it is always strictly
better to testify, so T is a dominant strategy. The strategy profile (T, T ) is a
dominant strategy equilibrium but it is Pareto dominated by (S, S). The source
of the dilemma is that outcome resulting from (T, T ) is strictly worse for both
players than the outcome resulting from (S, S).
L. 3-34
The maximin strategy/security level are the optimal solution/value of the LP
maximize v subject to
m
X m
X
x ≥ 0, xi = 1, xi pij ≥ v for all j = 1, · · · , n
i=1 i=1
The security level of the row player is maxx∈X miny∈Y p(x, y). It is easy to see
that it is the same to maximize the minimum payoff over all pure strategies of the
other player, so maxx∈X minj∈{1,...,n} m
P
i=1 xi pij . We can formulate this as the
LP given.
E. 3-35
<Chicken> The game of Chicken is as follows: two people drive their cars
towards each other at high speed, they can decide to chicken out (C) or continue
driving (D). If they collide (ie. both don’t chicken), they both die. If one chickens
and the other doesn’t, the person who chicken is cowardice.
This can be represented by the table on the right.
Here there is no dominating strategy, so we need a C D
different way of deciding what to do. Instead, we C (2, 2) (1, 3)
can use the maximin strategy. This strategy mini- D (3, 1) (0, 0)
mizes the worst possible loss.
3.3. NON-COOPERATIVE GAMES 91
The unique maximin strategy in this game is to chicken for a security level of 1.
This isn’t an equilibria solution since if one players employ this maximin strategy,
it would be better for the other to not chicken out.
In this game, there are two pure equilibrium, (C, D) and (D, C), and there is a
mixed equilibrium in which the players pick the options with equal probability.
T. 3-36
(Nash, 1961) Every bimatrix game has an equilibrium.
Recall the LP for max min p(x, y).[L.3-34] Adding slack variable z ∈ Rn with z ≥ 0,
we obtain the Lagrangian
n m
! m
!
X X X
L(v, x, z, w, y) = v + yj xi pij − zj − v − w xi − 1
j=1 i=1 i=1
n
! m n
! n
X X X X
= 1− yj v+ pij yj − w xi − yj zj + w.
j=1 i=1 j=1 j=1
where w ∈ R and yP∈ Rn areP Lagrange multipliers. This has finite minimum for
all v ∈ R, x ≥ 0 iff yi = 1, pij yj ≤ w for all i, and y ≥ 0. The dual is
minimize w subject to
n
X n
X
y ≥ 0, yj = 1, pij yj ≤ w for all i
j=1 j=1
This corresponds to the column player choosing a strategy (yi ) such that the
expected payoff of the row player is bounded above by w.
The optimum value of the dual is miny∈Y maxx∈X p(x, y). So the result follows
from strong duality.
92 CHAPTER 3. OPTIMIZATION
We call v = maxx∈X miny∈Y p(x, y) = miny∈Y maxx∈X p(x, y) the value of the
matrix game with payoff matrix P .
This result is equivalent to maxx∈X miny∈Y p(x, y) = − maxy∈Y minx∈X −p(x, y).
Then for a zero-sum game, we see that the left hand side is the worst payoff the
row player can get if he employs the minimax strategy, while the right hand side
is the worst payoff the column player can get if he uses his minimax strategy.
Combining with the next theorem, this theorem then says that if both players
employ the minimax strategy, then this is an equilibrium. So in a zero-sum game,
maximin strategies are optimal.
T. 3-39
(x, y) ∈ X × Y is an equilibrium of the matrix game with payoff matrix P iff
max
0
p(x0 , y) = min
0
max
0
p(x0 , y 0 )
x ∈X y ∈Y x ∈X
≥ min
0
p(x, y 0 ) = − max
0
(−p(x, y 0 )) = −(−p(x, y)) = p(x, y)
y y
This problem is a linear program. In theory, we can write it into the general form
Ax = b with regional constraint mk ≤ xk ≤ mk , where A is the matrix given by
1
if the kth edge starts at vertex i
aik −1 if the kth edge ends at vertex i
0 otherwise
Note that instead of representing edge by indices i, j, in this case we represent them
with just one index k. However, this method is not very efficient, we will look for
better methods.
P
Note that i∈V bi = 0 is required for feasibility, which makes sense (total supply is
equal to the total consumption), and that a problem satisfying this condition can be
transformed into an equivalent circulation problem where bi = 0 for all i by introduc-
ing an additional vertex, and new edges from each sink to the new vertex and from
the new vertex to each of the sources, and let these new edges have upper and lower
bounds equal to the flow that should enter the sources or leave the sinks. Note also
we can assume wlog that the network G is connected. Otherwise the problem can be
decomposed into several smaller problems that can be solved independently.
An uncapacitated problem is the case where mij = 0 and mij = ∞ for all (i, j) ∈ E.
Clearly, an uncapacitated flow problem is either unbounded (which can happen if some
cij are negative), or is bounded and hence has an equivalent problem with finite capac-
ities (as we can add a bound greater than what the optimal solution wants).
The Lagrangian of the minimum-cost circulation problem is
X X X X X
L(x, λ) = cij xij − λi xij − xji = (cij − λi + λj )xij
(i,j)∈E i∈V j:(i,j)∈E j:(j,i)∈E (i,j)∈E
T. 3-41
If x ∈ Rn×n is a feasible flow for a circulation problem and with λ ∈ Rn such that
then x is optimal.
For (i, j) ∈ E, let cij = cij − λi + λj . Then, for every feasible flow x0 ,
X X X X X X
cij x0ij = cij x0ij − λi x0ij − x0ji = cij x0ij
(i,j)∈E (i,j)∈E i∈V j:(i,j)∈E j:(j,i)∈E (i,j)∈E
| {z }
=0
X X X X
≥ cij mij + cij mij = cij xij = cij xij
(i,j)∈E (i,j)∈E (i,j)∈E (i,j)∈E
cij <0 cij >0
Note that this result is simply Lagrange sufficiency. Note that for an x and λ that
satisfies the conditions stated, we must have L(x, λ) = inf x0 ∈X L(x0 , λ) since the
conditions implies we cannot decreased L anymore.
The Lagrange multiplier λi is also referred to as a node number, or as a potential
associated with vertex i ∈ V . Since only the difference between pairs of Lagrange
multipliers appears in the optimality conditions, we can set wlog λ1 = 0.
m
X n
X
xij = si for i = 1, · · · , n and xij = dj for j = 1, · · · , m
j=1 i=1
Consider a minimum-cost flow problem on network (V, E). Wlog assume that
mij = 0 for all (i, j) ∈ E, because otherwise we can set mij to 0, mij to mij − mij ,
bi to bi − mij , bj to bj + mij , xij to xij − mij .
Moreover, we can assume that all capacities are finite: if some edge has infinite
capacity but P
non-negative cost, then setting the capacity to a large enough number,
for example i∈V |bi | does not affect the optimal solutions. This is since cost is
non-negative, P
and the optimal solution will not want shipping loops. So we will
have at most |bi | shipments.
We now construct an transportation problem
P as follows: Replace every vertex
i ∈ V with a consumer with demand ( k:(i,k)∈E mik ) − bi . Replace every edge
(i, j) ∈ E with a supplier with supply mij , this supplier has an edge to consumer
i with cost c(ij,i) = 0 and an edge to consumer j with cost c(ij,j) = cij .
P
i k:(i,k)∈E mik − bi
0
mij ij
cij P
j k:(j,k)∈E mjk − bj
The idea is that if the capacity of the edge (i, j) is, say, 5, in the original network,
and we want to transport 3 along this edge, then in the new network, we send 3
units from ij to j, and 2 units to i.
For any flow x in the original network, the corresponding flow P on (ij, j) is xij and
the flow on (ij, i) is mij − xij . The total flow into i is then k:(i,k)∈E (mik − xik ) +
P
k:(k,i)∈E xki . This satisfies the constraints of the new network iff
X X X
(mik − xik ) + xki = mik − bi ,
k:(i,k)∈E k:(k,i)∈E k:(i,k)∈E
which is exactly the constraint for the node i in the original minimal-cost flow
problem. So the two problem is equivalent.
So we can solve a bounded minimum cost-flow problem by solving the equivalent
transportation problem, which is usually easier.
C. 3-43
<Transportation Algorithm> For the transportation problem, it is convenient
to have two sets of Lagrange multipliers, one for the supplier constraints and one
for the consumer constraint. Then the Lagrangian of the transportation problem
96 CHAPTER 3. OPTIMIZATION
can be written as
m X
n n m
! m n
!
X X X X X
L(x, λ, µ) = cij xij + λi si − xij − µj dj − xij
i=1 j=1 i=1 j=1 j=1 i=1
n X
X n n
X m
X
= (cij − λi + µj )xij + λi si − µj d j .
i=1 j=1 i=1 j=1
Note that we use different signs for the Lagrange multipliers for the suppliers and
the consumers, so that our ultimate optimality condition will look nicer.
Since x ≥ 0, the Lagrangian has a finite minimum iff cij − λi + µj ≥ 0 for all
i, j. So this is our dual feasibility condition. At an optimum, complementary
slackness entails that (cij − λi + µj )xij = 0 for all i, j. In fact if we have λ, µ
and x that satisfies these conditions, then L(x, λ, µ) = inf x0 ∈X L(x0 , λ, µ), so by
Lagrange sufficiency x is optimal. To solve the transportation problem we could
use a method similar to the simplex method. In this case, we made a tableau as
follows:
µ1 µ2 µ3 µ4
λ1 − µ1 λ1 − µ2 λ 1 − µ3 λ 1 − µ4
λ1 x11 c11 x12 c12 x13 c13 x14 c14 s1
λ2 − µ1 λ2 − µ2 λ 2 − µ3 λ 2 − µ4
λ2 x21 c21 x22 c22 x23 c23 x24 c24 s1
λ3 − µ1 λ3 − µ2 λ 3 − µ3 λ 3 − µ4
λ3 x31 c31 x32 c32 x33 c33 x34 c34 s1
d1 d2 d3 d4
We have a row for each supplier and a column for each consumer. We assume
there are 3 suppliers and 4 consumers but of course the table can be alter for any
number of suppliers or consumers. We proceed as follows:
1. Find an initial BFS, and let T be the edge of he corresponding spanning tree.
Although it looks like we have n + m constraints, we effectively only have m +
Pm Pn
n − 1 constraints since for example d1 = i=1 si − i=2 di so the constraint
Pn
i=1 xi1 = d1 can be derived from the other m + n − 1 constraints. So a
BFS has at most m + n − 1 non-zero entries. If we have a BFS, we can always
reduce so that it does not contains cycles. That is because if we have a cycle,
we can increase/decrease the flow an edge of the cycle, the flow of the other
edges of the cycle must change correspondingly to maintain feasibility, eventually
the flow of one of the edges will be reduced to 0, in which case we don’t have
the cycle anymore. Assuming there is no degeneracy, the resulting graph would
be a tree. In general, degeneracies occur when a subset of the consumers can be
satisfied exactly by a subset of the suppliers, hence the graph can be disconnected.
Assuming no degeneracy, the graph would be a spanning tree with n + m − 1
edges. Note that it must be spanning otherwise some suppliers/consumers are
not supplying/demanding.
E. 3-44
Suppose we have three suppliers with supplies 8, 10 and 9; and four consumers
with demands 6, 5, 8, 8.
It is easy to create an initial feasible solution - we just start from the first supplier
and the first consumer, we supply as much as we can until one side runs out of
supply/demand. If it is the supplier that runs out, we take in the next supplier
to continue the job. If it is the consumer that has no more demand, we go on to
supply to the next consumer. We first fill our tableau with our feasible solution.
6
8 = s1 d1 = 6
2
6 5 2 3 4 6 8 3
10 = s2 d2 = 5
7
2 3 7 7 4 1 10
1
9 = s3 d3 = 8
5 6 1 2 8 4 9 8
6 5 8 8
d4 = 8
We see that our basic feasible solution corresponds to a spanning tree. In general,
if we have n suppliers and m consumers, then we have n + m vertices, and hence
n + m − 1 edges (assuming no degeneracy). To set λ, µ so that cij − λi + µj = 0 for
all these edges, we have n+m−1 constraints here, so we can arbitrarily choose one
Lagrange multiplier, and the other Lagrange multipliers will follow. We choose
λ1 = 0. Now we must have µ1 = −5 etc., we fill in the values of the other Lagrange
multipliers as follows, and obtain
-5 -3 0 -2
0 6 5 2 3 4 6
4 2 3 7 7 4 1
2 5 6 1 2 8 4
We can fill in the values of λi − µj :
98 CHAPTER 3. OPTIMIZATION
-5 -3 0 -2
0 2
0 6 5 2 3 4 6
9 6
4 2 3 7 7 4 1
7 5
2 5 6 1 2 8 4
We didn’t bother filling in the value for λi − µj for (i, j) ∈ T since we know
cij − λi + µj = 0. If λi − µi ≤ cij is satisfied everywhere, we have optimality.
However we do not in this case, for example 9 = λ2 − µ1 > c21 = 2. We add
an edge, from the second supplier to the first consumer. Then we have created a
cycle. We keep increasing the flow on the new edge. This causes the values on
other edges to change by flow conservation. So we keep doing this until some other
edge reaches zero. If we increase flow by, say, δ, we have
6−δ
8 = s1 d1 = 6
2+δ
δ
6−δ 5 2+δ 3 4 6 10 = s2
3−δ
d2 = 5
7
δ 2 3−δ 7 7 4 1
1
9 = s3 d3 = 8
8
5 6 1 2 8 4
d4 = 8
3 5 5 3 4 6
The maximum value of δ we can take
is 3. So we end up with 3 2 7 7 4 1
5 6 1 2 8 4
-5 -3 -7 -9
7 9
0 3 5 5 3 4 6
We re-compute the Lagrange multipli-
0 6
ers to obtain
-3 3 2 7 7 4 1
0 -2
-5 5 6 1 2 8 4
-5 -3 -2 -4
Calculating the Lagrange multipliers 2 4
gives the table on the right. 0 3 5 5 3 4 6
0 -1
No more violations. So this is the op- -3 3 2 7 4 7 1
timal solution. 5 3
0 5 6 8 2 1 4
Suppose we have a network (V, E) with a single source 1 and a single sink n (but
how much stuff comes out of the source or into the sink are not fixed). There is no
costs in transportation, but each edge (i, j) has a capacity mij = Cij . We assume for
convenience that mij = 0 for all (i, j) ∈ E. We want to transport as much stuff from
1 to n as possible. We can write the problem as
maximize δ subject to
X X δ
i=1
xij − xji = −δ i=n for each i
j:(i,j)∈E j(j,i)∈E
0 otherwise
In fact we can turn this into a minimum-cost flow problem. We add an edge from
n to 1 with −1 cost and infinite capacity and let flow be conserved in this network
(circulation problem). Then the minimal cost flow will maximize the flow from on
(n, 1) hence maximise the flow from 1 to n through the network. However we will see
that there is an easier ways to solve the problem.
100 CHAPTER 3. OPTIMIZATION
L. 3-46
If x is a feasible flow vector that sends δ units from 1 to n, then for any cut S ⊆ V
with 1 ∈ S and n ∈ V \ S, we have δ = fx (S, V \ S) − fx (V \ S, S) ≤ C(S).
X X X
δ= xij − xji = fx (S, V ) − fx (V, S)
i∈S j:(i,j)∈E j:(j,i)∈E
This says that the flow δ from 1 to n is bounded above by any capacity of any
cut S with 1 ∈ S and n ∈ V \ S, which is obviously true. In fact by the below
theorem, this bound is tight, ie. there is always a cut S such that δ = C(S).
T. 3-47
<Max-flow min-cut theorem> Let δ be an optimal solution for the maximum
flow problem, then δ = min{C(S) : S ⊆ V, 1 ∈ S, n ∈ V \ S}.
E. 3-49
1
5 1
Consider the diagram on the right with ca-
1 4 n
pacities as labelled. - -
5 5
2
3.4. NETWORK PROBLEMS 101
Linear Algebra
103
104 CHAPTER 4. LINEAR ALGEBRA
Let ui + wi ∈ U + W , λ, µ ∈ F. Then
D. 4-7
Let V be a vector space over F and S ⊆ V . The span of S is defined as
( n )
X
hSi = span S = λi si : λi ∈ F, si ∈ S, n ≥ 0
i=1
4.1. VECTOR SPACES 105
Note that any subset of S of order 2 has the same span as S. Also S is linearly
dependent since
1 0 1
1 0 + 2 1 + (−1) 2 = 0.
0 1 2
S also does not span V since (0, 0, 1) 6∈ hSi.
• Let X be a set and x ∈ X. Define the function δx : X → F by
(
1 y=x
δx(y) = .
0 y 6= x
Suppose that we have already found Tr0 ⊆ T of order 0 ≤ r < n such that
Tr = (T \ Tr0 ) ∪ {e1 , · · · , er } spans V . Note that the case r = 0 is true, since we
can take Tr0 = ∅; and the case r = n is the theorem which we want to achieve.
Suppose we have these. Since Tr spans V , we can write
k
X
er+1 = λi ti , λi ∈ F, ti ∈ Tr .
i=1
We know that the ei are linearly independent, so not all ti ’s are ei ’s. So there is
some j such that tj ∈ (T \ Tr0 ). We can write this as
1 X λi
tj = er+1 + − ti .
λj λj
i6=j
0
We let Tr+1 = Tr0 ∪ {tj } of order r + 1, and
0
Tr+1 = (T \ Tr+1 ) ∪ {e1 , · · · , er+1 } = (Tr \ {tj }} ∪ {er+1 }
Let {u1 , · · · , um } be a basis for U and extend this to a basis {u1 , · · · , um , vm+1 , · · · , vn }
for V . We want to show that S = {vm+1 + U, · · · , vn + U } is a basis for V /U .
P P
Suppose v + U ∈ V /U . Since we can write v = λi ui + µi vi ,
X X X
v+U = µi (vi + U ) + λi (ui + U ) = µi (vi + U ).
(recall in a more familiar part IA notation, (ab)H = (aH)(bH). Also from [E.4
-6] we know λ(aH) = (λa)H.) So S spans V /U .
P
To show that they P are linearly independent, suppose that λi (vi + U ) = 0 + U =
U . This requires λ i vi ∈ U . Then we can write this as a linear combination
P P
of the ui ’s, ie. λ i vi = µj uj for some µj . Since {u1 , · · · , um , vn+1 , · · · , vn }
is a basis for V , we must have λi = µj = 0 for all i, j. So {vi + U } is linearly
independent.
We can view this as a linear algebra version of Lagrange’s theorem. This com-
bined with the first isomorphism theorem for vector spaces gives the rank-nullity
4.2. LINEAR MAPS 109
theorem. This is because if A is a linear map on V with nullity U , then the first iso-
morphism theorem says V /U ∼ = Im A, so dim V = dim U +dim Im A = n(A)+r(f ).
Note that this result also implies [L.4-15] since if U is proper, then dim(V /U ) > 0,
so dim U = dim V − dim(V /U ) < dim V .
D. 4-18
• Suppose V is a vector space over F and U, W subspaces of V . We say that V is
the (internal) direct sum of U and W if U + W = V and U ∩ W = 0. We write
V = U ⊕ W.
Equivalently, this requires that every v ∈ V can be written uniquely as u + w with
u ∈ U, w ∈ W . We say that U and W are complementary subspaces of V .
• If U1 , · · · , Un ⊆ V are subspaces of V , then V is the (internal) direct sum
n
M
V = U1 ⊕ · · · ⊕ Un = Ui
i=1
P
if every v ∈ V can be written uniquely as v = ui with ui ∈ Ui . This
P can be
extended to an infinite sum with the same definition, but the sum v = ui still
has to be finite.
• If U, W are vector spaces over F, the (external) direct sum is
U ⊕ W = {(u, w) : u ∈ U, w ∈ W },
with pointwise operations. This can be made into an infinite sum if we require
that all but finitely many of the ui have to be zero.
E. 4-19
The difference between internal and external direct sum is that the first is decom-
posing V into smaller spaces, while the second is building a bigger space based on
two spaces.
Note, however, that the external direct sum U ⊕ W is the internal direct sum
of U and W viewed as subspaces of U ⊕ W , ie. as the internal direct sum of
{(u, 0) : u ∈ U } and {(0, v) : v ∈ V }. So these two are indeed compatible
notions, and this is why we give them the same name and notation.
E. 4-20
Let V = R2 , and U = h( 01 )i. Then h( 11 )i and h( 10 )i are both complementary
subspaces to U in V .
D. 4-21
• Let U, V be vector spaces over F. Then α : U → V is a linear map if
1. α(u1 + u2 ) = α(u1 ) + α(u2 ) for all ui ∈ U .
2. α(λu) = λα(u) for all λ ∈ F, u ∈ U .
We write L(U, V ) for the set of linear maps U → V .
• We say a linear map α : U → V is an isomorphism if there is some β : V → U
(also linear) such that α ◦ β = idV and β ◦ α = idU . If there exists an isomorphism
U → V , we say U and V are isomorphic , and write U ∼ =V.
• Let α : U → V be a linear map. Then the image of α is Im α = {α(u) : u ∈ U }.
The kernel of α is ker α = {u ∈ U : α(u) = 0}.
E. 4-22
• Note that we can combine the two requirements to the single requirement that
α(λu1 + µu2 ) = λα(u1 ) + µα(u2 ).
• It is easy to see that if α is linear, then it is a group homomorphism (if we view
vector spaces as groups). In particular, α(0) = 0.
• If we want to stress the field F, we say that α is F-linear. For example, complex
conjugation is a map C → C that is R-linear but not C-linear (since (iz)∗ = −iz ∗ 6=
iz ∗ ).
E. 4-23
• Let A be an n × m matrix with coefficients in F. We will write A ∈ Mn,m (F).
Then α : Fm → Fn defined by v → Av is linear:
m
X m
X m
X
α(λu + µv)i = Aij (λu + µv)j = λ Aij uj + µ Aij vj = λα(u)i + µα(v)i .
j=1 j=1 j=1
for some p, q ∈ C ∞ (R, R). Then if y(t) ∈ Im β, then there is a solution (in
C ∞ (R, R)) to the differential equation f 00 (t) + p(t)f 0 (t) + q(t)f (t) = y(t). Simi-
larly, ker β contains the solutions to the homogeneous equation f 00 (t) + p(t)f 0 (t) +
q(t)f (t) = 0.
P. 4-26
Let α : U → V be an F-linear map.
i. If α is injective and S ⊆ U is linearly independent, then α(S) is linearly
independent in V .
ii. If α is surjective and S ⊆ U spans U , then α(S) spans V .
iii. If α is isomorphic and S ⊆ U is a basis, then α(S) is a basis for V .
In particular, if U and V are finite-dimensional vector spaces over F and
α : U → V is an isomorphism, then dim U = dim V .
Next to prove surjectivity, suppose that (v1 , · · · , vn ) is an ordered basis for V . Let
xi vi , we just need to show that α is a isomorphism Fn → V
P
α((x1 , · · · , xn )) =
since if it is then by construction Φ(α) = (v1 , · · · , vn ). It is easy to check that α
is well-defined and linear. We P also know P that α is injective since (v1 , · · · , vn ) is
linearly independent. So if xi v i = yi vi , then xi = yi . Also, α is surjective
since (v1 , · · · , vn ) spans V . So α is an isomorphism.
This result shows that if V is any F-vector space of dimension n < ∞, then there
must be an isomorphism Fn → V , so V is isomorphic to Fn . So in fact any two
F-vector space of dimension n < ∞ must be isomorphic. Choosing a basis for an
n-dimensional vector space V corresponds to choosing an identification of V with
Fn .
P. 4-28
Suppose U, V are vector spaces over F and S = {e1 , · · · , en } is a basis for U . Then
every function f : S → V extends uniquely to a linear map U → V .
This illustrates that to define a linear map, it suffices to define its values on a basis.
In fact this result can also be extend to the infinite-dimension case. It is not hard
to see that the only subsets of U that satisfy the conclusions of the proposition
are bases: spanning ensures uniqueness and linear independence ensures existence
(well-define).
4.2. LINEAR MAPS 113
P. 4-29
Let Matn,m (F) be the set of n × m matrices over F. Suppose U and V are finite-
dimensional vector spaces over F with bases (e1 , · · · , em ) and (f1 , · · · , fn ) respec-
tively.
1. For any A = (aij ) ∈ Matn,m (F),Pby the above proposition there is a unique
linear map α such that α(ei ) = aji fj .
We can interpret this has follows: the ith column of A tells us how to write α(ei )
in terms of the fj .
We can also draw a fancy diagram to display this result. Given bases e1 , · · · , em ,
by [P.4-27], we get an isomorphism s(ei ) : U → Fm . Similarly, we get an isomor-
phism s(fi ) : V → Fn . Since a matrix is a linear map A : Fm → Fn , given a matrix
A, we can produce a linear map α : U → V via the following composition
s(ei ) A s(fi )−1
U Fm Fn V. Fm A
Fn
We can put this into a square like on the right. Then the s(ei ) s(fi )
corollary tells us that every A gives rise to an α, and every
α corresponds to an A that fit into this diagram. α
U V
D. 4-30
• We call the matrix corresponding to a linear map α ∈ L(U, V ) under the [P.4-29]
the matrix representing α with respect to the bases (e1 , · · · , em ) and (f1 , · · · , fn ).
P. 4-31
Suppose U, V, W are finite-dimensional vector spaces over F with bases R =
(u1 , · · · , ur ), S = (v1 , .., v2 ) and T = (w1 , · · · , wt ) respectively. If α : U → V and
β : V → W are linear maps represented by A and B respectively (with respect to
R, S and T ), then βα is linear and represented by BA with respect to R and T .
!
A B
Fr Fs Ft
X X
βα(ui ) = β Aki vk = Aki β(vk )
k k s(R) s(S) s(T )
!
X X X X
β
= Aki Bjk wj = Bjk Aki wj U α
V W
k j j k
X
= (BA)ji wj
j
T. 4-32
<First isomorphism theorem> Let α : U → V be a linear map. Then
ker α and Im α are subspaces of U and V respectively. Moreover, α induces an
isomorphism ᾱ : U/ ker α → Im α with ᾱ(u + ker α) = α(u).
Note that if we view a vector space as an abelian group, then this is the first
isomorphism theorem of Part IA groups, but with a little twist since here we
doesn’t just have a group/homomorphism, we also need to care about conditions
on scalar multiplication.
T. 4-33
<Rank-nullity theorem> If α : U → V is a linear map and U is finite-
dimensional, then r(α) + n(α) = dim U .
Most of the work is hidden in [P.4-17]. In fact conversely the Rank-nullity theorem
also implies [P.4-17]. Below is an direct proof of Rank-nullity theorem that doesn’t
involve quotient space, which is also the Part IA proof.
P. 4-34
If α : U → V is a linear map between finite-dimensional vector spaces over F, then
there are bases (e1 , · · · , em ) for U and (f1 , · · · , fn ) for V such that α is represented
by the n × m matrix ( I0r 00 ), where r = r(α) and Ir is the r × r identity matrix.
In particular, r(α) + n(α) = dim U = m.
P. 4-37
Suppose α : U → V is a linear map between vector spaces over F both of dimension
n < ∞. Then
(i)[[ α is injective ]] ⇐⇒ (ii)[[ α is surjective ]] ⇐⇒ (iii)[[ α is an isomorphism ]]
It is clear that, (iii) implies (i) and (ii), and (i) and (ii) together implies (iii). So
it suffices to show that (i) and (ii) are equivalent.
Note that α is injective iff n(α) = 0, and α is surjective iff r(α) = dim V = n. By
the rank-nullity theorem, n(α) + r(α) = n. So the result follows.
116 CHAPTER 4. LINEAR ALGEBRA
L. 4-38
Let A ∈ Mn,n (F) = Mn (F) be a square matrix. Then
(i)[[ ∃B ∈ Mn (F) s.t. BA = In ]] ⇐⇒ (ii)[[ ∃C ∈ Mn (F) s.t. AC = In ]].
If these hold, then B = C, and we call A invertible or non-singular , and write
A−1 = B = C.
(i)⇔[[ there exists linear map β s.t. βα = ι ]]⇔[[ α is injective ]]⇔[[ α is an isomor-
phism ]]⇔[[ α has an inverse α−1 ]]⇔[[ α is isomorphism ]]⇔[[ α is surjective ]]⇔[[ there
exists linear map γ s.t. αγ = ι ]]⇔(ii)
Note [[ α is injective ]]⇒[[ there exists linear map β s.t. βα = ι ]] is actually because
[[ α is injective ]]⇒[[ α is an isomorphism ]]⇒[[ there exists linear map β s.t. βα = ι ]].
Similarly for [[ α is surjective ]]⇒[[ there exists linear map γ s.t. αγ = ι ]].
T. 4-39
Suppose that (e1 , · · · , em ) and (u1 , · · · , um ) are basis for a finite-dimensional
vector space U over F, and (f1 , · · · , fn ) and (v1 , · · · , vn ) are basis of a finite-
dimensional vector space V over F. Let α : U → V be a linear map represented
by a matrix A with respect to (ei ) and (fi ) and by B with respect Pm to (ui ) and
(vi ). PThen B = Q−1 AP where P and Q are given by ui = k=1 Pki ek and
vi = n k=1 Qki fk .
Note that one can view P as the matrix representing the identity map iU from U
with basis (ui ) to U with basis (ei ), and similarly for Q. So both are invertible.
n
X XX X
α(ui ) = Bji vj = Bji Q`j f` = (QB)`i f`
j=1 j ` `
m
! m
X X X X
α(ui ) = α Pki ek = Pki A`k f` = (AP )`i f`
k=1 k=1 ` `
α
U V
The diagram on the right shows the the linear map α : U → V
represented by A in basis {ei } for U and basis {fi } for V . (ei ) (fi )
ιU α ιV
Then if we want a matrix representing the U U V V
map U → V with respect to bases (ui )
(ui ) (ei ) (fi ) (vi )
and (vi ), we can write it as the composi-
tion B = Q−1 AP . P A Q
Fm Fm Fn Fn
D. 4-40
• We say A, B ∈ Matn,m (F) are equivalent if there are (invertible) matrices P ∈
GLm (F) and Q ∈ GLn (F) such that B = Q−1 AP .
E. 4-41
Since GLk (F) = {A ∈ Matk (F) : A is invertible} is a group, for each k ≥ 1,
equivalence of matrices is indeed an equivalence relation. The equivalence classes
are orbits under the action of GLm (F) × GLn (F), given by
GLm (F) × GLn (F) × Matn,m (F) → Mat(F) with (P, Q, A) 7→ QAP −1 .
Two matrices are equivalent if and only if they represent the same linear map with
respect to different basis. Hence by [P.4-34]: If A ∈ Matn,m (F), then there exists
invertible matrices P ∈ GLm (F), Q ∈ GLn (F) so that Q−1 AP = ( I0r 00 ) for some
0 ≤ r ≤ min(m, n). This also tells us that there are min(m, n) + 1 orbits of the
action for each r ∈ {0, 1, · · · , min(m, n)}.
E. 4-42
Note that if α : Fm → Fn is the linear map represented by A (with respect to the
standard basis), then r(A) = r(α), ie. the column rank is the rank. Moreover,
since the rank of a map is independent of the basis, equivalent matrices have the
same column rank.
T. 4-43
r(A) = r(AT ) for any A ∈ Matn,m (F). (row rank is equivalent to the column
rank)
We know that there are some invertible P, Q such that Q−1 AP = ( I0r 00 ) where r =
r(A). We can transpose this whole equation to obtain (Q−1 AP )T = P T AT (QT )−1 =
( I0r 00 ). So r(AT ) = r.
D. 4-44
A matrix in GLn (F) is called an elementary matrix if it differs from the identity
matrix by one single elementary row operation (i.e. switching two rows, multiply-
ing a row by a non-zero scalar, or adding a multiple of one row to another).
118 CHAPTER 4. LINEAR ALGEBRA
E. 4-45
Elementary matrices in GLn (F) consist of matrices of the following three types:
n
• Sij : the matrix obtain by swapping row i and row j of the identity matrix.
• Tin : a diagonal matrix, with diagonal entries 1 everywhere except in the ith
position, where it is λ.
n
• Eij : the identity matrix but with an λ in the (i, j) position instead of 0.
................................................................................
Observe that if A is a m × n matrix, then
n
• ASij is the matrix A swapping the i and j columns.
• ATin (λ) is the matrix A with the ith column multiply by λ.
n
• AEij (λ) is the matrix A with λ times its column i added to column j.
Multiplying on the left by m × m elementary matrix instead of the right would
result in the same operations performed on the rows instead of the columns.
P. 4-46
If A ∈ Matn,m (F), then there exists invertible matrices P ∈ GLm (F) and Q ∈
GLn (F) so that Q−1 AP = ( I0r 00 ) for some 0 ≤ r ≤ min(m, n).
4.3 Duality
D. 4-47
• Let V be a vector space over F. The dual space of V is defined as V ∗ =
L(V, F) = {θ : V → F : θ linear}. Elements of V ∗ are called linear functionals or
linear forms .
4.3. DUALITY 119
L. 4-49
If V is a finite-dimensional vector space over F with basis (e1 , · · · , en ), then there
is a basis (ε1 , · · · , εn ) for V ∗ such that εi (ej ) = δij . In particular dim V = dim V ∗ .
Since linear maps are characterized by their values on a basis, there are unique
ε1 , · · · , εn ∈ V ∗ such that εi (ej ) = δij . Now we show that (ε1 , · · · , εn ) is a basis.
Given any θ ∈ V ∗ , we can write θ uniquely as a combination of ε1 , · · · , εn because
n
X n
X
θ= λ i εi ⇐⇒ θ(ej ) = λi εi (ej ) for all j ⇐⇒ λj = θ(ej ).
i=1 i=1
1
It might seems like the definitions are not consistent, and W 0 should be a subset of V ∗∗ and
not V . We will later show that there is a canonical isomorphism between V ∗∗ and V , and this will
all make sense.
120 CHAPTER 4. LINEAR ALGEBRA
Pn
Write Q = P −1 so that ej = Qkj fk . Now εi = n T
P
k=1 `=1 P`i η` because
n
! n
! n !
X T
X X X
P`i η` (ej ) = Pi` η` Qkj fk = Pi` δ`k Qkj
`=1 `=1 k=1 k,`
X
= Pi` Q`j = (P Q)ij = δij .
k,`
E. 4-51
Consider R3 with standard basis (e1 , e2 , e3 ) and (R3 )∗ with dual basis (ε1 , ε2 , ε3 ).
If U = he1 + 2e2 + e3 i and W = hε1 − ε3 , 2ε1 − ε2 i, then U 0 = W and W 0 = U .
We see that the dimension of U and U 0 add up to three, which is the dimension
of R3 . This is typical.
P. 4-52
Let V be finite-dimensional vector space over F and U a subspace. Then dim U +
dim U 0 = dim V .
(Proof 1) Let (e1 , · · · , ek ) be a basis for U and extend to (e1 , · · · , en ) a basis for
V . Consider the dual basis for V ∗ , say (ε1 , · · · , εn ). We will prove the result
by showing that U 0 = hεk+1 , · · · , εn i. If j > k, then εj (ei ) = 0 for all i ≤ k.
0 0
k+1 , · · · , εn ∈ U . On the other hand, suppose θ ∈ U . Then we can write
So εP
n
θ = j=1 λj εj . But then 0 = θ(ei ) = λi for i ≤ k.
(Proof 2) Consider the restriction map V ∗ → U ∗ , given by θ 7→ θ|U . This is
obviously linear. Since every linear map U → F can be extended to V → F,
this is a surjection. Moreover, the kernel is U 0 . So by rank-nullity theorem,
dim V = dim V ∗ = dim U 0 + dim U ∗ = dim U 0 + dim U .
(Proof 3) Consider the map α : U 0 → (V /U )∗ such that α(θ)(v + U ) = θ(v). This
is well-defined since [[ v +U = u+U ]] ⇒ [[ v −u ∈ U ]] ⇒ [[ θ(v) = θ(u) ]]. This is an
linear map since for all u + U ∈ V /U we have α(λθ + µφ)(u + U ) = (λθ + µφ)(u) =
λ(θ)(u) + µ(φ)(u) = λα(θ)(u + U ) + µα(φ)(u + U ). This is also injective since
[[ α(θ) = 0 ]] ⇒ [[ θ(v) = 0 for all v ]] ⇒ [[ θ = 0 ]]. This is also surjective since given
any σ ∈ (V /U )∗ , θ ∈ U 0 defined by θ(v) = σ(v + U ) is such that α(θ) = σ. Hence
α is an isomorphism and U 0 ∼ = (V /U )∗ , so dim U 0 = dim(V /U )∗ = dim(V /U ) =
dim V − dim U by [P.4-17].
D. 4-53
Let V, W be vector spaces over F and α : V → W a linear map. The dual map
to α, written α∗ : W ∗ → V ∗ is given by θ 7→ θ ◦ α.
E. 4-54
Note that since the composite of linear maps is linear, α∗ (θ) ∈ V ∗ for all θ ∈ W ∗ .
P. 4-55
Let α ∈ L(V, W ) be a linear map, then α∗ ∈ L(W ∗ , V ∗ ), i.e α∗ is a linear map.
α∗ (λθ1 + µθ2 )(v) = (λθ1 + µθ2 )(αv) = λθ1 (α(v)) + µθ2 (α(v))
= (λα∗ (θ1 ) + µα∗ (θ2 ))(v).
P. 4-56
Let V, W be finite-dimensional vector spaces over F and α : V → W be a linear
map. If α is represented by A with respect to basis (e1 , · · · , en ) and (f1 , · · · , fm )
for V and W , then α∗ is represented by AT with respect to the corresponding dual
bases.
∗ Pn
Since this is true for all j, we “take away” the (ej ), so α (ηi ) = k=1 ATki εk .
Note that if α : U → V and β : V → W , θ ∈ W ∗ , then
1. Note that α∗∗ ◦ ev and ev ◦α are both linear maps V → W ∗∗ . Then α∗∗ ◦ ev =
ev ◦α since for any v ∈ V and θ ∈ W ∗ we have
α∗∗ (ev(v))(θ) = ev(v)(α∗ )(θ) = (α∗ (θ))v = θ(α(v)) = ev(α(v))(θ)
4.4. BILINEAR FORMS I 123
For the second result first note that ev(U1 ) ∩ ev(U2 ) = U100 ∩ U200 = (U10 + U20 )0 .
Now as ev is an isomorphism and using 2. we have (U1 ∩ U2 )0 = U10 + U20 since
Note that if we think of ev(v) and v as the same thing and abuse the notation
and write ev(v) = v, then the result α∗∗ ◦ ev = ev ◦α is simply α∗∗ = α and
U 00 = ev(U ) is just U 00 = U . So again we can think of them as “the same”.
Another way get the result α∗∗ = α is by considering basis: Let (e1 , · · · , en ) be a
basis for V and (f1 , · · · , fm ) be a basis for W , and let (ε1 , · · · , εn ) and (η1 , · · · , ηn )
be the corresponding dual basis. We know that
E. 4-61
• The map V × V ∗ → F defined by (v, θ) 7→ θ(v) = ev(v)(θ) is a bilinear form.
C. 4-62
<Matrix representation> Let (e1 , · · · , en ) be a basis for V and (f1 , · · · , fm )
be a basis for W , and ψ : V × W → F a bilinearPform. Define the P matrix Aij =
ψ(ei , fj ). For any v ∈ V and w ∈ W , write v = vi ei and w = wj fj , then by
linearity, we get
X X X X
ψ(v, w) = ψ vi ei , w = vi ψ(ei , w) = vi ψ ei , wj fj
i i
X
= vi wj ψ(ei , fj ) = veT Awf .
i,j
P. 4-63
P
Suppose (e1 , · · · , en ) and (v1 , · · · , vn ) are basis for V such that vi = Pki ek
i = 1, · · · , n; and (f1 , · · · , fm ) and (w1 , · · · , wm ) are bases for W such that
for allP
wi = Q`j f` for all j = 1, · · · , m. If ψ : V ×W → F is a bilinear form represented
by A with respect to (e1 , · · · , en ) and (f1 , · · · , fm ) and by B with respect to the
bases (v1 , · · · , vn ) and (w1 , · · · , wm ), then B = P T AQ.
P P P P T
Bij = φ(vi , wj ) = φ ( Pki ek , Q`j f` ) = Pki Q`j φ(ek , f` ) = k,` Pik Ak` Q`j =
(P T AQ)ij .
The difference between this and the transformation laws of matrices representing
linear maps is that this time we are taking transposes, not inverses. Note that
while the transformation laws for bilinear forms and linear maps are different, we
still get that two matrices are representing the same bilinear form with respect
to different bases if and only if they are equivalent, since if B = P T AQ, then
B = ((P −1 )T )−1 AQ.
D. 4-64
Let ψ : V × W → F be a bilinear form,
• The kernel of ψL is called the left kernel of ψ, while the kernel of ψR is the
right kernel of ψ.
• ψ is non-degenerate if the left and right kernels are both trivial. We say ψ is
degenerate otherwise.
E. 4-65
If we are given a bilinear map ψ : V × W → X, we immediately get two linear
maps ψL : V → W ∗ and ψR : W → V ∗ . Note that ψL is indeed linear since for
any fix w ∈ W , ψL (λu + µv)(w) = ψ(λu + µv, w) = λψ(u, w) + µψ(v, w) =
λψL (u)(w) + µψL (v)(w), hence ψL (λu + µv) = λψL (u) + µψL (v). Similarly ψR
is linear.
4.5. DETERMINANTS OF MATRICES 125
Also note that the rank of ψ is well-defined since r(P T AQ) = r(A) for invertible
P and Q.
E. 4-66
• If ψ : V × V ∗ → F, is defined by (v, θ) 7→ θ(v), then ψL : V → V ∗∗ is the
evaluation map. On the other hand, ψR : V ∗ → V ∗ is the identity map.
• For bilinear form ψ : V × W → F, v ∈ V is in the left kernel if ψ(v, w) = 0 for
all w ∈ W . More generally, for T ⊆ V , we can define T ⊥ = {w ∈ W : ψ(t, w) =
0 for all t ∈ T } and similarly for U ⊆ W we define ⊥ U = {v ∈ V : ψ(v, u) =
0 for all u ∈ U }. Then V ⊥ = ker ψR and ⊥ W = ker ψL .
• Note that for ψ : V × W → F we have ψL (v)w = ψ(v, w) = ψR (w)v, we can
∗ ∗
easily show that ψR = ψL ◦ ev and ψL = ψR ◦ ev.
L. 4-67
Let (e1 , · · · , en ),(f1 , · · · , fn ) be basis of V, W respectively and (ε1 , · · · , εn ), (η1 , · · · , ηn )
their dual basis on V ∗ , W ∗ . If A represents ψ with respect to (e1 , · · · , en ) and
(f1 , · · · , fm ), then
• A also represents ψR with respect to (f1 , · · · , fm ) and (ε1 , · · · , εn );
• AT represents ψL with respect to (e1 , · · · , en ) and (η1 , · · · , ηm ).
P P T
ψL (ei )(fj ) = ψ(ei , fj ) = Aij = ` Ai` η` (fj ), so ψL (ei ) = ` A`i η` and hence
AT represents ψL . We also have ψR (fj )(ei ) = Aij , so ψR (fj ) =
P
Akj εk .
Note that this says that the rank of ψ is the same as the rank of ψL and ψR .
L. 4-68
Let V and W be finite-dimensional vector spaces over F with bases (e1 , · · · , en )
and (f1 , · · · , fm ) respectively, and let ψ : V ×W → F be a bilinear form represented
by A with respect to these bases. Then φ is non-degenerate if and only if A is
(square and) invertible. In particular, V and W have the same dimension if φ is
non-degenerate.
Since ψR and ψL are represented by A and AT (in some order), they both have
trivial kernel iff n(A) = n(AT ) = 0 iff dim W = r(AT ) = r(A) = dim V with A
having full rank, ie. the corresponding linear map is bijective.
E. 4-69
The map F2 × F2 → F defined by (( ac ), ( db )) 7→ ad − bc is a bilinear form. This,
obviously, corresponds to the determinant of a 2×2 matrix. We have ψ(v, w) =
−ψ(w, v) for all v, w ∈ F2 .
E. 4-71
• If n = 2, then S2 = {id, (1 2)}, so det A = A11 A22 − A12 A21 .
L. 4-72
1. det A = det AT .
2. Q
If A is an upper triangular matrix (ie. Aij = 0 for all i > j), then det A =
n
i=1 Aii .
Qn Qn
1. Let τ = σ −1 . Note that ε(τ ) = ε(σ) and i=1 Aσ(i)i = j=1 Ajτ (j) . So
X n
Y X n
Y
det AT = ε(σ) Aσ(i)i = ε(τ ) Ajτ (j) = det A.
σ∈Sn i=1 τ ∈Sn j=1
L. 4-73
Let n-many vectors A(i) ∈ Fn (1 ≤ i ≤ n) be the columns of the matrix A =
(A(1) A(2) · · · A(n) ) ∈ Matn (F). Then det A is a volume form.
To show it is alternating, suppose now there are Q some k, ` distinct such that
A(k) = A(`) (ie. Aik = Ai` ). Let τ = σ(k `), then n
Qn
i=1 Aσ(i)i = i=1 Aτ (i)i (since
Aτ (k)k = Aσ(`)` , Aτ (`)` = Aσ(k)k and Aτ (i)i = Aσ(i)i otherwise). So det A = 0
because
X n
Y X n
Y
det A = ε(σ) Aσ(i)i = ε(τ (k `)) Aτ (i)i
σ∈Sn i=1 τ ∈Sn i=1
X n
Y
=− ε(τ ) Aτ (i)i = − det A.
τ ∈Sn i=1
P Qn
Alternatively,
P QnSn is the union
P of the cosets
Qn An and An (k, `), and σ∈An i=1 Aσ(i)i =
τ (k,`)∈An i=1 Aτ (i)i = τ ∈An (k,`) i=1 Aτ (i)i . But det A = LHS − RHS = 0.
4.5. DETERMINANTS OF MATRICES 127
L. 4-74
Let d be a volume form on Fn . Then swapping two entries changes the sign, ie.
d(v1 , · · · , vi , · · · , vj , · · · , vn ) = −d(v1 , · · · , vj , · · · , vi , · · · , vn ).
0 = d(v1 , · · · , vi + vj , · · · , vi + vj , · · · , vn )
= d(v1 , · · · , vi , · · · , vi , · · · , vn ) + d(v1 , · · · , vi , · · · , vj , · · · , vn )
+ d(v1 , · · · , vj , · · · , vi , · · · , vn ) + d(v1 , · · · , vj , · · · , vj , · · · , vn )
= d(v1 , · · · , vi , · · · , vj , · · · , vn ) + d(v1 , · · · , vj , · · · , vi , · · · , vn ).
T. 4-75
Let d be any volume form on Fn , let {e1 , · · · , en } the standard basis of Fn , and
let A = (A(1) · · · A(n) ) ∈ Matn (F). Then
1. d(A(1) , · · · , A(n) ) = (det A)d(e1 , · · · , en );
2. d(Av1 , · · · , Avn ) = (det A)d(v1 , · · · , vn ) for any v1 , · · · , vn ∈ Fn .
1. We can compute
n
! n
(1) (n)
X (2) (n)
X
d(A ,··· ,A )=d Ai1 ei , A ,··· ,A = Ai1 d(ei , A(2) , · · · , A(n) )
i=1 i=1
n
X
= Ai1 Aj2 d(ei , ej , A(3) , · · · , A(n) ) = · · ·
i,j=1
X n
Y
= d(ei1 , · · · , ein ) Aij j .
i1 ,··· ,in j=1
We know that lots of these are zero, since if ik = ij for some k, j, then the
term is zero. So we are just summing over distinct tuples, ie. when there is
some σ such that ij = σ(j). So we get
X n
Y
d(A(1) , · · · , A(n) ) = d(eσ(1) , · · · , eσ(n) ) Aσ(j)j
σ∈Sn j=1
X n
Y
= ε(σ)d(e1 , · · · , en ) Aσ(j)j = (det A)d(e1 , · · · , en ).
σ∈Sn j=1
2. We can rewrite the first part result as d(Ae1 , · · · , Aen ) = (det A)d(e1 , · · · , en ).
Let B be the linear map such that Bei = vi for all i. Define dA (u1 , · · · , un ) =
d(Au1 , · · · , Aun ), then dA is a volume form since it’s multi-linear (as ui 7→ Aui
is linear) and alternating (as ui = uj implies Aui = Auj ). Now using part 1
we have
d(Av1 , · · · , Avn ) = (det A)d(v1 , · · · , vn ) says that det A is the volume rescaling
factor of an arbitrary parallelopiped, and this is true for any volume form d.
T. 4-76
Let A, B ∈ Matn (F). Then det(AB) = det(A) det(B).
We have proved that (i) ⇒ (ii) above, and the rank-nullity theorem implies (iii)
⇒ (i). So we just need to prove (ii) ⇒ (iii). Suppose r(A) < n. By rank-nullity
theorem, n(A) > 0. So there is some non-zero column vector x = (λ1 , · · · , λn )
such that Ax = 0. Say λk 6= 0. We define B as follows:
1 λ1
.. ..
. .
1 λk−1
B=
λk
λk+1 1
.. ..
. .
λn 1
with 0 in the black space. So AB has the kth column identically zero. So
det(AB) = 0, but det B = λk 6= 0. So det A = 0.
D. 4-79
• Write Âij for the matrix obtained from A by deleting the ith row and jth column.
• Let A ∈ Matn (F). The adjugate matrix of A, written adj A, is the n × n matrix
such that (adj A)ij = (−1)i+j det Âji .
L. 4-80
Let A ∈ Matn (F). Then for any fixed j ∈ {1, 2, · · · , n}, we can expand det A as
n
X n
X
det A = (−1)i+j Aij det Âij = (−1)i+j Aji det Âji .
i=1 i=1
4.5. DETERMINANTS OF MATRICES 129
We just have to prove the first equality, and then the second equality follows
from det A = det AT . Let A(1) , · · · , A(n) be the columns of A, then det A =
d(A(1) , P
· · · , A(n) ) where d is the volume form induced by the determinant. Since
A(j) = n j=1 Aij ei , we can write det A as
n
!
(1) (j−1)
X (j+1) (n)
det A = d A ,··· ,A , Aij ei , A ,··· ,A
j=1
n
X
= Aij d(A(1) , · · · , A(j−1) , ei , A(j+1) , · · · , A(n) )
i=1
The volume form on the last line is the determinant of a matrix B 0 which is the
matrix A with the jth column replaced with ei . We can make n − j column
transpositions and n − i row transpositions (i.e. column transpositions on its
transpose) on B 0 to obtain the matrix B given by
Âij 0
B=
stuff 1
Note instead of using volume forms, we could prove actually prove this directly
from the definition, as done in part IA.
T. 4-81
If A ∈ Matn (F), then A(adj A) = (det A)In = (adj A)A. In particular, if det A 6= 0,
then A−1 = det1 A adj A.
n
X n
X
((adj A)A)jk = (adj A)ji Aik = (−1)i+j det Âij Aik . (∗)
i=1 i=1
L. 4-82
Let A, B be square matrices. Then for any C, we have
A C
det = (det A)(det B).
0 B
X k+`
Y
det X = ε(σ) Xiσ(i) .
σ∈Sk+` i=1
X k
Y `
Y
det X = ε(σ1 σ2 ) Xiσ1 (i) Xk+j σ2 (k+j)
σ=σ1 σ2 i=1 j=1
X k
Y X k
Y
= ε(σ1 ) Aiσ1 (i) ε(σ2 ) Bjσ2 (j) = (det A)(det B).
σ1 ∈Sk i=1 σ2 ∈S` j=1
4.6 Endomorphisms
D. 4-83
• If V is a (finite-dimensional) vector space over F. An endomorphism of V is a
linear map α : V → V . We write End(V ) for the F-vector space of all such linear
maps, and ι for the identity map V → V .
• Properties of endomorphisms that are not dependent on the basis we pick is known
as invariants .
E. 4-84
Endomorphisms are linear maps from a vector space V to itself. Unlike when we
work with arbitrary linear maps where we are free to choose any basis for the
domain, and any basis for the co-domain, when working with endomorphisms, we
require ourselves to use the same basis for the domain and co-domain, and there
is much more we can study assuming this.
One major objective is to classify all matrices up to similarity, where two ma-
trices are similar if they represent the same endomorphism under different bases.
GLn (F), the group of invertible n × n matrices, acts on Matn (F) by conjugation:
(P, A) 7→ P · A = P AP −1 .
We are conjugating it this way so that the associativity axiom Q·(P ·A) = (Q·P )·A
holds (otherwise we get a right action instead of a left action). Then A and B are
similar iff they are in the same orbit. Since orbits always partition the set, this is
an equivalence relation. Our main goal is to classify the orbits, ie. find a “nice”
representative for each orbit.
L. 4-85
Suppose (e1 , · · · , en ) and (f1 , · · · , fn ) are bases for V and α ∈ End(V ). If A repre-
sents α with respect to (e1 , · · · , en ) and B represents α with respect to (f1 , · · · , fn ),
then B = P −1 AP where P is given by fi = n
P
j=1 P ji ej .
A special case of [T.4-39] where we always use the same base for the domain and
co-domain.
L. 4-86
1. If A ∈ Matm,n (F) and B ∈ Matn,m (F), then tr AB = tr BA.
2. If A, B ∈ Matn (F) are similar, then tr A = tr B.
3. If A, B ∈ Matn (F) are similar, then det A = det B.
m
X m X
X n n X
X m
1. tr AB = (AB)ii = Aij Bji = Bji Aij = tr BA
i=1 i=1 j=1 j=1 i=1
This allows us to define the trace and determinant of an endomorphism since they
are invariant (i.e. independent on basis used). In fact we can define determinant
even without reference to a basis, by defining more general volume forms and
define the determinant as a scaling factor. The trace is slightly more tricky to
define without reference to a basis, but in fact it is the directional derivative of
the determinant at the origin.
D. 4-87
Let V be a finite dimensional vector space and α ∈ End(V ).
To prove the result we need to show that if ki=1 xi = ki=1 yi with xi , yi ∈ E(λi ),
P P
then xi = yi for all i. We are going to find some clever
Q map that tells us what xi
and yi are. Consider βj ∈ End(V ) defined by βj = r6=j (α − λr ι). Then
i
! k Y k Y
X X X Y
βj xk = (α − λr ι)(xi ) = (λi − λr )(xi ) = (λj − λr )(xj ).
i=1 i=1 r6=j i=1 r6=j r6=j
Pi Q
Similarly,
P P we obtain βj ( i=1
Q yk ) = − λr )(yj ). Since we Q
r6=j (λj Q know that
xi = yi , we must have r6=j (λj −λr )xj = r6=j (λj −λr )yj . Since r6=j (λr −
λj ) 6= 0, we must have xi = yi for all i.
The proof shows that any set of non-zero eigenvectors with distinct eigenvalues is
linearly independent.
T. 4-91
Let α ∈ End(V ) and λ1 , · · · , λk be distinct eigenvalues of α. Write Ei for E(λi ).
Then the following are equivalent:
i. α is diagonalizable.
ii. V has a basis of eigenvectors for α.
iii. V = ki=1 Ei .
L
2
Note that some author in addition call the zero vector 0 an eigenvector.
3
You might be used to the definition χα (t) = det(α − tι) instead. These two definitions are
obviously equivalent up to a factor of −1, but this definition has an advantage that χα (t) is always
monic, ie. the leading coefficient is 1. However, when doing computations in reality, we often use
det(α − tι) instead, since it is easier to negate tι than α.
4.6. ENDOMORPHISMS 133
Pk
iv. dim V = i=1 dim Ei .
i ⇔ ii: Suppose (e1 , · · · , en ) is a basis for V , then α(ei ) = Aji ej where A repre-
sents α under the basis. Then A is diagonal iff each ei is an eigenvector.
D. 4-92
• A polynomial over F is an object of the form f (t) = am tm + am−1 tm−1 + · · · +
a1 t + a0 with m ≥ 0, a0 , · · · , am ∈ F. We write F[t] for the set of polynomials over
F.4
• Let f ∈ F[t]. Then the degree of f , written deg f is the largest n such that
an 6= 0. In particular, deg 0 = −∞.
E. 4-93
Note that deg f g = deg f + deg g and deg(f + g) ≤ max{deg f, deg g}. This also
illustrate why it make sense to single out the 0 polynomials (with degree −∞)
from other constant polynomials (with degrees 0).
L. 4-94
<Polynomial division> If f, g ∈ F[t] with g 6= 0, then there exists q, r ∈ F[t]
with deg r < deg g such that f = qg + r.
We prove that given f, g ∈ F[t] with g 6= 0 and deg f ≥ deg g, there exists q, r ∈ F[t]
with deg r < deg f such that f = qg + r. And then we can just repeatedly apply
this result to get the stated result.
4
Note that we don’t identify a polynomial f with the corresponding function it represents. For
example, if F = Z/pZ, then tp and t are different polynomials, even though they define the same
function (by Fermat’s little theorem/Lagrange’s theorem). Two polynomials are equal if and only
if they have the same coefficients. However, we will later see that if F is R or C, then polynomials
are equal if and only if they represent the same function, and this distinction is not as important.
134 CHAPTER 4. LINEAR ALGEBRA
L. 4-95
1. If λ is a root of f ∈ F[t], then there is a polynomial g such that f (t) = (t−λ)g(t).
2. Any non-zero f ∈ F[t] can be written as f (t) = g(t) ki=1 (t − λi )ai where
Q
λ1 , · · · , λk are all distinct, ai > 1, and g is a polynomial with no roots in F.
1. By polynomial division, we have f (t) = (t − λ)g(t) + r(t) for some g(t), r(t) ∈
F[t] with deg r < deg(t − λ) = 1. So r has to be constant, ie. r(t) = a0 for
some a0 ∈ F. But 0 = f (λ) = (λ − λ)g(λ) + r(0) = a0 . So r(t) = a0 = 0.
2. Suppose it’s true for all polynomial of degree k. Let f be any polynomial of
degree k + 1. If f has no roots, then done. If f have a root λ, then by part 1,
f (t) = (t − λ)g(t), and g has degree k. By induction hypothesis, g and hence
f can be written in the desired form. The statement is true for any degree 0
polynomial, hence the general result is true by induction.
L. 4-96
A non-zero polynomial f ∈ F[t] has at most deg f roots, counted with or without
multiplicity.
Using the above lemma, f (t) = g(t) ki=1 (t − λi )ai where is g a polynomial with
Q
no roots. i=1 (t − λi )ai is a polynomial with k roots and degree ki=1 ai . But we
Qk P
Pk
must have deg f ≥ i=1 ai , so deg f ≥ k.
L. 4-97
1. Let f, g ∈ F[t] have degree less than n. If there are λ1 , · · · , λn distinct such
that f (λi ) = g(λi ) for all i, then f = g (in the polynomial sense).
2. If F is infinite, then f = g if and only if they agree on all points.
1. Consider f −g. This has degree less than n, and (f −g)(λi ) = 0 for i = 1, · · · , n.
Since it has at least n ≥ deg(f − g) roots, we must have f − g = 0 and so
f = g.
2. The forward direction is obviously true. If f and g agrees on all points, pick n
such that n > deg f and n > deg g, then by the first part f = g.
T. 4-98
<Fundamental theorem of algebra> Every non-constant polynomial over C
has a root in C.
We will not prove this here, a prove is given in [T.10-50]. Because of this result
we say C is an algebraically closed field .
In fact it follows from this result that every polynomial over C of degreeQn > 0 has
ai
precisely n roots, counted with multiplicity, since if we write f (t) = g(t)
P (t−λi )
and g has no roots, then g is constant. So the number of roots is ai = deg f ,
counted with multiplicity.
It also follows that every polynomial in R factors into linear polynomials and
quadratic polynomials with no real roots since complex roots of real polynomials
come in complex conjugate pairs.
4.6. ENDOMORPHISMS 135
T. 4-99
<Diagonalizability theorem> Suppose α ∈ End(V ). Then α is diagonalizable
if and only if there exists non-zero p(t) ∈ F[t] that can be expressed as a product
of distinct linear factors such that p(α) = 0.
L. 4-102
Let α ∈ End(V ), and p ∈ F[t]. Then p(α) = 0 if and only if Mα (t) is a factor of
p(t). In particular, Mα is unique.
For all such p, we can write p(t) = q(t)Mα (t) + r(t) for some r of degree less than
deg Mα . Then r(α) = 0 iff p(α) = 0. But deg r < deg Mα . By the minimality of
Mα , we must have r(α) = 0 iff r = 0. So p(α) = 0 iff Mα (t) | p(t).
So if M1 and M2 are both minimal polynomials for α, then M1 | M2 and M2 | M1 .
So M2 is just a scalar multiple of M1 . But since M1 and M2 are monic, they must
be equal.
E. 4-103
Let V = F2 , and consider the matrices A = ( 10 01 ) and B = ( 10 11 ). Consider the
polynomial p(t) = (t − 1)2 . We can compute p(A) = p(B) = 0. So MA (t) and
MB (t) are factors of (t−1)2 . There aren’t many factors of (t−1)2 . So the minimal
polynomials are either (t − 1) or (t − 1)2 . Since A − I = 0 and B − I 6= 0, the
minimal polynomial of A is t − 1 and the minimal polynomial of B is (t − 1)2 .
T. 4-104
<Diagonalizability theorem v2> Let α ∈ End(V ). Then α is diagonalizable
if and only if Mα (t) is a product of distinct linear factors.
(Forward) If there exists a basis (e1 , · · · , en ) for V such that α and β are repre-
sented by A and B respectively, with both diagonal, then by direct computation,
AB = BA. But AB represents αβ and BA represents βα. So αβ = βα.
(Backward) Suppose αβ = βα. The idea is to consider each eigenspace of α
individually, and
Lthen diagonalize β in each of the eigenspaces. Since α is diago-
nalizable, V = ki=1 Eα (λi ) where λi are the different eigenvalues of α. Write Ei
for Eα (λi ).
We now show that β(Ei ) ⊆ Ei . Let v ∈ Ei , then α(β(v)) = β(α(v)) = β(λi v) =
λi β(v). So β(v) is an eigenvector of α with eigenvalue λi , hence β(v) ∈ Ei .
Thus we can view β|Ei ∈ End(Ei ). Note that Mβ (β|Ei ) = Mβ (β)|Ei = 0. Since
Mβ (t) is a product of distinct linear factors (as β is diagonalizable), it follows that
β|Ei is diagonalizable for each Ei , and we can choose a basis Bi of Ei consist of
eigenvectors of β|Ei which must also be eigenvectors of β.
Then since V is a direct sum of the Ei ’s, we know that B = ki=1 Bi is a basis for
S
V consisting of eigenvectors for both α and β.
This result is important in quantum mechanics. This means that if two operators
do not commute, then they do not have a common eigenbasis. Hence we have the
uncertainty principle.
4.6. ENDOMORPHISMS 137
D. 4-106
An endomorphism α ∈ End(V ) is triangulable if there is a basis for V such that
α is represented by an upper triangular matrix (ie. Aij = 0 for all i > j).
L. 4-107
An endomorphism α is triangulable if and only if χα (t) can be written as a prod-
uct of linear factors, not necessarily distinct. In particular, if F = C (or any
algebraically closed field), then every endomorphism is triangulable.
Since dim W < dim V , by induction hypothesis there is a basis vr+1 , · · · , vn for
W such that β is represented by an upper triangular C. For j = 1, · · · , n − r, we
have α(vj+r ) = u + n−r
P
k=1 C kj vk+r for some u ∈ U . So α is represented by
λIr stuff
0 C
5
Note that in general, β is not α|W in general, since α does not necessarily map W to W (as
can be seen from the “stuff” in the matrix above). However, we can say that (α − β)(w) ∈ U for
all w ∈ W . This can be much more elegantly expressed in terms of quotient spaces.
138 CHAPTER 4. LINEAR ALGEBRA
E. 4-108
Consider the real rotation matrix
cos θ sin θ
.
− sin θ cos θ
This is not similar to a real upper triangular matrix (if θ is not an integer multiple
of π). This is since the eigenvalues are e±iθ and are not real. On the other
hand, as a complex matrix, it is triangulable, and in fact diagonalizable since the
eigenvalues are distinct. For this reason, in the rest of the section, we are mostly
going to work in C. We can now prove the Cayley-Hamilton theorem.
T. 4-109
<Cayley-Hamilton theorem> Let V be a finite-dimensional vector space with
dimension n, and α ∈ End(V ) with characteristic polynomial χα . Then χα (α) = 0,
ie. Mα (t) | χα (t). In particular, deg Mα ≤ n.
Hence we have the desired result for V that is over C. Now suppose V is over a
field F, which is not C but a subfield of C (for example R). Say α is represented
by B ∈ Matn (F) over some basis, we can view B as an element of Matn (C) to see
that χB (B) = 0. But χα (α) = χB (α) is represented by χB (B) and so is 0.
(Proof 2) Let α be represented by A, and B = tIn − A. Then B adj B = det BIn =
χα (t)In . But we know that adj B is a matrix with entries in F[t] of degree at most
n−1. So we can write adj B = Bn−1 tn−1 +Bn−2 tn−2 +· · ·+B0 with Bi ∈ Matn (F).
We can also write χα (t) = an tn + an−1 tn−1 + · · · + a0 . Then we get
from B adj B = χα (t)In . Both sides are equals as a polynomials, so the coefficients
on both sides must be equal. Equating coefficients in tk gives ak In = Bk−1 − ABk ,
4.6. ENDOMORPHISMS 139
................................................................................
It is tempting to prove this result by substituting t = α into det(tι − α) and
get det(α − α) = 0, but this is meaningless, since what the statement χα (t) =
det(tι − α) tells us to do is to expand the determinant of the matrix
t − a11 a12 ··· a1n
a21 t − a22 · · · a2n
tIn − A = .. .. .. ..
. . . .
an1 an2 ··· t − ann
This is exactly what we showed in the proof — after multiplying out the first k
elements of the product (counting from the right), the image is contained in the
span of the first n − k basis vectors.
L. 4-110
Let α ∈ End(V ), λ ∈ F. Then
(i)[[ λ is an eigenvalue of α ]]⇔(ii)[[ λ is a root of χα (t) ]]⇔(iii)[[ λ is a root of
Mα (t) ]].
i.e. (Jm (λ))i,i = λ, (Jm (λ))i,i+1 = 1 and other entries 0. These matrices are called
Jordan blocks . We say a matrix A ∈ Matn (C) is in Jordan normal form if it is
a block diagonal matrix
J (λ ) 0
n1 1
Jn2 (λ2 )
..
.
0 Jnk (λk )
P
where k ≥ 1, n1 , · · · , nk ∈ N such that n = ni and λ1 , · · · , λk ∈ C not neces-
sarily distinct.
E. 4-113
• For the matrix A on the right, λ is an eigenvalue, λ 1 ··· 0
!
. ..
aλ = n = cλ and gλ = 1. A= 0 λ .. .
.. .. ..
. . . 1
0 0 ··· λ
L. 4-114
If λ is an eigenvalue of α, then 1.[[ 1 ≤ gλ ≤ aλ ]] and 2.[[ 1 ≤ cλ ≤ aλ ]].
E. 4-117
Every 2 × 2 matrix in Jordan normal form is one of the three types:
If k ≥ n, then we have (Jn (λ) − λI)k = 0. Hence n((Jm (λ) − λIm )r ) = min{r, m}.
And so for A = Jn (λ), we have χA (t) = MA (t) = (t − λ)n . So λ is the only
eigenvalue of A. Just to be clear writing the algebraic multiplicity for A of λ as
4.6. ENDOMORPHISMS 143
aJn (λ) (instead of just aλ ) etc. we have (∗)[[ aJn (λ) = n ]] and (†)[[ cJn (λ) = n ]]. We
also know that n(A − λI) = n − r(A − λI) = 1. So (‡)[[ gJn (λ) = 1 ]].
Recall that a general Jordan normal form is a block diagonal matrix of Jordan
blocks. We have just studied individual Jordan blocks. Next, we want to look at
some properties of block diagonal matrices in general. If A is the block diagonal
matrix
A1
! k
A2
Y
A= .. then χA (t) = χAi (k) (∗)
.
Ak i=1
The above tells us that if A is in Jordan normal form, we get the following:
(‡): gλ is the number of Jordan blocks in A with eigenvalue λ.
(∗): aλ is the sum of sizes of the Jordan blocks of A with eigenvalue λ.
(†): cλ is the size of the largest Jordan block with eigenvalue λ.
T. 4-119
Let α ∈ End(V ), and A in Jordan normal form representing α. Then the number
of Jordan blocks Jn (λ) in A with n ≥ r is n((α − λι)r ) − n((α − λι)r−1 ).
J
n1 (λ1 )
Jn2 (λ2 )
A= ..
.
Jnk (λk )
n((α − λι)r ) − n((α − λι)r−1 ) is independent on basis. So with this result, for any
λ ∈ {λi : i} we can figure out how many Jordan blocks of size exactly n by doing
the right subtraction. And this is true for all Jordan normal forms that represents
α. Hence this tells us that Jordan normal forms are unique up to permutation of
blocks.
We can interpret this result as follows: if r ≤ m, when we take an additional
power of Jm (λ) − λIm , we get from ( 00 Im−r
0
) to ( 00 Im−r−1
0
). So we kill off one
more column in the matrix, and the nullity increase by one. This happens until
144 CHAPTER 4. LINEAR ALGEBRA
(Jm (λ) − λIm )r = 0, in which case increasing the power no longer affects the
matrix. So when we look at the difference in nullity, we are counting the number
of blocks that are affected by the increase in power, which is the number of blocks
of size at least r.
We have now proved uniqueness, but existence is not yet clear. To show this, we
will reduce it to the case where there is exactly one eigenvalue. This reduction
is easy if the matrix is diagonalizable, because we can decompose the matrix into
each eigenspace and then work in the corresponding eigenspace. In general, we
need to work with “generalized eigenspaces”.
T. 4-120
Let
Qk V be a finite-dimensional vector space C such that α ∈ End(V ). Write Mα (t) =
ci
i=1 (t − λi ) with λ1 , · · · , λk ∈ C distinct. Then V = V1 ⊕ · · · ⊕ Vk where
Vi = ker((α − λi ι)ci ).
We
P now define the endomorphism πj = qj (α)pj (α) P for j = 1, · · · , k. We have
πj = ι. Below we will prove that Im πj ⊆ Vj ⊆ ker( i6=j πi ) ⊆ Im πj and hence
Im πj = Vj for all j.
• Since Mα (α) = 0 and Mα (t) = (t − λj ι)cj pj (t), we have (α − λj ι)cj πj = 0. So
Im πj ⊆ Vj .
• Since (α − λj ι)cj is a factor
P
P of i6=j πi (plus factors of the form (α − λι)
commutes), so Vj ⊆ ker( i6=j πi ).
P P P
• v ∈ ker( i6=j πi ) ⇒ v = ( i πi )v = πj (v), so ker( i6=j πi ) ⊆ Im πj .
Also noteP that πi πj =2 0, since the product contains Mα (α) as a factor.
Pk So πi =
ιπi = ( πj ) πi = πi . Given any v ∈ V , we have v = ι(v) = j=1 πj (v) ∈
P P
j Im πj . On the other hand if v = j πj (uj ), then applying πi to both sides
gives πi (v) = πi (uLi ). Hence there is a unique way of writing v as a sum of things
L
in Im πj . So V = Im πj = Vj .
We call Vi = ker((α − λi ι)ci ) the generalized eigenspace .
This allows us to decompose V into a block diagonal matrix, and then each block
will only have one eigenvalue. Note that if c1 = · · · = ck = 1, then we recover the
diagonalizability theorem.
Note that we didn’t really use the fact that the vector space is over C, except to get
that the minimal polynomial is a product of linear factors. In fact, for arbitrary
vector spaces, if the minimal polynomial of a matrix is a product of linear factors,
then it can be put into Jordan normal form. The converse is also true — if it
can be put into Jordan normal form, then the minimal polynomial is a product
of linear factors, since we’ve seen that a necessary and sufficient condition for the
4.6. ENDOMORPHISMS 145
2 −2 0
*0+ We see A − I has rank 2
A − I = 1 −1 0 =⇒ EA (1) = 0 and hence nullity 1, and
1 0 0 1 the eigenspace its kernel.
We need to pick our v2 that is in this kernel but not in the kernel of A − I
(which is the eigenspace E1 we have computed above). So we have v2 = (1, 1, 0),
v1 = (0, 0, 1) and v3 = (2, 1, 2). Hence we have
0 1 2 1 1 0
P = 0 1 1 and P −1 AP = 0 1 0 .
1 0 2 0 0 2
Note that quadratic forms are not linear maps (they are quadratic).
L. 4-124
Let V be a finite-dimensional vector space over F with basis (e1 , · · · , en ), and
φ : V × V → F is a bilinear form represented by the matrix M with respect to the
basis, ie. Mij = φ(ei , ej ). Then φ is symmetric if and only if M is symmetric.
L. 4-125
Let V be a finite-dimensional vector space, and φ : V × V → F Pa bilinear form.
Let (e1 , · · · , en ) and (f1 , · · · , fn ) be bases of V such that fi = n
k=1 Pki ek . If A
represents φ with respect to (ei ) and B represents φ with respect to (fi ), then
B = P T AP .
We induct over n = dim V . The cases n = 0 and n = 1 are trivial, since all of its
matrices are diagonal.
Suppose we have proven the result for all spaces of dimension less than n. First
consider the case where φ(v, v) = 0 for all v ∈ V . We want to show that we must
have φ = 0. This follows from the polarization identity, since this φ induces the
zero quadratic form, and we know that there is a unique symmetric bilinear form
that induces the zero quadratic form. Since we know that the zero bilinear form,
which is symmetric, induces the zero quadratic form, we must have φ = 0. Then φ
will be represented by the zero matrix with respect to any basis, which is trivially
diagonal.
If not, pick e1 ∈ V such that φ(e1 , e1 ) 6= 0. Let
Since φ(e1 , · ) ∈ V ∗ \{0}, we know that dim U = n−1 by the rank-nullity theorem.
Consider φ|U ×U : U × U → F, a symmetric bilinear form. By the induction
hypothesis, we can find a basis e2 , · · · , en for U such that φ|U ×U is represented by
a diagonal matrix with respect to this basis. Now by construction, φ(ei , ej ) = 0
for all 1 ≤ i 6= j ≤ n and (e1 , · · · , en ) is a basis for V .
This tells us classifying symmetric bilinear forms is easier than classifying en-
domorphisms, since for endomorphisms, even over C, we cannot always make it
diagonal, but we can for bilinear forms over arbitrary fields.
E. 4-128
Let q be a quadratic form on R3 given by
Find a basis f1 , f2 , f3 for R3 such that q is of the form q(af1 + bf2 + cf3 ) = λa2 +
µb2 + νc2 .
T. 4-130
Let φ be a symmetric bilinear form over a complex vector space V . Then there
exists a basis (v1 , · · · , vm ) for V such that φ is represented by ( I0r 00 ) with respect
to this basis, where r = r(φ).
We’ve already shown that there exists a basis (e1 , · · · , en ) such that φ(ei , ej ) =
λi δij for some λij . By reordering the ei , we can assume that λ1 , · · · , λr 6= 0 and
λr+1 , · · · , λn = 0. For each 1 ≤ i ≤ r, there exists some µi such that µ2i = λi . For
r + 1 ≤ r ≤ n, we let µi = 1 (or anything non-zero). We define vi = ei /µi , then
(
1 0 i 6= j or i = j > r
φ(vi , vj ) = φ(ei , ej ) =
µi µj 1 i = j < r.
Note that it follows that for the corresponding quadratic form q, we have
n
! r
X X
q ai vi = a2i .
i=1 i=1
We’ve already shown that there exists a basis (e1 , · · · , en ) such that φ(ei , ej ) =
λi δij for some λ1 , · · · , λn ∈ R. By reordering, we may assume
√
λi > 0 1 ≤ i ≤ p
√λi
1≤i≤p
λi < 0 p + 1 ≤ i ≤ r And we define µi by µi = −λi p + 1 ≤ i ≤ r
λi = 0 i > r 1 i>r
1
Now defining vi = e
µi i
we find that φ is indeed represented by A.
Note that we have seen these things in special relativity, where −1 0 0 0
the Minkowski inner product is given by the symmetric bilinear 0 1 0 0
0 0 1 0
form represented by the matrix on the right in units where c = 1. 0 0 0 1
T. 4-133
<Sylvester’s law of inertia> Let φ be a symmetric bilinear form on a finite-
dimensional real vector space V . Then there exists unique non-negative integers
p, q such that with respect to some basis φ is represented by
Ip 0 0
A = 0 −Iq 0 .
0 0 0
E. 4-135
Note that alternatively, to define a sesquilinear form, we can define a new complex
vector space V̄ structure on V by taking the same abelian group (ie. the same
underlying set and addition), but with the scalar multiplication of V̄ satisfying
α∗v = ᾱv where ∗ is the scalar multiplication of V̄ and “the normal” multiplication
is the scalar multiplication of V . Then a sesquilinear form on V × W is a bilinear
form on V̄ ×W (since φ(λ∗v1 +µ∗v2 , w) = λ̄φ(v1 , w)+ µ̄φ(v2 , w)). Alternatively,
this is a linear map W → V̄ ∗ .
As usual, the matrix representing the sesquilinear form determines the whole
sesquilinear form. This
Pfollows from thePanalogous fact for the bilinear form on
V̄ × W → C. Let v = λi vi and W = µj wj . Then we have
X
φ(v, w) = λi µj φ(vi , wj ) = λ† Aµ.
i,j
We now want the right definition of symmetric sesquilinear form. We cannot just
require φ(v, w) = φ(w, v), since φ is linear in the second variable and conjugate
linear on the first variable. So in particular, if φ(v, w) 6= 0, we have φ(iv, w) 6=
φ(v, iw), so requiring φ(v, w) = φ(w, v) just means that φ(v, w) = 0 for all v, w.
So instead we have the Hermitian forms.
Note that if φ is Hermitian, then φ(v, v) = φ(v, v) ∈ R for any v ∈ V . So it
makes sense to ask if it is positive or negative. Moreover, for any complex number
λ, we have φ(λv, λv) = |λ|2 φ(v, v). So multiplying by a scalar does not change
the sign. So it makes sense to talk about positive (semi-)definite and negative
(semi-)definite Hermitian forms.
L. 4-136
Let φ : V × V → C be a sesquilinear form on a finite-dimensional vector space over
C, and (e1 , · · · , en ) a basis for V . Then φ is Hermitian if and only if the matrix
A representing φ is Hermitian (ie. A = A† ).
L. 4-138
<Polarization identity> A Hermitian form φ on V is determined by the func-
tion ψ : V → R defined by v 7→ φ(v, v).
Almost the same as for symmetric forms over R, see [T.4-127], [T.4-131] and
[T.4-133]. For the first part of the proof, [T.4-127] like before if φ(v, v) for all v,
then by Polarization identity φ = 0, otherwise we can pick φ(e1 , e1 ) 6= 0 and
continue like before. For the middle part of the proof,[T.4-131] , note that it is like
[T.4-131] but not [T.4-130], we cannot simply have ( I0r 00 ) because here we have
φ(λe, λe) = |λ|2 φ(e, e), so when φ(e, e) is negative we cannot “normalised” it
to 1. For the last part of the proof,[T.4-131] we should note that positive (semi-
)definitness and negative (semi-)definitness works on Hermitian forms.
T. 4-142
Let V be an inner product space, then for any v, w ∈ V ,
1. <Cauchy-Schwarz inequality> |hv, wi| ≤ kvkkwk
2. <Triangle inequality> kv + wk ≤ kvk + kwk
L. 4-143
<Parseval’s identity> Let V be a finite-dimensional innerP product space with
n
1 , · · · , un . For any v, w ∈ V , hv, wi =
an orthonormal basis uP i=1 hui , vihui , wi.
2 n 2
In particular, kvk = i=1 |hui , vi| .
* n n
+ n
X X X
hv, wi = hui , viui , huj , wiuj = hui , vihuj , wihui , uj i
i=1 j=1 i,j=1
n
X n
X
= hui , vihuj , wiδij = hui , vihui , wi.
i,j=1 i=1
T. 4-144
<Gram-Schmidt process> Let V be an inner product space and e1 , e2 , · · · a
linearly independent set. Then we can construct an orthonormal set v1 , v2 , · · ·
with the property that h{v1 , · · · , vk }i = h{e1 , · · · , ek }i for every k.
k
X
hvj , uk+1 i = hvj , ek+1 i − hvi , ek+1 iδij = hvj , ek+1 i − hvj , ek+1 i = 0.
i=1
Now let w = ki=1 hwi , viwi . Clearly, we have w ∈ W . We just need to show
P
⊥
v − w ∈ W . For each j, we can compute
k
X k
X
hwj , v − wi = hwj , vi − hwi , vihwj , wi i = hwj , vi − hwi , viδij = 0.
i=1 i=1
λj wj , v − wi = 0. So we have v − w ∈ W ⊥ .
P
Hence for any λi , we have h
Notice that unlike general vector space complements, orthogonal complements are
unique.
E. 4-148
Note that the external direct sum is equivalent to the internal direct sum of
{(v1 , 0) : v1 ∈ V1 } and {(0, v2 ) : v2 ∈ V2 }.
P. 4-149
Let V be a finite-dimensional inner product space and W ≤ V . Let (e1 , · · · , ek )
be an orthonormal basis of W . Let π be the orthonormal projection of V onto W .
Then
1. π is given by the formula π(v) = ki=1 hei , viei .
P
with equality if and only if kπ(v)−wk = 0, ie. π(v) = w. Note that v−π(v) ∈
ker π = W ⊥ while π(v) − w ∈ Im π = W , hence hv − π(v), π(v) − wi = 0.
W⊥ v
Note that 2. says that π(v) is the point on
W that is closest to v. w π(v)
L. 4-150
Let V and W be finite-dimensional inner product spaces and α : V → W a linear
map. There exists a unique linear map α∗ : W → V such that hαv, wi = hv, α∗ wi
for all v ∈ V , w ∈ W .
156 CHAPTER 4. LINEAR ALGEBRA
L. 4-153
Let V be a real finite-dimensional space and α ∈ End(V ).
1. α is orthogonal if and only if α−1 = α∗ .
2. α is orthogonal if and only if α is represented by an orthogonal matrix (ie. a
matrix A such that AT = A−1 ) with respect to any orthonormal basis.
1. (Backward) If α−1 = α∗ , then hαv, αvi = hv, α∗ αvi = hv, α−1 αvi = hv, vi.
(Forward) If α is orthogonal and (v1 , · · · , vn ) is an orthonormal basis for V ,
then for 1 ≤ i, j ≤ n,
Pwe have δij = hvi , vj i = hαvi , αvj i = hvi , α∗ αvj i. So
n
we know α α(vj ) = i=1 hvi , α∗ αvj ivi = vj . So by linearity of α∗ α, we know
∗
α∗ α = idV . So α∗ = α−1 .
2. Let (e1 , · · · , en ) be any arbitrary orthonormal basis for V . Let A represent
α under this basis, then A† represents α. So α is orthogonal iff α−1 = α∗ iff
A−1 = AT .
If α∗ = α−1 , then α is invertible. It is clear from definition that O(V ) is closed
under multiplication (since if AT = A−1 and B T = B −1 then (AB)T = (AB)−1 )
and inverses (since if AT = A−1 then (A−1 )T = (A−1 )−1 ). So O(V ) is indeed a
group.
The same result and the same proof extends to complex finite-dimensional space,
with orthogonal replace by unitary.
P. 4-154
Let V be a finite-dimensional real inner product space and (e1 , · · · , en ) is an or-
thonormal basis of V . Then there is a bijection O(V ) → {orthonormal basis for V }
defined by α 7→ (α(e1 ), · · · , α(en )).
Again the same result and the same proof extends to complex space, with the
orthogonal group replace by the unitary group.
L. 4-155
Let V be a finite-dimensional inner product space, and α ∈ End(V ) self-adjoint,
1. α has a real eigenvalue, and all eigenvalues of α are real.
2. Eigenvectors of α with distinct eigenvalues are orthogonal.
We can also prove the real case of 1. without reducing to the complex case. We
know every irreducible factor of Mα (t) in R[t] must have degree 1 or 2, since the
roots are either real or come in complex conjugate pairs. Suppose f (t) were an
irreducible factor of degree 2. Then (Mα /f )(α) 6= 0 since it has degree less than
the minimal polynomial. So ∃v ∈ V such that (Mα /f )(α)(v) 6= 0. So it must be
that f (α)(v) = 0. Let U = h{v, α(v)}i. Then this is an α-invariant subspace of
V (ie. α(U ) ⊆ U ) since f has degree 2 and f (α)(v) = 0 (thus α2 (v) is a linear
combination of v and α(v)).
This result says that Hermitian matrix has real eigenvalues and that eigenvectors
corresponding to distinct eigenvalues are orthogonal, which we see in Part IA.
T. 4-156
Let V be a finite-dimensional inner product space, and α ∈ End(V ) self-adjoint.
Then V has an orthonormal basis of eigenvectors of α. In particular V is the
orthogonal (internal) direct sum of the eigenspaces of α.
By the previous lemma, α has a real eigenvalue, say λ. Then we can find an
eigenvector v ∈ V \ {0} such that αv = λv. Let U = hvi⊥ . Then we can write
V = hvi ⊥ U . We now want to prove α sends U into U . Suppose u ∈ U . Then
P. 4-157
Let A ∈ Matn (R) (resp. Matn (C)) be symmetric (resp. Hermitian). Then there
exists an orthogonal (resp. unitary) matrix P such that P T AP = P −1 AP (resp.
P † AP ) is diagonal with real entries.
P. 4-158
Let V be a finite-dimensional real inner product space and ψ : V × V → R a
symmetric bilinear form. Then there exists an orthonormal basis (v1 , · · · , vn ) for
V with respect to which ψ is represented by a diagonal matrix.
So α(w) ∈ W and hence α|W ∈ End(W ). Also, α|W is unitary since α is. So
by induction on dim V , W has an orthonormal basis of α eigenvectors. If we
add v/kvk to this basis, we get an orthonormal basis of V itself comprised of α
eigenvectors.
160 CHAPTER 4. LINEAR ALGEBRA
If αv = λv, then |λ|2 hv, vi = hαv, αvi = hv, vi, hence |λ| = 1.
This theorem and the analogous one for self-adjoint endomorphisms have a com-
mon generalization, at least for complex inner product spaces. The key fact that
leads to the existence of an orthonormal basis of eigenvectors is that α and α∗
commute. This is clearly a necessary condition, since if α is diagonalizable, then
α∗ is diagonal in the same basis (since it is just the transpose (and conjugate)),
and hence they commute. It turns out this is also a sufficient condition.
However, we cannot generalize this in the real or-
thogonal case. For example the matrix on the right cos θ sin θ
∈ O(R2 )
cannot be diagonalized (if θ 6∈ πZ). − sin θ cos θ
CHAPTER 5
Analysis II
D. 5-1
Alternatively, we can say ∀ε > 0, ∃N s.t. ∀n > N, supx∈E d(fn (x), f (x)) < ε.
For M = R with the usual metric (norm) this is kfn − f k∞ → 0 as a real
sequence, where kgk∞ = supx∈E |g(x)|.
? Just to be clear, when we write fn → 0, the 0 refers to the function that send
everything to 0, this can be understood as the “0” (additive identity) of the
vector space of functions (eg. C[a, b]).
• is uniformly Cauchy if ∀ε > 0, ∃N s.t.∀m, n > N, supx∈E d(fn (x), fm (x)) < ε.
E. 5-2
161
162 CHAPTER 5. ANALYSIS II
Hence we want to find a middle ground between the two cases — a notion of conver-
gence that is sufficiently strong to preserve most interesting properties, without
being too trivial. To do so, we can examine what went wrong in the examples
above. In the last example, even though our sequence fn does indeed tends point-
wise to f , different points converge at different rates to f . For example, at x = 1,
we already have f1 (1) = f (1) = 1. However, at x = (100!)−1 , f99 (x) = 0 while
f (x) = 1. No matter how large n is, we can still find some x where fn (x) differs
a lot from f (x). In other words, if we are given pointwise convergence, there is no
guarantee that for very large n, fn will “look like” f , since there might be some
points for which fn has not started to move towards f . Hence, what we need is
for fn to converge to f at the same pace, this is known as uniform convergence.
5.1. SEQUENCE OF FUNCTIONS 163
E. 5-3
• It should be clear from definition that if fn → f uniformly, then fn → f pointwise.
But the converse is false:
fn : [−1, 1] → R is defined by fn (x) = x1/(2n+1) . If the uniform limit existed,
then it must be given by
1
0<x≤1
fn (x) → f (x) = 0 x=1 ,
−1 −1 ≤ x < 0
We split the function into two parts which we can “control”. Given ε > 0,
∃N s.t. (1 − ε)N ≤ ε. Now for all n ≥ N we have supx∈[0,1] |fn (x) − 0| < ε
since (
n 1(1 − ε)n ≤ ε for x ∈ [0, 1 − ε]
|fn (x)| = (1 − x)x ≤
ε(1n ) = ε for x ∈ [1 − ε, 1]
Alternatively we could do this by finding the maximum of fn through differentia-
tion.
T. 5-5
Let E be a set and M be a metric space. (fn : E → M ) converges uniformly
implies (fn : E → M ) is uniformly Cauchy. Also, the converse is true if M is
complete.
Clearly the same result holds for pointwise convergence and pointwise Cauchy
sequence.
If we are given a concrete sequence of functions, then the usual way to show it
converges uniformly is to compute the pointwise limit (since the function that it
uniformly converge to must be same as its pointwise limit) and then prove that
the convergence is uniform. Since R is complete, if the sequence of functions is
real functions, it is often much easier to show that it is uniformly convergent by
showing that it is uniformly Cauchy.
T. 5-6
1. <Uniform limit theorem> Let E and M be metric spaces with x ∈ E. If
fn : E → M are continuous at x for all n and fn → f uniformly, then f is also
continuous at x.
2. If fn , f : [a, b] → R are Riemann integrable for all n and fn → f uniformly,
Rb Rb
then a fn (t) dt → a f (t) dt.
1. Let ε > 0. Choose N such that ∀n ≥ N , supy∈E d(fn (y), f (y)) < ε. Since fN is
continuous at x, there is some δ such that d(x, y) < δ ⇒ d(fN (x), fN (y)) < ε.
Then for each y such that d(x, y) < δ, we have
d(f (x), f (y)) ≤ d(f (x), fN (x)) + d(fN (x), fN (y)) + d(fN (y), f (y)) < 3ε.
b b b b
Z Z Z Z
2.
fn (t) dt − f (t) dt = fn (t) − f (t) dt ≤ |fn (t) − f (t)| dt
a a a a
≤ sup |fn (t) − f (t)|(b − a) → 0 as n → ∞
t∈[a,b]
Note also the first result implies if fn are continuous everywhere, then f is con-
tinuous everywhere. A slightly different version of the first result is: Let E be a
topological space and M a metric space. If fn : E → M is continuous for all n and
fn → f uniformly, then f is also continuous. To prove this note that by [L.1-37]
we just need to show that given any x ∈ X and ε > 0, we can find a neighbour-
hood U of x in E such that f (U ) ⊆ Bε (f (x)). Again we could pick N such that
supy∈E d(fN (y), f (y)) < ε/3, now since fN is continuous, we can find a neighbour-
hood U of x in E such that fN (U ) ⊆ Bε/3 (fN (x)). So now for all u ∈ U we have
d(f (u), f (x)) ≤ d(f (u), fN (u)) + d(fN (u), fN (x)) + d(fN (x), f (x)) < ε. The first
result can be concisely phrased as “the uniform limit of continuous functions is
continuous”.
We will prove later that if fn is integrable and fn → f uniformly, then f is
integrable.
These results show that uniform convergence tends to preserve properties of func-
tions. However, the relationship between uniform convergence and differentiability
is more subtle. The uniform limit of differentiable functions need not be differ-
entiable. Even if it were, the limit of the derivative is not necessarily the same
as the derivative of the limit, even if we just want pointwise convergence of the
derivative.
• Let fn , f : [−1, 1] → R be defined by fn (x) = |x|1+1/n and f (x) = |x|. Then
fn → f uniformly. Each fn is differentiable — this is obvious at x 6= 0, and at
5.1. SEQUENCE OF FUNCTIONS 165
x = 0, the derivative is
fn (x) − fn (0)
fn0 (0) = lim = lim sgn(x)|x|1/n = 0
x→0 x x→0
1
sup |fn (x)| ≤ √ → 0.
x∈R n
√
So fn → f = 0 uniformly in R. However, the derivative is fn0 (x) = n cos nx,
which does not converge to f 0 = 0, eg. at x = 0.
T. 5-7
Let fn : [a, b] → R be a sequence of functions differentiable on [a, b] (at the end
points a, b, this means that the one-sided derivatives exist). If
1. For some c ∈ [a, b], fn (c) converges
2. The sequence of derivatives (fn0 ) converges uniformly on [a, b]
then (fn ) converges uniformly on [a, b], and if f = limn fn , then f is differentiable
with derivative f 0 (x) = limn fn0 (x).
To show that (fn ) converges uniformly on [a, b], we want to find an N such that
n, m > N implies sup |fn − fm | < ε. Fix x ∈ [a, b]. We apply the mean value
theorem to fn − fm to get
for some t ∈ (x, c). Taking the supremum and rearranging terms, we obtain
sup |fn (x) − fm (x)| ≤ |fn (c) − fm (c)| + (b − a) sup |fn0 (t) − fm
0
(t)|.
x∈[a,b] t∈[a,b]
So given any ε, since fn0 and fn (c) converge and are hence Cauchy, there is some N
such that for any n, m ≥ N , supt∈[a,b] |fn0 (t) − fm
0
(t)| < ε, and |fn (c) − fm (c)| < ε.
Hence for all n, m ≥ N, supx∈[a,b] |fn (x) − fm (x)| < (1 + b − a)ε. So (fn ) converges
uniformly on [a, b]. Let f = lim fn .
Now we have to check differentiability. Let fn0 → h (ie. write h = lim fn0 ). For any
fixed y ∈ [a, b], define the two function:
(
f (x)−f (y)
x 6= y
(
fn (x)−fn (y)
x−y
x 6= y g(x) = x−y
gn (x) = h(y) x=y
fn0 (y) x=y
By definition, fn is differentiable at y iff gn is continuous at y, and also f is
differentiable with derivative h at y iff g is continuous at y. However, we know
that gn → g pointwise on [a, b], and we know that gn are all continuous. So if we
can show that gn → g uniformly, then g is continuous and hence the final result.
For x 6= y, we know that
for some t ∈ [x, y]. This also holds for x = y, since gn (y) − gm (y) = fn0 (y) − fm
0
(y)
by definition.
Let ε > 0. Since f 0 converges uniformly, there is some N such that for all n, m > N ,
we have |gn (x)−gm (x)| ≤ sup |fn0 −fm
0
| < ε. So for all n, m ≥ N , sup[a,b] |gn −gm | <
ε, ie. gn converges uniformly.
Note that we do not assume that fn0 are continuous or even Riemann integrable.
If they are, then the proof is much easier! If we assume fn0 are continuous, then
by the fundamental theorem of calculus, we have
Z x
fn (x) = fn (c) + fn0 (t) dt. (∗)
c
Then the fundamental theorem of calculus says that f is differentiable and f 0 (x) =
h(x) = lim fn0 (x). So done.
................................................................................
The result can be generalised for the sequence of differentiable functions fn : Ω →
C, where Ω ⊆ C is a bounded convex set.
Let F : Ω → C be a holomorphic function, and let c and x be distinct points in
Ω. Applying the mean value theorem to F1 , F2 : [0, 1] → R defined by F1 (t) =
Re(F (c + t(x − c))/(x − c)) and F2 (x) = Im(F (c + x(x − c))/(x − c)), we see that
there exist points u, v on the line segment from c to x such that
F (x) − F (c) F (x) − F (c)
Re(F 0 (u)) = Re , Im(F 0 (v)) = Im .
x−c x−c
=⇒ sup |fn (x) − fm (x)| ≤ |fn (c) − fm (c)| + 2α sup |fn0 (t) − fm
0
(t)|
x∈Ω t∈Ω
for some t, w on the line segment from x to y. So supx∈Ω |gn (x) − gm (x)| ≤
2 supt∈Ω |fn0 (t) − fm
0
(t)|. So our original proof still holds.
5.2. SERIES OF FUNCTIONS 167
P. 5-8
1. Let fn , gn : E → C, be sequences, and fn → f , gn → g uniformly on E. Then
for any a, b ∈ C, afn + bgn → af + bg uniformly.
2. Let fn → f uniformly, and let g : E → C be bounded. Then gfn : E → C
converges uniformly to gf .
1. sup |(afn + bgn ) − (af + bg)| ≤ |a| sup |fn − f | + |b| sup |gn − g| → 0 as n → ∞.
2. Say |g(x)| < M for all x ∈ E. Then |(gfn )(x) − (gf )(x)| ≤ M |fn (x) − f (x)|.
So supE |gfn − gf | ≤ M supE |fn − f | → 0.
Note that 2 is false without assuming boundedness. An easy example is to take
fn = n1 , x ∈ R, and g(x) = x. Then fn → 0 uniformly, but g(x)fn (x) = nx does
not converge uniformly to 0.
For each N , we can make this difference large enough by picking a really large n,
and then making x close enough to 1. So the supremum of it does not tends to 0
as n, m → ∞.
T. 5-12
<Weierstrass M-test> Let gn : E → M , where M is a complete normed space.
Suppose there is some sequence Mn such that for all n, we have
Pn
Taking supremum, we have sup kfn (x)−fm (x)k ≤ j=m+1 Mj → 0 as n, m → ∞.
So done by [T.5-5].
Note that this holds when M = C or R.
L. 5-13
S
Let V, W be normed space and f : V → W . Let U = α∈A Uα , where Uα are
open subsets of V (for all α ∈ A). If f |Uα : Uα → W is continuous for all α ∈ A,
then f |U : U → W is continuous.
We say that the sum converges locally absolutely uniformly inside circle of conver-
gence, ie. for every point y ∈ BR (a), there is some open disc around y on which
the sum converges absolutely uniformly.
n n−m
X X xn
xj = xn xj ≥ .
j=m j=0
1−x
This is not uniformly small since we can make this large by picking x close to 1.
T. 5-15
cn (x − n)n is a power series with
P
(Term-wise differentiation of power series) If
radius of convergence R > 0, then
1. The “derived series” ∞ n−1
P
n=1 ncn (x − a) has radius of convergence R.
P n
2. The function defined by f (x) = cn (x − a)
P , x ∈ BR (a)n−1
= {y ∈ C : |y − a| <
0
R} is differentiable with derivative f (x) = ncn (x − a) within the (open)
circle of convergence.
Suppose that R1 < R, then P there are r1 , r such that R1 < r1 < r < R, where
n|cn |r1n−1 diverges while |cn |rn converges. But this cannot be true since
P
n−1 n
n|cn |r1 ≤ |cn |r for sufficiently large n. So we must have R1 = R.
Let fn (x) = n
P j 0 Pn j−1
j=0 cj (x − a) , then fn (x) = j=1 jcj (x − a) . We want to
use [T.5-7]. This requires that fn converges at a point, and that fn0 converges
uniformly. The first is obviously true, and we know that fn0 converges uniformly
on [a−r, a+r] for any r < R. So for each x0 , there is some interval containing x0
on which fn0 is uniformly convergent. So on this interval, we know
P that f (x) =
limn→∞ fn (x) is differentiable with f 0 (x) = limn→∞ fn0 (x) = ∞ j
j=1 jcj (x − a) .
0 P∞ j
In particular, f (x0 ) = j=1 jcj (x0 − a) . Since this is true for all x0 , the
result follows.
D. 5-16
Write RN to be the set of all infinite real sequences (xk ).
C. 5-17
<Space of sequences> We extend our notions on Rn (a finite-dimensional vector
space) RN of infinite-dimension. RN is a vector space with termwise addition and
scalar multiplication.
• Define `1 = (xk ) ∈ RN :
P
|xk | < ∞ . This is aPlinear subspace of RN . We can
define the norm on it by k(xk )k1 = k(xk )k`1 = |xk |.
2 P 2
• Similarly, we can have the subspace ` = (xk ) ∈ RN : xk < ∞ and a norm
P 2 1/2
on it defined by k(xk )k2 = k(xk )k`2 = xk .
We can also write this as k(xk )k`2 = limn→∞ k(x1 , · · · , xn )k2 . So the triangle
inequality for the Euclidean norm implies the triangle inequality for `2 .
• In general, for p ≥ 1, we can define `p = (xk ) ∈ RN : |xk |p < ∞ with the
P
• Finally, we have `∞ , where `∞ = {(xk ) ∈ RN : sup |xk | < ∞}, with the norm
k(xk )k∞ = k(xk )k`∞ = sup |xk |.
E. 5-18
When we define the `p norm and space, we first have the norm defined as a sum,
and then `p to be the set of all sequences for which the sum converges. However,
note in [C.1-13] when we define the Lp space, we restrict ourselves to C([0, 1]), and
then Rdefine the norm. Can we just define, say, L1 to be the set of all functions such
1
that 0 |f | dx exists? No because the norm would no longer be the norm, since if
we have the function f (x) = 1 when x = 0.5 and 0 otherwise, then f is integrable
with integral 0, but is not identically zero (ie. the requirement kvk = 0 ⇔ v = 0
is violated). So we cannot expand our vector space to be too large. To define Lp
properly, we need some more sophisticated notions such as Lebesgue integrability
and other fancy stuff, which we are not doing here.
D. 5-19
Let V be a (real) vector space. Two norms k · k and k · k0 on V are called
Lipschitz equivalent if there are real constants 0 < a < b such that akxk ≤
kxk0 ≤ bkxk for all x ∈ V .
E. 5-20
Note that Lipschitz equivalence forms an equivalence relation on the set of all
norms on V . Looking at [L.1-31] and [E.1-32], we see that norms that are Lipschitz
equivalent induces the same topology, hence the “topological” properties of the
space do not depend on which norm we choose, and the norms will agree on which
sequences are convergent and which functions are continuous.
We can also see that the requirement akxk ≤ kxk0 ≤ bkxk for Lipschitz equivalent
is equivalent to B1/b (0) ⊆ B10 (0) ⊆ B1/a (0) where B 0 is the ball with respect to
k · k0 , while B is the ball with respect to k · k. (Recall Br (a) = {x ∈ V : kx − ak <
r})
................................................................................
Later we will show that any two norms on a finite-dimensional vector space are
Lipschitz equivalent. Here we look at infinite dimensional cases.
5.3. NORMED SPACE 171
R1
Let V = C([0, 1]) with the norms kf k1 = 0 |f | dx y
and kf k∞ = sup[0,1] |f |. We clearly have the bound
kf k1 ≤ kf k∞ . However, there is no constant b such 1
that kf k∞ ≤ bkf k1 for all f .
This is easy to show by constructing a sequence of
functions fn like on the diagram on the right where
x
the width is n2 and the height is 1. Then kfn k∞ = 1 1
1 n
but kfn k1 = n → 0.
Similarly, consider the space `2 = (xn ) :
P 2
xn < ∞ under the regular `2 norm
and the `∞ norm. We have k(xk )k∞ ≤ k(xk )k`2 but there is no b such that
k(xk )k`2 ≤ bk(xk )k∞ . For example, consider the sequence xn = (1, 1, · · · , 1, 0, 0, · · · ),
where the first n terms are 1.
So far in all our examples, out of the two inequalities, one holds and one does not.
Actually it is possible for both inequalities to not hold.
P. 5-21
Suppose k · k and k · k0 are two norms on the vector space V . The followings are
equivalent:
1. ∃C > 0 such that kvk0 ≤ Ckvk for all v ∈ V .
2. The map id : (V, k · k) → (V, k · k0 ) defined by id(v) = v is continuous.
3. τ 0 ⊆ τ where τ and τ 0 are the topology induced by k · k and k · k0 respectively.
In particular if k · k and k · k0 are Lipschitz equivalent, then τ 0 = τ and convergence
and Cauchy-ness (of a sequence), boundedness (of a set), continuity (of a function),
completeness (of the space) etc. on k · k and k · k0 agrees.
1. kx − yk ≤ kx − xk k + kxk − yk → 0. So kx − yk = 0. So x = y.
2. kaxk − axk = |a|kxk − xk → 0.
3. k(xk + yk ) − (x + y)k ≤ kxk − xk + kyk − yk → 0.
172 CHAPTER 5. ANALYSIS II
P. 5-23
Let (V, k · k) be a normed vector space, then k · k : (V, k · k) → R is continuous.
Fix ε > 0. Suppose x(k) → x. Then there is some N such that for any k ≥ N
such that
n
X (k)
kx(k) − xk22 = (xj − xj )2 < ε2 .
j=1
(k)
Hence |xj − xj | < ε for all k ≤ N . Conversely, if for any fixed j, there is some Nj
(k)
such that k ≥ Nj implies |xj − xj | < √ε . Then for k ≥ max{Nj : j = 1, · · · , n},
n
n
!1
X 2
(k) (k) 2
kx − xk2 = (xj − xj ) < ε.
j=1
Note that this results also says that f = (f1 , · · · , fm ) : Rn → Rm is continuous iff
each of fi is continuous. This is because f is continuous iff f (xn ) → f (x) whenever
xn → x.
E. 5-26
Another space we would like to understand is the space of continuous functions.
It should be clear that uniform convergence (supx |fn − f | → 0) is the same as
convergence under the uniform norm (as kfn −f k = supx |fn −f |), hence the name.
However, there is no norm such that convergence under the norm is equivalent to
pointwise convergence, ie. pointwise convergence is not normable. In fact, it is
not even metrizable. However, we will not prove this.
T. 5-27
<Bolzano-Weierstrass theorem in Rn > Any bounded sequence in Rn (with,
say, the Euclidean norm) has a convergent subsequence.
ky(k) k2 + |xn
(k) 2
| = kx(k) k2 ,
(k)
it follows that both (y(k) ) and (xn ) are bounded. So by the induction hypothesis,
there is a subsequence (kj ) of (k) and some y ∈ Rn−1 such that y(kj ) → y. Also,
5.3. NORMED SPACE 173
(kj ) (k )
by Bolzano-Weierstrass in R, there is a further subsequence (xn `
) of (xn j ) that
converges to, say, yn ∈ R. Then we know that x(kj` ) → (y, yn ).
(Proof 2) By [T.1-128] and [T.1-139].
All finite dimensional vector spaces are isomorphic to Rn as vector spaces for some
n, and we will later show that all norms on finite dimensional spaces are equiv-
alent. This means every finite-dimensional normed space satisfies the Bolzano-
Weierstrass property. It turns out the converse is also true: If a normed vector
space satisfies the Bolzano-Weierstrass property, must it be finite dimensional.
E. 5-28
Note that the above this is generally not true for normed spaces. Finite-dimensionality
is important for both of the above results.
(k)
• Consider (`∞ , k · k∞ ). We let ej = δjk be the sequence with 1 in the kth
(k)
component and 0 in other components. Then ej → 0 for all fixed j, and hence
e(k) converges componentwise to the zero element 0 = (0, 0, · · · ). However, e(k)
does not converge to the zero element since ke(k) − 0k∞ = 1 for all k. Also, this
is bounded but does not have a convergent subsequence for the same reasons.
• Let C([0, 1]) have the k · kL2 norm. Consider fn (x) = sin(2nπx). We know
that Z 1
1
kfn k2L2 = |fn |2 = .
0 2
So it is bounded. However, it doesn’t have a convergent subsequence. If it did,
say fnj → f in L2 , then we must have kfnj − fnj+1 k2 → 0. However, by direct
calculation, we know that
Z 1
kfnj − fnj+1 k2 = (sin(2nj πx) − sin(2nj+1 πx))2 = 1.
0
In fact the same argument shows also that the sequence (sin 2nπx) has no
subsequence that converges pointwise on [0, 1]: we need the result that if (fj )
is a sequence in C([0, 1]) that is uniformly bounded with fj → f pointwise,
then fj converges to f under the L2 norm. However, we will not be able to
prove this (in a nice way) without Lebesgue integration from IID Probability
and Measure.
L. 5-29
1. Any convergent sequence is Cauchy.
2. A Cauchy sequence is bounded.
3. If a Cauchy sequence has a subsequence converging to an element x, then the
whole sequence converges to x.
Clear these results apply to both normed spaces as well as metric spaces.
T. 5-30
Rn (with the Euclidean norm, say) is complete.
(k)
(Proof 1) If (xk ) is Cauchy in Rn , then (xj ) is a Cauchy sequence of real numbers
for each j ∈ {1, · · · , n}. By the completeness of the reals, we know that xkj → xj ∈
R for some xj . So xk → x = (x1 , · · · , xn ) since convergence in Rn is equivalent to
componentwise convergence.
(Proof 2) By [T.1-146].
E. 5-31
• Let V = {(xn ) ∈ RN : xj = 0 for all but finitely many j}. Take the supremum
norm k · k∞ on V . V is a subspace of `∞ (and is sometimes denoted `0 ). Then
(V, k · k∞ ) is not complete: we define x(k) = (1, 21 , 13 , · · · , k1 , 0, 0, · · · ) for k =
1, 2, 3, · · · . Then this is Cauchy, since
1
kx(k) − x(`) k = → 0,
min{`, k} + 1
(k)
but it is not convergent in V . If it actually converged to some x, then xj → xj .
So we must have xj = 1j , but this sequence not in V .
• We claim that C([0, 1]) is not complete in L1 (i.e. with the k · k1 norm) . Consider
fn ∈ C([0, 1]) where fn is the function such that the set of (x, fn (x)) is the straight
line joining (0, 0), ( 12 , 0), ( 12 + n1 , 1), (1, 1), then fn is Cauchy, however it doesn’t
converge:
Suppose fn → f ∈ C([0, 1]) in L1 . Then fn → f in L1 on [0, 12 ], but fn is the zero
constant function on [0, 12 ], the only continuous function it can converge to in L1 is
the zero constant function, so f |[0,1/2] = 0. Also for any N we must have fn → f
in L1 on [ 12 + N1 , 1], but eventually for n large enough fn |[1/2+1/N,1] will be the
constant function 1, so f |[1/2+1/N,1] = 1, however N is arbitrary, so f |(1/2,1] = 1.
But this means f 6∈ C([0, 1]), contradiction.
E. 5-32
1. Show that (`1 , k · k1 ) is complete.
2. Show that (C([a, b]), k · k∞ ) is complete.
Firstly we find the possible limit. Firstly in (∗) note that |xpn −xqn | ≤ ∞ p
P
n0 =1 |xn0 −
q 1 2
xn0 | < ε, so for each n we wee that xn , xn , · · · is a Cauchy sequence in R, so
it converge to say xn .
m
Next we want
PM to show that (xPn) → (xn ) in (`1 , k · k1 ). Note that ∀p, q ≥ N
p q ∞ p q
we have |x n − x | = n=1 |xn − xn | < ε for any M . Taking limit
n=1 PM n p
q → ∞ we have n=1 |xn − Px∞n | ≤ ε for any M , now take limit M → ∞
we have k(xn )p − (xn )k1 = p
n=1 |xn − xn | ≤ ε. So we now have ∀ε > 0,
∃N s.t. ∀p ≥ N, k(xn ) − (xn )k1 ≤ ε, that is (xn )m → (xn ).
p
5.3. NORMED SPACE 175
E. 5-33
The spaces `1 , `2 , `∞ are all complete with respect to the standard norms. C([0, 1])
is complete with respect to k · k∞ but not with the L1 or L2 norms.
The incompleteness of L1 tells us that C([0, 1]) is not large enough to to be com-
plete under the L1 or L2 norm. In fact, the space of Riemann integrable functions,
say R([0, 1]), is the natural space for the L1 norm, and of course contains C([0, 1]).
As we have previously
R1 mentioned, this time R([0, 1]) is too large for k · k to be
a norm, since 0 |f | dx = 0 does not imply f = 0. This is a problem we can
solve. We just have to take the equivalence R 1 classes of Riemann integrable func-
tions, where f and g are equivalent if 0 |f − g| dx = 0. But still, L1 is not
complete on R([0, 1])/∼. This is a serious problem in the Riemann integral. This
eventually lead to the Lebesgue integral, which generalizes the Riemann integral,
and gives a complete normed space.
Note
R1 that when we quotient our R([0, 1]) by the equivalence relation f ∼ g if
0
|f − g| dx = 0, we are not losing too much information about our functions. We
know that for the integral to be non-zero, f − g cannot be non-zero at a point of
continuity. Hence they agree on all points of continuities. By Lebesgue’s theorem,
the set of points of discontinuity has Lebesgue measure zero. So they disagree on
at most a set of Lebesgue measure zero.
T. 5-34
Let (V, k · k) be a normed vector space and K ⊆ V .
1. If K is compact, then K is closed and bounded.
2. If V = Rn (with, say, the Euclidean norm), then the converse of 1 is also true.
That is K is compact iff it is closed and bounded. (Heine-Borel theorem)
(Proof 2) By [T.1-128].
176 CHAPTER 5. ANALYSIS II
E. 5-35
Let (V, k · k), (V 0 , k · k0 ) be normed spaces, and let E ⊆ V be a subset, and
f : E → V 0 a mapping. Let y ∈ E. By [P.1-23] f : E → V 0 is continuous at y if
for all ε > 0, there is δ > 0 such that ∀x ∈ E, kx−ykV < δ ⇒ kf (x)−f (y)kV 0 < ε.
Note that x ∈ E and kx − yk < δ is equivalent to saying x ∈ Bδ (y) ∩ E. Sim-
ilarly, kf (x) − f (y)k < ε is equivalent to f (x) ∈ Bε (f (y)). In other words,
x ∈ f −1 (Bε (f (y))). So we can rewrite this statement as there is some δ > 0 such
that E ∩ Bδ (y) ⊆ f −1 (Bε (f (y))).
We are going show again that the above definition of continuity of f at y ∈ E is
equivalent to: for any sequence yk → y in E, we have f (yk ) → f (y).
(Forward) Suppose f is continuous at y ∈ E, and that yk → y. Given ε >
0, by continuity, there is some δ > 0 such that Bδ (y) ∩ E ⊆ f −1 (Bε (f (y))).
For sufficiently large k, yk ∈ Bδ (y) ∩ E. So f (yk ) ∈ Bε (f (y)), or equivalently,
kf (yk ) − f (y)kV 0 < ε.
(Backward) If f is not continuous at y, then there is some ε > 0 such that for any
k, we have B 1 (y) 6⊆ f −1 (Bε (f (y))). Choose yk ∈ B 1 (y) \ f −1 (Bε (f (y))). Then
k k
yk → y, yk ∈ F , but kf (yk ) − f (y)k ≥ ε, contrary to the hypothesis.
T. 5-36
Let (V, k · k) and (V 0 , k · k0 ) be normed spaces, and K a compact subset of V ,
and f : V → V 0 a continuous function. Then
1. f (K) is compact in V 0
2. If V 0 = R, then f |K attains its supremum and infimum, ie. ∃y1 , y2 ∈ K such
that f (y1 ) = sup f (K) and f (y2 ) = inf f (K).
2. For any x = n
P
j=1 xj vj we have
n
n
!1
X
X X 2
2
kxk =
xj v j
≤ |xj |kvj k ≤ kxk2 kvj k
j=1 j=1
| {z }
=b
T. 5-38
If V is a finite dimensional (real) vector space, then any two norms on it are
Lipschitz equivalent.
In fact the same proof (along with the lemma) can be extend to for complex vector
space. Also note that the key to the proof is the compactness of the unit sphere
of (V, k · k). On the other hand, compactness of the unit sphere also characterizes
finite dimensionality. If the unit sphere of a space is compact, then the space must
be finite-dimensional.
178 CHAPTER 5. ANALYSIS II
P. 5-39
Let (V, k · k) be a finite-dimensional normed space, then
1. The Bolzano-Weierstrass theorem holds, ie. any bounded sequence sequence
in V has a convergent subsequence.
2. A subset of it is compact iff it is closed and bounded.
3. It is complete.
2. Since these results hold for the Euclidean norm k · k2 , it follows that they hold
for arbitrary finite-dimensional vector spaces. Note that closeness must be the
same in any norm since convergence is and closeness is that the set contains
all its limit points.
3. This is true since if a space is complete in one norm, then it is complete in any
Lipschitz equivalent norm, and we know that Rn under the Euclidean norm is
complete.
Note that the finite-dimensional condition is important, for example B̄1 (0) wrt
k · k∞ in C[0, 1] is closed and bounded but not (sequentially) compact, eg. take
the sequence fn = [function with straight line joining (0, 1), (1/n, 0), (1, 0)].
d(x, y)
g(x, y) = min{1, d(x, y)} h(x, y) =
1 + d(x, y)
The axioms are easily shown to be satisfied, apart from the triangle inequality. So
let’s check the triangle inequality for h. We’ll use a general fact that for numbers
a, c ≥ 0 and b, d > 0 we have
a c a c
≤ ⇐⇒ ≤ .
b d a+b c+d
Based on this fact, we can start with d(x, y) ≤ d(x, z) + d(z, y), then we obtain
D. 5-41
Metrics d, d0 on a set X are said to be Lipschitz equivalent if there are (positive)
constants A, B such that Ad(x, y) ≤ d0 (x, y) ≤ Bd(x, y) for all x, y ∈ X.
E. 5-42
Clearly, any Lipschitz equivalent norms give Lipschitz equivalent metrics. Any
metric coming from a norm in Rn is thus Lipschitz equivalent to the Euclidean
metric. Two norms induce the same topology if and only if they are equivalent. In
some sense, Lipschitz equivalent norms are indistinguishable.
Lipschitz equivalent metrics induce the same topology. The converse, however, is
not true in general. For example, let X = R, d(x, y) = |x − y| and d0 (x, y) =
min{1, |x − y|}. It is easy to check that these are not Lipschitz equivalent, but
they induce the same set collection of open subsets.
E. 5-43
We can create an easy example of an incomplete metric on Rn . We start by
defining h : Rn → Rn by
x
h(x) = ,
1 + kxk
where k · k is the Euclidean norm. We can check that this is injective: if h(x) =
h(y), taking the norm gives kxk/(1 + kxk) = kyk/(1 + kyk). So we must have
kxk = kyk. Also h(x) = h(y) means that x = λy for some real λ. So h(x) = h(y)
implies x = y.
Now we define d(x, y) = kh(x) − h(y)k. It is an easy check that this is a metric
on Rn . Rn under this metric is incomplete, we can consider the sequence xk =
(k − 1)e1 , where e1 = (1, 0, 0, · · · , 0) is the usual basis vector. Then (xk ) is Cauchy
in (Rn , d). To show this, first note that h(xk ) = 1 − k1 e1 . Hence we have
1 1
d(xn , xm ) = kh(xn ) − h(xm )k = − → 0.
n m
n
So it is Cauchy. To show it does not converge in (R k , x) → 0
, d), suppose d(x
for some x. Then since d(xk , x) = kh(xk ) − h(x)k ≥ kh(xk )k − kh(x)k. We must
have
kh(x)k = lim kh(xk )k = 1.
k→∞
E. 5-46
It is easy to check that being totally bounded implies being bounded.
From metric and topological space we know that all compact metric spaces are
complete and bounded. The converse is not true. For example, recall if we have
an infinite-dimensional normed vector space, then the closed unit sphere can be
complete and bounded, but not compact. Alternatively, we can take X = R with
the metric d(x, y) = min{1, |x − y|}. This is clearly bounded (by 1), and it is easy
to check that this is complete. However, this is not compact since the sequence
xk = k has no convergent subsequence.
However, we can strengthen the condition of boundedness to total boundedness,
and get the equivalence between “completeness and total boundedness” and com-
pactness.
T. 5-47
Let (X, d) be a metric space. X is (sequentially) compact if and only if X is
complete and totally bounded.
(Backward) Let (yi ) ∈ X. For every j ∈ N, there exists a finite set of points Ej
such that every point is within 1j of one of these points. Write Br (x) = B(x, r)
Now since E1 is finite, there is some x1 ∈ E1 such that there are infinitely many
yi ’s in B(x1 , 1). Pick the first yi in B(x1 , 1) and call it yi1 . Now there is some
x2 ∈ E2 such that there are infinitely many yi ’s in B(x1 , 1) ∩ B(x2 , 21 ). Pick the
one with smallest value of i > i1 , and call this yi2 . Continue till infinity.
This procedure gives a sequence xi ∈ Ei and subsequence (yik ) of (yi ) with
n
\ 1
yin ∈ B xj , .
j=1
j
2
It is easy to see that (yin ) is Cauchy since if m > n, then d(yim , yin ) < n
. By
completeness of X, this subsequence converges.
(Forward) Suppose X is not totally bounded, Sthen there exist ε such that there is
no finite set of points x1 , · · · , xN with X = N
i=1 Bε (xi ).
5.4. METRIC SPACES 181
∀ε > 0, ∃δ > 0 s.t. ∀x, y ∈ X, d(x, y) < δ ⇒ d(f (x), f (y)) < ε.
Continuity does not imply uniform continuity, an example is given in the next
theorem. To show that uniform continuity does not imply Lipschitz, take X =
X 0 = R. We define the metrics as d(x, y) = min{1, |x − y|}, and d0 (x, y) = |x − y|.
Now consider the function f : (X, d) → (X 0 , d0 ) defined by f (x) = x. We can then
check that this is uniformly continuous but not Lipschitz.
Note that the statement that metrics d and d0 are Lipschitz equivalent is equivalent
to saying the two identity maps i : (X, d) → (X, d0 ) and i0 : (X, d0 ) → (X, d) are
Lipschitz, hence the name.
The metric map itself is also a Lipschitz map for any metric. That is if we view
the metric as a function d : X × X → R with the metric on X × X defined as
˜ 1 , y1 ), (x2 , y2 )) = d(x1 , x2 ) + d(y1 , y2 ). Then by triangle inequality, d(x1 , y1 ) ≤
d((x
d(x1 , x2 )+d(x2 , y2 )+d(y1 , y2 ). Moving the middle term to the left gives d(x1 , y1 )−
˜ 1 , y1 ), (x2 , y2 )). Swapping the theorems around, we can put in the
d(x2 , y2 ) ≤ d((x
absolute value to obtain |d(x1 , y1 ) − d(x2 , y2 )| ≤ d((x˜ 1 , y1 ), (x2 , y2 )).
Note that we have again used the property that λ < 1. This implies d(xm , xn ) → 0
as m, n → ∞. So this sequence is Cauchy.
By the completeness of X, there exists some x ∈ X such that xn → x. Since
f is a contraction, it is continuous, so f (xn ) → f (x). However, by definition
f (xn ) = xn+1 , taking the limit on both sides, we get f (x) = x. So x is a fixed
point.
Now suppose that f (m) is a contraction for some m. Hence by the first part, there
is a unique x ∈ X such that f (m) (x) = x. But then
So f (x) is also a fixed point of f (n) (x). By uniqueness of fixed points, we must
have f (x) = x. Since any fixed point of f is clearly a fixed point of f (m) as well,
it follows that x is the unique fixed point of f .
5.5. INTEGRATION 183
Based on the proof of the theorem, we have the following error estimate in the
contraction mapping theorem: for x0 ∈ X and xn = f (xn−1 ), we showed that for
λn
m > n, we have d(xm , xn ) ≤ 1−λ d(x1 , x0 ). If xn → x, taking the limit of the
above bound as m → ∞ say that for any n
λn
d(x, xn ) ≤ d(x1 , x0 ).
1−λ
Note that the theorem is false if we drop the completeness assumption. For ex-
ample, f : (0, 1) → (0, 1) defined by x2 is clearly a contraction with no fixed point.
The theorem is also false if we drop the assumption λ < 1. In fact, it is not enough
to assume d(f (x), f (y)) < d(x, y) for all x, y.
We can see finding fixed points as the process of solving equations. One important
application we will have is to use this to solve differential equations. After we do
some integration we will look at Picard-Lindelöf existence theorem which gives
condition for the existence of solutions (at least locally) to the ODE
df
= F(t, f (t)) subject to f (t0 ) = x0
dt
where t0 ∈ R, x0 ∈ Rn , f : R → Rn and F : R × Rn → Rn . We can imagine f as
the position vector of a particle moving in Rn , passing through x0 at time t0 . We
then ask if there is a trajectory f (t) such that the velocity of the particle at any
time t is given by F(t, f (t)).
5.5 Integration
T. 5-52
If f : [a, b] → [A, B] is integrable and g : [A, B] → R is continuous, then g ◦ f :
[a, b] → R is integrable.
ε0
U (P, g ◦ f ) − L(P, g ◦ f ) ≤ ε(b − a) + 2 sup |g| .
[A,B] δ
Note that g must be bounded by the maximum value theorem. Now let ε0 = εδ,
then we have shown that given any ε > 0 there exists a partition such that
!
U (P, g ◦ f ) − L(P, g ◦ f ) < (b − a) + 2 sup |g| ε.
[A,B]
T. 5-53
If fn : [a, b] → R is integrable (and bounded) for all n, and (fn ) converges uniformly
to f : [a, b] → R, then f is bounded and integrable.
So given ε > 0, we choose n such that 2(b − a)cn < 2ε , and then choose P such
that U (P, fn ) − L(P, fn ) < 2ε . Then for this partition, U (P, f ) − L(P, f ) < ε.
P. 5-54
Suppose fn : [a, b] → RR is integrable for eachR n and fn → f uniformly. Suppose
x x
c ∈ [a, b], let Fn (x) = c fn (y)dy and F (x) = c f (y)dy. Then Fn → F uniformly.
By the last theorem f is integrable (on [a, b] and hence also on [x, c] for any
x ∈ [a, b]), so F (x) exist.
Z x Z x
|Fn (x) − F (x)| = fn (y) − f (y)dy ≤ |fn (y) − f (y)|dy
Z c x c
≤ kfn − f k∞ dy ≤ (b − a)kfn − f k∞
c
Note that in general Fn → F uniformly does not hold if we replace [a, b] with R,
but it does hold for [a, b].
5.5. INTEGRATION 185
E. 5-55
Let f (x) = ∞ n
P
n=0 cn (x − a) be a real power series with radius of convergence R,
then for any x ∈ (a − R, a + R) the following exist and equals
Z x ∞ ∞
X cn X
f (y)dy = (x − a)n+1 f 0 (x) = ncn (x − a)n−1
a n=0
n+1 n=0
1. f converge uniformly on [a − r, a + r] for any 0 < r < R.[T.5-14] Also its partial
Rx
sum fn (x) = N
P n PN cn
n=0 cn (x−a) are integrable with a fn (y)dy = n=0 n+1 (x−
n+1
a) . So it follows from the last two results
R x that f P is integrable on [x, a] for
x ∈ [a − r, a + r] for any 0 < r < R with a f (y)dy = ∞ cn
n=0 n+1 (x − a)
n+1
. So
Rx P∞ cn n+1
it follows that a f (y)dy = n=0 n+1 (x − a) for any x ∈ (a − R, a + R).
This says we can integrate or differentiate term by term inside the radius of con-
vergence.
D. 5-56
Let f : [a, b] → Rn be a vector-valued function, where f (x) = (f1 (x), f2 (x), · · · , fn (x)).
Then we say f is Riemann integrable iff fj : [a, b] → R is Riemann integrable for
all j. We define the integral as
Z b Z b Z b
f (x) dx = f1 (x) dx, · · · , fn (x) dx ∈ Rn .
a a a
E. 5-57
It is easy to see that most basic properties of integrals of real functions extend to
the vector-valued case.
E. 5-58
Let k · k be the k · k1 or k · k2 norm. If f : [a, b] → Rn is integrable, then kf (x)k
Rb Rb
is integrable (on [a, b]), and k a f (x) dxk ≤ a kf (x)k dx.
n Z
X b
X n Z b Z b n
X Z b
kvk1 =
fj (x)dx ≤ |fj (x)|dx = |fj (x)|dx = kf (x)k1
j=1 a j=1 a a j=1 a
186 CHAPTER 5. ANALYSIS II
n
X n
X Z b Z b n
X Z b
kvk22 = vj2 = vj fj (x) dx = (vj fj (x)) dx = v · f (x) dx
j=1 j=1 a a j=1 a
Z b Z b
≤ kvk2 kf (x)k2 dx = kvk2 kf (x)k2 dx.
a a
where we have use kxk − kyk ≤ kx − yk (triangle inequality) and kxk ≤ Ckxk1
(all norms on Rn being equivalent). So
m
!
X
UDm kf k − LDm kf k = |Ik | sup kf (x)k − inf kf (x)k
Ik Ik
k=1
n m
!! n
X X X
≤C |Ik | sup fi (x) − inf fi (y) =C (UDm fi − LDm fi ) → 0
Ik Ik
i=1 k=1 i=1
since k m
P Pm
i=0 |Ii |f (xi )k ≤ i=0 |Ii | kf (xi )k for all m.
T. 5-60
<Picard-Lindelöf existence theorem> Let x0 ∈ Rn , R > 0, a < b, t0 ∈ [a, b].
Suppose F : [a, b] × B̄R (x0 ) → Rn is a continuous function satisfying
for some fixed κ > 0 for all t ∈ [a, b] and x, y ∈ B̄R (x0 ) = {x ∈ Rn : kx − x0 k2 ≤
R}. In other words, F(t, · ) : Rn → Rn is Lipschitz on B̄R (x0 ) with the same
Lipschitz constant for every t. Then
i. There exists an ε > 0 and a unique differentiable function f : [t0 − ε, t0 + ε] ∩
[a, b] → Rn such that
df
= F(t, f (t)) and f (t0 ) = x0 (∗)
dt
R 2
ii. If sup[a,b]×B̄R (x0 ) kFk2 ≤ b−a , then there exists a unique differentiable func-
n
tion f : [a, b] → R that satisfies (∗).
First we show that that (ii) implies (i). We know that sup[a,b]×B̄R (x) kFk2 is
bounded since it is a continuous function on a compact domain. So we can pick
ε > 0 such that 2ε ≤ R/ sup[a,b]×B̄R (x) kFk2 . Then writing [t0 − ε, t0 + ε] ∩ [a, b] =
[a1 , b1 ], we have
R R
sup kFk2 ≤ sup kFk2 ≤ ≤ .
[a1 ,b1 ]×B̄R (x) [a,b]×B̄R (x) 2ε b1 − a1
So (ii) implies there is a solution on [t0 − ε, t0 + ε] ∩ [a, b]. Hence it suffices to prove
(ii).
To apply the contraction mapping theorem, we need to convert this into a fixed
point problem. The key is to reformulate the problem as an integral equation. We
know that a differentiable f : [a, b] → Rn satisfies the differential equation (∗) if
and only if f : [a, b] → B̄R (x0 ) is continuous and satisfies
Z t
f (t) = x0 + F(s, f (s)) ds
t0
by the fundamental theorem of calculus. Note R t that f being continuous means that
F(s, f (s)) is continuous, and so f (t) = x0 + t0 F(s, f (s)) ds is differentiable by the
fundamental theorem of calculus. This is very helpful, since we can work over the
much larger vector space of continuous functions, and it would be easier to find a
solution.
We let X = C([a, b], B̄R (x0 )) be the set of all continuous f : [a, b] → B̄R (x0 ). We
equip X with the supremum metric so that for all g, h ∈ X,
We see that X is a closed subset of the complete metric space C([a, b], Rn ) (again
taken with the supremum metric). So X is complete. For every g ∈ X, we define
2
Intuitively we can understand this condition as “not escaping B̄R (x0 )”. Since F represents the
gradient of how f changes with t, if the maximum gradient times (b − a) is less then R, our f would
not escape B̄R (x0 ) for any t ∈ [a, b], hence we have nice solution.
188 CHAPTER 5. ANALYSIS II
Rt
a function T g : [a, b] → Rn by (T g)(t) = x0 + t0
F(s, g(s)) ds. Our differential
equation is thus
f = Tf.
So we first want to show that T is actually mapping X → X, ie. T g ∈ X whenever
g ∈ X, and then prove it (or a power of it) is a contraction map. If g ∈ X, then
Z t
Z t
kT g(t) − x0 k2 =
F(s, g(s)) ds
≤
kF(s, g(s))k2 ds
t0 2 t0
≤ |b − a| sup kFk2 ≤ R
[a,b]×B̄R (x0 )
Hence we know that T g(t) ∈ B̄R (x0 ), so T g ∈ X. It turns out T itself need not
be a contraction. Instead, what we have is that for g1 , g2 ∈ X, we have
Z t
kT g1 − T g2 k = sup
F(s, g1 (s)) − F(s, g2 (s)) ds
t∈[a,b] t0 2
t
Z
≤ sup kF(s, g1 (s)) − F(s, g2 (s))k2 ds
t∈[a,b] t0
≤ κ(b − a)kg1 − g2 k
by the Lipschitz condition on F . If we indeed have κ(b − a) < 1 (†), then the
contraction
Rt mapping theorem gives an f ∈ X such that T f = f , ie. f = x0 +
t0
F(s, f (s)) ds. However, we do not necessarily have (†). There are many ways
we can solve this problem. Here, we can solve it by finding an m such that
T (m) = T ◦ T ◦ · · · ◦ T : X → X is a contraction map. We will now show that this
map satisfies the bound
(b − a)m κm
kT (m) g1 − T (m) g2 k ≤ kg1 − g2 k. (‡)
m!
The key is the m!, since this grows much faster than any exponential. Given this
bound, we know that for sufficiently large m, we have ((b − a)m κm )/m! < 1, ie.
T (m) is a contraction. So by the contraction mapping theorem, the result holds.
To prove this, we prove instead the pointwise bound: ie. for any t ∈ [a, b], we have
(|t − t0 |)m κm
kT (m) g1 (t) − T (m) g2 (t)k2 ≤ sup kg1 (s) − g2 (s)k2 .
m! s∈[t0 ,t]
From this, taking the supremum on t ∈ [a, b], we obtain the bound (‡).
To prove this pointwise bound, we induct on m. We wlog assume t > t0 . We know
that for every m, the difference is given by
Z t
(m) (m) (m−1) (m−1)
kT g1 (t) − T g2 (t)k2 =
F(s, T g1 (s)) − F(s, T g2 (s)) ds
t0 2
Z t
(m−1) (m−1) a
≤κ kT g1 (s) − T g2 (s)k2 ds. ( )
t0
Note in the final part of the proof to get the factor of m!, we had to actually
perform the integral integrating (s − t0 )m−1 , instead of just bounding (s − t0 )m−1
by (t − t0 )m−1 . In general, this is a good strategy if we want tight bounds. Instead
Rb
of bounding | a f (x) dx| ≤ (b − a) sup |f (x)|, we write f (x) = g(x)h(x), where
Rb
h(x) is something easily integrable. Then we can have a bound | a f (x) dx| ≤
Rb
sup |g(x)| a |h(x)| dx.
Note also that any differentiable f satisfying the differential equation is auto-
matically continuously differentiable, since the derivative is F(t, f (t)), which is
continuous.
Even n = 1 case of this result is important, special, non-trivial case. Even if we
have only one dimension, explicit solutions may be very difficult to find, if not
impossible. For example, df
dt
= f 2 + sin f + ef would be almost impossible to solve.
However, the theorem tells us there will be a solution, at least locally.
The requirements of the theorem are indeed necessary:
• We first look at that ε in (i). Without the addition requirement in (ii), there
might not exist a solution globally on [a, b]. For example, we can consider
the n = 1 case, where we want to solve df dt
= f 2 , with boundary condition
2
f (0) = 1. Our F (t, f ) = f is a nice, uniformly Lipschitz function on any
[0, b] × BR (1) = [0, b] × [1 − R, 1 + R]. However, there is no global solution: If we
d
assume f 6= 0, then for all t ∈ [0, b], the equation is equivalent to dt (t+f −1 ) = 0.
−1
So we need t + f to be constant. The initial conditions tells us this constant
1 1
is 1. So we have f (t) = 1−t . Hence the solution on [0, 1) is 1−t . Any solution
on [0, b] must agree with this on [0, 1). So if b ≥ 1, then there is no solution in
[0, b].
• The Lipschitz condition is also necessary to guarantee uniqueness. Without this
condition, existence of a solution is still guaranteed (but is another theorem,
the Cauchy-Peano theorem), but we could have many different solutions. For
example, we can consider the differential equation
df p
= |f | with f (0) = 0.
dt
p
Here F (t, x) = |x| is not Lipschitz near x = 0. It is easy to see that both
f = 0 and f (t) = 41 t2 are both solutions. In fact, for any α ∈ [0, b], the function
(
0 0≤t≤α
fα (t) = 1 2
4
(t − α) α≤t≤b
More generally any system of ODE can be reduce to an equation of the form
G(x, ẋ) = 0 where G : Rn × Rn → Rm . For example we reduces the DE of our
theorem ḟ = F(t, f (t)) to
Ṫ 1 T
= which is of the form ẋ = F̃(x(t)) where x=
ḟ F(T, f ) f
First we need a few facts about these functions. Clearly, pn,k (x) ≥ 0 for all
x ∈ [0, 1]. Also, by the binomial theorem, n n k n−k
(x + y)n . So we get
P
k=0 k x y =
5.5. INTEGRATION 191
Pn
k=0 pn,k (x) = 1. Differentiating the binomial theorem partially with respect to
x and putting y = 1 − x gives
n
! n
! n
X n k−1 n−k
X n X
kx (1−x) =n =⇒ kxk (1−x)n−k = kpn,k (x) = nx.
k k
k=0 k=0 k=0
Pn
Similarly but differentiating once more gives k=0 k(k − 1)pn,k (x) = n(n − 1)x2 .
Adding these two results gives
n
X
k2 pn,k (x) = n2 x2 + nx(1 − x) =⇒
k=0
n
X
(nx − k)2 pn,k (x) = n2 x2 − 2nx · nx + n2 x2 + nx(1 − x) = nx(1 − x). (∗)
k=0
Given any ε > 0, we can pick δ suchPthat |f (x)−f (y)| < ε whenever |x−y|
P < δ since
f is uniformly continuous. Since pn,k (x) = 1, we have f (x) = pn,k (x)f (x).
Now for each fixed x, we can write
n
n
k f k − f (x) pn,k (x)
X X
|pn (x) − f (x)| = f − f (x) pn,k (x) ≤
n n
k=0 k=0
f k − f (x) pn,k (x) + f k − f (x) pn,k (x)
X X
= n n
k:|x−k/n|<δ k:|x−k/n|≥δ
n
X X
≤ε pn,k (x) + 2 sup |f | pn,k (x)
k=0 [0,1]
k:|x−k/n|>δ
2
1 X k
≤ ε + 2(sup |f |) x− pn,k (x)
[0,1] δ2 n
k:|x−k/n|>δ
n 2
1 X k 2 sup |f | 2 sup |f |
≤ ε + 2(sup |f |) x − pn,k (x) = ε + nx(1 − x) ≤ ε +
[0,1] δ2 n δ 2 n2 δ2 n
k=0
Hence given any ε and δ, we can pick n sufficiently large that that |pn (x) − f (x)| <
2ε. This is picked independently of x. So done.
D. 5-62
A subset A ⊆ R is said to have Lebesgue measure zero if for any ε > 0, there exists
S∞
a countable (possibly finite) collection of open intervals I j such that A ⊆ j=1 IJ
and ∞
P
j=1 |Ij | < ε where |Ij | is the length of the interval.
E. 5-63
Lebesgue measure zero sets are “small”. It turns out that if a function’s set of
discontinuity is of measure 0, then it is integrable.
• The empty set has measure zero. Any finite set has measure zero.
• Any countable set has measure zero. If A = {a0 , a1 , · · · }, take
ε ε
Ij = aj − j+1 , aj + j+1 .
2 2
Then A is contained in the union and the sum of lengths is ε.
192 CHAPTER 5. ANALYSIS II
• A countable union of sets of measure zero has measure zero, using a similar
proof strategy as above.
• Any (non-trivial) interval does not have measure zero.
• The Cantor set , despite being uncountable, has measure zero. The Cantor set
1 2
is constructed asfollows:
start with C0 = [0, 1]. Remove the middle third 3 , 3
1 2
to obtain C1 = 0, 3 ∪ 3
, 1 . Removing the middle third of each segment to
obtain C2 = 0, 19 ∪ 29 , 93 ∪ 69 , 79 ∪ 98 , 1 . Continue
iteratively by removing
the middle thirds of each part. Then the set C = ∞
T
n=0 Cn is the Cantor set.
n
Since each Cn consists of 2 disjoint closed intervals of length 1/3n , the total
n
length of the segments of Cn is 23
→ 0. So we can cover C by arbitrarily
small union of intervals. Hence the Cantor set has measure zero. It is slightly
trickier to show that C is uncountable.
T. 5-64
<Lebesgue’s theorem on the Riemann integral> Let f : [a, b] → R be a
bounded function, and let Df be the set of points of discontinuities of f . Then f
is Riemann integrable if and only if Df has Lebesgue measure zero.
Using this result, a lot of our theorems follow easily of these. Apart from the easy
ones like the sum and product of integrable functions is integrable, we can also
easily show that the composition of a continuous function with an integrable func-
tion is integrable, since composing with a continuous function will not introduce
more discontinuities.
Similarly, we can show that the uniform limit of integrable functions is integrable,
since the points of discontinuities of the uniform limit is at most the (countable)
union of all discontinuities of the functions in the sequence.
equivalently
f (a + h) − f (a) − Ah
lim = 0.
h→0 khk
We call A the derivative of f at a. We write the derivative A as Df (a) or Df |a .
• Write L(Rn ; Rm ) for the space of linear maps A : Rn → Rm . More generally write
L(V ; W ) for the space of linear maps A : V → W .
• The directional derivative of f at a ∈ U in the direction of u ∈ Rn is
f (a + tu) − f (a)
Du f (a) = lim
t→0 t
d
whenever this limit exists. By definition, we have Du f (a) = dt
f (a + tu)t=0 .
E. 5-66
As in the case of R in IA Analysis I, for limits we do not impose any requirements
on F when x = a. In particular, we don’t assume that a ∈ E.
Usual laws of limits like [[ f (x) → a ]] ∧ [[ g(x) → b ]] ⇒ [[ λf (x) + µg(x) → λa + µb ]]
follows from the fact that [[ f̃ , g̃ continuous ]]⇒[[ λf̃ + µg̃ continuous ]].
Note that officially, α(h) = o(h) as a whole is a piece of notation, and does not
represent equality.
Note for differentiability we require the domain U of f to be open, so that for each
a ∈ U , there is a small ball around a on which f is defined. We could relax this
condition and consider “one-sided” derivatives instead, but we will not look into
these here.
We can interpret the definition of differentiability as saying we can find a “good”
linear approximation (technically, it is affine) to the function f near a. Equiva-
lently, that is we can approximate the function by a hyperplane near a as f (a)+Ah
defines an hyperplane through f (a).
P. 5-67
Derivatives are unique.
since B − A is linear and so k(B − A)(tu)k = |t|k(B − A)uk. This says that
(B − A)u = 0 for all u ∈ Rn . So B = A.
194 CHAPTER 5. ANALYSIS II
P. 5-68
Let X and Y be normed spaces with X finite dimensional, then all linear maps
X → Y are continuous.
Suppose
Pn f : X → Y is linear. Choose a basis (e1 , e2 , · · · , en ) in X. Then f (x) =
i=1 xi f (ei ). Let M = maxi kf (ei )k. Since
P any two norms on a finite-dimensional
space are equivalent, ∃C > 0 such that n i=1 |xi | ≤ Ckxk for all x ∈ X. Now by
the triangle inequality,
n
n n
!
X
X X
kf (x)k =
xi f (ei )
≤ |xi |kf (ei )k ≤ |xi | M ≤ CM kxk.
i=1 i=1 i=1
P as h → 0, we know f (a + h) − f (a) =
1. By definition, if f is differentiable, then
Df (a)h + o(h) → 0, since Df (a)h = n i=1 hi Df (a)ei → 0.
x3
(
y
y 6= 0
f (x, y) =
0 y=0
tu3
(
f (0 + tu) − f (0) u2
1
6 0
u2 =
=
t 0 u2 = 0
Note that in each term, we are just moving along the coordinate axes. Since the
partial derivatives exist, the mean value theorem of single-variable calculus applied
to g(t) = f (a + h(j−1) + tej ) on the interval t ∈ [0, hj ], so ∃θj ∈ (0, 1) for each j
so that
X n
f (a + h) − f (a) = hj Dj f (a + h(j−1) + θj hj ej )
j=1
n
X n
X
= hj Dj f (a) + hj Dj f (a + h(j−1) + θj hj ej ) − Dj f (a)
j=1 j=1
5.6. DIFFERENTIATION FROM RM TO RN 197
This is a very useful result. For example, we can now immediately conclude that
the function
x 2
y + e6z
y 7→ 3x + 4 sin14x
xyze
z
is differentiable everywhere, since it has continuous partial derivatives. This is
much better than messing with the definition itself.
D. 5-71
Let V, W be finite dimensional normed vector spaces over R. The operator norm
on L = L(V ; W ) is defined by kAk = supx∈V :kxk=1 kAxk.
C. 5-72
So far, we have only looked at derivatives at a single point. We haven’t dis-
cussed much about the derivative at, say, a neighbourhood or the whole space.
We might want to ask if the derivative is continuous or bounded. However, this
is not straightforward, since the derivative is a linear map, and we need to define
these notions for functions whose values are linear maps. In particular, we want
to understand the map Df : Br (a) → L(Rn ; Rm ) given by x 7→ Df (x). To do so,
we need a metric on the space L(Rn ; Rm ). In fact, we will use a norm.
Let L = L(Rn ; Rm ). This is a vector space over R defined with addition and scalar
multiplication defined pointwise. In fact, L is a subspace of C(Rn , Rm ) continuous
functions Rn → Rm . Since L is finite-dimensional (it is isomorphic to the space
of real m × n matrices, as vector spaces, and hence have dimension mn), it really
doesn’t matter which norm we pick as they are all Lipschitz equivalent, but a
convenient choice is the sup norm, or the operator norm.
P. 5-73
1. k · k is indeed a norm on L.
kAxk
2. kAk = sup .
V \{0} kxk
3. kAxk ≤ kAkkxk for all x ∈ V .
4. Let A ∈ L(V ; W ) and B ∈ L(W ; U ). Then kBAk ≤ kBkkAk where BA =
B ◦ A ∈ L(V ; U ).
3. Immediate from 2.
kBAxk kBkkAxk
4. kBAk = sup ≤ sup = kBkkAk.
V \{0} kxk V \{0} kxk
From 2 and 3 we see that A is Lipschitz with Lipschitz constant K iff kAk ≤ K
since both says that kA(u) − A(v)k = kA(u − v)k ≤ Kku − vk.
P. 5-74
Let V be finite dimensional normed vector spaces over R. Let U = {A ∈ L(V ; V ) :
A invertible}. Then U is open and · −1 : U → U is continuous.
L(V ; W ) is also finite dimensional normed space, hence it’s complete. Suppose
h ∈ U with khk < 1, then
m
m
X
X
n
h
≤ khkn → 0 as m > k → ∞.
n=k n=k
P∞ P∞
So n
Phm is Cauchy
n=0 and hence converges. Write h0 = n
n=0 h . Note that
n m+1
(I − h)( n=0 h ) = I − h where I is the identity function V → V . Take
limit m → ∞, we have (I − h)h0 = I, hence (I − h) is invertible with inverse
(I − h)−1 = h0 .
Suppose A ∈ U . Then if khk < 1/kA−1 k, then khA−1 k < 1 so A+h = (I +hA−1 )A
−1 −1 P∞ −1 n
is invertible. Hence U is open. Moreover, (A + h) = A n=0 (−hA ) , so
−1 −1 −1 P∞ −1 n
k(A + h) − A k = kA n=1 (−hA ) k → 0 as h → 0 since for any m,
m m
kA−1 k2 khk(1 − kA−1 km khkm )
X
−1 X −1 n
A (−hA )
≤ kA−1 kn+1 khkn =
n=1
n=1
1 − kA−1 kkhk
kA−1 k2 khk
≤ → 0 as h → 0.
1 − kA−1 kkhk
P. 5-75
1. If A ∈ L(R, Rm ), then A can be written as Ax = xa for some a ∈ Rm .
Moreover, kAk = kak, where the second norm is the Euclidean norm in Rn
2. If A ∈ L(Rn , R), then Ax = x · a for some fixed a ∈ Rn . Again, kAk = kak.
T. 5-76
<Chain rule> Let U ⊆ Rn be open and f : U → Rm differentiable at a ∈ U .
Also, V ⊆ Rm is open with f (U ) ⊆ V and g : V → Rp differentiable at f (a). Then
g ◦f : U → Rp is differentiable at a, with derivative D(g ◦f )(a) = Dg(f (a)) Df (a).
(g ◦ f )(a + h) = g(f (a) + Ah + o(h)) = g(f (a)) + B(Ah + o(h)) + o(Ah + o(h))
| {z }
k
m n
!1/2
X X
0 0 02
|g (t)| ≤ vi fi (t) ≤ kvk fi (t) = kvkkDf (t)k ≤ M kvk.
i=1 i=1
Mean value theorem says that g(b) − g(a) = g 0 (t)(b − a) for some t ∈ (a, b). By
definition of g, we get v · (f (b) − f (a)) = g 0 (t)(b − a). By definition of v, we have
If f (b) = f (a), then there is nothing to prove. Otherwise, divide by kf (b) − f (a)k
and done.
T. 5-78
<Mean value inequality> If U ⊆ Rn is an open convex set and f : U → Rm
is differentiable on U with kDf (x)k ≤ M for all x ∈ U , then kf (b1 ) − f (b2 )k ≤
M kb1 − b2 k for any b1 , b2 ∈ U .
We will reduce this to the previous theorem. Fix b1 , b2 ∈ Br (a). Note that
tb1 + (1 − t)b2 ∈ U for all t ∈ [0, 1]. Consider g : [0, 1] → Rm defined by
g(t) = f (tb1 + (1 − t)b2 ). By the chain rule, g is differentiable and g0 (t) =
Dg(t) = (Df (tb1 + (1 − t)b2 ))(b1 − b2 ). Therefore
Now we can apply the previous theorem, and get kf (b1 )−f (b2 )k = kg(1)−g(0)k ≤
M kb1 − b2 k
Note that this result says that if f : U → Rm have Df (x) = 0 for all x ∈ U , then
f is constant. Just apply the result with M = 0.
200 CHAPTER 5. ANALYSIS II
T. 5-79
Let U ⊆ Rm be open and path-connected. Then for any differentiable f : U → Rm ,
if Df (x) = 0 for all x ∈ U , then f is constant on U .
We are going to use the fact that f is locally constant. wlog, assume m = 1 since
we just need to show that fi is constant for all i where f = (f1 , · · · , fm ). Fix any
a, b ∈ U . Let γ : [0, 1] → U be a (continuous) path from a to b. For any s ∈ (0, 1),
there exists some ε > 0 such that Bε (γ(s)) ⊆ U since U is open. By continuity of
γ, there is a δ such that (s − δ, s + δ) ⊆ [0, 1] with γ((σ − δ, s + δ)) ⊆ Bε (γ(s)) ⊆ U .
We know f is constant on Bε (γ(s)) by the previous result, so g(t) = f ◦ γ(t) is
constant on (s − δ, s + δ). So g is differentiable at s with derivative 0. This is true
for all s ∈ (0, 1). So the map g : [0, 1] → R has zero derivative on (0, 1), also it is
continuous on [0, 1] since it is a composition of continuous map. So g is constant
and hence g(0) = g(1), ie. f (a) = f (b).
Note that if γ is differentiable, then this is much easier, since we can show g 0 = 0
by the chain rule g 0 (t) = Df (γ(t))γ 0 (t).
D. 5-80
• Let U ⊆ Rn be open. We say f : U → Rm is C 1 on U if f is differentiable at each
x ∈ U and Df : U → L(Rn , Rm ) is continuous.
• We write C 1 (U ) or C 1 (U ; Rm ) for the set of all C 1 maps from U to Rm .
• Let U, U 0 ⊆ Rn are open, then a map g : U → U 0 is a diffeomorphism (or
C 1 -diffeomorphism) if it is C 1 and has a C 1 inverse.4
E. 5-81
Suppose U ∈ Rn is open and f : U → Rn is C 1 , then ∃ε > 0 such that ẋ = f (x)
with x(t0 ) = a ∈ U has a unique solution x : [t0 − ε, t0 + ε] → Rn .
m n
!2 m n
! n
! m X
n
2
X X X X X X
kAxk = Aij xj ≤ a2ij x2j = kxk2 a2ij .
i=1 j=1 i=1 j=1 j=1 i=1 j=1
2
1 , · · · , xn ) and we have used Cauchy-Schwarz. Dividing by kxk , we
where x = (xq
P P 2
know kAk ≤ i j aij . Applying this to A = Df (x) − Df (y), we get
sX X
kDf (x) − Df (y)k ≤ (Dj fi (x) − Dj fi (x))2 .
i j
is just the Euclidean norm if we treat the matrix as a vector written in a funny
way. So by the equivalenceq of norms on finite-dimensional vector spaces, there is
PP 2
some C such that kAk ≤ C aij and then the result follows.
kx1 − x2 k − kf (x1 ) − f (x2 )k ≤ k(x1 − f (x1 )) − (x2 − f (x2 ))k = kh(x1 ) − h(x0 )k
1
≤ kx1 − x2 k.
2
Hence kx1 − x2 k ≤ 2kf (x1 ) − f (x2 )k. Apply this to x1 = g(y1 ) and x2 = g(y2 ),
and note that f (g(yj )) = yj we have kg(y1 ) − g(y2 )k ≤ 2ky1 − y2 k.
(Part iii) Note that if g is differentiable, then its derivative must be given by
Dg(y) = Df (g(y))−1 since by definition f (g(y)) = y and hence the chain rule
gives Df (g(y)) · Dg(y) = I. Also, we immediately know Dg is continuous, since
it is the composition of continuous functions. So we only need to check that
Df (g(y))−1 is indeed the derivative of g.
First we check that Df (x) is indeed invertible for every x ∈ B̄r (a). Suppose
Df (x)v = 0, then
1
kvk = kDf (x)v − vk ≤ kDf (x) − Ikkvk ≤ kvk.
2
So we must have kvk = 0, ie. v = 0. So ker Df (x) = {0} hence Df (g(y))−1
exists.
Now let x ∈ V be fixed, and y = f (x). Let k be small (so that y + k ∈ W ) and
h = g(y + k) − g(y). In other words f (x + h) − f (x) = k. We have
C. 5-87
<2nd derivatives as bilinear map>
By linear algebra, in general, a linear map φ : R` → L(Rn ; Rm ) induces a bilinear
map Φ : R` × Rn → Rm by Φ(u, v) = φ(u)(v) ∈ Rm . In particular, we know
P. 5-88
Suppose D(Df )(a) exist, then for any u, v,
1. D2 f (a)(u, v) = Du (Dv f )(a)
2. Write u = n
P Pn
j=1 uj ej and v = j=1 vj ej , where {ei } is the standard basis,
then
Xn n X
X m
D2 f (a)(u, v) = Dij f (a)ui vj = Dij fk (a)ui vj ek
i,j=1 i,j=1 k=1
We have been very careful to keep the right order of the partial derivatives. How-
ever, in most cases we care about, it doesn’t matter as will be illustrate below.
T. 5-89
Let a ∈ Bρ (a) ⊆ U ⊆ Rn with U open, and f : U → Rm . Let i, j ∈ {1, · · · , n} be
fixed and suppose that Di Dj f (x) and Dj Di f (x) exist for all x ∈ Bρ (a) and are
continuous at a, then Di Dj f (a) = Dj Di f (a).
Now apply mean value theorem to the function s 7→ Di (a + θtei + stej ), there is
some η ∈ (0, 1) such that
We can do the same for gji , and find some θ̃, η̃ such that gij (t) = t2 Di Dj f (a +
θ̃tei + η̃tej ). From definition of gij we see that gij = gji , so
This follows from the fact that continuity of second partials implies differentiability,
and the symmetry of mixed partials.
This result along with the related results generalised to higher derivatives. So if
f : U → Rm is C k in a neighbourhood of a ∈ U , then f is k times differentiable
with
n
X (1) (2) (k)
Dk f (a)(v(1) , v(2) , · · · , v(k) ) = Di1 i2 ···ik f (a) vi1 vi2 · · · vik
i1 ,i2 ,··· ,ik =1
T. 5-91
<Taylor’s theorem> Let U ⊆ Rn be open and a ∈ U . Suppose f : U → R is k
times differentiable and that h ∈ Rn is such that the line segment from a to a + h
is contained in U , then
k−1
X 1 i 1
f (a + h) = D f (a)hi + Dk f (a + sh)hk for some s ∈ [0, 1]
i=0
i! k!
Consider the function g(t) = f (a + th). Then the assumptions tell us g is k times
differentiable. By the 1D Taylor’s theorem with Lagrange form remainder, we
know
k−1
X 1 (i) 1
g(1) = g (0) + g (k) (s) for some s ∈ [0, 1].
i=0
i! k!
1 k
where E(h) = D f (a + sh)hk − Dk f (a)hk .
k!
kBk = supv1 ··· ,vk 6=0 kB(v1 , · · · , vk )k/( ki=1 kvi k) is a norm on the space of multi-
Q
Methods
In physics many important differential equations are linear,
that is if φ1 , φ2 are solutions, then so are λ1 φ1 + λ2 φ2 (for
any constants λi ). Where did the linearity of equations of Laplace’s equation:
physics come from? The real world is not linear in gen-
eral. However, often we are not looking for a completely ∂2φ ∂φ
+ =0
accurate and precise description of the universe. When we ∂x2 ∂y 2
have low energy/speed/whatever, we can often quite accu- Heat equation:
rately approximate reality by a linear equation. Whatever 2
∂2φ
the complicated equations governing the dynamics of the ∂φ ∂ φ
=κ +
underlying theory, if we just look to first order in the small ∂t ∂x2 ∂y 2
perturbations then we’ll find a linear equation essentially
by definition.
When dealing with functions and differential equations, we will often think of the space
of functions as a vector space. In many cases, it will be useful to find a “basis” for our
space of functions. Under different situations, we would want to use a different basis
for our space. A familiar example would be the Taylor series, where we are thinking
of {xn : n ∈ N} as the basis of our space, and trying to approximate an arbitrary
function as a sum of the basis elements. When writing the function f as a sum like
this, it is of course important to consider whether the sum converges, and when it
does, whether it actually converges back to f . Note that the set of solutions to a linear
differential equation would form a vector space since if φ1 , φ2 are solutions, then so are
λ 1 φ1 + λ 2 φ2 .
We will often want to restrict our functions to take particular values on some boundary,
known as boundary conditions. Often, we want the boundary conditions to preserve
linearity. We call these nice boundary conditions homogeneous conditions.
D. 6-1
A boundary condition is homogeneous if whenever f and g satisfy the boundary
conditions, then so does λf + µg for any λ, µ ∈ C (or R).
E. 6-2
Let Σ = [a, b]. We could require that f (a)+7f 0 (b) = 0, or maybe f (a)+3f 00 (a) = 0.
These are examples of homogeneous boundary conditions. On the other hand, the
requirement f (a) = 1 is not homogeneous.
209
210 CHAPTER 6. METHODS
We can normalize these to get { √12π einθ : n ∈ Z}, a orthonormal set of complex
valued periodic functions. Fourier’s idea was to try to use this as a “basis” for any
periodic function. Fourier (not entirely correctly) claimed that any f : S 1 → C
can be expanded in this basis as Fourier series given by
Z π
X 1 inθ 1
f (θ) = fˆn einθ , where fˆn = he , f i = e−inθ f (θ) dθ
n∈Z
2π 2π −π
It turns out that in many cases, the Fourier series don’t converge, that is the partial
Fourier sum do not converge to f . As we have seen in Analysis, there are many
ways of defining convergence of functions. Of course, with different definitions of
convergence, we can get different answers to whether it converges. It can be shown
that if f is continuously differentiable on S 1 , then Sn f converges to f uniformly
(and hence also pointwise).
E. 6-5
<Real Fourier series> In the special case where f is a real valued function, we
can re-formulate the Fourier series in terms of sin and cos.
Z π ∗ Z π
1 1
(fˆn )∗ = e−inθ f (θ) dθ = einθ f (θ) dθ = fˆ−n .
2π −π 2π −π
It turns out that this series converges (pointwise) to the sawtooth for all θ 6=
(2m + 1)π, ie. everywhere that the sawtooth is continuous.
Let’s look explicitly at the case where θ = π. Each term (n and −n part together)
of the partial Fourier series is zero. So the Fourier series converges to 0 here. This
is the average value of limε→0+ f (π + ε) and limε→0+ f (π − ε). This is typical. At
an isolated discontinuity, the Fourier series is the average of the limiting values of
the original function as we approach from either side.
C. 6-7
<Integration and differentiation of Fourier series>
Suppose f : S 1 → C is such that Sn f converge pointwise to f . Then we can define
a new sequence Sn F by the integrals
θ −1 ikθ n
− (−1)k X ˆ eikθ − (−1)k
Z X e
Sn F ≡ S − nf (φ)dφ = (θ − π)fˆ0 + fˆk + fk
−π ik ik
k=−n k=1
This new series is guarantee to converge since the original series did by assumption
and integration has suppressed each coefficient by a further factor of k. In fact
even if the original function had jump discontinuities, so that at some discrete
points the Fourier series converged to the average value of f on either side of the
discontinuity, we’ve seen that integration produces
Rθ a continuous function for us,
the new series will in fact converge to F (θ) = −π f (φ)dφ everywhere.
By contrast, if we differentiate a Fourier series term by term then we multiply each
coefficient by ik and this makes convergence worse, perhaps fatally.
Integration is a smoothening operator. The indefinite integral of the step function
212 CHAPTER 6. METHODS
Note that by assumption f (m) and hence |f (m) | is integrable. Also f (n) is continu-
ous (hence f (n) (π) = f (n) (−π)) for all n ≤ m − 1. So repeatedly apply integration
by parts we have
Z π π Z π
1 1 −ikθ 1
fˆk = e−ikθ f (θ) dθ = − e f (θ) + e−ikθ f 0 (θ) dθ
2π −π 2πik −π 2πik −π
Z π Z π
1 −ikθ 0 1 1
= e f (θ) dθ = · · · = e−ikθ f (m) (θ) dθ.
2πik −π (ik)m 2π −π
Z π
1 1
=⇒ |fˆk | ≤ m |f (m) (θ)| dθ
k 2π −π
For example if f (m) is bounded and continuous except at finitely many points,
then it is integrable.
This result makes sense. If we have a rather smooth function, then we would
expect the first few Fourier terms (with low frequency) to account for most of
the variation of f . Hence the coefficients decay really quickly. However, if the
function is jiggly and bumps around all the time, we would expect to need some
higher frequency terms to account for the minute variation. Hence the terms would
not decay away that quickly. So in general, if we can differentiate it more times,
then the terms should decay quicker. Conversely the smoothness of the function
can be inferred from the decay speed of the Fourier coefficients fˆk as k → ∞.
T. 6-9
Pn
<Fejér’s theorem> If f : S 1 → C is continuous, then σn (f ) = 1
n+1 m=0 Sm f
converge uniformly to f as n → ∞.
iii. It follows from (ii) that Fn (x) → 0 uniformly outside an arbitrary small region
(−δ, δ) around θ = 0, this is because for any δ ≤ |x| ≤ π,
1 1 1 1
Fn (x) ≤ ≤ → 0 as n → ∞
n + 1 sin2 (x/2) n + 1 sin2 (δ/2)
(i), (ii) and (iii) basically tell us thatas n → ∞, Fn behaves R π like the Dirac delta
1
function we see in Part IA, hence for large n, σn (f (θ)) = 2π −π
f (φ)Fn (θ−φ)dφ ≈
1
Rπ
f (θ) 2π −π
F n (θ − φ)dφ = f (θ). We are going to formalise this.
1 1 1
Let δ = 1/n 4 , then Fn (x) ≤ n+1 sin2 (δ/2)
= Bn → 0 as n → ∞ for all x ∈
[−π, π] \ (−δ, δ). Also f is continuous on a compact domain, so |f (x)| ≤ C for all
x. Furthermore both f and Fn are periodic with period 2π plus Fn is even, so
Z π Z θ+π
1 1
σn (f (θ)) = f (φ)Fn (θ − φ)dφ = f (φ)Fn (θ − φ)dφ
2π −π 2π θ−π
Z π
1
= f (θ + x)Fn (x)dx
2π −π
Now we have
Z π
1
sup |σn (f (θ)) − f (θ)| = sup (f (θ + x) − f (θ))Fn (x)dx
θ θ 2π −π
Z π
1
≤ sup |f (θ + x) − f (θ)|Fn (x)dx
θ 2π −π
Z δ
1
≤ sup 2CBn + |f (θ + x) − f (θ)|Fn (x)dx
θ 2π −δ
supx∈(−δ,δ) |f (θ + x) − f (θ)| δ
Z
≤ 2CBn + sup Fn (x)dx
θ 2π −δ
supx∈(−δ,δ) |f (θ + x) − f (θ)| π
Z
≤ 2CBn + sup Fn (x)dx
θ 2π −π
= 2CBn + sup sup |f (θ + x) − f (θ)| → 0 as n → ∞
θ x∈(−δ,δ)
Note that this result is saying that if we are given the coefficients fˆn and the
information that f is continuous, then we can recover the function f from fˆn .
In this chapter we’ll not pay too much attention about rigour (this result and proof
is actually non-examinable), we’ll henceforth mostly gloss over these subtle issues
of convergence.
214 CHAPTER 6. METHODS
T. 6-10
Suppose f : S 1 → C is continuous,
1. If ∞ ˆ
P
n=−∞ |fn | converges, then Sn f converges uniformly to f .
2. If k∈Z |k||fˆk | converges, then f is differentiable and the partial sum (Sn f )0 =
P
Pn ˆ ikθ converge uniformly to f 0 (θ) as n → ∞.
k=−n ik fk e
π
Z
inθ inθ inθ −inθ
|he , f i − he , gi| = |he , f − gi| =
e (f (θ) − g(θ))dθ
−π
Z π
≤ |f (θ) − g(θ)|dθ ≤ 2πε.
−π
def 1
So ĝn = 2π heinθ , gi = 2π
1
heinθ , limm→∞ Sm f i = limm→∞ heinθ , Sm f i = fˆn . So
Sn f = Sn g for all n, and hence also σn f = σn g for all n, where σ is as defined
in Fejér’s theorem. By Fejér’s theorem, σn f converge uniformly to both f and
g, but the uniform limits must be unique, hence f = g.
P ˆ 0
2. Since k∈Z |k||fk | converge, by [T.5-12] (Sn f ) converge uniformly to some
function. Since k∈Z |k||fk | converge, so k∈Z |fˆk | also converge by compar-
ˆ
P P
ison test, so by 1, Sn f converge uniformly to f . Now by [T.5-7] f is differen-
tiable and (Sn f )0 converge uniformly to f 0 (θ).
T. 6-11
If f : S 1 → C is C 2 (i.e. two times differentiable with continuous second deriva-
tive), then Sn f converges uniformly to f .
T. 6-12
Rπ
<Parseval’s identity> hf, f i = −π |f (θ)|2 dθ = 2π n∈Z |fˆn |2
P
We can provide a more rigours proof here under some restricted conditions, al-
though the result actually holds under some more relaxed condition. We have
Z π n
!∗ n
! n Z π
X X X
hSn f, Sn f i = ˆ
fj eijθ ˆ
fk e ikθ
dθ = fˆj∗ fˆk ei(k−j)θ dθ
−π j=−n k=−n j,k=−n −π
n
X n
X
= 2π fˆj∗ fˆk δkj = 2π |fˆk |2
j,k=−n k=−n
n
! n n
Z π X X Z π X
hSn f, f i = fˆk∗ e−ikθ f (θ)dθ = fˆk∗ e−ikθ f (θ)dθ = 2π |fˆk |2
−π k=−n k=−n −π k=−n
For example if f is such that supn∈N,θ∈(π,−π) |Sn f (θ) − f (θ)| < ∞ and that Sn f
converge uniformly to f except at finitely
Rπ many arbitrary small interval, then
limn→∞ hSn f − f, Sn f − f i = limn→∞ −π |Sn f (θ) − f (θ)|2 dθ = 0.
E. 6-13
From Parseval’s identity we P can obtain some interesting results. Consider the
∞ 1
Riemann ζ-function ζ(s) = n=1 ns . We can show that for any m, ζ(2m) =
2m
π q for some q ∈ Q. This may not be obvious at first sight. Last time, we
computed that the sawtooth function f (θ) = θ has Fourier coefficients fˆ0 = 0 and
n
fˆn = i(−1)
n
for n 6= 0. Applying Parseval’s theorem for the sawtooth function we
get
Z π ∞
2π 3 X X 1
= θ2 dθ = hf, f i = 2π |fˆn |2 = 4π 2
.
3 −π n∈Z n=1
n
π2
So ∞ 1
P
n=1 n2 = ζ(2) = 6 . We have just done it for the case where m = 1. But if
we integrate the sawtooth function repeatedly, then we can get the general result
for all m.
It is an easy check that this is in fact linear (i.e. L(λy + µz) = λL(y) + µL(z)). We
dp
say L has order p if the highest derivative that appears is dx p . In most applications,
we will be interested in the case p = 2.
C. 6-14
<Sturm-Liouville operators> Consider the 2nd order linear differential oper-
ator,
d2 y
2
dy d y R dy Q
Ly = P (x) 2 + R(x) − Q(x)y = P + − y
dx dx dx2 P dx P
R R
R d R dy Q d dy Q
= P e− P dx e P dx − y = P p−1 p − py .
dx dx P dx dx P
dx . We further define q = Q
R R
where p = exp P P
p. We also drop a factor of
−1
P p . Then we are left with
d d
L= p(x) − q(x).
dx dx
This is the Sturm-Liouville form of the operator. Now let’s compute hf, Lgi.
Assuming that p, q are real, we integrate by parts numerous times to obtain
Z b Z b ∗
d dg df dg
hf, Lgi = f∗ p − qg dx = [f ∗ pg 0 ]ba − p + f ∗ qg dx
a dx dx a dx dx
Z b ∗
d df
= [f ∗ pg 0 − f 0∗ pg]ba + p − qf ∗ g dx
a dx dx
= [(f ∗ g 0 − f 0∗ g)p]ba + hLf, gi.
So (assuming that p, q are real) L is self-adjoint (i.e. hf, Lgi = hLf, gi) with
respect to this norm when we restrict ourself to the set of functions (satisfying some
appropriate boundary conditions) such that the boundary terms vanish. Examples
of such boundary conditions are:
1. All our functions satisfies a1 h0 (a) + a2 h(a) = 0 and b1 h0 (b) + b2 h(b) = 0 where
a1,2 and b1,2 are fixed real constants (and h is f or g).2 The boundary term
above vanishes at each boundary separately for any function f and g that
satisfy this condition. This is called the separated boundary conditions .
2. If the function p obeys p(a) = p(b) then we restrict all out functions to be peri-
odic so that h(a) = h(b) and h0 (a) = h0 (b); this ensures that the contributions
at each boundary cancels. This is called the periodic boundary condition .
3. Finally it may sometimes be that p(a) = p(b) = 0, though in this case the
endpoints of the interval [a, b] are singular points of the differential equation.
Note that it is important that we have a second-order differential operator. If it is
first-order, then we would have a negative sign, since we integrated by parts once,
so not possible for L to be self-adjoint.
2
Note that a1 and a2 cannot both be zero, otherwise there is no boundary condition. Similarly
for b1 and b2
6.2. STURM-LIOUVILLE THEORY 217
D. 6-15
• A weight function w is real, non-negative function that has only finitely many
zeroes on the domain.
Rb
• An inner product with weight w is defined by hf, giw = a f ∗ (x)g(x)w(x) dx.
λi hyi , yi iw = λi hyi , wyi i = hyi , Lyi i = hLyi , yi i = hλi wyi , yi i = λ∗i hyi , yi iw .
T. 6-18
For a Sturm-Liouville problem with periodic or separate boundary condition,3
the eigenvalues are real and can be ordered to form a countable infinite (not
necessary strictly) increasing sequence λ1 , λ2 , · · · with λn → ∞ as n → ∞, and
such that the corresponding eigenfunction y1 , y2 , · · · upon normalisation forms
a complete orthonormal basis for the function space on which the operator is
defined (including satisfying the boundary condition). That is, anyP such function
∞ ˆ
f : [a, b] → C in the function space can be expanded as f (x) = n=1 fn yn (x)
ˆ
R ∗
where fn = hyn , f iw = yn (x)f (x)w(x) dx.
We will not prove this. A detailed version of this result is that for a regular
Sturm-Liouville problem (in particular it has separate boundary conditions), each
eigenvalue only has one independent eigenfunction, so the sequence λ1 , λ2 , · · · is
actually strictly increasing. On the other hand if we have periodic boundary
conditions, it may happen that there are two linearly independent eigenfunctions
corresponding to the same eigenvalue. Since the Sturm-Liouville problem is second
order we know that it is impossible to have more than two linearly independent
solutions irrespective of the boundary conditions. If there are two independent
eigenfunctions corresponding to the same eigenvalue, we can always made them
to be mutually orthogonal by the Gram-Schmidt process. And the totality of all
these eigenfunctions forms a complete orthonormal basis as said in the result.
The significant feature here is that the function f (x) is expanded as a discrete sum,
just as we saw for Fourier series. This is remarkable, because the definition of the
yn s that they be normalised eigenfunctions of L involves no hint of discreteness.
In fact the discreteness arises because the domain [a, b] is compact, and because
of the boundary conditions.
E. 6-19
d2
Let L = dx 2 . Here we have p = 1, q = 0. If we ask for functions to be satisfy the pe-
riodic boundary condition on [a, b], then L is self-adjoint. Now let [a, b] = [−L, L]
2
and w = 1. Eigenfunction obeys ddxy2n = λn yn (x). So our eigenfunctions are
yn (x) = einπx/L with eigenvalues λn = −n2 π 2 /L2 for n ∈ Z. This is just the
Fourier series! Note that each eigenvalue λn = λ−n has two independent eigen-
functions, also note that this is related to the result that if y is an eigenfunction,
then so is y ∗ with the same eigenvalue.
If however instead of the periodic boundary condition, we require our functions
to satisfy f (−L) = f (L) = 0 which is a separate boundary condition, then the
problem is a regular Sturm-Liouville problem, and we would find just the sinusoidal
Fourier series which has non-degenerate eigenvalues (i.e. each eigenvalue λn has
only one independent eigenfunction namely yn (x) = sin(nπx/L)).
E. 6-20
<Hermite polynomials> We want to study the DE 21 H 00 − xH 0 = −λH with
H : R → C for arbitrary λ. We want to put this in Sturm-Liouville form. We have
Rx 2
the integrating factor p(x) = exp − 0 2t dt = e−x . We can rewrite the DE as
d 2 dH 2
e−x = −2λe−x H(x).
dx dx
3
and some other condition including p, q and w being “nice”.
6.2. STURM-LIOUVILLE THEORY 219
2
Here L = d
dx
(e−x dx
d
) and we are solving for LH = −2λwH where we have weight
−x2
function w(x) = e . We now ask that H(x) grows at most polynomially as
2
|x| → ∞. In particular, we want e−x H(x)2 → 0. This ensures that our Sturm-
Liouville operator is self-adjoint, since it ensure that the boundary terms from
integrationby partsvanish at the infinite boundary and that the integral hf, giw =
R∞ ∗ d 2 dg R∞ ∗ 2 dg
−∞
f dx e−x dx dx = −∞ df dx
e−x dx dx remains finite. The eigenfunctions
turn out to be n
2 d 2
Hn (x) = (−1)n ex n
e−x .
dx
These are known as the Hermite polynomials . Note that these are indeed poly-
2
nomials. When we differentiate the e−x term many times, we get a lot of things
2
from the product rule, but they will always keep an e−x , which will ultimately
2
cancel with ex . The Hermite polynomials are orthogonal with respect to our
weight function.
T. 6-21
<Parseval’s identity II> hf, f iw = n∈Z |fˆn |2 for inner product with weight
P
w.
Again a not very rigorous proof: Let {y1 , y2 , · · · } be a complete set of functions
that are orthonormal with respect to the weight function w. We can expand f
with this basis, then
Z b X Z ∗ ∗
hf, f iw = f ∗ (x)f (x)w(x) dx = fˆn yn (x)fˆm ym (x)w(x) dx
a n,m∈N Ω
X X
= fˆn∗ fˆm hyn , ym iw = |fˆn | .
2
n,m∈N n∈N
C. 6-22
<Least squares approximation> So far, we have expanded functions in terms
of infinite series. However, in real life, when we ask a computer to do this for
us, it is incapable of storing and calculating “infinite” terms. So it’s important to
know how accurately we can represent a function by expanding it in just a limited,
incomplete set of eigenfunctions.
Suppose we approximate some function f : Ω → C by a finite set of P eigenfunctions
n
{y1 , · · · , yn }. Suppose we write the approximation g as g(x) = k=1 ck yk (x).
The objective here is to figure out what values of the coefficients ck are “the best”,
ie. can make g represent f as closely as possible. On notion of “as closely as
possible” is that we want to minimize the hf − g, f − giw . To minimize this norm,
first we want ∇hf − g, f − giw = 0 where we are treating hf − g, f − giw as a
function of variables Re(c1 ), · · · , Re(cn ), Im(c1 ), · · · , Im(cn ). The requirement for
∇hf − g, f − giw = 0 is that for all j we have
∂ ∂
0= hf − g, f − giw = (hf, f iw + hg, giw − hf, giw − hg, f iw )
∂ Re(cj ) ∂ Re(cj )
n n n
!
∂ X X ∗
X
= 2
|ck | − ˆ
fk ck − ck fk = 2 Re(cj ) − fˆj∗ − fˆj
∗ ˆ
∂ Re(cj ) i=1
k=1 k=1
= 2 Re(cj ) − 2 Re(fˆj ).
220 CHAPTER 6. METHODS
∂
and 0= hf −g, f −giw = · · · = 2 Im(cj )−ifˆj∗ +ifˆj = 2 Im(cj )−2 Im(fˆj )
∂ Re(c∗j )
where the hf, f iw term vanishes since it does not depend on our variables and
we expanded the other inner products in a similar manner like in the Parseval’s
identity. These conditions are satisfy iff cj = fˆj for all j. Now to check that this
is indeed an minimum, we can look at the second-derivatives, we have
∂2 ∂2
hf − g, f − giw = hf − g, f − giw = 2δij ,
∂ Re(ci )∂ Re(cj ) ∂ Im(ci )∂ Im(cj )
∂2
and hf − g, f − giw = 0.
∂ Im(ci )∂ Re(cj )
Hence the Hessian matrix is 2I which is obviously positive-define, so this is indeed
a minimum. Thus we know that hf − g, f − giw is minimized over all g(x) when
ck = fˆk = hyk , f iw . These are exactly the coefficients in our infinite expansion.
Hence if we truncate our infinite series at an arbitrary point, it is still the best
approximation we can get using only the remaining eigenfunctions.
P. 6-23
Let L be a linear differential operator. Given f , a necessary condition for the
inhomogeneous equation Lg = f to have a solution g is that f is orthogonal to
ker(L∗ ), where L∗ is the adjoint of L.
E. 6-24
Let Ω ⊆ R2 be a compact domain with boundary ∂Ω. We want to seek solution
φ : Ω → R that solves the poisson equation ∇2 φ = f with Neumann boundary
condition n · ∇φ|∂Ω = 0.
Z Z Z Z
hψ, ∇2 φi = ψ∇2 φ dV = ψ∇φ · dS − ∇ψ · ∇φ dV = − ∇ψ · ∇φ dV
ZΩ ∂Ω Ω Ω
Z Z Z
0= ∇φ · dS = ∇2 φdV = f (x)dV.
∂Ω Ω Ω
6.2. STURM-LIOUVILLE THEORY 221
C. 6-25
<Inhomogeneous equations and Green’s functions>
Analogue to solving the inhomogeneous matrix equation M u = f for a self-adjoint
matrix M . In the context of Sturm–Liouville differential operators, we seek to
solve the inhomogeneous differential equation Lg = f (x) where f (x) is a forcing
term. We can write this as Lg = w(x)F (x). Let {y1 , y2 , · · · } be the orthonormal
P of eigenfunctions of L,Pso that Lyn = λn wyn . We expand g and F as
basis consist
g(x) = n∈N ĝn yn (x) and F (x) = n∈N F̂n yn (x). By linearity,
!
X X X
w(x) F̂n yn (x) = w(x)F (x) = Lg = ĝn Lyn (x) = ĝn λn w(x)yn (x).
n∈N n∈N n∈N
Taking the (regular) inner product with ym (x) (and noting orthogonality of eigen-
functions), we obtain F̂m = ĝm λm . This tells us that ĝm = F̂m /λm . So provided
all λn are non-zero we have found the (particular) solution
X F̂n
g(x) = yn (x).
n∈N
λn
Note that there are no non-trivial contemporary solution otherwise we would have
a 0 eigenvalue.
It is often helpful to rewrite this into another form, using the fact that F̂n =
hyn , F iw and f = wF . Not caring too much about rigorousness, we have
X hyn , F iw Z bX ∗ Z b
yn (t)F (t)
g(x) = yn (x) = yn (x)w(t) dt = G(x, t)f (t) dt,
n∈N
λn a n∈N λn a
X 1 ∗
where G(x, t) = yn (t)yn (x).
n∈N
λn
We call G the Green’s function . The Green’s function is a function of two vari-
ables (x, t) ∈ [a, b] × [a, b]. Note G depends on λn and yn only, it depends on the
differential operator L both through its eigenfunctions and through the boundary
conditions we chose to ensure L is self-adjoint, but it doesn’t depend on the forc-
ing term f . Thus if we know the Green’s function we can use it to construct a
particular solution g of Lg = f for an arbitrary forcing term f .
In this way, the Green’s function provides a formal inverse to the differential oper-
ator L in analogy to the finite dimensional case where for an non-singular matrix
M , M −1 is its inverse so that u = M −1 f provides a solution to M u = f . Recall
that for a matrix, the inverse exists if the determinant is non-zero, which is true
if the eigenvalues are all non-zero, equivalently ker(M ) = 0. Similarly, here a
necessary condition for the Green’s function to exist is that all the eigenvalues are
non-zero, that is ker(L) = 0.
What happen if we have a 0 eigenvalue, say λm = 0? Well F̂m = ĝm λm tell us that
there will be no solution if F̂m 6= 0. If F̂m = 0, then we can have a solution by
we can have arbitrary amount of ĝm , ie. an arbitrary amount of ym . This makes
sense since ym is in ker(L).
222 CHAPTER 6. METHODS
∞
X ∞
X ∞
X
(1 − x2 ) an n(n − 1)xn−2 − 2 an nxn + λ an xn = 0.
n=0 n=0 n=0
For this to hold for all x ∈ (−1, 1), the equation must hold for each coefficient of x
separately. So an (λ − n(n + 1)) + an−2 (n + 2)(n + 1) = 0. This requires that
n(n + 1) − λ
an+2 = an .
(n + 2)(n + 1)
This equation relates an+2 to an . So we can choose a0 and a1 freely. So we get two
linearly independents solutions Θ0 (x) and Θ1 (x), where they satisfy Θ0 (−x) = Θ0 (x)
and Θ1 (−x) = −Θ1 (x). In particular, we can expand our recurrence formula to
obtain
λ (6 − λ)λ 4
Θ0 (x) = a0 1 − x2 − x + ···
2 4!
(2 − λ) 3 (12 − λ)(2 − λ) 5
Θ1 (x) = a1 x + x + x + ··· .
3! 5!
We now consider the boundary conditions. We know that Θ(x) must be regular (ie.
finite) at x = ±1. However, we have limn→∞ an+2 /an = 1. This is fine if we are
inside (−1, 1), since the power series will still converge. However, at ±1, the ratio
test is not conclusive and does not guarantee convergence. In fact, more sophisticated
convergence tests show that the infinite series would diverge on the boundary!
To avoid this problem, we need to choose λ such that the series truncates. That is, if
we set λ = `(` + 1) for some ` ∈ N0 , then our power series will truncate, and so Θ(x)
is finite at x = ±1. Note that in this case, the finiteness boundary condition fixes our
possible values of eigenvalues. This is how quantization occurs in quantum mechanics.
In this case, this process is why angular momentum is quantized.
6.3. PARTIAL DIFFERENTIAL EQUATIONS 223
The resulting polynomial solutions P` (x) are called Legendre polynomials . For exam-
ple, we have P0 (x) = 1, P1 (x) = x, P2 (x) = 21 (3x2 − 1) and P3 (x) = 21 (5x3 − 3x) where
the overall normalization is chosen to fix P` (1) = 1. It turns out that
1d` 2
P` (x) = (x − 1)` .
2` `!
dx`
The constants in front are just to ensure normalization. We now check that this indeed
gives the desired normalization:
1 d`
` ` 1 h i
`!(x + 1)` + (x − 1)(stuff)
P` (1) = ` ` ((x − 1) (x + 1) ) = ` =1
2 dx x=1 2 `! x=1
Since Q`,k has degree k, Q0`,k has degree k − 1. So the right bunch of stuff has degree
k + 1. Done. Now we can show orthogonality. We have
Z 1 Z 1 `
1 d
hP` , P`0 i = P` (x)P`0 (x) dx = ` `
(x2 − 1)` P`0 (x) dx
−1 2 `! −1 dx
`−1 1 Z 1 `−1
1 d 1 d dP 0
= ` (x 2
− 1)`
P `0 (x) − (x2 − 1)` ` dx
2 `! dx`−1 −1 2 ` `!
−1 dx `−1 dx
Z 1 `−1
1 d dP 0
=− ` (x2 − 1)` ` dx = · · · = · · ·
2 `! −1 dx`−1 dx
Z 1 `−k
1 d dk P`0
=− ` `−k
(x2 − 1)` dx
2 `! −1 dx dxk
Note that the boundary term disappears since the (` − 1)th derivative of (x2 − 1)` still
has a factor of x2 − 1. So integration by parts allows us to transfer the derivative from
x2 − 1 to P`0 . Now if ` 6= `0 , we can wlog assume `0 < `. We can integrate by parts
`0 + 1 times until we get the `0 + 1th derivative of P`0 , which is zero. In fact, we can
2
show that hP` , P`0 i = 2`+1 δ``0 hence the P` (x) are orthogonal polynomials.
(Roots of P` (x)): By the fundamental theorem of algebra, P` (x) has ` roots. In
fact, they are always real, and lie in (−1, Q 1). To see this, suppose only m < `
roots lie in (−1, 1). Then let Qm (x) = m r=1 (x − xr ), where {x1 , x2 , · · · , xm } are
these
Q` m roots. Q Consider the polynomial P` (x)Qm (x). If we factorize this, we get
m 2
r=m+1 (x − r) r=1 (x − xr ) . The first few terms have roots outside (−1, 1), and
hence do not change sign in (−1, 1). The R 1 latter terms are always non-negative. Hence
Pm sign, we have ± −1 P` (x)Qm (x) dx > 0. However, we can ex-
for some appropriate
pand Qm (x) = r=1 qr Pr (x) in a basis of Legendre polynomials, but hP` , Pr i = 0 for
all r < `, so the integral is 0. This is a contradiction.
224 CHAPTER 6. METHODS
Bessel functions
Consider the Bessel’s equation
d2 R dR
x2 +x + (x2 − n2 )R = 0
dx2 dx
Note that this is actually a whole family of differential equations, one for each n.
Here we assume n ∈ N0 . Since Bessel’s equations are second-order, there are two
independent solution for each n, namely Jn (x) and Yn (x). These are called Bessel
functions of order n of the first (Jn ) or second (Yn ) kind. We will not study these
functions’ properties, but just state some useful properties of them.
The Jn ’s are all regular at the origin. In particular, as x → 0 Jn (x) ∼ xn . These
look like decaying sine waves, but the zeroes are not regularly spaced. On the other
hand, the Yn ’s are similar but singular at the origin. As x → 0, Y0 (x) ∼ ln x, while
Yn (x) ∼ x−n .
The Bessel functions satisfies the orthogonality condition
Z a
a2
knj r kni r
Jn Jn r dr = δij (Jn0 (kni ))2 ,
0 a a 2
Note that we have a weight function r here. Also, this is the orthogonality relation
for different roots of Bessel’s functions of the same order, it does not relate Bessel’s
functions of different orders.
We note that the second term vanishes since either we have Φ = 0 on the boundary
for Dirichlet boundary condition or we have ∇Φ · n = 0 on the boundary for
4
∂Ω is the boundary of Ω.
6.3. PARTIAL DIFFERENTIAL EQUATIONS 225
R
Neumann boundary condition. So we have 0 = − Ω (∇Φ) · (∇Φ) dV . However,
this can only hold if ∇Φ = 0 throughout. Hence Φ is constant throughout Ω,
i.e. φ1 = φ2 + c for constant c. For Dirichlet boundary condition, Φ = 0 at the
boundary, so we must have Φ = 0 throughout, i.e. c = 0 and φ1 = φ2 .
E. 6-27
<Laplace’s equation on a disk> Solve ∇2 φ = 0 on a disk.
For (x, y) ∈ R2 , let z = x + iy and z̄ = x − iy, then the Laplace’s equation becomes
∂2φ
0= =⇒ φ(z, z̄) = ψ(z) + χ(z̄) for some ψ, χ
∂z∂ z̄
where φ(z) is holomorphic (i.e. ∂φ/∂ z̄ = 0) and χ(z̄) is antiholomorphic (i.e.
∂χ/∂z = 0). Suppose that we wish to solve Laplace’s equation inside the unit disc,
obeying the Dirichlet boundary condition φ(z, z̄)|∂Ω = f (θ), where the boundary
∂Ω is the unit circle. Since the domain of f is the unit circle S 1 , we can Fourier-
expand it (assuming that f is “nice” enough):
X ∞
X ∞
X
f (θ) = fˆn einθ = fˆ0 + fˆn einθ + fˆ−n e−inθ .
n∈Z n=1 n=1
iθ −iθ
However, we know that z = re and z̄ = re . So on the P boundary, we know
that z = eiθ and z̄ = e−iθ . So we can write f (θ) = fˆ0 + ∞ ˆ n ˆ n
n=1 (fn z + f−n z̄ ).
This is defined for |z| = 1. Now we can extend this and define φ on the unit disk
by
∞
X ∞
X
φ(z, z̄) = fˆ0 + fˆn z n + fˆ−n z̄ n .
n=1 n=1
It is clear that this obeys the boundary conditions by construction. Also, φ(z, z̄) is
of the form ψ(z) + χ(z̄), a sum of a holomorphic and an antiholomorphic function
(the constant fˆ0 being both holomorphic and antiholomorphic, may be included
with either). Note that φ(z, z̄) certainly converges when |z| ≤ 1 if the series for
f (θ) converged on ∂Ω. So it is a solution to Laplace’s equation on the unit disk.
Since the solution is unique, we’re done!
C. 6-28
<Separation of variables> Unfortunately, the use of complex variables is very
special to the case where Ω ⊆ R2 . In higher dimensions, we proceed differently.
The method of separation of variables is proceed as follows:
1. Write ψ as a product of functions that depend only on one variable each.
Hence reduce Laplace’s PDE to a system of ODEs that depend on a number
of constants (here λ and µ).
2. Solve the system of ODEs. Since Laplace’s equation was a second order linear
equation, these ODEs will always be of Sturm-Liouville type; the constants will
appear as eigenvalues of the Sturm-Liouville equation and the equations will
be solved by the eigenfunctions of the Sturm-Liouville operator.
3. Use the homogeneous boundary conditions to impose restrictions on the possi-
ble values of the eigenvalues. The solution for a fixed permissible choice of the
eigenvalues is known as a normal mode of the system. By linearity, the general
solution is a linear combination of these normal modes.
226 CHAPTER 6. METHODS
The boundary conditions says that we want our function to be f when z = 0 and
vanish at the boundaries. The first step is to look for a solution of ∇2 ψ(x, y, z) = 0
of the form ψ(x, y, z) = X(x)Y (y)Z(z). Then we have
00
Y 00 Z 00
X
0 = ∇2 ψ = Y ZX 00 + XZY 00 + XY Z 00 = XY Z + + .
X Y Z
As long as ψ 6= 0, we can divide through by ψ and obtain
X 00 Y 00 Z 00
+ + = 0.
X Y Z
The key point is each term depends on only one of the variables (x, y, z). If we
vary, say, x but keep others unchanged, then Y 00 /Y + Z 00 /Z does not change. For
the total sum to be 0, X 00 /X must be constant. So each term has to be separately
constant. We can thus write
The signs before λ and µ are there just to make our life easier later on. We can
solve these to obtain
√ √
X = a sin λx + b cos λx,
√ √
Y = c sin λy + d cos λx,
p p
Z = g exp( λ + µz) + h exp(− λ + µz).
We now impose the homogeneous boundary conditions, ie. the conditions that ψ
vanishes at the walls and at infinity: At x = 0, we need ψ(0, y, z) = 0, so b = 0. At
x = a, we need ψ(a, y, z) = 0, so λ = ( nπa
)2 . At y = 0, we need ψ(x, 0, z) = 0, so
d = 0. At y = b, we need ψ(x, b, z) = 0, so µ = ( mπ b
)2 . As z → ∞, ψ(x, y, z) → 0,
so g = 0. Therefore for each n, m ∈ N, we have solutions
2
m2
nπx mπy
n
ψ(x, y, z) = An,m sin sin exp(−sn,m z) where s2n,m = + π2
a b a2 b2
and An,m is an arbitrary constant. This obeys the homogeneous boundary condi-
tions for any n, m ∈ N but not the inhomogeneous condition at z = 0. By linearity,
the general solution obeying the homogeneous boundary conditions is
∞
X nπx mπy
ψ(x, y, z) = An,m sin sin exp(−sn,m z)
n,m=1
a b
6.3. PARTIAL DIFFERENTIAL EQUATIONS 227
The objective is thus to find the An,m coefficients. We can use the orthogonality
relation: Z a
kπx nπx a
sin sin dx = δk,n .
0 a a 2
kπx
So we multiply (†) by sin a and integrate wr,t, x:
∞ mπy Z a Z a
X kπx nπx kπx
An,m sin sin sin dx = sin f (x, y) dx.
n,m=1
b 0 a a 0 a
where Am,n is given by (∗). Since we have shown that there is a unique solution
to Laplace’s equation obeying Dirichlet boundary conditions, we’re done. Note
that if we imposed a boundary
√ condition at finite z, say 0 ≤ z ≤ c, then both
sets of exponential exp(± λ + µz) would have contributed. Similarly, if ψ does
not vanish at the other boundaries, then the cos terms would also contribute. In
general, to actually find our ψ, we have to do the horrible integral for Am,n , and
this is not always easy.
In particular suppose f (x, y) = 1, then
(
Z a Z b 16
4 nπx mπy
π 2 mn
n, m both odd
Am,n = sin dx sin dy =
ab 0 a 0 b 0 otherwise
Hence we have
16 X 1 nπx mπy
ψ(x, y, z) = sin sin exp(−sm,n z).
π2 nm a b
n,m odd
Note that in this example, we obtained a Fourier sine series because of the ho-
mogeneous Dirichlet boundary conditions on x and y. If instead we’d imposed
Neumann boundary conditions ∂ψ/dx = 0 at y = 0, b and ∂ψ/dy = 0 at x = 0, a
then we would instead find Fourier cosine series.
228 CHAPTER 6. METHODS
E. 6-30
<Laplace’s equation in spherical polar coordinates>
p Find the axisymmet-
ric solutions of ∇2 ψ = 0 on Ω = {(x, y, z) ∈ R3 : x2 + y 2 + z 2 ≤ a}.
Since our domain is spherical, it makes sense to use some coordinate system with
spherical symmetry. We use spherical coordinates (r, θ, φ), where x = r sin θ cos φ,
y = r sin θ sin φ and z = r cos θ. The Laplacian is
∂2
1 ∂ ∂ 1 ∂ ∂ 1
∇2 = 2 r2 + 2 sin θ + 2 2 2
.
r ∂r ∂r r sin θ ∂θ ∂θ r sin θ ∂φ
E. 6-31
<Multipole expansions for Laplace’s equation> One can quickly check that
1
φ(r) =
|r − r0 |
We can expand it as ∞ `
P
`=0 c` r P` (cos θ) by our previous result since 1/|r − k̂| is
finite at the origin. To find c` , we can employ a little trick. Since P` (1) = 1, at
θ = 0, we have
∞ ∞
X 1 X
c` r` = = r` .
1−r
`=0 `=0
So all the coefficients must be 1. This is valid for r < 1. More generally, we have
∞
1 1 X r ` 1 r
0
= 0 P` (r̂ · r̂0 ) = 0 + 02 r̂ · r̂0 + · · · ..
|r − r | r r0 r r
`=0
This is called the multiple expansion , and is valid when r < r0 . The first term
1/r0 is known as the monopole , and the second term is known as the dipole ,
in analogy to charges in electromagnetism. The monopole term is what we get
due to a single charge, and the second term is the result of the interaction of two
charges.
E. 6-32
<Laplace’s equation in cylindrical coordinates> Solve the Laplace’s equa-
tion in cylindrical coordinates on Ω = {(r, θ, z) ∈ R3 : r ≤ a, z ≥ 0}. Let f be a
given function. Give the solution that is regular inside Ω and obeys the boundary
conditions
1 ∂2φ ∂2φ
1 ∂ ∂φ
∇2 φ = r + 2 2 + = 0.
r ∂r ∂r r ∂θ ∂z 2
r Θ00
(rR0 )0 + + µr2 = 0.
R Θ
n2
d dR
r − R = −µrR.
dr dr r
2 √
Here we have p(r) = r, q(r) = − nr and w(r) = r. Introducing x = r µ, we can
rewrite this as
d2 R dR
<Bessel’s equation> x2 +x + (x2 − n2 )R = 0
dx2 dx
Note that the n here is not the eigenvalue we are trying to figure out in this
equation. It is already fixed by the Θ equation. The actual eigenvalue we are
finding out is µ, not n.
The boundary√
conditions require the solution to decay as z → ∞, so we have
Z(z) = cµ e− µz . Now we can write our separable solution as
√ √ √
φ(r, θ, z) = (an sin nθ + bn cos nθ)e− µz
(cµ,n Jn (r µ) + dµ,n Yn (r, µ)).
Often it’s hard to explicitly evaluate these integral, but we can ask our computers
to do this numerically for us.
6.3. PARTIAL DIFFERENTIAL EQUATIONS 231
value of t and t0 doesn’t matter since the total amount of heat is conserved, that is
property 1). We have
Z Z Z
φ2 (x, t) dn x = A φ(λx, λ2 t) dn x = Aλ−n φ(y, λ2 t) dn y,
Rn Rn Rn
n
where we substituted y = λx. So we need A = λ .
Whenever φ(x, t) solves ∂φ
∂t
= κ∇2 φ, then so does λn φ(λx, λ2 t) with the same amount
of total heat. So we try to find solutions of the form
1 x 1
φ(x, t) = F √ = F (η),
(κt)n/2 κt (κt)n/2
√
where η = x/ κt is called a similarity variable . Note we have φ(x, t) = λn φ(λx, λ2 t).
In other words we are finding solution that satisfies φ(x, t) = λn φ(λx, λ2 t) for any λ.
Turns out our solution correspond to the solution to the heat equation with initial
condition Cδ(x) (Dirac delta function) at t = 0. Intuitively it’s reasonable to say that
there will be some heat at the origin at some point, so Rwe expect F (0) 6= 0, but this
means limt→0 φ(0, t) = ∞; also we want the total heat Rn φ2 (x, t) dV to be finite, so
we expect F (y) → 0 as kyk → ∞, this means limt→0 φ(x, t) = 0 for any x 6= 0.
In 1 + 1 dimensions, we can look for a solution of the form φ(x, t) = √1 F (η) to
κt
2
∂φ 0
∂t
= κ ∂∂xφ2 with boundary condition F (0) = 0. We have
∂φ ∂ 1 −1 1 dη 0 −1
= √ F (η) = √ F (η) + √ F (η) = √ (F + ηF 0 )
∂t ∂t κt 2 κt3 κt dt 2 κt3
∂2φ ∂2
1 κ ∂ ∂η 0 1
κ = κ 2 √ F (η) = √ F = √ F 00 .
∂x2 ∂x κt κt ∂x ∂x κt3
232 CHAPTER 6. METHODS
Eigenfunctions solution
We can also find solution to the heat equation using eigenfunctions. Suppose φ(x, t)
obeys the heat equation ∂φ ∂t
= ∇2 φ for t > 0, where for convenience we P let κ = 1.
Given the initial state φ(x, 0) at time t = 0, we can expand it as φ(x, 0) = I cI yI (x)
in the complete set {yI (x)} of eigenfunctions for ∇2 . Let λI be the eigenvalue of yI ,
i.e. ∇2 yI = −λI yI , then the solution for all time is
X
φ(x, t) = cI e−λI t yI
I
P
as can be verified by direct substitution. We can write φ(x, t) = I cI (t)yI (x) where
cI (t) = cI e−λ2 t .
In some cases the eigenvalues λ must be positive. Suppose we have an eigenfunction
y(x) of the Laplacian ∇2 on Ω, so that ∇2 y = −λy. Using the identity ∇ · (Φ∇Ψ) =
(∇Φ) · (∇Ψ) + Φ∇2 Ψ we have
Z Z Z Z
−λ |y|2 dV = y ∗ (x)∇2 y(x) dV = y ∗ n · ∇y dS − |∇y|2 dV.
Ω Ω ∂Ω Ω
Einstein came up with an example where the heat equation can come out of apparently
naturally from seemingly reversible laws, namely Brownian motion. The idea is that
a particle will randomly jump around in space, where the movement is independent of
time and position.
Let the probability that a dust particle moves R ∞ through a step y over a time ∆t be
p(y, ∆t). For any fixed ∆t, we must have −∞ p(y, ∆t) dy = 1. Here we assume
that p(y, ∆t) is independent of time, and of the location of the dust particle. We also
assume p(y, ∆t) is strongly peaked around y = 0 and also p(y, ∆t) = p(−y, ∆t). Now
let P (x, t) be the probability that the dust particle is located at x at time t. Then at
time t + ∆t, we have
Z ∞
P (x, t + ∆t) = P (x − y, t)p(y, ∆t) dy.
−∞
∂P y2 ∂ 2 P
P (x − y, t) ≈ P (x, t) − y (x, t) + (x, t)
∂x 2 ∂x2
∂P 1 ∂2P
=⇒ P (x, t + ∆t) ≈ P (x, t) − (x, t)hyi + hy 2 i 2 (x, t) + · · · ,
∂x 2 ∂x
Z ∞
where hy r i = y r p(y, ∆t) dy.
−∞
Since p is even function, we expect hy r i to vanish when r is odd. Also, since y is strongly
peaked at 0, we expect the higher order terms to be small. So we can write
1 2 ∂2P
P (x, t + ∆t) − P (x, t) = hy i 2 .
2 ∂x
2
Suppose that as we take the limit ∆t → 0, we get hy i
2∆t
→ κ for some κ. Then this
becomes the heat equation
∂P ∂2P
=κ 2.
∂t ∂x
P. 6-33
Suppose φ : Ω × [0, ∞) → R satisfies the heat equation ∂φ
∂t
= κ∇2 φ, and obeys
• Initial conditions φ(x, 0) = f (x) for all x ∈ Ω
• Boundary condition φ(x, t)|∂Ω = g(x, t) for all t ∈ [0, ∞).
Then φ(x, t) is unique.
So we know that E decreases with time but is always non-negative. But at time
t = 0, E = Φ = 0. So E = 0 always, so Φ = 0 and hence φ1 = φ2 .
234 CHAPTER 6. METHODS
E. 6-34
<Heat conduction in uniform medium> Suppose we are on Earth, and the
Sun heats up the surface of the Earth through sun light. So the sun will maintain
the soil at some fixed temperature. However, this temperature varies with time as
we move through the day-night cycle and seasons of the year.
We let φ(x, t) be the temperature of the soil as a function of the depth x, defined
2
on [0, ∞) × [0, ∞). Then it obeys the heat equation ∂φ∂t
= K ∂∂xφ2 with conditions
1. φ(0, t) = φ0 + A cos 2πt
tD
+ B cos 2πt
tY
.
2. φ(x, t) → const as x → ∞.
We try the separation of variables. Suppose φ(x, t) = T (t)X(x), then we get the
equations T 0 = λT and X 00 = K λ
X. From the boundary solution, we know that
our things will be oscillatory. So we let λ be imaginary, and set λ = iω. So we
have √ √
φ(x, t) = eiωt aω e− iω/Kx + bω e iω/Kx .
Note that we have q
r (1 + i) |ω| ω>0
iω q 2K
=
K (i − 1) |ω| ω<0
2K
x = −L x=0 x=L
We first look for any solution satisfying the boundary conditions φS (−L, t) =
0 and φS (L, t) = 1. For example, we can look for time-independent solutions
2
φS (x, t) = φS (x). Then we need ddxφ2S = 0. So we get
x+L
φS (x) = .
2L
By linearity, ψ(x, t) = φ(x, t) − φs (x) obeys the heat equation with the con-
ditions ψ(−L, t) = ψ(L, t) = 0 which is homogeneous! Our initial condition
now becomes ψ(x, 0) = Θ(x) − x+L 2L
. We now perform separation of variables,
ψ(x, t) = X(x)T (t). Then we obtain the equations T 0 = −κλT and X 0 = −λX.
Then we have √ √
ψ(x, t) = a sin( λx) + b cos( λx) e−κλt .
Since initial condition is odd, we can eliminate all cos terms. Our boundary
conditions also requires λ = n2 π 2 /L2 where n = 1, 2, · · · . So we have
∞
κn2 π 2
X nπx
φ(x, t) = φs (x) + an sin exp − t ,
n=1
L L2
We know that the Earth isn’t just a cold piece of rock. There are still volcanoes.
So we know many terms still contribute to the sum nowadays. This is rather
difficult to work with. So we instead look at the temperature gradient
−κn2 π 2 t
∂φ φ0 X nπr
= (−1)n+1 cos exp + sin stuff.
∂r r n∈Z R R2
φ(x, t)
B
A
Consider two points A, B separated by a small distance δx. Let TA and TB be the
tension of the string at A and B respectively, and θA , θB the angle they make with the
horizontal. Consider the segment of the string between A and B. It has no sideways
(x) motion, there is no net horizontal force, so
TA cos θA = TB cos θB = T. (∗)
If the string has mass per unit length µ, then in the vertical direction, Newton’s second
law gives
∂2φ
µδx 2 = TB sin θB − TA sin θA .
∂t
6.3. PARTIAL DIFFERENTIAL EQUATIONS 237
The coefficients An are fixed by the initial profile φ(x, 0) of the string, while the
coefficients Bn are fixed by the initial string velocity ∂φ
∂t
(x, 0). Note that we need two
sets of initial conditions, since the wave equation is second-order in time.
2 2
From IA Differential Equations, we’ve learnt that the solution of c12 ∂∂t2φ = ∂∂xφ2 can
be written as f (x − ct) + g(x + ct). However, the method does not extend to higher
dimensions, but the method of separation of variables does.
What can we see here? Our solution is essentially an (infinite) sum of independent
harmonic oscillators, one for each n. The period of the fundamental mode (n = 1) is
2π L
ω
= 2π πc = 2Lc
. Thus, averaging over a period, the average kinetic energy is
Z 2L/c Z 2L/c
c c E
K̄ = K(t) dt = V̄ = V (t) dt = .
2L 0 2L 0 2
Hence we have an equipartition of the energy between the kinetic and potential en-
ergy.
P. 6-37
2
Suppose φ : Ω × [0, ∞) → R obeys the wave equation ∂∂t2φ = c2 ∇2 φ inside Ω ×
(0, ∞), and is fixed at the boundary. Then E is constant.
Suppose φ1 and φ2 are two such solutions. Then ψ = φ1 − φ2 obeys the wave
2
equation ∂∂tψ ∂ψ
2 2
2 = c ∇ ψ and ψ|∂Ω×[0,∞) = ψ|Ω×{0} = ∂t Ω×{0} = 0. Consider the
energy (per µ) !
Z 2
1 ∂ψ 2
Eψ (t) = + c ∇ψ · ∇ψ dV.
2 Ω ∂t
Then since ψ obeys the wave equation with fixed boundary conditions, we know
Eψ is constant. Initially, at t = 0, we know that ψ = ∂ψ ∂t
= 0. So Eψ (0) = 0. At
time t, we have
Z 2
1 ∂ψ
Eψ = + c2 (∇ψ) · (∇ψ) dV = 0.
2 Ω ∂t
∂ψ
Hence we must have ∂t
= 0. So ψ is constant. Since it is 0 at the beginning, it is
always 0.
6.3. PARTIAL DIFFERENTIAL EQUATIONS 239
E. 6-39
<Vibrations of a circular membrane> Consider Ω = {(x, y) ∈ R2 , x2 + y 2 ≤
1}, and let φ(r, θ, t) solve
1 ∂2φ 1 ∂2φ
2 1 ∂φ
2 2
=∇ φ= r + 2 2,
c ∂t r ∂r r ∂θ
with the boundary condition φ|∂Ω = 0. We can imagine this as a drum, where the
membrane can freely oscillate with the boundary fixed. Separating variables with
φ(r, θ, t) = T (t)R(r)Θ(θ), we get
Then as before, T and Θ are both sine and cosine waves. Since we are in polars
coordinates, we need φ(t, r, θ + 2π) = φ(t, r, θ). So we must have µ = m2 for some
m ∈ N. Then the radial equation becomes r(rR0 )0 + (r√ 2
λ − m2 )R = 0
√ which is
Bessel’s equation of order m. So we have R(r) = am Jm ( λr) + bm Ym ( λr).
Since we want regularity at r = 0, we need √
bm = 0 for all m. To satisfy the
boundary condition φ|∂Ω = 0, we must choose λ = kmi , where kmi is the ith root
of Jm . Hence the general solution is
∞
X
φ(t, r, θ) = A0i sin(k0i ct) + B0i cos(k0i ct) J0 (k0i r)
i=0
∞ X
X ∞
+ Ami cos(mθ) + Bmi sin(mθ) sin(kmi ct)Jm (kmi r)
m=1 i=0
X∞ X ∞
+ Cmi cos(mθ) + Dmi sin(mθ) cos(kmi ct)Jm (kmi r)
m=1 i=0
We can differentiate this w.r.t t, set t = 0 and multiply with J0 (k0j r)r to obtain
Z ∞
1X Z 1
k0i cA0i J0 (k0i r)J0 (k0j r)r dr = g(r)J0 (k0j r)r dr.
0 i=0 0
Note that the frequencies come from the roots of the Bessel’s function, and are not
evenly spaced. This is different from, say, string instruments, where the frequencies
are evenly spaced. So drums sound differently from strings.
240 CHAPTER 6. METHODS
E. 6-40
So far, we have used separation of variables to solve our differential equations. It
worked in our examples, but there are a few issues with it. Of course, we have the
problem of whether it converges or not, but there is a more fundamental problem.
To perform separation of variables, we need to pick a good coordinate system,
such that the boundary conditions come in a nice form. However, in real life, our
domain might have a weird shape, and we cannot easily find good coordinates for
it.
Mark Kac asked the interesting question “can you hear the shape of a drum?” —
suppose we know all the frequencies of the modes of oscillation on some domain
Ω. Can we know what Ω is like? The answer is no, and we can explicitly construct
two distinct drums that sound the same. However if we require Ω to be convex,
and has a real analytic boundary, then yes! For example, we can recover the area:
let N (λ0 ) be the number of eigenfrequencies less than λ0 . Then we can show that
N (λ0 )
4π 2 lim = Area(Ω).
λ0 →∞ λ0
6.4 Distributions
When performing separation of variables, we first find some particular solutions of the
form, say, X(x)Y (y)Z(z). We know that these solve, say, the wave equation. However,
what we do next is take an infinite sum of these functions. First of all, how can we
be sure that this converges at all? Even if it did, how do we know that the sum
satisfies the differential equation? As we have seen in Fourier series, an infinite sum
of continuous functions can be discontinuous. If it is not even continuous, how can
we say it is a solution of a differential equation? Hence, at first people were rather
skeptical of this method.
Quite remarkably, the most fruitful way forward has turned out not to be to restrict
ourselves to sufficiently differentiable functions that our concerns are eased, but rather
to be to generalize the very notion of a function itself with the aim of finding the right
class of object where the method always makes sense. Generalized functions were intro-
duced in mathematics by Sobolev and Schwartz. They’re designed to full an apparently
mutually contradictory pair of requirements: They are sufficiently well behaved that
they’re infinitely differentiable and thus have a chance to satisfy partial differential
equations, yet at the same time they can be arbitrarily singular neither smooth, nor
differentiable, nor continuous, nor even finite if interpreted naively as ‘ordinary func-
tions’. These generalised functions are called distributions, inspired by the distribution
of singular source which is represented by the Dirac delta distribution.
D. 6-41
• For Ω ⊆ Rn , write D(Ω) for the set of functions Ω → R which is smooth (i.e.
infinitely differentiable) and has compact support (i.e. is identically zero outside
some compact set). Alternatively they are the set of functions Rn → R which is
smooth and has compact support on Ω (i.e. it’s zero outside some compact set
K ⊆ Ω).
• Distributions are a class of linear functionals that map a set (space) of test
functions (conventional and well-behaved functions) into the set of real numbers.
6.4. DISTRIBUTIONS 241
Below we will take D(Ω) as the space of test functions, and distributions are
elements of the dual space (D(Ω))∗ , which here we will write as D0 (Ω).
Like a usual vector space given distributions T1 , T2 and constants λ, µ, the distri-
bution λT1 + µT2 is given by (λT1 + µT2 )[φ] = λT1 [φ] + µT2 [φ]. In addition, given
a smooth function ψ ∈ C ∞ (Ω) and distribution T , we define the distribution ψT
by (ψT )[φ] = T [ψφ].6
• Given an ordinary function f : Ω → R that is locally integrable (i.e. inte-
Rgrable over any compact region), define the distribution Tf by Tf [φ] = hf, φi =
Ω
f (x)φ(x) dV . Sometimes Tf [φ] is simply written as f [φ].
Note that this is a linear map since integration is linear (and multiplication is
commutative and distributes over addition). Also this integral is guaranteed to be
well-defined even when Ω is non-compact (say, the whole of Rn ) since φ has com-
pact support and f is locally integrable. Note also that unlike the test functions,
we do not require f itself to have compact support. When the context is clear, we
might write Tf simply as f .
R
• By analogy with above, we often abuse notation and write δ[φ] = Ω δ(x)φ(x) dV
and pretend δ(x) is an actual function (more precisely, pretend that there is an
ordinary function δ(x) that gave rise to the Dirac delta distribution) like we did
in part IA. Of course, this cannot really be the case, since if it were, then we must
have δ(x) = 0 whenever x 6= 0, since δ[φ] = φ(0) only depends on what happens
at 0. But then this integral will just give 0 if δ(0) ∈ R. Some people like to think
of this as a function that is zero anywhere and “infinitely large” everywhere else.
Formally, the Dirac delta should be think of as a distribution.
• Although distributions can be arbitrarily singular and insane, we can nonetheless
define all their derivatives, as T 0 [φ] = −T [φ0 ]. This is motivated by the case of
regular functions, where we would want Tf0 = Tf 0 : In one-dimension (i.e. Ω ⊆ R,
after integrating by parts we get
Z Z
Tf 0 [φ] = f 0 (x)φ(x) dx = − f (x)φ0 (x) dx = −T [φ0 ],
Ω Ω
6
In general there is no way to multiply two distributions together.
242 CHAPTER 6. METHODS
where the step functions provide the jumps in the sawtooth. Then differentiating
this term by term gives
∞
X X
2 (−1)n+1 cos(nx) = 1 + 2π δ(x − nπ)
n=1 n∈Z
• The previous two are special cases of the following: suppose f (x) is a continu-
ously differentiable function with isolated simple zeros at xi . Then near any of
its zeros xi , we have f (x) ≈ (x − xi ) ∂f . Then
∂x x i
Z ∞ n Z ∞
! n
X ∂f X 1
δ(f (x))φ(x) dx = δ (x − xi ) φ(x) dx = 0 (x )|
φ(xi ).
−∞ i=1 −∞ ∂x xi
i=1
|f i
E. 6-45
• Generalized functions can occur as limits of sequences of normal functions. For
example, the family of functions
n 2 2
Gn (x) = √ e−n x
π
are smooth for any finite n, and Gn [φ] → δ[φ] for any φ. It thus makes Rsense to de-
fine δ 0 [φ] = −δ[φ0 ] = −φ0 (0) as this is the limit of the sequence limn→∞ Ω G0n (x)φ(x) dx.
It is often convenient to think of δ(x) as limn→∞ Gn (x), and δ 0 (x) = limn→∞ G0n (x)
etc., despite the fact that these limits do not exist as functions.
• We can also expand the δ-function in a basis of eigenfunctions.
P Suppose we live in
the interval [−L, L], and write a Fourier expansion δ(x) = n∈Z δ̂n einπx/L with
1
R L −inπx/L 1
δ̂n = 2L −L
e δ(x) dx = 2L . So we have
1 X inπx/L
δ(x) = e .
2L n∈Z
This does make sense as a distribution. Consider the partial sum SN δ(x),
Z L Z L N
1 X
lim SN δ(x)φ(x) dx = lim einπx/L φ(x) dx
N →∞ −L N →∞ 2L L n=−N
N Z L
X 1
= lim einπx/L φ(x) dx
N →∞
n=−N
2L −L
N
X N
X
= lim φ̂−n = lim φ̂n einπ0/L = φ(0),
N →∞ N →∞
n=−N n=−N
since the Fourier series of the smooth function φ(x) does converge for all x ∈
[−L, L].
• We can equally well expand δ(x) in terms of any other set of orthonormal eigenfunc-
tions. Let {yn (x)} be a complete set of eigenfunctions on [a, b] that areP
orthogonal
with respect to a weight function w(x). Then we can write δ(x − ξ) = n cn yn (x)
Rb
with cn = a yn∗ (x)δ(x − ξ)w(x) dx = yn∗ (ξ)w(ξ). So
X ∗ X ∗
δ(x − ξ) = w(ξ) yn (ξ)yn (x) = w(x) yn (ξ)yn (x),
n n
C. 6-46
<Green’s functions> One of the main uses of the δ function is the Green’s
function. Suppose we wish to solve the 2nd order ordinary differential equation
Ly = f on [a, b] (which may be ±∞ respectively), where f (x) is a bounded forcing
term, and L is a differential operator
∂2 ∂
L = α(x) + β(x) + γ(x).
∂x2 ∂x
where α, β, γ are continuous with α nonzero except perhaps at a finite number of
isolated points (on [a, b]). We now define the Green’s function G(x, ξ) of L to be
the any solution (might not be unique) to the problem LG = δ(x − ξ).
Rb
Given G(x, ξ), if we define y(x) = a G(x, ξ)f (ξ) dξ, then
Z b Z b
Ly = LG(x, ξ)f (ξ) dξ = δ(x − ξ)f (ξ) dξ = f (x).
a a
E. 6-47
Use Green’s function to solve Ly = f on [a, b] with boundary condition y(a) =
y(b) = 0.
It would be enough if we could find the Green’s function G(x, ξ) obeying the
homogeneous boundary conditions G(a, ξ) = G(b, ξ) = 0, which would gives us
Rb
the (unique) solution y(x) = a G(x, ξ)f (ξ) dξ (satisfying the required boundary
condition) for the problem.
Note that LG(x, ξ) = 0 whenever x 6= ξ. Thus, for both x < ξ and x > ξ we can
express G in terms of solutions of the homogeneous equation Ly = 0. Suppose that
{y1 (x), y2 (x)} is a basis of linearly independent solutions to the problem Ly = 0
on [a, b]. We define this basis by requiring that y1 (a) = 0 and y2 (b) = 0. That is,
each of y1 and y2 obeys one of the homogeneous boundary conditions. Note that
such y1 and y2 are unique up to multiplication of a constant. Therefore we must
have (
A(ξ)y1 (x) a≤x<ξ
G(x, ξ) =
B(ξ)y2 (x) ξ<x≤b
So we have a whole family of solutions. To fix the coefficients, we must decide how
to join these solutions together over x = ξ.
If G(x, ξ) were discontinuous at x = ξ, then ∂x G|x=ξ would involve a δ function,
while ∂x2 G|x=ξ would involve the derivative of the δ function. This is not good,
since nothing in LG = δ(x − ξ) would balance a δ 0 . So G(x, ξ) must be everywhere
continuous. Hence we require
y2 (ξ) y1 (ξ)
A(ξ) = , B(ξ) = ,
α(ξ)W (ξ) α(ξ)W (ξ)
∂2y ∂2y
µ 2
= T 2 + µg.
∂t ∂x
246 CHAPTER 6. METHODS
as can be seen by slightly altering the derivation in [C.6.3.4]. We look for the
steady state solution ẏ = 0 (i.e. shape of the string when the string is at rest)
obeying y(0, t) = y(L, t) = 0. In this case the above equation reduces to
∂2y µ(x)
=− g.
∂x2 T
2
∂ y
We look for a Green’s function obeying ∂x 2 = −δ(x − ξ). This can be interpreted
as the contribution of a pointlike mass of mass T /g located at x = ξ, or in other
∂2y
words the solution of ∂x2 = −µ(x)g/T with µ(x) = δ(x − ξ)T /g (i.e under a point
mass). The homogeneous equation y 00 = 0 gives y = Ax + B(x − L). So we get
(
A(ξ)x 0≤x<ξ
G(x, ξ) =
B(ξ)(x − L) ξ < x ≤ L.
ξ−L ξ
G(x, ξ) = xΘ(ξ − x) + (x − L)Θ(x − ξ).
L L
0 ξ L
Notice that ξ is always less that L. So the first x
term has a negative slope; while ξ is always pos-
itive, so the second term has positive slope. G(x, ξ)
We can model the string with arbitrary mass per unit length µ(x) as having many
R x +∆x
pointlike particles of mass mi = x i µ(x) dx at small separations ∆x along
i
the string. So we get the solution
L/∆x Z L
X g µ(ξ)g
y(x) = G(x, xi )mi → G(x, ξ) dξ in the limit ∆x → 0
i=1
T 0 T
This is what the Green’s function are supposed to do. In general we can think
of the Green’s function as the solution to the point sources, and then we can
reconstruct the forcing term using a weighted sum of point sources, and hence the
the solution for an arbitrary forcing term.
E. 6-50
Use Green’s function to solve Ly = f (t) subject to y(t0 ) = y 0 (t0 ) = 0.
Now instead of having two boundaries, we have one boundary and restrict both
the value and the derivative at this boundary. Note that this boundary condition
is still homogeneous. Similar to before, if we can find Green’s function G(t, τ )
satisfying G(t0 , τ ) = G0 (t0 , τ ) = 0, the the solution y construct from G(t, τ ) would
obey y(t0 ) = y 0 (t0 ) = 0.
As before, let y1 (t), y2 (t) be any basis of solutions to Ly = 0. The Green’s function
obeys L(G) = δ(t − τ ). We can write our Green’s function as
(
A(τ )y1 (t) + B(τ )y2 (t) t0 ≤ t < τ
G(t, τ ) =
C(τ )y1 (t) + D(τ )y2 (t) t > τ.
6.6. FOURIER TRANSFORMS 247
since we know that when τ > t, the Green’s function G(t, τ ) = 0. Thus the
solution y(t) depends on the forcing term term f only for times < t. This expresses
causality!
E. 6-51
Suppose we have ÿ + y = f (t) with y(0) = ẏ(0) = 0. Then we have
G(t, τ ) = Θ(t − τ )(C(τ ) cos(t − τ ) + D(τ ) sin(t − τ ))
for some C(τ ), D(τ ). The continuity and jump conditionsRgives D(τ ) = 1, C(τ ) =
t
0. So we get G(t, τ ) = Θ(t − τ ) sin t(τ ). So we get y(t) = 0 sin(t − τ )f (τ ) dτ .
E. 6-52
<Eigenfunction expansion of Green’s function>
P
If L is the Sturm-Liouville operator, then we can expand G(x, ξ) = n∈N Ĝn (ξ)yn (x)
where {yn } is the basis of w orthonormal eigenfunctions of L. We have
X X
δ(x − ξ) = LG(x, ξ) = Ĝn (ξ)Lyn (x) = w(x) Ĝn (ξ)λn yn (x)
n∈N n∈N
X Z b Z b
∗ ∗
=⇒ Ĝn (ξ)λn ym (x)yn (x)w(x)dx = ym (x)δ(x − ξ)dx = yn∗ (ξ)
n∈N a a
∗ X yn∗ (ξ)
ym (ξ)
=⇒ Ĝm = and so G(x, ξ) = yn (x)
λm n∈N
λn
R∞
• The convolution of functions f, g : R → C is f ∗ g(x) = −∞
f (x − y)g(y) dy.
R∞ −2πikx
7
Some authors use a different definition f˜(k) = −∞
e f (x) dx.
248 CHAPTER 6. METHODS
E. 6-54
Note that for any k, we have
Z ∞ Z ∞ Z ∞
|f˜(k)| = e−ikx f (x) dx ≤ |e−ikx f (x)| dx =
|f (x)| dx.
−∞ −∞ −∞
Since we have assumed that our function is absolutely integrable, this is finite,
and the definition makes sense. Note also that
Z ∞ Z ∞
f ∗ g(x) = f (x − y)g(y) dy = f (y)g(x − y) dy = g ∗ f (x).
−∞ −∞
C. 6-55
<Properties of Fourier transform>
1. Linearity: If f, g : R → C are absolutely integrable and c1 , c2 are constants,
then F[c1 f (x) + c2 g(x)] = c1 F[f (x)] + c2 F[g(x)]. So F is a linear operator.
2. Translation:
Z Z
F[f (x − a)] = e−ikx f (x − a) dx = e−ik(y+a) f (y) dy
R R
Z
−ika
=e e−iky f (y) dy = e−ika F[f (x)]
R
4. Scaling:
Z ∞ Z ∞
−ikx −iky/c dy 1 ˜ k
F[f (cx)] = e f (cx) dx = e f (y) = f .
−∞ −∞ |c| |c| c
6. The most useful property of the Fourier transform is that it “turns differenti-
ation into multiplication”. Integrating by parts, we have
Z ∞ Z ∞
df d
F[f 0 (x)] = e−ikx dx = − (e−ikx )f (x) dx
−∞ dx −∞ dx
Z ∞
= ik e−ikx f (x) dx = ikF[f (x)]
−∞
Note that we don’t have any boundary terms since for the function to be
absolutely integrable, it has to decay to zero as we go to infinity. Conversely,
Z ∞ Z ∞
d
F[xf (x)] = e−ikx xf (x) dx = i e−ikx f (x) dx = if˜0 (k).
−∞ dk −∞
T. 6-56
R∞
<Fourier inversion theorem> f (x) = 1
2π −∞
eikx f˜(k) dk.8
We will only give a non-rigours proof: Recall that in the periodic case where
f (x) = f (x + L), we have the Fourier series
Z L/2
X 1
f (x) = fˆn e2inxπ/L where fˆn = e−2inπu/L f (u) du
n∈Z
L −L/2
This result says that we can express our original function f (x) in terms of its
Fourier transform f˜(k), so we can write f (x) = F −1 [f˜(k)].
Nevertheless, note that the inverse Fourier transform looks very similar to the
Fourier transform itself. We have F −1 [f (x)] = 2π1
F[f (−x)] and the duality prop-
erty
1
f˜(k) = F[f (x)] ⇐⇒ f (−x) = F[f˜(k)].
2π
These are useful because it means we can use our knowledge of Fourier transforms
to compute inverse Fourier transform. Note that this does not occur in the case of
the Fourier series. In the Fourier series, we obtain the coefficients by evaluating
an integral, and restore the original function by taking a discrete sum. These
operations are not symmetric.
8
This in fact requires f to be well behaved satisfying certain conditions.
R ∞ −2πikx Also if we use the
R ∞ 2πikx
definition f˜(k) = −∞ e f (x) dx, then this result becomes f (x) = −∞ e f˜(k) dk.
250 CHAPTER 6. METHODS
E. 6-57
<Fourier transform on differential equation> Suppose we have a differential
equation
p
X dr
L(∂)y = f where L(∂) = cr r
r=0
dx
is a differential operator of pth order with constant coefficients. Taking the Fourier
transform of both sides of the equation, we find F[L(∂)y] = F[f (x)] = f˜(k). The
interesting part is the left hand side, since the Fourier transform turns differenti-
ation into multiplication, we have
p
X
F[L(∂)y] = cr (ik)r ỹ(k) = L(ik)ỹ(k).
r=0
Here L(ik) is a polynomial in ik. Thus taking the Fourier transform has changed
our ordinary differential equation into the algebraic equation L(ik)ỹ(k) = f˜(k).
Since L(ik) is just multiplication by a polynomial, we can immediately get
f˜(k)
ỹ(k) = .
L(ik)
f˜(k)
ỹ(k) = .
k2 + A2
1 −µ|x|
Consider h(x) = 2µ
e with µ > 0. Then
Z ∞ Z ∞
1 −ikx −µ|x| 1 −(µ+ik)x
h̃(k) = e e dx = Re e dx
2µ −∞ µ 0
1 1 1
= Re = .
µ ik + µ µ2 + k 2
Therefore we get Z ∞
1
y(x) = e−A|x−u| f (u) du.
2A −∞
6.6. FOURIER TRANSFORMS 251
E. 6-58
For φ : Rn → C, suppose we have the equation ∇2 φ − m2 φ = ρ(x). We define the
n dimensional Fourier transform by
Z
F[φ(x)](k) = φ̃(k) = e−ik·x f (x) dV
Rn
ρ̃(k)
φ̃(k) = − .
|k|2 + m2
Note that we have (2π)n instead of 2π since we get a factor of 2π for each dimension
(and the negative sign was just brought down from the original expression). Using
F −1 [f˜(k)g̃(k)] = f ∗ g(x), we have
−1
φ(x) = F −1 [φ̃(k)] = ρ ∗ F −1 .
|k|2 + m2
T. 6-59
<Parseval’s theorem> Suppose f, g : R → C are sufficiently well-behaved that
f˜ and g̃ exist and that F −1 [f˜] = f and F −1 [g̃] = g. Then
Z
1 ˜
f ∗ (x)g(x) dx =
def
hf, gi = hf , g̃i.
R 2π
In particular kf k2 = 1
2π
kf˜k2 .
Z ∞ Z ∞ Z ∞ Z ∞
1 1
hf, gi = f ∗ (x) eikx g̃(x) dk dx = f ∗ (x)eikx dx g̃(k) dk
−∞ −∞ 2π 2π −∞ −∞
Z ∞ Z ∞ ∗ Z ∞
1 −ikx 1
= f (x)e dx g̃(k) dk = f˜∗ (k)g̃(k) dk
2π −∞ −∞ 2π −∞
1 ˜
= hf , g̃i.
2π
E. 6-60
Suppose f (x) is defined by y
(
1 |x| < 1
f (x) =
0 |x| ≥ 1. x
This function looks rather innocent. Sure it has discontinuities at ±1, but these are
not too absurd, and we are just integrating. This is certainly absolutely integrable.
We can easily compute the Fourier transform as
Z ∞ Z 1
2 sin k
f˜(k) = e−ikx f (x) dx = e−ikx dx = .
−∞ −1 k
1 ∞ ikx sin k
Z
2 sin k
F −1 = e dk.
k π −∞ k
This is hard to do. So let’s first see if this function is absolutely integrable. We
have
Z ∞ Z ∞ Z ∞ N Z (n+3/4)π
ikx sin k sin k sin k X sin k
e dx = dk = 2 dk ≥ 2
k dk.
−∞
k −∞
k
0
k
n=0 (n+1/4)π
The idea here is instead of looking at the integral over the whole real line, we just
pick out segments of it. In these small segments, we know that | sin k| ≥ √12 . So
we can bound this by
N N Z (n+3/4)π
∞ Z (n+3/4)π √ X
Z
ikx sin k X 1 dk dk
e dx ≥ 2 √ ≥ 2
−∞
k n=0
2 (n+1/4)π k n=0 (n+1/4)π (n + 1)π
√ N
2X 1
= ,
2 n=0 n + 1
We see that highly localized functions in x-space have very spread-out behaviour
in k-space, and vice versa.
E. 6-62
More formally, we define Fourier transformation of distributions as follows: Recall
given an ordinary function g we have the distribution Tg [φ] = g[φ] = hg, φi =
1
R
Ω
g(x)φ(x)dx. Parseval’s theorem tell us that hg, φi = 2π hF[g], F[φ]i, equiva-
−1
lently we have hF[g], χi = 2πhg, F [χ]i. In light of this, for any distribution g
(not just those derived from ordinary function), we define the distribution Fg to
be
(Fg)[χ] = 2πg[F −1 [χ]].
Using this, we have
Z ∞
(Fδ)[φ] = 2πδ[F −1 [φ]] = 2πF −1 [φ](0) = eik0 φ(k)dk = h1, φi = 1[φ],
−∞
in agreement with what we got before. Conversely we have (F1)[φ] = 2π1[F −1 [φ]] =
2πh1, F −1 [φ(x)]i = h1, F[φ(−x)]i = (Fδ)[F[φ(−x)]] = 2πδ[φ(−x)] R ∞ = 2πφ(0).
Hence (F −1 1)[φ] = δ[φ]. So in the world of distributions we have −∞ e−ikx dx =
2πδ(k).
Consider the step function Θ(x) = I[x > 0] were I is the indicator function. Define
Θε (x) = Θ(x)e−εx . Note that Θ(x) = limε→0+ Θε (x). Now
Z ∞ Z ∞
1
F[Θε ] = e−ikx Θε (x)dx = e−(ε+ik)x dx =
−∞ 0 ε + ik
The presence of ε is important to ensure convergence of the integral. However
F[Θ] is in fact not just 1/ik. To understand what F[Θ] we let it act on a test
function, for any δ > 0 we have
Z ∞ Z Z δ
φ(k) φ(k) φ(k) − φ(0) φ(0)
(FΘ)[φ] = lim dk = dk+ lim + dk
ε→0+ −∞ ε + ik |k|>δ ik ε→0+ −δ ε + ik ε + ik
This is sometimes written as F[Θ](k) =p.v.(ik)−1 + πδ(k) where the letters p.v.
stand for the (Cauchy) principle value and mean that we should exclude the point
k = 0 from any intgral containing this term. What happens at k = 0 is instead
governed by the δ-function.
E. 6-63
<Linear systems and response functions> Suppose we have an amplifier
that modifies an input signal I(t) to produce an output O(t). Typically, amplifiers
work by modifying the amplitudes and phases of specific frequencies in the output.
By Fourier’s inversion theorem, we know
Z ∞
1 ˜
I(t) = eiωt I(ω) dω.
2π −∞
˜
This I(ω) is the resolution of I(t). We specify what the amplifier does by the
transfer function R̃(ω) such that the output is given by
Z ∞
1 ˜
O(t) = eiωt R̃(ω)I(ω) dω.
2π −∞
˜
Since this R̃(ω)I(ω) ˜
is a product, on computing O(t) = F −1 [R̃(ω)I(ω)], we obtain
a convolution
Z ∞ Z ∞
1
O(t) = R(t − u)I(u) du where R(t) = eiωt R̃(ω) dω
−∞ 2π −∞
is the response function . By plugging it directly into the equation above, we see
˜
that R(t) is the output O(t) of the system when the input has I(ω) = 1 – in other
words when the input signal is I(t) = δ(t). Note that causality implies that the
amplifier cannot “respond” before any input has been given. So we must have
R(t) = 0 for all t < 0. Assume that we only start providing input at t = 0. Then
Z ∞ Z t
O(t) = R(t − u)I(u) du = R(t − u)I(u) du.
−∞ 0
This is exactly the same form of solution as we found for initial value problems
with the response function R(t) playing the role of the Green’s function.
................................................................................
<General form of transfer function> To model the situation, suppose the
amplifier’s operation is described by the ordinary differential equation
m
X di
I(t) = Lm O(t), where Lm = ai
i=0
dti
˜ = m j
dO P
Using the fact F dt = iω Õ(ω), we have I(ω) j=0 aj (iω) Õ(ω). So we get
˜
I(ω) 1
Õ(ω) = =⇒ R̃(ω) = .
a0 + iωa1 + · · · + (iω)m am a0 + iωa1 + · · · + (iω)m am
J J X kj
1 Y 1 X Γrj
R̃(ω) = =
am j=1 (iω − cj )kj j=1 r=1
(iω − cj )r
for some constants Γrj ∈ C, where we obtain the last equality by repeated use of
partial fractions. By linearity of the (inverse) Fourier transform, we can find O(t)
if we know the inverse Fourier transform of all functions of the form 1/(iω − α)p .
Consider the function ( p
t
eαt t > 0
hp (t) = p!
0 otherwise
So the response function is a linear combination of these functions hp (t) (if any of
the roots cj have non-negative real part, then it turns out the system is unstable,
and is better analysed by the Laplace transform). We see that the response func-
tion does indeed vanish for all t < 0. In fact, each term (except h0 ) increases from
zero at t = 0 to rise to some maximum before eventually decaying exponentially.
E. 6-64
<Discrete Fourier transform> So far, we have done Fourier analysis over
some abelian groups. For example, we’ve done it over S 1 , which is an abelian
group under multiplication, and R, which is an abelian group under addition. We
will next look at Fourier analysis over another abelian group, Zm , known as the
discrete Fourier transform. Recall that the Fourier transform is defined as
Z
f˜(k) = e−ikx f (x) dx.
R
To find the Fourier transform, we have to know how to perform the integral. If we
cannot do the integral, then we cannot do the Fourier transform. However, it is
usually difficult to perform integrals for even slightly more complicated functions.
A more serious problem is that we usually don’t even have a closed form of the
function f for us to perform the integral. In real life, f might be some radio signal,
and we are just given some data of what value f takes at different points. There
is no hope that we can do the integral exactly.
256 CHAPTER 6. METHODS
This is just the Riemann sum. Similarly, our computer can only store the result
f˜(k) for some finite list of k. Let’s choose these to be at k = km = 2πm/(R + S).
Then after some cancellation,
N −1 −1
N
!
˜ R + S ikm R X − 2πi jm 2πimR 1 X −jm
f (km ) ≈ e f (xj )e N = (R + S)e R+S f (xj )ω ,
N j=0
N j=0
2πi PN −1 −jm
where ω = e N is an N th root of unity. Let F (m) = N1 j=0 f (xj )ω . Of
course, we’ve thrown away lots of information about our original function f (x),
since we made approximations all over the place. For example, we have already lost
all knowledge of structures varying more rapidly than our sampling interval R+S N
.
Also, F (m + N ) = F (m), since ω N = 1. So we have “forced” some periodicity
into our function F , while f˜(k) itself was not necessarily periodic.
For the usual Fourier transform, we were able to re-construct our original function
from the f˜(k), but here we clearly cannot. However, if we know the F (m) for
all m = 0, 1, · · · , N − 1, then we can reconstruct the exact values of f (xj ) for
all j by just solving linear equations. To make this more precise, we want to
put what we’re doing into the linear algebra framework we’ve been using. Let
G = {1, ω, ω 2 , · · · , ω N −1 }. For our purposes below, we can just treat this as a
discrete set, and ω i are just meaningless symbols that happen to visually look like
the powers of our N th roots of unity.
Consider a function g : G → C defined by g(ω j ) = f (xj ). This is actually nothing
but just a new notation for f (xj ). Then using this new notation, we have
N −1
1 X −jm
F (m) = ω g(ω j ).
N j=0
For n 6= m, we have
N −1 N −1
1 X ∗ j 1 X j(m−n) 1 1 − ω (m−n)N
hen , em i = en (ω )em (ω j ) = ω = = 0.
N j=0 N j=0 N 1 − ω m−n
If we forget about our f s and just look at the g, what we have effectively done is
take the Fourier transform of functions taking
PN −1values on G = {1, ω, · · · , ω N −1 } ∼
=
−jm
1
ZN . This can be seen from F (m) = N j=0 ω g(ω j ). This is exactly anal-
ogous to what we did for the Fourier transform, except that everything is now
discrete, and we don’t have to worry about convergence since these are finite
sums.
E. 6-65
<Fast Fourier transform> What we’ve said so far is we’ve defined
N −1
1 X −mj
F (m) = ω g(ω j ).
N j=0
To compute this directly, even given the values of ω −mj for all j, m, this takes
N − 1 additions and N + 1 multiplications. This is 2N operations for a single
value of m. To know F (m) for all m, this takes 2N 2 operations.
This is a problem. Historically, during the cold war, people were in fear that the
world will one day go into a huge nuclear war. Countries decided to come up with
a treaty to ensure people don’t perform nuclear testings anymore. However, it is
difficult to monitor whether other countries are doing nuclear tests. Underground
nuclear tests are hard to detect.
They then came up with the idea to put some seismometers all around the world,
and record vibrations in the ground. To distinguish normal crust movements
from nuclear tests, they wanted to take the Fourier transform and analyze the
frequencies of the vibrations. However, they had a large amount of data, and the
value of N is on the order of magnitude of a few million. So 2N 2 will be a huge
number that the computers at that time were not able to handle. So they develop a
trick, known as the fast Fourier transform, to perform Fourier transforms quickly.
This is nothing new mathematically, but entirely a computational trick.
Now suppose N = 2M . We can write
2M −1 M −1
!
1 X −jm 1 1 X −2km −(2k+1)m
F (m) = ω g(ω j ) = ω 2k
g(ω ) + ω g(ω 2k+1
) .
2M j=0 2 M
k=0
258 CHAPTER 6. METHODS
We now let η be an M th root of unity, and define G(η k ) = g(ω 2k ) and H(η k ) =
g(ω 2k+1 ). We then have
M −1 M −1
!
1 1 X −km ω −m X −km
F (m) = η G(η k ) + η H(η k )
2 M M
k=0 k=0
1
= [G̃(m) + ω −m H̃(m)].
2
Suppose we are given the values of G̃(m) and H̃(m) and ω −m for all m =
{0, · · · , N − 1}. Then we can compute F (m) using 3 operations per value of
m. So this takes 3 × N = 6M operations for all M .
We can compute ω −m for all m using at most 2N operations, and suppose it takes
PM operations to compute the transform G̃(m) (or equivalently H̃(m)) for all m.
Then the number of operations needed to compute F (m) is P2M = 2PM +6M +2M .
Now let N = 2n . Then by iterating this, we can find PN ≤ 4N log2 N N 2 . So
by breaking the Fourier transform apart, we are able to greatly reduce the number
of operations needed to compute the Fourier transform.
E. 6-68
Intuitively, “the solution depends continuously on the auxiliary data” means “small
change” in the Cauchy data leads to a “small change” in the solution. To under-
stand this, we need to make it clear what we mean by “small change”. To do this
properly, we need to impose some topology on our space of functions, which is
some technicalities we will not go into. Instead, we can look at a simple example.
Suppose we have the heat equation ∂t φ = κ∇2 φ. We know that whatever starting
condition we have, the solution quickly smooths out with time. Any spikiness
of the initial conditions get exponentially suppressed. Hence this is a well-posed
problem — changing the initial condition slightly will result in a similar solution.
However, if we take the heat equation but run it backwards in time, we get a non-
well-posed problem. If we provide a tiny, small change in the “ending condition”,
as we go back in time, this perturbation grows exponentially, and the result could
vary wildly.
Another example is as follows: consider the Laplace’s equation ∇2 φ on the upper
half plane (x, y) ∈ R × R≥0 subject to the boundary conditions φ(x, 0) = 0 and
∂y φ(x, 0) = g(x). If we take g(x) = 0, then φ(x, y) = 0 is the unique solution,
obviously. However, if we take g(x) = sin(Ax)/A, then we get the unique solution
sin(Ax) sinh(Ay)
φ(x, y) = .
A2
So far so good. However, now consider the limit as A → ∞. Then g(x) =
π
sin(Ax)/A → 0 for all x ∈ R. However, at the special point φ 2A , y , we get
π sinh(Ay)
φ ,y = → A−2 eAy ,
2A A2
which is unbounded. So as we take the limit as our boundary conditions g(x) → 0,
we get an unbounded solution.
The condition the solution depends continuously on the auxiliary data is important
in physics since we can neither set up our apparatus nor measure our results with
infinite precision, equations that can usefully model the physics had better obey
this condition or else our approximation would give us very wrong answer.
the water flows at each point. We can then obtain a curve by starting a point
and flow along with the water. Notice that the integral curves are determined by
a system of 1st order ODE, ( dx , dy ) = V(x(s), y(s)), and hence always exist, at
ds ds
least locally.
Note that we have a 1st order ODE, so our solution has
a free variable, so the solution is not unique. For suffi-
ciently regular vector fields, we can fill the whole space
with different integral curves. We can parametrize which B(t)
curve we are on by the parameter t. More precisely, C4
we pick a curve B = (x(t), y(t)) that is transverse (ie.
nowhere parallel) to our family of curves at any point, C3
C2
and we can label the members of our family by the value C1
of t at which they intersect B. We can thus label our
family of curves by (x(s, t), y(s, t)), so that for each t we
have the integral curve Ct . If the Jacobian
∂(x, y) ∂x ∂y ∂x ∂y
J= = − 6= 0,
∂(s, t) ∂s ∂t ∂t ∂s
then we can invert this to find (s, t) as a function of (x, y), ie. at any point, we
know which curve we are on, and how far along the curve we are. This means
the set of integral curves fills the whole space and we now have a new coordinate
system (s, t) for our points in R2 . It turns out by picking the right vector field
V, we can make differential equations much easier to solve in this new coordinate
system.
C. 6-71
<The method of characteristics> Suppose φ : R2 → R obeys
∂φ ∂φ
a(x, y) + b(x, y) = f (x, y).
∂x ∂y
with boundary condition φ|B = h(t) for some function h(t) along a curve B ⊆ R2
parametrise by t. Our differential equation is equivalent to V · ∇φ = f where
V(x, y) = (a, b). Along any particular integral curve (x(s), y(s)) of V, we have
manage to find the solution φ(x(s, t), y(s, t)) in terms of s, t, and we could invert
the variables so that (s, t) is a function of (x, y), then we have found the solution
φ(x, y). In particular if we have f = 0, then since φ does not vary with s, ie.
∂φ
∂s
= 0, the solution is simply φ(x(s, t), y(s, t)) = h(t). Now if we can invert the
variables so that t = t(x, y) is a function of (x, y), then we have φ(x, y) = h(t(x, y)).
Note the following features of the above construction:
• If any characteristic curve intersects the initial curve B more than once then
the problem is over-determined. In this case we might have a contradiction
and so no solution. For example, in the case of a homogeneous equation, i.e.
when f = 0, the solution will be constant along the same characteristic curve,
so in order to have a solution our Cauchy data φ|B = h(t) must be such that
h(t1 ) = h(t2 ) for any points t1 , t2 ∈ B that intersect the same characteristic
curve.
• If B does not intersect all characteristics curves, we will not get a unique solu-
tion, as the solution is not fixed along those characteristics.
• If the initial curve is transverse to all characteristics and intersects them once
only, then the problem is well-posed for any h(t) and has a unique solution
φ(x, y) (at least in a neighbourhood of B).
Note that the initial data cannot be propagated from one characteristic to an-
other. In particular, if h(t) is discontinuous, then these discontinuities will
propagate along the corresponding characteristic curve.
E. 6-72
Suppose φ obeys
∂φ(x, y)
= 0. with φ(0, y) = f (y).
∂x
The solution is obviously φ(x, y) = f (y). However, we can try to use the method
of characteristics to practice our skills. Our vector field and integral curves is given
by dx
1 ds
dx dy
V= = dy =⇒ = 1, = 0.
0 ds ds ds
So we have x(s) = s + c and y(s) = d. We have boundary condition along the
cure (xB (t), yB (t)) = (0, t). We want our integral curves to intersect this at s = 0,
i.e. (xB (t), yB (t)) = (x(0), y(0)), so our integral curves are (x(s, t), y(s, t)) = (s, t).
Now we have to solve
∂φ(x(s, t), y(s, t))
=0 with φ(x(0, t), y(0, t)) = f (t).
∂s
So we know that φ(x, y) = f (y).
E. 6-73
Consider the equation
∂φ ∂φ
ex + =0 with φ(x, 0) = cosh x.
∂x ∂y
The vector field is V = (ex , 1). So integral cures obey dx ds
= ex and dy
ds
= 1. Thus
−x(s)
our integral curves (x(s, t), y(s, t)) obeys e = −s + c(t) and y(s) = s + d(t).
So
(x(s, t), y(s, t)) = (− ln(−s + c(t)), s + d(t)).
262 CHAPTER 6. METHODS
We want cosh t = (x(0, t), y(0, t)) = (− ln(c(t)), d(t)). So (x, y) = (− ln(−s +
e−t ), s). We thus have φ(x, y) = cosh(t) = cosh(− ln(y + e−x )).
E. 6-74
Suppose φ : R2 → R solve the inhomogeneous partial differential equation ∂x φ +
2∂y φ = yex with φ(x, x) = sin x. We can still use the method of characteristics.
We have V = (1, 2). So the characteristic curves (x(s, t), y(s, t)) are obtained as
∂x ∂y
= 1, =2 with φ(x(0, t), y(0, t)) = (t, t)
∂s ∂s
So (x, y) = (s + t, 2s + t). We can invert this to obtain (s, t) = (y − x, 2x − y). The
partial differential equation now becomes
=⇒ φ(x(s, t), y(s, t)) = 2(s − 1)es+t + tes+t + h(t) with sin t = et (t − 2) + h(t)
= (2 − t)et (1 − es ) + sin t + 2ses+t
=⇒ φ(x, y) = (2 − 2x + y)e2x−y + sin(2x − y) + (y − 2)ex
where aij (x), bi (x), c(x) ∈ R and aij = aji (wlog). We define the symbol of L as
n
X n
X
σ(k, x) = aij (x)ki kj + bi (x)ki + c(x),
i,j=1 i=1
| {z }
=σ p (k,x)
where we just replace the derivatives by the variable k. The principal part of
the symbol is the quadratic form σ p (k, x) = kT A(x)k, where A(x) is the real
symmetric matrix with entries aij (x). We say the differential operator L at a
point x is
• elliptic if all eigenvalues of A(x) have the same sign;9
• hyperbolic if all but one eigenvalues of A(x) have the same sign;
• ultra-hyperbolic if there are more than one eigenvalues of A(x) of each sign;
• parabolic if A(x) is degenerate, ie. has a zero eigenvalue.
9
Recalling that the eigenvalues of a real, symmetric matrix are always real.
6.7. PDES AND METHOD OF CHARACTERISTICS 263
This is a relation that is trivially true, since we just add and subtract the same
thing. Note, however, that the first term points along n, while the latter term is
orthogonal to n. Write ∇⊥ f = ∇f − n(n · ∇f ). So we have ∇f = n(n · ∇f ) + ∇f⊥ .
Then we can compute
T
(∇f )T A(∇f ) = n(n · ∇f ) + ∇⊥ f A n(n · ∇f ) + ∇⊥ f = (∇⊥ f )T A(∇⊥ f ).
E. 6-80
Consider the wave equation
∂2φ ∂2φ
2
− c2 2 = 0
∂t ∂x
on R1,1 . Then the equation is hyperbolic everywhere, and the characteristic curves
are x ± ct = const. Let’s look for a solution to the wave equation that obeys
φ(x, 0) = f (x) and ∂t φ(x, 0) = g(x). Now put u = x − ct and v = x + ct. Then
∂2φ
the wave equation becomes ∂u∂v = 0. So the general solution to this is
The initial conditions means f (x) = G(x) + H(x) and g(x) = −cG0 (x) + cH 0 (x).
Solving these, we find
Z x+ct
1 1
φ(x, t) = f (x − ct) + f (x + ct) + g(y) dy.
2 2c x−ct
x2
2 1
F −1 [e−Dk t ] = √ exp −
4πDt 4Dt
266 CHAPTER 6. METHODS
We shall call this S1 (x, t), known as the fundamental solution of the heat equation,
where the subscript 1 tells us we are in 1 + 1 dimensions. We then get
Z ∞
φ(x, t) = f (y)S1 (x − y, t) dy
−∞
Suppose our initial data is f (x) = φ0 δ(x). So we start with a really cold room, with a
huge spike in temperature at the middle of the room. Then we get
x2
φ0
φ(x, t) = √ exp − .
4πDt 4Dt
What this shows, as we’ve seen before, is that if we start with a delta function, then
as time evolves, we get a Gaussian that gets shorter and broader.
Now note that if we start with a delta function, at t = 0, everywhere outside the origin
is zero. However, after any infinitely small time t, φ becomes non-zero everywhere,
instantly. Unlike the wave equation, information travels instantly to all of space, ie.
heat propagates arbitrarily fast according to this equation (of course in reality, it
doesn’t). Mathematically, this is a consequence of the fact that the heat equation is
parabolic, and so has only one family of characteristic surfaces (in this case, they are
the surfaces t =const). Physically, we see that the heat equation is not compatible
with Special Relativity; this is because it is really just a macroscopic approximation
to the underlying statistical mechanics of microscopic particle motion.
where eiky is just the Fourier transform of δ(x − y). This is equal to
(
0 t<τ 2
G̃(k, t; y, τ ) = −iky −Dk2 (t−τ )
= Θ(t − τ )e−iky e−Dk (t−τ ) .
e e t>τ
This integral is just the inverse Fourier transform of the Gaussian with a phase shift.
So we end up with
(x − y)2
Θ(t − τ )
G(x, t; y, τ ) = p exp − = Θ(t − τ )S1 (x − y; t − τ ).
4πD(t − τ ) 4D(t − τ )
It is interesting that the solution to the forced equation involves the same function
S1 (x, t) as the homogeneous equation with inhomogeneous boundary conditions.
Duhamel’s principle
In general, Sn (x, t) solves
∂Sn
− D∇2 Sn = 0 with Sn (x, 0) = δ (n) (x − y)
∂t
and we can find
|x − y|2
1
Sn (x, t) = exp − .
(4πDt)n/2 4Dt
Then in general, given an initial condition φ|t=0 = f (x), the solution is
Z
φ(x, t) = f (y)S(x − y, t) dn y.
− e−ikα
ikα
e 1
F −1 [sin kα] = F −1
= δ(x − α) − δ(x + α) .
2i 2i
Hence our Green’s function is
Θ(t − τ )
G(x, t; y, τ ) = − δ(|x − y| − c(t − τ )) − δ(|x − y| + c(t − τ )) .
4πc|x − y|
Now we look at our delta functions. The step function is non-zero only if t > τ . Hence
|x − y| + c(t − τ ) is always positive. So δ(|x − y| + c(t − τ )) does not contribute. On
the other hand, δ(|x − y| − c(t − τ )) is non-zero only if t > τ . So Θ(t − τ ) is always
positive in this region. So we can write our Green’s function as
1 1
G(x, t; y, τ ) = − δ(|x − y| − c(t − τ )).
4πc |x − y|
As always, given our Green’s function, the general solution to the forced equation
∂2φ
∂t2
− c2 ∇2 φ = F (x, t) is
Z ∞Z
F (y, τ )
φ(x, t) = − δ(|x − y| − c(t − τ )) d3 y dt.
0 R3 4πc|x − y|
We can use the delta function to do one of the integrals. It is up to us which integral
we do, but we pick the time integral to do. Then we get
Z
1 F (y, tret ) 3 |x − y|
φ(x, t) = − d y, where tret = t − .
4πc2 R3 |x − y| c
This shows that the effect of the forcing term at some point y ∈ R3 affects the solution
φ at some other point x not instantaneously, but only after time |x − y|/c has elapsed.
This is just as we saw for characteristics. This, again, tells us that information travels
at speed c.
Also, we see the effect of the forcing term gets weaker and weaker as we move further
away from the source. This dispersion depends on the number of dimensions of the
space. As we spread out in a three-dimensional space, the “energy” from the forcing
term has to be distributed over a larger sphere, and the effect diminishes. On the
contrary, in one-dimensional space, there is no spreading out to do, and we don’t have
this reduction. In fact, in one dimensions, we get
Z tZ
Θ(c(t − τ ) − |x − y|)
φ(x, t) = F (y, τ ) dy dτ.
0 R 2c
We see there is now no suppression factor at the bottom, as expected.
By divergence theorem,
Z Z Z
φn · ∇ψ dS = ∇ · (φ∇ψ) dV = φ∇2 ψ + (∇φ) · (∇ψ) dV.
∂Ω Ω Ω
|x − y| + c1
n=1
1
dGn 1
= =⇒ Gn (x, y) = 2π ln |x − y| + c2 n=2
dr An rn−1
− 1
+ cn n≥3
An (n − 2)|x − y|n−2
Ω = Br − Bε = {x ∈ Rn : ε ≤ |x − y| ≤ r}.
In other words, we remove a small region of radius ε centered on y from the domain.
In this choice of Ω, it is completely safe to use Green’s identity, since our Green’s
6.8. GREEN’S FUNCTIONS FOR PDES 271
function is certainly regular everywhere in this Ω. Using Green’s second identity and
noting that ∇2 Gn = 0 everywhere except at x = y, we get
Z Z
− Gn ∇2 φ dV = φ∇2 Gn − Gn ∇2 φ dV
Ω Ω
Z Z
= φ(n · ∇Gn ) − Gn (n · ∇φ) dS + φ(n · ∇G3 ) − Gn (n · ∇φ) dS
n−1 n−1
Sr Sε
Note that on the inner boundary, we have n = −r̂. Also, at Sεn−1 , we have
1 ε
Gn |Sεn−1 = − =− ,
An (n − 2)εn−2 (ε)
(n − 2)An
dGn 1 1
=− = − (ε) .
dr Sεn−1 An εn−1 An
So the inner boundary terms are
Z
φ(n · ∇Gn ) − Gn (n · ∇φ) dS
n−1
Sε
Z Z
1 ε
=− (ε)
φ dS + (ε)
(n · ∇φ)dS
An n−1
Sε (n − 2)An n−1
Sε
This in fact tends to −φ(y) as ε → 0. To see this note that the final integral is bounded
by the assumption that φ is everywhere smooth (so the value of ∇φ is bounded). So
as we take the limit ε → 0, the final term vanishes. For the first term
Z
1
φ dS = − average of φ on Sεn−1 → −φ(y) as ε → 0
− (ε)
An Sεn−1
Now suppose ∇2 φ = −F . Then this gives Green’s third identity :
Z Z
φ(y) = φ(n · ∇Gn ) − Gn (n · ∇φ) dS − Gn (x, y)F (x) dV.
∂Ω Ω
where the integral are taken over the x variable. This is a remarkable formula! It de-
scribes the solution throughout our domain in terms of the solution on the boundary,
the forcing term and the known function Gn . Also notice that (unlike the previous
cases) the Green’s function is here providing an expression for the solution with inho-
mogeneous boundary conditions. If the boundary values of φ and n · ∇φ vanish as we
take r → ∞, then we have
Z
φ(y) = − Gn (x, y)F (x) dV.
Rn
But we know there is a unique solution to Laplace’s equation on every bounded do-
main once we specify the boundary value φ|∂Ω (Dirichlet boundary condition), or a
unique-up-to-constant solution if we specify the boundary value of n·∇φ|∂Ω (Neumann
condition). However, to get φ(y) using Green’s identity, we need to know φ and and
n · ∇φ on the boundary. This is too much. Green’s third identity is a valid relation
obeyed by solutions to Poisson’s equation, but it is not constructive. We cannot specify
φ and n · ∇φ freely.
So as long as we find H, we can express the value of φ(y) just in terms of the values
of φ on the boundary. Similarly, if we are given a Neumann condition, ie. the value of
n · ∇φ on the boundary, we have to find an H that kills off n · ∇G on the boundary,
and get a similar result.
In general, finding a harmonic ∇2 H = 0 with H|∂Ω = −Gn |∂Ω is a difficult problem.
However, the method of images allows us to solve this in some special cases with lots
of symmetry. The key concept is to match the boundary conditions by placing a
extending the domain beyond the region of interest, and placing a ‘mirror’ or ‘image’
source or forcing term in the unphysical region.
E. 6-82
Let Ω = {(x, y, z) ∈ R3 : z ≥ 0}. Find a solution to ∇2 φ = −F in Ω with φ → 0
rapidly as |x| → ∞ with boundary condition φ(x, y, 0) = g(x, y).
However, let xR
0 be the point (x0 , y0 , −z0 ) . This is the reflection of x0 in the
boundary plane z = 0. Since the point xR R
0 is outside our domain, G3 (x, x0 ) obeys
6.8. GREEN’S FUNCTIONS FOR PDES 273
∇2 G3 (x, xR R
0 ) = 0 for all x ∈ Ω, and also G3 (x, x0 )|z=0 = G3 (x, x0 )|z=0 . Hence
R
we take G(x, x0 ) = G3 (x, x0 ) − G3 (x, x0 ). The outward pointing normal to Ω at
z = 0 is n = −ẑ. Hence we have
1 −(z − z0 ) −(z + z0 ) 1 z0
n·∇G|z=0 = − = 2 + (y − y )2 + z 2 ]3/2
.
4π |x − x0 |3 |x − xR
0 | 3
z=0 2π [(x − x 0 ) 0 0
What have we actually done here? The Green’s function G3 (x, x0 ) in some sense
represents a “charge” at x0 . We can imagine that the term G3 (x, xR
0 ) represents
the contribution to our solution from a point charge of opposite sign located at
xR
0 . Then by symmetry, G is zero at z = 0.
E. 6-83
Suppose a chimney produces smoke such that the density φ(x, t) of smoke obeys
The left side is just the heat equation, modelling the diffusion of smoke, while the
right forcing term describes the production of smoke by the chimney. If this were
a problem for x ∈ R3 , then the solution is
Z tZ
φ(x, t) = F (y, τ )S3 (x − y, t − τ ) d3 y dτ,
0 R3
|x − y|2
1
where S3 (x − y, t − τ ) = exp − .
[4πD(t − τ )]3/2 4D(t − τ )
This is true only if the smoke can diffuse in all of R3 . However, this is not true
for our current circumstances, since smoke does not diffuse into the ground. To
account for this, we should find a Green’s function that obeys n · ∇G|z=0 = 0
which says that there is no smoke diffusing in to the ground. This is achieved by
picking
G(x, t; y, τ ) = Θ(t − τ ) S3 (x − y, t − τ ) + S3 (x − yR , t − τ ) .
We can directly check that this obeys ∂t D2 −∇2 G = δ(t−τ )δ 3 (x−y) when x ∈ Ω,
and also n · ∇G|z0 = 0. Hence the smoke density is given by
Z t Z
φ(x, t) = F (y, τ ) S3 (x − y, t − τ ) + S3 (x − yR , t − τ ) d3 y.
0 Ω
We can think of the second term as the contribution from a “mirror chimney”.
Without a mirror chimney, we will have smoke flowing into the ground. With a
mirror chimney, we will have equal amounts of mirror smoke flowing up from the
ground, so there is no net flow. Of course, there are no mirror chimneys in reality.
These are just artifacts we use to find the solution we want.
274 CHAPTER 6. METHODS
E. 6-84
Suppose we want to solve the wave equation in the region (x, t) with x > 0 with
boundary conditions
On R1,1 d’Alembert’s solution gives φ(x, t) = 21 (b(x − ct) + b(x + ct)). This is
not what we want, since eventually we will have a wave moving past the x = 0
line. To compensate for this, we introduce a mirror wave moving in the opposite
direction, such that when as they pass through each other at x = 0, there is no
net flow across the boundary.
More precisely, we include a mirror initial condition φ(x, 0) = b(x) + b(−x), where
we set b(x) = 0 when x < 0. In the region x > 0 we are interested in, only the
b(x) term will contribute. In the x < 0 region, only b(−x) will contribute. Then
the general solution is
1
φ(x, t) = b(x + ct) + b(x − ct) + b(−x − ct) + b(−x + ct) .
2
CHAPTER 7
Quantum mechanics
Quantum mechanics (QM) is a radical generalization of classical physics. Profound
new features of quantum mechanics include
1. Quantisation — Quantities such as energy are often restricted to a discrete set of
values, called quanta.
2. Wave-particle duality — Classical concepts of particles and waves become merged
in quantum mechanics. They are different aspects of a single entity. For example
electrons has properties of both particles and waves.
3. Probability and uncertainty — Predictions in quantum mechanics involve prob-
ability in a fundamental way. This probability does not arise from our lack of
knowledge of the system, but is a genuine uncertainty in reality.
Quantum mechanics also involves a new fundamental constant — Planck constant h
h
or it reduced form ~ = 2π . The dimension of this is
We can think of this constant as representing the “strength” of quantum effects. De-
spite having these new profound features, we expect to recover classical physics when
we take the limit ~ → 0.
Historically, there are a few experiments that led to the development of quantum
mechanics:
Light quanta
In quantum mechanics, light (or electromagnetic waves) consists of quanta called pho-
tons. We can think of them as waves that come in discrete “packets” that behave like
particles.
In particular, photons behave like particles with energy E = hν = ~ω, where ν is the
frequency and ω = 2πν is the angular frequency. However, we usually don’t care about
ν and just call ω the frequency. Similarly, the momentum is given by p = h/λ = ~k,
where λ is the wavelength and k = 2π/λ is the wave number.
For electromagnetic waves with speed c = ω/k = νλ, the above is consistent with the
fact that photons are massless particles, since we have E = cp. as entailed by special
relativity.
The physical reality of photons was clarified by Einstein in explaining
the photo-electric effect. When we shine some light (or electromag- e
γ
netic radiation γ) of frequency ω onto certain metals, this can cause
an emission of electrons (e). We can measure the maximum kinetic
energy K of these electrons. Experiments show that
1. K depends only (linearly) on the frequency but not the intensity.
275
276 CHAPTER 7. QUANTUM MECHANICS
2. For ω < ω0 (for some critical value ω0 ), no electrons are emitted at all, regardless
of the intensity of the incident light.
3. For a given frequency of incident radiation, the rate at which electrons are ejected
is directly proportional to the intensity of the incident light.
This is hard to understand classically, but is exactly as expected if each electron emitted
is due to the impact with a single photon (of energy E = ~ω). If W is the energy
required to liberate an electron, then we would expect K = ~ω−W by the conservation
of energy. We will have no emission at all if ω < ω0 = W/~, even if we increase the
number of such photons hitting the metal (i.e. increase the intensity).
L = mrv = n~ for n = 1, 2, · · · .
mv 2 e2 1
Assume these, together with the requirement for centripetal force r
= 4πε0 r 2
, we
can solve r and v completely for each n and obtain
2
e2 1 e2
4πε0 2 2 1 1
rn = ~ n vn = En = − m .
me2 4πε0 ~n 2 4πε0 ~ n2
When electron make transitions between different energy levels n and m > n, we will
see emission or absorption of a photon of frequency ω given by
0
1
e2
2
1 1
Em
E = ~ω = En − Em = m − 2 γ
2 4πε0 ~ n2 m En
This model explains a vast amount of experimental data. This also gives an estimate
of the size of a hydrogen atom: r1 = 4πε × 10−11 m known as the Bohr
2
me2
0
~ ≈ 5.29
277
radius. While the model fits experiments very well, it does not provide a good expla-
nation for why the radius/angular momentum should be quantized, and we shall seek
the answer in quantum mechanics.
Matter waves
The double-slit experiment on electrons show that electrons really behave like waves,
in particular it show that elections have interference like waves. We have a sinusoidal
wave/election incident on some barrier with narrow openings as shown:
wave/elections
wave/elections density of
electrons
δ
wave/elections
At different points, depending on the difference δ in path length, we may have con-
structive interference (large amplitude) or destructive interference (no amplitude). In
particular, constructive interference occurs if δ = nλ, and destructive if δ = (n + 12 )λ.
Not only does this experiment allow us to verify if something is a wave. We can also
figure out its wavelength λ by experiment.
Practically, the actual experiment for electrons is slightly more complicated. Since
the wavelength of an electron is rather small, to obtain the diffraction pattern, we
cannot just poke holes in sheets. Instead, we need to use crystals as our diffraction
grating.
278 CHAPTER 7. QUANTUM MECHANICS
The physical
R ∞ content of the wavefunction is as follows: if ψ is appropriately normalized
so that −∞ |ψ(x)|2 dx = 1, then when we measure the position of a particle, we get a
result x with probability density function |ψ(x)|2 , ie. the probability that the position
is found in [x, x+δx] (for small δx) is given by |ψ(x)|2 δx. Alternatively, the probability
of finding it in an interval [a, b] is given by
Z b
P(particle position in [a, b]) = |ψ(x)|2 dx.
a
The normalised condition ensures that the probability of finding the particle somewhere
is 1. It is possible that in some cases, the particles in the configuration space may be
restricted. For example, we might require − 2` ≤ x ≤ 2` with some boundary conditions
at the edges. Then the normalization condition would not be integrating over (−∞, ∞),
but [− 2` , 2` ].
If we do not care about normalization, then for any (non-zero) λ ∈ C, ψ(x) and λψ(x)
represent the same quantum state (since they give the same probabilities). In practice,
we usually refer to either of these as “the state”. We can thus think of the states as
equivalence classes of wavefunctions under the equivalence relation ψ ∼ φ if φ = λψ for
some non-zero λ. What R∞we do require, then, is not that the wavefunction is normalized,
but normalizable, ie. −∞ |ψ(x)|2 dx < ∞. We will encounter wavefunctions that are
not normalizable. Mathematically, these are useful things to have, but we have to be
more careful when interpreting these things physically.
A characteristic property of quantum mechanics is that if ψ1 (x) and ψ2 (x) are wave-
functions for a particle, then ψ1 (x) + ψ2 (x) is also a possible particle state (ignoring
normalization), provided the result is non-zero. This is the principle of superposition,
and arises from the fact that the equations of quantum mechanics are linear.
E. 7-1
−(x − c)2 x2 |ψ(x)|2
ψ(x) = B exp + exp −
2α 2β
We choose B so that ψ in a normalized wavefunction for
a single particle. Then the resultant distribution |ψ(x)|2
x
would like the diagram on the right. c
7.1. WAVEFUNCTIONS AND THE SCHRÖDINGER EQUATION 279
7.1.2 Operators
We know that the square of the wavefunction gives the probability distribution of the
position of the particle. How about other information such as the momentum and
energy? It turns out that all the information about the particle is contained in the
wavefunction (which is why we call it the “state” of the particle).
We call each property of the particle which we can measure an observable . Each
observable has a corresponding operator , which acts on wavefunctions ψ(x). For
example, the position is represented by the operator x̂ = x. This means that (x̂ψ)(x) =
xψ(x). Here are a few operators:1
E. 7-3
2
x
Consider the Gaussian distribution ψ(x) = C exp(− 2α ). We get p̂ψ(x) = −i~ψ 0 (x) 6=
pψ(x) for any number p. So this is not an eigenfunction of the momentum. How-
ever, for the harmonic oscillator with potential V (x) = 21 Kx2 , this ψ(x) is an
eigenfunction of the Hamiltonian operator, provided we picked the right α. We
have
~2 00 1
Hψ = − ψ + Kx2 ψ = Eψ
2m 2
q
~2
for some constant E when α2 = Km
. The energy is in fact E = ~
2
K
m
.
The wavefunction specifies the state, however the state can change with time. For a
time-dependent wavefunction Ψ(x, t), its evolution with time is described by
∂Ψ
<Time-dependent Schrödinger equation> i~ = HΨ.
∂t
The classical dynamics (time evolution) of, say a particle, is determined by its potential
through F (x) = −V 0 (x). In quantum mechanics, the time evolution of a state is deter-
mined by the Hamiltonian through the Time-dependent Schrödinger equation. For a
particle in a potential V (x), the time-dependent Schrödinger equation can reads
∂Ψ ~2 ∂ 2 Ψ
i~ =− + V (x)Ψ.
∂t 2m ∂x2
Note that it is linear. So the sums and multiples of solutions are also solutions. It is
also first-order in time. So if we know the wavefunction Ψ(x, t0 ) at a particular time
t0 , then this determines the whole function Ψ(x, t).
This is similar to classical dynamics, where knowing the potential V (and hence the
Hamiltonian H) completely specifies how the system evolves with time. However, this
is in some ways different from classical dynamics. Newton’s second law is second-order
in time, while this is first-order in time. This is significant since if our equation is
first-order in time, then the current state of the wavefunction completely specifies the
evolution of the wavefunction in time.
Yet, this difference is just an illusion. The wavefunction is the state of the particle,
and not just the “position”. Instead, we can think of it as capturing the position and
momentum. Indeed, if we write the equations of classical dynamics in terms of position
and momentum, it will be first order in time.
D. 7-4
A stationary state is a state of the form Ψ(x, t) = ψ(x)e−iEt/~ where ψ(x) is an
eigenfunction of the Hamiltonian with eigenvalue E. This term is also sometimes
applied to ψ instead.
7.2. ENERGY EIGENSTATES IN ONE DIMENSION 281
E. 7-5
Note that a stationary state satisfies the time-dependent Schrödinger equation,
also we have |Ψ(x, t)|2 = |ψ(x)|2 which is independent of time. The stationary state
is the unique solution to the time-dependent Schrödinger equation with Ψ(x, 0) =
ψ(x) and HΨ = EΨ. Note that an measurement of energy for a stationary state
would give definite result of E at any time.
P. 7-6
Let Ψ(x, t) a time-dependent wavefunction. The probability density P (x, t) =
|Ψ(x, t)|2 obeys a conservation equation
∂Ψ∗
∂P ∂j i~ ∂Ψ
=− where j(x, t) = − Ψ∗ − Ψ
∂t ∂x 2m ∂x ∂x
This is straightforward from the Schrödinger equation and its complex conjugate.
Assuming V is real, we have
∂P ∂Ψ ∂Ψ∗ i~ 00 i~ 00∗ ∂j
= Ψ∗ + Ψ = Ψ∗ Ψ − Ψ Ψ=− .
∂t ∂t ∂t 2m 2m ∂x
since the two V terms cancel each other out.
j(x, t) is called the probability current . Note that Ψ∗ Ψ0 is the complex conjugate
of Ψ0∗ Ψ, so Ψ∗ Ψ0 − Ψ0∗ Ψ is imaginary. So multiplying by i ensures that j(x, t) is
real, which is a good thing since P is also real.
The important thing here is not the specific form of j, but that ∂P
∂t
can be written
as the space derivative of some quantity. This implies that the probability that
we find the particle in [a, b] at time t changes with time as
Z b Z b
d ∂j
|Ψ(x, t)|2 dx = − (x, t) dx = j(a, t) − j(b, t).
dt a a ∂x
We can think of the final term as the probability current getting in and out of the
interval at the boundary.
In particular, consider a normalizable state such that Ψ, Ψ0 , j → 0 as x → ±∞ for
fixed t. Taking a → −∞ and b → +∞, we have
Z ∞
d
|Ψ(x, t)|2 dx = 0.
dt −∞
What does this tell us? This tells us that if Ψ(x, 0) is normalized, Ψ(x, t) is
normalized for all t. Hence we know that for each fixed t, |Ψ(x, t)|2 is a probability
distribution. So what this really says is that the probability interpretation is
consistent with the time evolution.
In other words, we want to find the allowed energy eigenvalues. This is a hard problem
in general. We will consider simple cases involving simple V (x).
7.2.1 Parity
Consider the case with potential such that V (x) = V (−x). By changing variables
x → −x, we see that ψ(x) is an eigenfunction of H with energy E if and only if ψ(−x)
is an eigenfunction of H with energy E. There are two possibilities:
1. If ψ(x) and ψ(−x) represent the same quantum state, this can only happen if
ψ(−x) = ηψ(x) for some constant η. Since this is true for all x, we can do this twice
and get ψ(x) = ηψ(−x) = η 2 ψ(x). So we get that η = ±1 and ψ(−x) = ±ψ(x).
We call η the parity, and say ψ has even/odd parity if η is +1/ − 1 respectively.
For example, in our particle in a box, our states ψn have parity (−1)n+1 .
2. If ψ(x) and ψ(−x) represent different quantum states, then we can still take linear
combinations ψ± (x) = α(ψ(x) ± ψ(−x)) and these are also eigenstates with energy
eigenvalue E, where α is for normalization. Then by construction, ψ± (−x) =
±ψ± (x) and have parity η = ±1.
Hence, if we are given a potential with reflective symmetry V (−x) = V (x), then we
can restrict our attention and just look for solutions with definite parity.
V V V
x x x
We require ψ = 0 for |x| > a and ψ continuous at x = ±a. Within |x| < a, the
Schrödinger equation is
~2 00 ~2 k2
− ψ = Eψ =⇒ ψ 00 + k2 ψ = 0 where E=
2m 2m
Here, instead of working with the complex exponentials, we use sin and cos since
we know well when these vanish. The general solution is thus ψ = A cos kx +
B sin kx. Our boundary conditions require that ψ vanishes at x = ±a. So we need
A cos ka ± B sin ka = 0. In other words, we require A cos ka = B sin ka = 0. Since
sin ka and cos ka cannot be simultaneously 0, either A = 0 or B = 0. So the two
possibilities are
1. B = 0 and ka = nπ/2 with n = 1, 3, · · ·
2. A = 0 and ka = nπ/2 with n = 2, 4, · · ·
Hence the allowed energy levels are
~2 π 2 2
En = n for n = 1, 2, · · ·
8ma2
Ra
and the normalized ( −a
|ψn (x)|2 dx = 1) wavefunctions are
1 (
1 2 cos nπx
2a
n odd
ψn (x) = .
a sin nπx
2a
n even
V V
ψ1 : ψ2 :
x x
−a a −a a
V V
ψ3 : ψ4 :
x x
−a a −a a
284 CHAPTER 7. QUANTUM MECHANICS
This was a rather simple and nice example. We have an infinite well, and the
particle is well-contained inside the box. The solutions just look like standing
waves on a string with two fixed end points.
Note that ψn (−x) = (−1)n+1 ψn (x). We will see that this is a general feature of
energy eigenfunctions of a symmetric potential. This is known as parity.
E. 7-8
<Potential well> We will consider a potential V
(
−U |x| < a x
V (x) =
0 |x| ≥ a
−U
for some constant U > 0. −a a
Classically, this is not very interesting. If the energy E < 0, then the particle is
contained in the well. Otherwise it is free to move around. However, in quantum
mechanics, this is much more interesting.
We want to seek energy levels for a particle of mass m, defined by the Schrödinger
~2
equation Hψ = − 2m ψ 00 + V (x)ψ = Eψ. For energies in the range −U < E < 0
we set
~2 k2 ~2 κ2
U +E = > 0, E=− ,
2m 2m
where k, κ > 0 are new real constants. Note that these coefficients are not inde-
pendent, since U is given and fixed. So they must satisfy k2 + κ2 = 2mU
~2
. Using
these constants, the Schrödinger equation becomes
(
ψ 00 + k2 ψ = 0 |x| < a
ψ 00 − κ2 ψ = 0 |x| > a.
ξ
1π 3π 5π 7π 9π 11π
2 2 2 2 2 2
The other equation is the equation of a circle. Depending on the size of the
constant 2ma2 U/~2 , there will be a different number of points of intersections.
η
So there will be a different number of solutions depending on the value of 2ma2 U/~2 .
In particular, if
1/2
2mU a2
(n − 1)π < < nπ,
~2
then we have exactly n even parity solutions (for n ≥ 1). We can do exactly the
same thing for odd parity eigenstates. For E > 0 or E < −U , we will end up
finding non-normalizable solutions.
We can compare the solutions we have now with what we would expect classically.
Classically, any value of E in the range −U < E < 0 is allowed, and the motion is
deeply uninteresting. The particle just goes back and forth inside the well, and is
strictly confined in −a ≤ x ≤ a.
Quantum mechanically, there is just a discrete, finite set of allowed energies. What
is more surprising is that while ψ decays exponentially outside the well, it is non-
zero! This means there is in theory a non-zero probability of finding the particle
outside the well! We call these particles bound in the potential, but in fact there
is a non-zero probability of finding the particle outside the well.
1 V
V (x) = mω 2 x2 .
2
This is a harmonic oscillator of mass m. Classically, this has a
motion of x = A cos ω(t − t0 ). x
286 CHAPTER 7. QUANTUM MECHANICS
This is a really important example. First of all, we can solve it, which is a good thing.
More importantly, any smooth potential can be approximated by a harmonic oscillator
near an equilibrium x0 , since
1 00
V (x) = V (x0 ) + V (x0 )(x − x0 )2 + · · · .
2
Systems with many degrees like crystals can also be treated as collections of inde-
pendent oscillators by considering the normal modes. If we apply this to the electro-
magnetic field, we get photons! So it is very important to understand the quantum
mechanical oscillator.
We are going to seek all normalizable solutions to the time-independent Schrödinger
~2
equation Hψ = − 2m ψ 00 + 21 mω 2 x2 ψ = Eψ. To simplify constants, we define y =
1
(mω/~) 2 x, E = 2E/(~ω) both of which is dimensionless. Then we are left with
d2 ψ
− + y 2 ψ = Eψ.
dy 2
We can consider the behaviour for y 2 E. For large y, the y 2 ψ term will be large,
2
and so we want the ψ 00 term to offset it. We might want to try the Gaussian e−y /2 ,
2
and when we differentiate it twice, we would have brought down a factor of y . So we
1 2
can wlog set ψ = f (y)e− 2 y , then the Schrödinger equation gives
d2 f df
<Hermite’s equation> − 2y + (E − 1) = 0.
dy 2 dy
X
(r + 2)(r + 1)an+2 + (E − 1 − 2r)ar y r = 0.
r≥0
2r + 1 − E
=⇒ ar+2 = ar , r ≥ 0.
(r + 2)(r + 1)
We can choose a0 and a1 independently, and can get two linearly independent solutions.
Each solution involves either all even or all odd powers.
However, we have a problem. We want normalizable solutions. So we want to make
sure our function does not explode at large y. Note that it is okay if f (y) is quite large,
1 2
since our ψ is suppressed by the e− 2 y terms, but we cannot grow too big.
We look at these two solutions individually. To examine the behaviour of f (y) when
y is large, observe that unless the coefficients vanish, we get ap /ap−2 ∼ 2/p. This is
2
bad, f (y) is like ey = p≥0 y 2p /p!, and ψ cannot be normalized.
P
Hence, we get normalizable ψ if and only if the series for f terminates to give a
polynomial. This occurs iff E = 2n + 1 for some n. Note that for each n, only one
of the two independent solutions is normalizable. So for each E, we get exactly one
solution. For n even, we have
2r − 2n
ar+2 = ar for r even,
(r + 2)(r + 1)
7.3. EXPECTATION AND UNCERTAINTY 287
and ar = 0 for r odd. And we have the other way round when n is odd. The
solutions are thus f (y) = hn (y), where hn is a polynomial of degree n with hn (−y) =
(−1)n hn (y). For example, we have
h0 (y) = a0 h1 (y) = a1 y
2
h2 (y) = a0 (1 − 2y 2 ) h3 (y) = a1 y − y 3 .
3
These are known as the Hermite polynomials . We have now solved our harmonic
oscillator. With the constant restored, the possible energy eigenvalues are
1
En = ~ω n + for n = 0, 1, 2, · · · .
2
The wavefunctions are
mω 21
1 mω 2
ψn (x) = hn x exp − x ,
~ 2 ~
where normalization fixes a0 and a1 .
Harmonic oscillators are everywhere. It turns out quantised electromagnetic fields
correspond to sums of quantised harmonic oscillators, with En − E0 = n~ω. This is
equivalent to saying the nth state contains n photons, each of energy ~ω.
• Suppose φ(x) = ψ(x)eikx , then |ϕ(x)|2 = |ψ(x)|2 and hx̂iφ = hx̂iψ . However
hp̂iφ = hp̂iψ + ~k. So an additional factor of eikx change the momentum by ~k.
288 CHAPTER 7. QUANTUM MECHANICS
• (∆x)2ψ and (∆y)2ψ are like variance, we’ll show that they are real and positive.
• If Q is Hermitian, then hψ, Qψi = hQψ, ψi = hψ, Qψi∗ , so hψ, Qψi is real, ie. hQiψ
is real.
• The commutator is a measure of the lack of commutativity of the two operators.
The commutator of position and momentum is [x̂, p̂] = x̂p̂ − p̂x̂ = i~. This comes
from product rule: (x̂p̂ − p̂x̂)ψ = −i~xψ 0 − (−i~(xψ)0 = i~ψ.
Note that if α and β are any real constants, then the operators X = x̂ − α,
P = p̂ − β also obey [X, P ] = i~.
P. 7-11
The operators x̂, p̂ and H (for real potentials) are all Hermitian.
To show that p̂ is Hermitian, note that −i~[φ∗ ψ]∞−∞ = 0 since φ, ψ are normaliz-
able. So using integration by parts
Z ∞ Z ∞
hφ, p̂ψi = φ∗ (−i~ψ 0 ) dx = (−i~φ0 )∗ ψ dx = hp̂φ, ψi.
−∞ −∞
2 2
h d
To show that H = − 2m dx2
+ V (x) is Hermitian, we want to show hφ, Hψi =
hHφ, ψi. To show this, it suffices to consider the kinetic and potential terms
separately. For the kinetic energy, we want hφ, ψ 00 i = hφ00 , ψi, which is true since
we can integrate by parts twice to obtain hφ, ψ 00 i = −hφ0 , ψ 0 i = hφ00 , ψi. For the
potential term, we have hφ, V (x̂)ψi = hφ, V (x)ψi = hV (x)φ, ψi = hV (x̂)φ, ψi. So
H is Hermitian.
Thus we know that hxiψ , hp̂iψ , hHiψ are all real. Furthermore, observe that X =
x̂ − α and P = p̂ − β are (similarly) Hermitian for any real α, β. Hence
and same for P . If we choose α = hx̂iψ and β = hp̂iψ , the expressions above say
that (∆x)2ψ and (∆p)2ψ are indeed real and positive.
P. 7-12
<Cauchy-Schwarz inequality> kψkkφk ≥ |hψ, φi| for any normalizable ψ, φ.
Same as [T.4-142].
T. 7-13
d 1 d
<Ehrenfest’s theorem> hx̂iΨ = hp̂iΨ hp̂iΨ = −hV 0 (x̂)iΨ
dt m dt
1 1 1
=− hΨ, H(x̂Ψ)i + hΨ, x̂(HΨ)i = hΨ, (x̂H − H x̂)Ψi.
i~ i~ i~
~2 ~2 i~
(x̂H − H x̂)Ψ = − (xΨ00 − (xΨ)00 ) + (xV Ψ − V xΨ) = − Ψ0 = p̂Ψ.
2m m m
Hence we have the first result. The second part is similar. We have
d 1 1
hp̂iΨ = hΨ̇, p̂Ψi + hΨ, p̂Ψ̇i = HΨ, p̂Ψ + Ψ, p̂ H Ψ
dt ~ i~
1 1 1
= − hΨ, H(p̂Ψ)i + hΨ, p̂(HΨ)i = hΨ, (p̂H − H p̂)Ψi.
i~ i~ i~
−~2
(p̂H − H p̂)Ψ = −i~ ((Ψ00 )0 − (Ψ0 )00 ) − i~((V (x)Ψ)0 − V (x)Ψ0 )
2m
= −i~V 0 (x)Ψ.
E. 7-14
When we proved Ehrenfest’s theorem, the last step was to calculate the [x̂, H] and
[p̂, H]. Commutator relations are important in quantum mechanics. When we first
∂
defined the momentum operator p̂ as −i~ ∂x , you might have wondered where this
definition came from.
T. 7-15
<Heisenberg’s uncertainty principle> If ψ is any normalized state (at any
fixed time), then (∆x)ψ (∆p)ψ ≥ ~/2.
290 CHAPTER 7. QUANTUM MECHANICS
7.4.1 Wavepacket
When we solve Schrödinger’s equation, what we get is a “wave” that represents the
probability of finding our thing at each position. However, in real life, we don’t think
of particles as being randomly distributed over different places. Instead, particles are
localized to some small regions of space.
These would be represented by wavefunctions in which most
of the distribution is concentrated in some small region.
These are known as wavepackets, it has a rather loose defi-
nition, and can refer to anything that is localized in space.
A Gaussian wavepacket given by
α 1/4 1 2
Ψ0 (x, t) = e−αx /2γ(t) .
π γ(t)1/2
7.4. WAVEPACKETS AND SCATTERINGS 291
i~
is a solution of the time-dependent Schrödinger equation with V = 0 for γ(t) = α+ m t.
1 2
Note that Ψ0 (x, 0) is the normalized Gaussian ψ(x) = (1/απ) 4 e−x /2α . Gaussian
wavepacket are particularly nice wavefunctions. For example, we can show that for
a Gaussian wavepacket, (∆x)ψ (∆p)ψ = ~2 exactly, uncertainty is minimized. Sub-
i~
stituting this γ(t) = α + m t into our equation, we find that the probability density
is
α 1/2 1 2 2
P0 (x, t) = |Ψ0 (x, t)|2 = e−αx /|γ(t)| ,
π |γ(t)|
which is peaked at x = 0. This corresponds to a particle at rest at the origin, spreading
out with time.
A related solution to the time-dependent Schrödinger equation with V = 0 is a moving
particle:
mu2
mu
Ψu (x, t) = Ψ0 (x − ut, t) exp i x exp −i t .
~ 2~
The probability density of this is Pu (x, t) = |Ψu (x, t)|2 = P0 (x − ut, t). So this corre-
sponds to a particle moving with velocity u. Furthermore, we get hp̂iΨu = mu. This
corresponds with the classical momentum, mass × velocity.
We see that wavepackets do indeed behave like particles, in the sense that we can set
them moving and the quantum momentum of these objects do indeed behave like the
classical momentum.
In the limit α → 0, our particle becomes more and more spread out in space. The un-
certainty in the position becomes larger and larger, while the momentum becomes more
and more definite. Then the wavefunction above resembles something like Ψ(x, t) =
Ceikx e−iEt/~ which is a momentum eigenstate with ~k = mu and energy E = 21 mu2 =
~2 k2 /(2m). Note, however, that this is not normalizable.
E. 7-18
Consider a free particle (zero potential), then the time-dependent Schrödinger
equation is
~ ∂2Ψ ∂Ψ
− =i (∗)
2m ∂t2 ∂t
Suppose we are given at time zero Ψ(x, 0) = ψ(x) and we wish to find Ψ(x, t).
Taking the Fourier transform on (∗) in the x variable we obtain
~k2 ∂ Ψ̃ ~k2
Ψ̃(k, t) = i (k, t) =⇒ Ψ̃(k, t) = A(k)e−i 2m t
2m ∂t
We can use this method to obtain the Gaussian wavepacket in the above example!
∂2
Note that the essence of this method is that we “eliminated” the ∂t 2 term. In
fact equivalently we can achieve this by expanding ψ(x) in terms of momentum
eigenstates.
292 CHAPTER 7. QUANTUM MECHANICS
7.4.2 Scattering
Consider the time-dependent Schrödinger equa-
tion with a potential barrier. We would like Ψ
to send a wavepacket towards the barrier and u
see what happens. Classically, we would expect
the particle to either pass through the barrier
or get reflected. However, in quantum mechan-
ics, we would expect it to “partly” pass through
and “partly” get reflected. Here Ψ, Ψref and AΨref BΨtr
Ψtr are normalized wavefunctions, and where
Pref = |A|2 and Ptr = |B|2 are the probabilities
of reflection and transmission respectively.
This is generally hard to solve. Scattering problems are much simpler to solve for
momentum eigenstates of the form eikx . However, these are not normalizable wave-
functions, and despite being mathematically convenient, we are not allowed to use
them directly, since they do not make sense physically. In some sense, they represent
particles that are “infinitely spread out” and can appear anywhere in the universe with
equal probability, which doesn’t really make sense.
There are two ways we can get around this problem. We know that we can construct
normalized momentum eigenstates for a single particle confined in a box − 2` ≤ x ≤ 2` ,
√
namely ψ(x) = eikx / ` where the periodic boundary conditions require ψ(x + `) =
ψ(x), ie. k = 2πn
`
for some integer n. After calculations have been done, the box can
be removed by taking the limit ` → ∞.
Identical results are obtained more conveniently by allowing Ψ(x, t) to represent beams
of infinitely many particles, with |Ψ(x, t)|2 being the density of the number of particles
(per unit length) at x, t. When we do this, instead of having one particle and watching
it evolve, we constantly send in particles so that the system does not appear to change
with time. This allows us to find steady states, which mathematically corresponds to
finding solutions to the Schrödinger equation that do not change with time. To de-
termine, say, the probability of reflection, roughly speaking, we look at the proportion
of particles moving left compared to the proportion of particles moving right in this
steady state. In principle, this interpretation is obtained by considering a constant
stream of wavepackets and using some limiting/averaging procedure, but we usually
don’t care about these formalities.
For these particle beams, Ψ(x, t) is bounded, but no longer normalizable. Recall that
for a single particle, the probability current was defined as
i~
j(x, t) = − (Ψ∗ Ψ0 − ΨΨ0∗ ).
2m
If we have a particle beam instead of a particle, and Ψ is the particle density instead
of the probability distribution, j now represents the flux of particles at x, t, ie. the
number of particles passing the point x in unit time.
Our momentum eigenstates are ψ(x) = Ceikx which are solutions to the time-independent
2 2
k
Schrödinger equation with V = 0 with E = ~2m . Applying the momentum op-
erator, we find that p = ~k is the momentum of each particle in the beam, and
|ψ(x)|2 = |C|2 is the density of particles in the beam. We can also evaluate the current
to be j = ~k|C|2 /m. This makes sense — ~k/m = p/m is the velocity of the particles,
and |C|2 is how many particles we have. So this still roughly corresponds to what we
used to have classically.
In scattering problems, we will seek the transmitted and reflected flux jtr , jref in terms
of the incident flux jinc , and the probabilities for transmission and reflection are then
given by
|jtr | |jref |
Ptr = , Pref = .
|jinc | |jinc |
E. 7-19
<Potential step> Consider the time-independent V (x)
Schrödinger equation for a step potential
(
0 x≤0
V (x) = , x
U x>0
2
where U > 0 is a constant. The Schrödinger equation is −~
2m
ψ 00 + V (x)ψ = Eψ.
0
We require ψ and ψ to be continuous at x = 0. We can consider two different
cases:
1. 0 < E < U : We apply the standard method, introducing constants k, κ > 0
such that E = ~2 k2 /(2m), U −E = ~2 κ2 /(2m). Then the Schrödinger equation
becomes
( (
ψ 00 + k2 ψ = 0 x<0 Ieikx + Re−ikx x<0
00
=⇒ ψ=
2
ψ −κ ψ =0 x>0 Ce−κx x>0
We only have ψ = Ce−κx for x > 0 since ψ has to be bounded. Since ψ and
ψ 0 are continuous at x = 0, we have the equations
(
I +R=C k − iκ 2k
=⇒ R= I, C= I.
ikI − ikR = −κC k + iκ k + iκ
Note that it is in principle possible to get an e−iκx term on x > 0, but this
would correspond to sending in a particle from the right. We, by choice, assume
there is no such term. We now match ψ and ψ 0 at x = 0. Then we get the
equations
(
I +R=T k−κ 2k
=⇒ R= I, T = I.
ikI − ikR = iκT k+κ k+κ
We consider a stationary state with energy E with 0 < E < U . We set the con-
stants E = ~2 k2 /(2m) and U − E = ~2 κ2 /(2m). Then the Schrödinger equations
become
00
2
ψ + k ψ = 0 x<0 Ie
ikx
+ Re−ikx x<0
00 2
ψ −κ ψ =0 0<x<a =⇒ ψ = Ae + Be−κx
κx
0<x<a
00
ψ + k2 ψ = 0
ikx
x>a Te x>a
We can use these to find the transmission probability, and it turns out to be
−1
|T |2 U2
|jtrj | 2
Ptr = = = 1 + sinh κa .
|jinc | |I|2 4E(U − E)
This demonstrates quantum tunneling . There is a non-zero probability that the
particles can pass through the potential barrier even though it classically does not
have enough energy. In particular, for κa 1, the probability of tunneling decays
as e−2κa . This is important, since it allows certain reactions with high potential
barrier to occur in practice even if the reactants do not classically have enough
energy to overcome it.
This is an overdetermined system, since we have too many boundary conditions (we
have two conditions requiring no exponential growing on either side). Solutions exist
only when we are lucky, and only for certain values of E. So bound state energy levels
are quantized. We may find several bound states or none.
This is no longer overdetermined since we have more free constants (we only used up
one by requiring no e−ikx term on x → ∞). The solution for any E > 0 (imposing
condition on one complex constant) gives
( (
jinc + jref |I|2 ~k
m
− |R|2 ~k
m
x → −∞
j∼ =
jtr |T |2 ~k
m
x → +∞
|jref | |jtr |
Pref = |Aref |2 = Ptr = |Atr |2 = ,
|jinc | |jinc |
where Aref (k) = R/I and Atr (k) = T /I are the reflection and transmission amplitudes .
In quantum mechanics, “amplitude” general refers to things that give probabilities
when squared.
2. We will not prove this. Note that the finite dimensional case is proved in
[T.4-156], however in this chapter we mostly care about the case when V is
infinite dimensional, like function spaces. Note also the relation of this result
to [T.6-18].
Note that using this result, given a wavefunction ψ(x) (say at time
P 0), we can
expand it with energy eigenstates Hχn = En χn to get ψ = n αn χn , and
just like what we done for stationary state, its subsequence time evolution
according to the time-dependent Schrödinger equation is simply Ψ(x, t) =
P −iEn t/~
n αn χn (x)e .
C. 7-23
<Postulates for quantum mechanics>
1. <States> The state of a quantum system at a given time correspond to non-
zero elements of a complex complete inner product space V .6 Two elements of
V that are a (non-zero) multiple of each other are physically equivalent.
2. <Observables> Each observable (i.e. measurable quantity) Q has a corre-
sponding Hermitian (self-adjoint) operator Q̂.
3. <Measurement> If Q̂ has a discrete spectrum and ψ ∈ V is a normalised
state, then we have the following: A measurement of Q in the system represented
by ψ would be one of the eigenvalues of Q̂. The probability of obtaining the
eigenvalue λn is Pn = |αn |2 where αn = hψλn , ψi/hψλn , ψλn i is called the
amplitude and ψλn is the projection of ψ onto the eigenspace corresponding
to λn . The measurement is instantaneous and forces the system into the state
ψλn . That is our ψ turns into ψλn after the measurement.
Note that with appropriate condition, [P.7-22] says that the set of eigen-
states form a orthonormal basis for V , so for each eigenvalue λn , we can
pick a normalised eigenvector χn from Pthe eigenspace of λn so that the
normalised ψ can be written as ψ = n αn χn with αn = hχn , ψi. Note
that by construction all the χn has different eigenvalues. The measurement
then forces the system into the state χn . That is our ψ turns into χn after
the measurement.
Also note that we assume above that Q̂ has a discrete spectrum, if that’s
not the case (e.g. the position operator x̂ = x) then there is a different
way to find out the probability of the measurements. Note that we can
view x̂ as either having no eigenvalues/eigenvectors, or that the Dirac delta
functions δ(x − λ) are its eigenvectors so that all of R is its eigenvalues.
4. <Dynamics> The time evolution of a state Ψ(x, t) of a quantum system obeys
the time-dependent Schrödinger equation i~Ψ̇ = HΨ where H is a Hermitian
operator, the Hamiltonian ; this holds at all times except at the instant a
measurement is made.
6
V has nothing to do with the potential. More precisely V should be a Hilbert space, and it is
infinite dimensional in general.
298 CHAPTER 7. QUANTUM MECHANICS
E. 7-24
So far, we have worked with the vector space of functions (that are sufficiently
smooth), and used the integral as the inner product. However, in fact, we can
work with arbitrary complex vector spaces and arbitrary inner products.
Postulate 3 says that our ψ turns into χn after the measurement, which is rather
weird. However, it make sense because if we measure the state of the system,
and get, say, 3, then if we measure it immediately afterwards, we would expect to
get the result 3 again with certainty, instead of being randomly distributed like
the original state. So after a measurement, the system must be forced into the
corresponding eigenstate.
Note that these axioms
P are consistent in the following sense. If ψ is normalized,
then 1 = hψ, ψi = n Pn because
DX X E X X ∗ X
∗
hψ, ψi = αm χm , αn χn = αm αn hχm , χn i = αm αn δmn = |αn |2 .
m,n m,n n
P P
1. Write ψ = αn χn , then Q̂ψ = αn λn χn . So we have
X X ∗ X
hψ, Q̂ψi = hαm , χn , αn λn χn i = αn αn λn = λ n Pn .
n,m n
2. (Q̂ − hQiψ )2 χn = λ2n χn − 2λn hQiψ χn + hQi2ψ χn = (λn − hQiψ )2 χn hence done
by the first part.
From this, we see that ψ is an eigenstate of Q̂ with eigenvalue λ if and only if
hQiψ = λ and (∆Q)ψ = 0.
E. 7-26
Consider the harmonic oscillator as in [E.7.2.3], the operator Q̂ = H with eigen-
functions χn = ψn and eigenvalues λn = En = ~ω n + 12 . Suppose we have
prepared our system in the state
1
ψ = √ (ψ0 + 2ψ1 − iψ4 ).
6
√
Then Energy Probability
√ the coefficients
√ are α0 = 1/ 6, α1 =
2/ 6, α4 = −i/ 6. This is normalized since 1 1
E0 = ~ω P0 =
kψk2 = |α0 |2 + |α1 |2 + |α4 |2 = 1. Measuring the 2 6
energy would gives answers with probability as 3 2
in the table on the right. If a measurement gives E1 = ~ω P1 =
2 3
E1 , then ψ1 is the new state immediately after 9 1
measurement. E4 = ~ω P4 =
2 6
7.5. POSTULATES FOR QUANTUM MECHANICS 299
Note that the measurement postulate tells us that after measurement, the system is
then forced into the eigenstate. So when we said that we interpret the expectation
value as the “average result for many measurements”, we do not mean measuring
a single system many many times. Instead, we prepare a lot of copies of the system
in state ψ, and measure each of them once.
E. 7-27
Consider the normalised energy eigenstates Hψn = En ψn with hψm , ψn i = δmn .
Then we have certain simple solutions of the (time-dependent) Schrödinger equa-
tion of the form Ψn = ψn e−iEn t/~ . In general, given an initial state Ψ(0) =
P
n αn ψn , since the Schrödinger equation is linear, we can get the following solu-
tion for all time: X
Ψ(t) = αn e−iEn t/~ ψn .
n
For example onsider the harmonic oscillator with initial state Ψ(0) = √1 (ψ0 +
6
2ψ1 − iψ4 ). Then Ψ(t) is given by
1
Ψ(t) = √ (ψ0 e−iωt/2 + 2ψ1 e−3iωt/2 − iψ4 e−9iωt/2 ).
6
T. 7-28
<Ehrenfest’s theorem (General form)> If Q is any operator with no explicit
time dependence, then (recall that [Q, H] = QH − HQ)
d
i~ hQiΨ = h[Q, H]iΨ ,
dt
If Q does not have time dependence, then
d
i~ hΨ, QΨi = h−i~Ψ̇, QΨi + hΨ, Qi~Ψ̇i = h−HΨ, QΨi + hΨ, QHΨi
dt
= hΨ, (QH − HQ)Ψi = hΨ, [Q, H]Ψi.
If Q has explicit time dependence, then we have an extra term on the right,
d
and have i~ dt hQiΨ = h[Q, H]iΨ + i~hQ̇iΨ . These general versions correspond to
classical equations of motion in Hamiltonian formalism.
C. 7-29
<Discrete and continuous spectra> In stating the measurement postulates,
we have assumed that our spectrum of eigenvalues of Q̂ was discrete and eigenstates
normalisable, and we got nice results about measurements. However, this is not
always the case.
One way to get around this problem is to put the system in a “box” of length `
with suitable boundary conditions of ψ(x) (so that we force to have only certain
discrete eigenvalues with normalisable eigenvectors). We can then take ` → ∞ at
the end of the calculation.
Alternatively, when we have a continuous spectrum, we can proceed analogous
to the discrete case. Not caring too much about rigour below, suppose we have
300 CHAPTER 7. QUANTUM MECHANICS
Qχξ = λξ χξ for continuous label ξ instead of the discrete label n. And we have
eigenstates χξ with orthonormality conditions hχξ , χη i = δ(ξ−η) where we replaced
our old δmn with δ(ξ −η), the Dirac delta function.R To expand ψ in eigenstates, the
discrete sum becomes an integral, so we have ψ = αξ χξ dξ where αξ = hχξ , ψi. In
the discrete case, |αn |2 is the probability mass function. The obvious generalization
here would be to let |αξ |2 be our probability density function. More precisely,
Rb
a
|αξ |2 dξ is the probability that the result corresponds λξ with a ≤ ξ ≤ b.
E. 7-30
d
• Consider p̂ = −i~ dx . We have p̂eikx = ~keikx . So ~k are eigenvalue for all k ∈ R,
so the spectrum of p̂ is continuous, also the eigenvectors eikx are not normalisable
on (−∞, ∞).
Consider ψ(x) with periodic boundary conditions ψ(x + `) = ψ(x). So we can
d
restrict to −`/2 ≤ x ≤ `/2. The eigenstates of p̂ = −i~ dx are now χn (x) =
ikn x
√
e / ` where kn = 2πn/` with eigenvalues λn = ~kn , discrete and normalised.
The states are orthonormal on −`/2 ≤ x ≤ `/2. We expand P ψ(x) in terms of
the eigenstates to get a complex Fourier series ψ(x) = n αn χn (x) where the
amplitudes are given by αn = hχn , ψi. Take the limit as ` → ∞, the Fourier series
becomes a Fourier integral.
• Consider the particle in one dimension with position as our operator. The eigen-
states of x̂ are χξ (x) = δ(x − ξ) with corresponding eigenvalue λξ = ξ. This comes
from x̂χξ (x) = xδ(x − ξ) = ξδ(x − ξ) = ξχξ (x) since δ(x − ξ) is non-zero only when
x = ξ. With these eigenstates, we can expand
Z Z
ψ(x) = αξ χξ (x) dξ = αξ δ(x − ξ) dξ = αx .
Rb
So our coefficients are given by αξ = ψ(ξ). So a |ψ(ξ)|2 dξ is indeed the probabil-
ity of measuring the particle to be in [a, b]. So we recover our original interpretation
of the wavefunction.
D. 7-31
For any observable Q, the number of linearly independent eigenstates with eigen-
value λ is the degeneracy of the eigenvalue. In other words, the degeneracy
is the dimension of the eigenspace Vλ = {ψ : Qψ = λψ}. An eigenvalue is
non-degenerate if the degeneracy is exactly 1, and is degenerate if the degen-
eracy is more than 1. We say two states are degenerate if they have the same
eigenvalue.
T. 7-32
<Uncertainty principle (General form)> Let A and B be observables. If ψ
is any normalized state (at any fixed time), then (∆A)ψ (∆B)ψ ≥ 21 |h[A, B]iψ |.
P. 7-33
Let A and B be operators resulting from observables. A and B is simultaneously
diagonalizable (i.e. there exist a basis of joint eigenstates) if and only if A and B
commute (ie. [A, B] = 0).
(Forward) Suppose there exist a basis of joint eigenstates {χn }. So we have Aχn =
λn χn and Bχn = µn χn . So now [A, B]χn = (AB −BA)χn = λn µn χn −µn λn χn =
0 for all n. Since {χn } is a basis for V , we must have [A, B] = 0.
(Backward) Suppose [A, B] = 0. For any eigenvalue λ of A, consider the eigenspace
Vλ = {ψ : Aψ = λψ}. Now if ψ ∈ Vλ , then A(Bψ) = BAψ = λ(Bψ), so Bψ ∈ Vλ .
So B|Vλ ∈ End(Vλ ). Now V is the direct sum of the eigenspaces Vλ over all possible
eigenvalues λ since V is spanned by eigenstates of A. So given any basis for each
of the eigenspaces, their union is a basis for V . Since B is hermitian, B|Vλ is also
hermitian. It follows that Vλ has a basis of eigenstates of B, and all its element
is an eigenstate for A too (with eigenvalue λ) by definition. Since this holds for
every Vλ , this provides a basis of joint eigenstates for V .
The proof is similar to [T.4-105], although now we are in a infinite dimensional
space, so we have to get around it.
E. 7-34
<Commuting observables> In one dimension, the energy bound states are
always non-degenerate. However, in three dimensions, energy bound states may
be degenerate. If λ is degenerate, then there is a large freedom in choosing an
orthonormal basis for Vλ . Physically, we cannot distinguish degenerate states
by measuring Q alone. So we would like to measure something else as well to
distinguish these eigenstates. When can we do this? We have previously seen
that we cannot simultaneously measure position and momentum. It turns out the
criterion for whether we can do this is simple.
Recall that after performing a measurement, the state is forced into the corre-
sponding eigenstate. Hence, to simultaneously measure two observables A and
B, we need the state to be a simultaneously an eigenstate of A and B. In other
words, simultaneous measurement is possible if and only if there is a basis for V
consisting of simultaneous or joint eigenstates χn with
Aχn = λn χn , Bχn = µn χn .
we will attempt to find another operator A that commutes with H, then we can
find a common eigenstate, say with eigenvalue of µ wrt to A. Then if VλH 6= VµA ,
we would be able to tell apart eigenstates in VλH by grouping them into those that
are in VλH ∩ VµA and those that are not. So we are able to further classify the
underlying states. When dealing with the hydrogen atom later, we will use the
angular momentum to separate the degenerate energy eigenstates.
A complete set of commuting observables (CSCO) is a set of commuting opera-
tors whose eigenvalues completely specify the state of a system.
We see that position and momentum in different directions don’t come into conflict.
We can have a definite position in one direction and a definite momentum in another
direction. The uncertainty principle only kicks in when we are in the same direction.
In particular, we have
~
(∆xi )(∆pj ) ≥ δij .
2
Similar to what we did in classical mechanics, we assume our particles are structureless ,
that is a particle for which all observables can be written in terms of position and mo-
mentum. In reality, many particles are not structureless, and (at least) possess an
additional quantity known as “spin”. We will not study these and will only mention
it briefly near the end.
The Hamiltonian for a structureless particle in a potential V is
p̂2 ~2 2
H≡ + V (x̂) = − ∇ + V (x).
2m 2m
The time-dependent Schrödinger equation is i~ ∂Ψ
∂t
= HΨ. The probability current and
the conservation equation with which it obeys are
i~ ∂
j≡− (Ψ∗ ∇Ψ − Ψ∇Ψ∗ ) |Ψ(x, t)|2 = −∇ · j.
2m ∂t
7.6. QUANTUM MECHANICS IN THREE DIMENSIONS 303
The conservation equation implies that for any fixed volume V with boundary ∂V ,
Z Z Z
d
|Ψ(x, t)|2 d3 x = − ∇ · j d3 x = − j · dS,
dt V V ∂V
So if |Ψ(x,
R t)| → 02 sufficiently rapidly as |x| → ∞, then the boundary term disappears
and dtd
|Ψ(x, t)| d3 x = 0. This is the conservation of probability (or normaliza-
tion).
[A, BC] = ABC − BCA = ABC − BAC + BAC − BCA = [A, B]C + B[A, C].
The second version follows from [A, B] = −[B, A].
P. 7-39
1. [Li , Lj ] = i~εijk Lk .
2. [L2 , Li ] = 0.
3. [Li , x̂j ] = i~εijk x̂k and [Li , p̂j ] = i~εijk p̂k
1. We have
Li Lj = εiar x̂a p̂r εjbs x̂b p̂s = εiar εjbs (x̂a p̂r x̂b p̂s ) = εiar εjbs (x̂a (x̂b p̂r + [p̂r , x̂b ])p̂s )
= εiar εjbs (x̂a x̂b p̂r p̂s − i~δbr x̂a p̂s )
Similarly, Lj Li = εiar εjbs (x̂b x̂a p̂s p̂r − i~δas x̂b p̂r ). So the commutator is
where we get 0 since we are contracting the antisymmetric tensor εijk with the
symmetric tensor Lk Lj + Lj Lk .
3. We will use the Leibnitz property again
[Li , x̂j ] = εiab [x̂a p̂b , x̂j ] = εiab ([x̂a , x̂j ]p̂b + x̂a [p̂b , x̂j ]) = εiab x̂a (−i~δbj )
= i~εija x̂a
[Li , p̂j ] = εiab [x̂a p̂b , p̂j ] = εiab ([x̂a , p̂j ]p̂b + x̂a [p̂b , p̂j ]) = εiab (i~δaj p̂b )
= i~εijb p̂b .
Recall that in classical dynamics, an important result is that the angular momen-
tum is conserved in all directions. However, we know we can’t do this in quantum
mechanics, since the angular momentum operators do not commute (result (1)),
and we cannot measure all of them. This is why we have this L2 . It captures the
total angular momentum, and it commutes with the angular momentum operators:
[L2 , Li ] = 0 for all i. So (1) implies we cannot simultaneously measure, say L1
and L2 , the best that can be done is to measure L2 and L3 .
Note these operators involve only θ and ϕ. Furthermore, the expression for L2 is
something we have all seen before — we have
1 ∂2 1
∇2 = r − 2 2 L2 .
r ∂r2 ~ r
306 CHAPTER 7. QUANTUM MECHANICS
where P`m is the associated Legendre function . For the simplest case m = 0, we have
Y`0 = const P` (cos θ) where P` is the Legendre polynomial. The details are not impor-
tant, the important thing to take away is that there are solutions Y`m (θ, ϕ).
1 2 ~2 ~2 1 ∂ 2 1 1 2
H= p̂ + V = − ∇2 + V (r) = − r+ L + V (r).
2µ 2µ 2µ r ∂r2 2µ r2
The first thing we want to check is that [Li , H] = [L2 , H] = 0. This implies we can
use the eigenvalues of H, L2 and L3 to label our solutions to the equation. We check
this using Cartesian coordinates. The kinetic term is
[Li , p̂2 ] = [Li , p̂j p̂j ] = [Li , p̂j ]p̂j + p̂j [Li , p̂j ] = i~εijk (p̂k p̂j + p̂j p̂k ) = 0
since we are contracting an antisymmetric tensor with a symmetric term. We can also
compute the commutator with the potential term
∂V xk
[Li , V (r)] = −i~εijk xj = −i~εijk xj V 0 (r) = 0,
∂xk r
∂r
using the fact that ∂x i
= xri . Now that H, L2 and L3 are a commuting set of
observables, we have the joint eigenstates ψ(x) = R(r)Y`m (θ, ϕ) and we have
The numbers ` and m are known as the angular momentum quantum numbers . Note
that ` = 0 is the special case where we have a spherically symmetric solution. Finally,
we solve the Schrödinger equation Hψ = Eψ to obtain
~2 1 d2 ~2
− (rR) + `(` + 1)R + V R = ER.
2µ r dr2 2µr2
We often call R(r) the radial part of the wavefunction, defined on r ≥ 0. Of-
ten, it is convenient to work with χ(r) = rR(r), which is sometimes called the
7.6. QUANTUM MECHANICS IN THREE DIMENSIONS 307
~2 00 ~2 `(` + 1)
<Radial Schrödinger equation> − χ + χ + V χ = Eχ.
2µ 2µr2
R∞ R∞
Hence, ψ is normalizable if and only if 0
|R(r)|2 r2 dr = 0
|χ(r)|2 dr < ∞.
E. 7-40
<Three-dimensional well> Suppose we ave a spherically symmetric potential
given by (
0 r≥a
V (r) = ,
−U r < a
where U > 0 is a constant. We now look for bound state solutions to the
Schrödinger equation with −U < E < 0, with total angular momentum quan-
tum number `. Our radial wavefunction χ obeys
~2 k2
00 `(` + 1)
χ −
χ + k2 χ = 0 with U +E = for r≥a
r2 2µ
2 2
χ00 − `(` + 1) χ − κ2 χ = 0 ~ κ
with E=− for r<a
r2 2µ
χ(r)
ψ(r) = R(r)Y1m (θ, ϕ) = Y1m (θ, ϕ),
r
where m can take values m = 0, ±1. Solution for general ` involves spherical
Bessel functions, which we’ll not look at here.
308 CHAPTER 7. QUANTUM MECHANICS
e2 1
V (r) = − .
4πε0 r
This potential is due to a proton stationary at r = 0. We follow results from the last
section, and set the mass µ = me , the electron mass. The joint energy eigenstates of
H, L2 and L3 are of the form
When we simplify this mess, we see this holds if and only if (` + 1)κ = λ. Hence, for
any integer n = ` + 1 = 1, 2, 3, · · · , there are bound states with energies
2
~2 λ 2 e2
1 1
En = − = − me .
2me n2 2 4πε0 ~ n2
These are the energy levels of the Bohr model, derived within the framework of the
Bohr model. However, there is a slight difference. In our model, the total angular
momentum eigenvalue is
~2 `(` + 1) = ~2 n(n − 1),
which is not what the Bohr model predicted. Nevertheless, this is not the full solution.
For each energy level, this only gives one possible angular momentum, but we know
that there can be many possible angular momentums for each energy level. So there
is more work to be done.
7.7. THE HYDROGEN ATOM 309
So either σ = ` or σ = −(` + 1). We discard the σ = −(` + 1) solution since this would
make f and hence R singular at r = 0. So we have σ = `. Given this, the coefficients
are then determined by
2(κ(p + `) − λ)
ap = ap−1 , p ≥ 1.
p(p + 2` + 1)
Similar to the harmonic oscillator, we now observe that, unless the series terminates,
we have ap /ap−1 ∼ 2κ/p as p → ∞, which matches the behaviour of rα e2κr (for some
α). So R(r) is normalizable only if the series terminates. Hence the possible values of
κ are κn = λ for some n ≥ ` + 1. So the resulting energy levels are exactly those we
found before:
2 2
~2 2 ~2 λ 2 1 e 1
En = − κ =− = − m e for n ∈ N.
2me 2me n2 2 4πε0 ~ n2
This n is called the principle quantum number . For any given n, the possible angular
momentum quantum numbers are ` = 0, 1, 2, 3, · · · , n − 1 with m = 0, ±1, ±2, · · · , ±`.
The simultaneous eigenstates are then
ψn`m (x) = Rn` (r)Y`m (θ, ϕ) with Rn` (r) = r` gn` (r)e−λr/n ,
where gn` (r) are (proportional to) the associated Laguerre polynomials .
In general, the “shape” of probability distribution for any electron state depends on
r and θ, ϕ mostly through Y`m . For ` = 0, we have a spherically symmetric solu-
tions
ψn00 (x) = gn0 (r)e−λr/n .
This is very different from the Bohr model, that says the energy levels depend only
on the angular momentum and nothing else. Here we can have many different angular
310 CHAPTER 7. QUANTUM MECHANICS
momentums for each energy level, and can even have no angular momentum at all.
The degeneracy of each energy level En is
n−1
X `
X n−1
X
1= (2` + 1) = n2 .
`=0 m=−` `=0
In fact the degeneracy of energy eigenstates reflects the symmetries in the Coulomb
potential. Moreover, the fact that we have n2 degenerate states implies that there is
a hidden symmetry, in addition to the obvious SO(3) rotational symmetry, since just
SO(3) itself should give rise to much fewer degenerate states.
So we have solved the hydrogen atom. However this is only after we made a lot of
simplifying assumptions. It is worth revisiting these assumptions and see if they are
actually significant.
1. We assumed was that the proton is stationary at the origin and the electron moves
around it. We also took the mass to be µ = me . More accurately, we can consider
the motion relative to the center of mass of the system, and we should take the
mass as the reduced mass
me mp
µ= ,
me + mp
just as in classical mechanics. However, the proton mass is so much larger and
heavier, and the reduced mass is very close to the electron mass. Hence, what
we’ve got is actually a good approximation. In principle, we can take this into
account and this will change the energy levels very slightly.
3. We have always assumed that particles are structureless, namely that we can com-
pletely specify the properties of a particle by its position and momentum. However,
it turns out electrons (and protons and neutrons) have an additional internal degree
of freedom called spin . In particular it has two spin state called up and down.
This is a form of angular momentum, but with ` = 12 and m = ± 12 . This are
not due to orbital motion, orbital motion has integer values of ` for well-behaved
wavefunctions. However, we call it angular momentum since angular momentum
is conserved only if we take these into account as well.
Also, for each each quantum number n, `, m, since there are two possible spin states,
the total degeneracy of level En is then 2n2 . This agrees with what we know from
chemistry.
7.7. THE HYDROGEN ATOM 311
where ψi is any solution for the hydrogen atom, scaled appropriately by e2 7→ Ze2 to
accommodate for the larger charge of the nucleus. The energy is then
E = E1 + E2 + · · · + EZ .
We can next add in the electron-electron interactions terms, and find a more accurate
equation for ψ using something called perturbation theory. However, there is an ad-
ditional constraint on this. The Fermi-Dirac statistics or Pauli exclusion principle
states that no two particles can have the same state. In other words, if we attempt
to construct a multi-electron atom, we cannot put everything into the ground state.
We are forced to put some electrons in higher energy states. This is how chemical
reactivity arises, which depends on occupancy of energy levels.
• For n = 1, we have 2n2 = 2 electron states. This is full for Z = 2, and this is
helium.
• For n = 2, we have 2n2 = 8 electron states. Hence the first two energy levels are
full for Z = 10, and this is neon.
These are rather stable elements, since to give them an additional electron, we must
put it in a higher energy level, which costs a lot of energy. We also expect reactive
atoms when the number of electrons is one more or less than the full energy levels.
These include hydrogen (Z = 1), lithium (Z = 3), fluorine (Z = 9) and sodium
(Z = 11).
This is a recognizable sketch of the periodic table. However, for n = 3 and above, this
model does not hold well. At these energy levels, electron-electron interactions become
important, and the world is not so simple.
312 CHAPTER 7. QUANTUM MECHANICS
CHAPTER 8
Markov chains
P. 8-4
If X is a homogeneous Markov chain, then
P
1. λ is a distribution, ie. λi ≥ 0 for all i and i λi = 1.
P
2. P is a stochastic matrix, ie. pi,j ≥ 0 for all i, j and j pi,j = 1 for all i.
313
314 CHAPTER 8. MARKOV CHAINS
Note stochastic matrix only require the row sum to be 1, and the column sum
need not be.
T. 8-5
Let λ be a distribution (on S) and P a stochastic matrix. Then X = (X0 , X1 , · · · )
is a (homogeneous) Markov chain with initial distribution λ and transition matrix
P iff for all n and i0 , · · · , in we have
We only give a proof for when F depends on only the finite future, so H is given
in terms of X0 , X − 1, · · · , Xn−1 and F is given in terms of Xn+1 , Xn+2 , · · · , XN
for some N > n.
P(F, Xn = i, H)
P(F | Xn = i, H) =
P(Xn = i, H)
P P
<n >n λi0 pi0 ,i1 · · · pin−1 ,i pi,in+1 · · · piN −1 ,iN
= P
<n λi0 pi0 ,i1 · · · pin−1 ,i
X
= pi,in+1 · · · piN −1 ,iN = P(F | Xn = i)
>n
P
where P<n sums over all sequences (i0 , i1 , · · · , in−1 ) corresponding to the event
H, and >n over all sequences (in+1 , in+2 , · · · , iN ) correspond to event F .
8.1. MARKOV CHAINS 315
D. 8-7
Let X be a homogeneous Markov chain. The n-step transition probability from
i to j is pi,j (n) = P(Xn = j | X0 = i). We can write this as a matrix P (m) =
(pi,j (m))i,j∈S .
T. 8-8
P
<Chapman-Kolmogorov equation> pi,j (m+n) = k∈S pi,k (m)pk,j (n). That
is P (m + n) = P (m)P (n). In particular, P (n) = P (1)P (n − 1) = · · · = P (1)n =
P n.
pij (m + n) = P(Xm+n = j | X0 = i)
X
= P(Xm+n = j | Xm = k, X0 = i)P(Xm = k | X0 = i)
k
X
= P(Xm+n = j | Xm = k)P(Xm = k | X0 = i)
k
X
= pi,k (m)pk,j (n).
k
We can now attempt to compute p1,2 . We know that it must be of the form
p1,2 = Aκn n
1 + Bκ2 = A + B(1 − α − β)
n
316 CHAPTER 8. MARKOV CHAINS
where A and B are constants coming from U and U −1 . However, we know that
p1,2 (0) = 0, p1,2 (1) = α. So we obtain A + B = 0 and A + B(1 − α − β) = α.
Solve this we obtain
α
p1,2 (n) = (1 − (1 − α − β)n ) = 1 − p1,1 (n).
α+β
since row sum equals 1 as P (n) must also be stochastic. How about p2,1 and p2,2 ?
Well we don’t need additional work. We can obtain these simply by interchanging
α and β. So we obtain
We see that the two rows are the same. This means that as time goes on, where
we end up does not depend on where we started. We will later (near the end of
the course) see that this is generally true for most Markov chains.
D. 8-12
• The equivalence classes of ↔ are communicating classes .
E. 8-13
Note that intuitively a subset closed if we cannot escape from it. Note also that
different communicating classes are not completely isolated. Within a communi-
cating class A, of course we can move between any two vertices. However, it is also
possible that we can escape from a class A to a different class B. It is just that
after going to B, we cannot return to class A. From B, we might be able to get to
another space C. We can jump around all the time, but (if there are finitely many
communicating classes) eventually we have to stop when we have visited every
class. Then we are bound to stay in that class, i.e. the class is closed. Since we
are eventually going to be stuck in that class anyway, often, we can just consider
this final communicating class and ignore the others. So wlog we can assume that
the chain only has one communicating class, i.e. it is irreducible.
P. 8-14
A subset C is closed iff j ∈ C whenever i ∈ C and i → j.
So there is some route i, i1 , · · · , im−1 , j such that pi,i1 , pi1 ,i2 , · · · , pim−1 ,j > 0.
Since pi,i1 > 0, we have i1 ∈ C as C is closed. Since pi1 ,i2 > 0, we have i2 ∈ C.
By induction, we get that j ∈ C.
E. 8-15
Consider S = {1, 2, 3, 4, 5, 6} with transition matrix P given below.
1 1
0 0 0 0
2
2 2
0 0 1 0 0 0
1 1 1
3
0 0 3 3
0 1 3 5 6
P =
0 1 1
0 0 2 2
0
0 0 0 0 0 1 4
0 0 0 0 1 0
We see that the communicating classes are {1, 2, 3}, {4}, {5, 6}, where {5, 6} is
closed.
D. 8-16
For convenience, we write Pi (A) = P(A | X0 = i) and Ei (Z) = E(Z | X0 = i).
4
An equivalent definition is that C is closed if j ∈ C whenever i → j and i ∈ C .
318 CHAPTER 8. MARKOV CHAINS
The left hand side is almost the generating function Pi,j (s), except that we are
missing an n = 0 term, which is pi,j (0) = δi,j . The right hand side is the “convo-
lution” of the power series Pj,j (s) and Fi,j (s), which we can write as the product
Pj,j (s)Fi,j (s). So Pi,j (s) − δi,j = Pi,j (s)Fi,j (s).
5
Here we require n ≥ 1. Otherwise Ti would always be 0. Also Tj does not necessarily exist,
since {n ≥ 1 : Xn = j} might be empty.
8.2. CLASSIFICATION OF CHAINS AND STATES 319
L. 8-19
<Abel’s lemma> Let a1 , a2 , · · · be positive real numbers such that f (x) =
P n P
n an x converges for |x| < 1. Then limx→1− f (x) = n an .
Given any
P ε > 0, we can choose N such that ∀n > N , |sn − a| < ε/2. Since
(1 − x) N n
n=0 (sn − a)x is a continuous function that takes value 0 when x = 1,
∃δ > 0 s.t. ∀y ∈ (1 − δ, 1), |(1 − x) N n
P
n=0 (sn − a)x | < ε/2. Now ∀y ∈ (1 − δ, 1) we
have
∞ ∞
!
ε X n ε ε X n
|f (x) − a| < + (1 − y) |sn − a|y ≤ + y (1 − y) ≤ ε
2 n=N +1
2 2 n=N +1
P P
Hence limx→1− f (x) = n an . Suppose n an doesn’t converges, since its terms
PN
are positive, sN = n=0 an → ∞ as N → ∞. Given R ∈ R, we can pick N such
that ∀n ≥ N, sn > 2R. Also ∃δ > 0 s.t. ∀x ∈ (1 − δ, 1), xN > 12 . Now
∞
X ∞
X ∞
X
(1 − x) sn xn ≥ (1 − x) sn xn > 2R(1 − x) xn = 2RxN ≥ R
n=0 n=N n=N
P
So limx→1− f (x) = ∞ = n an .
T. 8-20
P
1. i is recurrent iff n pi,i (n) = ∞.
P
2. If j is transient, then n pi,j (n) < ∞ for all states i.
Since |s| < 1 we have Fi,i (s) < 1. So we are not dividing byP zero. Pi,i (s)
converge for |s| < 1 since it’s bounded byPthe geometric series n sn . So by
Abel’s lemma, lims→1 Pi,i (s) = Pi,i (1) = n pi,i (n). Similarly we also have
1 1 1
lim = = P .
s→1 1 − Fi,i (s) 1 − lims→1 Fi,i (s) 1 − fi,i (n)
P P P
Hence we have n pi,i (n) = 1/(1− n fi,i (n)). Since fi,i (n)Pis the probabil-
ity of ever returning, the probability of ever returning is 1 iff n pi,i (n) = ∞.
320 CHAPTER 8. MARKOV CHAINS
P P
2. By part 1, Pj,j (1) = n pi,i (n) < ∞, so n pi,j (n) = Pi,j (1) = δi,j +
Fi,j (1)Pj,j (1) < ∞.
T. 8-21
Let C be a communicating class. Then
1. Either every state in C is recurrent, or every state is transient.
2. If C contains a recurrent state, then C is closed.
2. If C is not closed, then there is a non-zero probability that we leave the class
and never get back. So the states are not recurrent.
E. 8-22
There is a profound difference between a finite state space and an infinite state
space. A finite state space can be represented by a finite matrix, and we are all
very familiar with a finite matrices. We can use everything we know about finite
matrices. However, infinite matrices are weirder.
For example, any finite transition matrix P has an eigenvalue of 1. This is since
the row sums of a transition matrix is always 1. So if we multiply P by e =
(1, 1, · · · , 1), then we get e again. However, this is not true for infinite matrices,
since we usually don’t usually allow arbitrary infinite vectors. To avoid getting
infinitely large numbers when multiplying
P 2 vectors and matrices, we usually restrict
our focus to vectors x such that xi is finite. In this case the vector e is not
allowed, and the transition matrix need not have eigenvalue 1.
Another thing about a finite state space is that probability “cannot escape”. Each
step of a Markov chain gives a probability distribution on the state space, and
we can imagine the progression of the chain as a flow of probabilities around the
state space. If we have a finite state space, then all the probability flow must be
contained within our finite state space. However, if we have an infinite state space,
then probabilities can just drift away to infinity.
T. 8-23
In a finite state space,
1. There exists at least one recurrent state.
2. If the chain is irreducible, every state is recurrent.
T. 8-25
<Pólya’s theorem> The symmetric random walk on Zd is recurrent for d = 1, 2
and transient for d ≥ 3.
P
We will start with the case d = 1. We want to show that p0,0 (n) = ∞. Then
we know the origin is recurrent. It is impossible toPget back to the origin in an
odd number of steps. So we can instead consider p0,0 (2n). To return to the
origin after 2n steps, we need to have made n steps to the left, and n steps to the
right, in any order. So we have
!
2n
2n 1
p0,0 (2n) = P(n steps left, n steps right) = .
n 2
√ n
Using Stirling’s formula n! ' 2πn ne , we get p0,0 (2n) ∼ √1πn . Summing this
P
is worst than the harmonic series. So we have p0,0 (2n) = ∞.
In the d = 2 case, suppose after 2n steps, I have taken r steps right, ` steps left,
u steps up and d steps down. We must have r + ` + u + d = 2n, and we need
r = `, u = d to return the origin. Let r = ` = m and u = d = n − m, then
2n X n
! n
2n X
1 2n 1 (2n)!
p0,0 (2n) = =
4 m=0
m, m, n − m, n − m 4 m=0
(m!) 2 ((n − m)!)2
2n ! n 2 2n ! n ! !
1 2n X n! 1 2n X n n
= =
4 n m=0 m!(n − m)! 4 n m=0 m n−m
So the sum diverges. So this is recurrent. Note that the two-dimensional proba-
bility turns out to be the square of the one-dimensional probability. This is not
a coincidence, and we will explain this after the proof. However, this does not
extend to higher dimensions.
322 CHAPTER 8. MARKOV CHAINS
This time, there is no neat combinatorial formula. Since we want to show this is
summable, we can try to bound this from above. We have
2n ! 2 2n ! 2
1 2n X n! 1 2n X n!
p0,0 (2m) = =
6 n i!j!k! 2 n 3n i!j!k!
2n ! X
1 2n n! n!
≤ max
2 n 3n i!j!k! 3n i!j!k!
i+j+k=n
n!
P
Now we will use the identity i+j+k=n 3n i!j!k! = 1, which can be obtained as
follows: suppose we have three urns, and throw n balls into it. Then the probability
n!
of getting i balls in the first, j in the second and k in the third is exactly 3n i!j!k! .
Summing over all possible combinations of i, j and k gives the total probability
of getting in any configuration, which is 1. To find the maximum of n!/(3n i!j!k!),
we can replace the factorial by the gamma function and use Lagrange multipliers.
However, we would just argue that the maximum can be achieved when i, j and
k are as close to each other as possible. So we get
2n ! 3
1 2n n! 1
≤ ∼ Cn−3/2
2 n 3n bn/3c!
P
for some constant C and using Stirling’s formula. So p0,0 (2n) < ∞ and the
chain is transient. We can prove similarly for higher dimensions.
Intuitively, this result makes sense that we get recurrence only for low dimensions,
since if we have more dimensions, then it is easier to get lost.
Let’s get back to why the two dimensional probability is the square of the one-
dimensional probability. This square might remind us of independence. However,
it is obviously not true that horizontal movement and vertical movement are in-
dependent — if we go sideways in one step, then we cannot move vertically. But
we can “fix” this:
We write Xn = (An , Bn ). We rotate our space,
record our coordinates in a pair of axis that is ro-
tated by 45◦ (and then stretched). The new coordi- B Vn
√
V
2
nates is
(An , Bn )
Un 1 1 An − Bn
= Xn =
Vn −1 1 An + Bn
D. 8-26
The hitting time of A ⊆ S is the random variable H A = min{n ≥ 0 : Xn ∈ A}.6
Also we write kiA = Ei (H A ) and hA
i = Pi (H
A
< ∞) = Pi (ever reach A). If A is
A
closed, then hi is called an absorption probability
E. 8-27
Note that we have to be careful finding Ei (H A ) = kiA . If there is a chance that we
never hit A, then H A could be infinite, and Ei (H A ) = ∞. This occurs if hA i < 1.
So often we are only interested in the case where hA i = 1. Note however that
hA A
i = 1 does not imply that ki < ∞. It is merely a necessary condition.
T. 8-28
The vector (hA
i : i ∈ S) satisfies
(
1 i∈A
hA
i = P A
,
j∈S pi,j hj i 6∈ A
and is minimal in that for any non-negative solution (xi : i ∈ S) to these equations,
we have hAi ≤ xi for all i.
By definition, hA
i = 1 if i ∈ A. Otherwise, we have
X X A
hAi = Pi (H
A
< ∞) = Pi (H A < ∞ | X1 = j)pi,j = hj pi,j .
j∈S j∈S
So hA
iis indeed a solution to the equations. To show that hA i is the minimal
solution, suppose x = (xi : i ∈ S) is a non-negative solution, ie.
(
A 1 i∈A
xi = P A
,
p x
j∈S i,j j i 6∈ A
A
X A A
≥ Pi (H = 1) + pi,j pj,k = Pi (H = 1) + Pi (H = 2) = Pi (H A ≤ 2).
j6∈A,k∈A
T. 8-29
(kiA : i ∈ S) is the minimal non-negative solution to
(
A 0 i∈A
ki =
1 + j pi,j kjA i 6∈ A.
P
A
P that yi ≥ Pi (H P
By induction, we know ≥ 1) + · · · + Pi (H A ≥ n) for all n. Let
n → ∞, then yi ≥ m≥1 Pi (H A ≥ m) = m≥1 mPi (H a = m) = kiA .
Note that we have the extra “1+” since when we move from i to j, one step has
already passed.
E. 8-30
<Gambler’s ruin> This time, we will consider a random walk on N. In each
step, we either move to the right with probability p, or to the left with probability
q = 1 − p. What is the probability of ever hitting 0 from a given initial point? In
{0}
other words, we want to find hi = hi .
phi+1 − hi + qhi−1 = 0, i ≥ 1.
1
with the boundary condition that h0 = 1. If p 6= q, ie. p 6= 2
, then the solution
has the form hi = A + B( pq )i for i ≥ 0.
If p < q, then for large i, ( pq )i is very large and blows up. However, since hi is
a probability, it can never blow up. So we must have B = 0. So hi is constant.
Since h0 = 1, we have hi = 1 for all i. So we always get to 0.
If p > q, since h0 = 1, we have A + B = 1. So hi = ( pq )i + A(1 − ( pq )i ). This is
in fact a solution for all A. So we want to find the smallest solution. As i → ∞,
we get hi → A. Since hi ≥ 0, we know that A ≥ 0. Subject to this constraint, the
minimum is attained when A = 0 (since (q/p)i and (1 − (q/p)i ) are both positive).
So we have hi = ( pq )i .
If p = q, then by similar arguments, hi = 1 for all i.
8.2. CLASSIFICATION OF CHAINS AND STATES 325
There is another way to solve this. We can give ourselves a ceiling, ie. we also
stop when we hit k > 0, ie. hk = 0. We now have two boundary conditions and
can find a unique solution. Then we take the limit as k → ∞. We seen similar
methods in IA Probability.
E. 8-31
<Birth-death chain> Let (pi : i ≥ 0) be an arbitrary sequence such that
pi ∈ (0, 1), and write qi = 1 − pi . Let N be our state space and define the
transition probabilities to be pi,i+1 = pi and pi,i−1 = qi for i ≥ 1 and p0,1 = p0
{0}
and p0,0 = q0 . What is hi ?
{0}
We write hi = hi . We know that
This is no longer a difference equation, since the coefficients depends on the index i.
To solve this, we rewrite this as pi hi+1 −hi +qi hi−1 = pi (hi+1 −hi )−qi (hi −hi−1 ).
We let ui = hi−1 − hi ,7 then our equation becomes
qi qi qi−1 q1
ui+1 = ui =⇒ ui+1 = ··· u1 .
pi pi pi−1 p1
This solves (∗) for any value P of u1 . But our theorem tells us that the value of u1
minimizes hi . Note that S = ∞ i=0 γi either diverges or converges. If S = ∞, then
we must have u1 = 0 and so hi = 1 for all i. This is since hi cannot blows up as
0 ≤ hi ≤ 1. If S is finite, then u1 can be non-zero. We know that the γi are all
positive. So to minimize hi , we need to maximize u1 . Since 0 ≤ hi , the maximum
possible value of u1 , is such that 0 = 1 − u1 S. In other words, u1 = 1/S. So we
have P∞
k=i γk
hi = P∞ .
k=0 γk
This is a more general case of the random walk — in contrast to the random walk
where we have a constant pi sequence. This is also a general model for population
growth, where the change in population depends on what the current population
is. Here each “step” does not correspond to some unit time, since births and
deaths occur rather randomly. Instead, we just make a “step” whenever some
birth or death occurs, regardless of what time they occur. Here, if we have no
people left, then it is impossible for us to reproduce and get more population. So
we might want to have p0,0 = 1. In this case 0 is absorbing in that {0} is closed.
D. 8-32
A random variable T (which is a function Ω → N ∪ {∞}) is a stopping time for
the Markov chain X = (Xn ) if for n ≥ 0, the event {T = n} is given in terms of
X0 , · · · , Xn .
7
Letting ui = hi − hi−1 might seem more natural, but this definition makes ui positive
326 CHAPTER 8. MARKOV CHAINS
E. 8-33
For example, suppose we are in a casino and gambling. We let Xn be the amount
of money we have at time n. Then we can set our stopping time as “the time
when we have $10 left”. This is a stopping time, in the sense that we can use this
as a guide to when to stop — it is certainly possible to set yourself a guide that
you should leave the casino when you have $10 left. However, it does not make
sense to say “I will leave if the next game will make me bankrupt”, since there is
no way to tell if the next game will make you bankrupt (it certainly will not if you
win the game!). Hence this is not a stopping time.
The hitting time H A is a stopping time. This is since {H A = n} = {Xi 6∈ A for i <
n} ∩ {Xn ∈ A}. We also know that H A + 1 is a stopping time since it only depends
in Xi for i ≤ n − 1. However, H A − 1 is not a stopping time since it depends on
Xn+1 .
T. 8-34
<Strong Markov property> Let X be a Markov chain with transition matrix
P , and let T be a stopping time for X. Given T < ∞ and XT = i, the chain
Y = (Yk : k ≥ 0) given by Yk = XT +k is a Markov chain with transition matrix
P and initial distribution XT +0 = i, and this Markov chain is independent of
X0 , · · · , XT .
So P(( n
T
k=1T Ak )∩A∩Bm ) = P(A∩Bm )pi,i1 pi1 ,i2 · · · pin−1 ,in , now sum over m ∈ N0
we get P( n k=1 Ak | A ∩ B) = pi,i1 pi1 ,i2 · · · pin−1 ,in . Hence Yk is a Markov chain
with transition matrix P and initial distribution XT +0 = i. Now let H be an
event given in terms of X0 , X1 , · · · , XT −1 . The event H ∩ Bm is in terms of
X0 , X1 , · · · , Xm−1 , so by Markov property at time m we have
n
!
\
pi,i1 pi1 ,i2 · · · pin−1 ,in P (A ∩ Bm ∩ H) = P A ∩ Bm ∩ H ∩ Ak
k=1
Tn
since pi,i1 pi1 ,i2 · · · pin−1 ,in = P( k=1 Ak | A ∩ Bm ). Sum over m ∈ N0 we get
n
!
\
pi,i1 pi1 ,i2 · · · pin−1 ,in P (A ∩ B ∩ H) = P A ∩ B ∩ H ∩ Ak
k=1
n
! n
!
\ \
=⇒ P Ak | A ∩ B P(H | A ∩ B) = P H ∩ Ak | A ∩ B
k=1 k=1
The “Markov property” we seen at the start of the chapter is the weak Markov
property. In probability, we often have “strong” and “weak” versions of things.
8.2. CLASSIFICATION OF CHAINS AND STATES 327
For example, we have the strong and weak law of large numbers. The difference
is that the weak versions are expressed in terms of probabilities, while the strong
versions are expressed in terms of random variables.
Initially, when people first started developing probability theory, they just talked
about probability distributions like the Poisson distribution or the normal distri-
bution. However, later it turned out it is often nicer to talk about random variables
instead. After messing with random variables, we can just take expectations or
evaluate probabilities to get the corresponding statement about probability distri-
butions. Hence usually the “strong” versions imply the “weak” version, but not
the other way round.
E. 8-35
<Gambler’s ruin> Again, this is the Markov chain taking values on the non-
negative integers, moving to the right with probability p and left with probability
q = 1 − p. 0 is an absorbing state, since we have no money left to bet if we are
broke. Instead of computing the probability of hitting zero, we want to find the
time it takes to get to 0, ie.
H = inf{n ≥ 0 : Xn = 0}.
Here we let infimum of the empty set be +∞, ie. if we never hit zero, we say
it takes infinite time. What is the distribution of H? We define the generating
function
∞
X
Gi (s) = Ei (sH ) = sn Pi (H = n), |s| < 1.
n=0
We have
How can we simplify this? The second term is easy, since if X1 = 0, then we must
have H = 1. So E1 (sH | X1 = 0) = s. The first term is more tricky. We are now
at 2. To get to 0, we need to pass through 1. So the time needed to get to 0 is the
time to get from 2 to 1 (say H 0 ), plus the time to get from 1 to 0 (say H 00 ). We
know that H 0 and H 00 have the same distribution as H, and by the strong Markov
property, they are independent. So
p
0 00 1 ± 1 − 4pqs2
G1 = p E1 (sH +H +1 ) + qs = psG21 + qs =⇒ G1 (s) = .
2ps
Using this, we can also find µ = E1 (H). Firstly, if p > q, then it is possible that
H = ∞, so µ = ∞. If p ≤ q, we can find µ by differentiating G1 (s) and evaluating
at s = 1. Doing this directly would result it horrible and messy algebra, which we
want to avoid. Instead, we can differentiate G1 = psG21 + qs and obtain
pG1 (s)2 + q
G01 = pG21 + ps2G1 G01 + q. =⇒ G01 (s) = .
1 − 2psG1 (s)
(
1
∞ p=
=⇒ µ = lim G01 (s) = 1
2
1
.
s→1
p−q
p< 2
D. 8-36
• Let Ti be the returning time to a state i as defined before. The mean recurrence time
of i is (
∞ i transient
µi = Ei (Ti ) = P∞
n=1 nf i,i (n) i recurrent
T. 8-38
Suppose X0 = i. Let Vi = |{n ≥ 1 : Xn = i}| and Fi,i = Pi (Ti < ∞). Then
r
Pi (Vi = r) = Fi,i (1 − Fi,i ) (i.e. has genuine geometric distribution). In particular
Pi (Vi = ∞) = 1 if i is recurrent and Pi (Vi < ∞) = 1 if i is transient.
Let Tir be the time at which the rth visit back to i takes place, with Tir = ∞ if
Vi < r. Since the Tir are increasing in r,
r r
By iteration we have P(Vi ≥ r) = Fi,i , hence Pi (Vi = r) = Fi,i (1 − Fi,i ). So if
Fi,i = 1 (i.e. i is recurrent), then Pi (Vi = r) = 0 for all r. So Pi (Vi = ∞) = 1.
Otherwise if Fi,i < 1 (i.e. i is transient), Pi (Vi = r) is a genuine geometric
distribution, and we get Pi (Vi < ∞) = 1.
r
Intuitively we have Pi (Vi = r) = Fi,i (1 − Fi,i ) since we have to return r times,
each with probability Fi,i , and then never return. Note that this result says that
if a state is recurrent, then we are expected to return to it infinitely many times.
T. 8-39
If i ↔ j are communicating, then
1. di = dj .
2. i is recurrent iff j is recurrent.
3. i is positive recurrent iff j is positive recurrent.
4. i is ergodic iff j is ergodic.
1. Assume i ↔ j. Then there are m, n ≥ 1 with pi,j (m), pj,i (n) > 0. By the
Chapman-Kolmogorov equation, we know that
where α = pi,j (m)pj,i (n) > 0. Now let Dj = {r ≥ 1 : pj,j (r) > 0}. Then by
definition, dj = hcf Dj .
This results say that recurrence, period, positive recurrent and ergodic are class
properties — if two states are in the same communicating class, then they are
either both recurrent, or both transient; both have the same period; etc.
330 CHAPTER 8. MARKOV CHAINS
P. 8-40
If the chain is irreducible and j ∈ S is recurrent, then P(Xn = j for some n ≥
1) = 1 regardless of the distribution of X0 .8
Let fi,j = Pi (Xn = j for some n ≥ 1). Since j → i, there exists a least integer
m ≥ 1 with pj,i (m) > 0. Since m is least, we know that pj,i (m) = Pj (Xm =
i, Xr 6= j for r < m). Then
This is since the left hand side is the probability that we first go from j to i in m
steps, and then never going from i to j again; while the right is the just probability
of never returning to j starting from j; and we know that it is easier to just not
get back to j than to go to i in exactly m steps and never returning to j. From
this equation we see that if fj,j = 1, then fi,j = 1. Now let λk = P(X0 = k) be
our initial distribution. Then
X
P(Xn = j for some n ≥ 1) = λj Pj (Xn = j for some n ≥ 1) = 1.
i
The main focus of this section is to study the existence and properties of invariant
distributions, and we will provide sufficient conditions for convergence to occur in
the next.
P. 8-43
For an irreducible recurrent chain with X0 = k ∈ S,
X
(i) ρk = 1 (ii) ρ i = µk (iii) ρ = ρP (iv) 0 < ρi < ∞ for all i ∈ S.
i
Note that we swapped the sum and expectation on a potentially infinite sums.
However, there is a theorem (bounded convergence) that tells us this is okay
whenever the summands are non-negative.
iii. We have
X X
ρj = Ek (Wj ) = Ek 1(Xm = j, Tk ≥ m) = Pk (Xm = j, Tk ≥ m)
m≥1 m≥1
XX
= Pk (Xm = j | Xm−1 = i, Tk ≥ m)Pk (Xm−1 = i, Tk ≥ m)
m≥1 i∈S
Of course this does not fix the problem. We will look at the different possible
cases. First, if i = k, then the r = 0 term is 1 since Tk ≥ 1 is always true by
definition and X0 = k, also by construction. On the other hand, the other
terms are all zero since it is impossible for the return time to be greater or
equal to r + 1 if we are at k at time r. So the sum is 1, which is ρk .
In the case where i 6= k, first note that when r = 0 we know that X0 = k 6= i.
So the term is zero. For r ≥ 1, we know that if Xr = i and Tk ≥ r, then we
must also have Tk ≥ r + 1, since it is impossible for the return time to k to be
exactly r if we are not at k atP time r. So Pk (Xr = i, Tk ≥ r + 1) = Pk (Xr =
i, Tk ≥ r). So indeed we have m≥0 Pk (Xm−1 = i, Tk ≥ m) = ρi .
332 CHAPTER 8. MARKOV CHAINS
iv. To show that 0 < ρi < ∞, first fix our i, and note that ρk = 1. We know that
ρ = ρP = ρP n for n ≥ 1. So by expanding the matrix sum, we know that
for any m, n, we have ρi ≥ ρk pk,i (n) and ρk ≥ ρi pi,k (m). By irreducibility,
we now choose m, n such that pi,k (m), pk,i (n) > 0. So the result follows since
ρk = 1 and
ρk
ρk pk,i (n) ≤ ρi ≤
pi,k (m)
This result says that for an irreducible recurrent Markov chain, there exist a ρ
(where ρi is the mean number of visit to i before return to k) such that
P ρP = ρ, this
is not quite an invariant distrubution since a distribution requires iP ρo = 1, but
we don’t know if we can actually “normalised” ρ. In particular since i ρi = µk ,
if µk = ∞, then we cannot “normalised” ρ.
L. 8-44
Suppose λi ≥ 0 and ∞
P
P|αi (n)| < M for some M for all i, n
i=1 λi converges, also
and αi (n) → 0 as n → ∞ for each i. Then ∞ i=1 λi αi (n) → 0 as n → ∞.
T. 8-45
For an irreducible Markov chain:
1. If some state is positive recurrent, then there exists an invariant distribution.
2. If there is an invariant distribution π, then every state is positive recurrent, and
πi = 1/µi for i ∈ S, where µi is the mean recurrence time of i. In particular,
π is unique.
2. Let π be an invariant distribution. We first show that all entries are non-zero.
n
For all n, we have π = P πP . Hence for all i, j ∈ S, n ∈ N, we have (∗)
πi ≥ πj pj,i (n). Since πi = 1, there is some k such that πk > 0. By (∗)
with j = k, we know that πi ≥ πk pk,i (n) > 0 for some n, by irreducibility. So
πi > 0 for all i.
Now we show that all states are positive recurrent. So we need to rule out
the cases of transience and null recurrence. Assume all states are transient.
P(n) → 0 for all i, j ∈ S as n → ∞.
So pj,i P[T.8-20] However, we know that
πi = j πj pj,i (n). By our previous lemma, j πj pj,i (n) → 0 as n → ∞ since
our state space is countable and pj,i (n) → 0 etc.. This is contradiction, since
πi is non-zero. Hence all states are recurrent.
To rule out the case of null recurrence, we prove that πi µi = 1, then this would
imply that µi is finite since
P πi > 0. By definition µi = Ei (TPi ), and we have the
general formula E(N ) = n P(N ≥ n). So we get πi µi = ∞ n=1 πi Pi (Ti ≥ n).
Note that Pi is a probability conditional on starting at i. So to work with the
8.3. LONG-RUN BEHAVIOUR 333
The idea of the proof is to show that for any i, j, k ∈ S, we have pi,k (n)−pj,k (n) →
0 as n → ∞. Then we can argue that no matter where we start, we will tend to
the same distribution, and hence any distribution tends to the same distribution
as π, since π doesn’t change.
334 CHAPTER 8. MARKOV CHAINS
Suppose we have two such independent Markov chains, one starts at i and the other
starts at j. Define the pair Z = (X, Y ) of the two chains, with X = (Xn ) and
Y = (Yn ) each having the state space S and transition matrix P . Now Z = (Zn ),
where Zn = (Xn , Yn ) is a Markov chain on state space S 2 . This has transition
probabilities pij,k` = pi,k pj,` by independence of the chains.
First, it can be shown that Z is irreducible. We have pij,k` (n) = pi,k (n)pj,` (n).
We want this to be strictly positive for some n. We know that there is m such
that pi,k (m) > 0, and some r such that pj,` (r) > 0. However, what we need is an
n that makes them simultaneously positive. We can indeed find such an n, based
on the assumption that we have aperiodic chains and waffling something about
number theory.
Next we show that it is positive recurrent. We know that X and Y is positive
recurrent. By our previous theorem, there is a unique invariant distribution π for
P . It is then easy to check that Z has invariant distribution
pi,k (n) = Pi (Xn = k) = Pij (Xn = k) = Pij (Xn = k, T ≤ n) + Pij (Xn = k, T > n)
Hence |pi,k (n) − pj,k (n)| ≤ Pij (T > n). As n → ∞, we know that Pij (T > n) → 0
since Z is recurrent. So |pi,k (n) − pj,k (n)| → 0.
P Now by the invariance of π, we
have π = πP n for all n. So we can write πk = j πj pj,k (n). Hence we have
X X
|πk − pi,k (n)| = πj (pj,k (n) − pi,k (n)) ≤ πj |pj,k (n) − pi,k |.
j j
We know that each individual |pj,k (n) − pi,k (n)| tends to zero. So by our lemma
we know πk − pi,k (n) → 0.
This proof is done by “coupling”. The idea of coupling is that here we have two
sets of probabilities, and we want to prove relations between them. The first step
is to move our attention to random variables, by considering random variables that
give rise to these probability distribution. In other words, we look at the Markov
chains themselves instead of the probabilities. In general, random variables are
nicer to work with, since they are functions, not discrete, unrelated numbers.
8.3. LONG-RUN BEHAVIOUR 335
What happens when we have a null recurrent case? We would still be able to
prove the result about pi,k (n) → pj,k (n), since T is finite by recurrence. However,
we do not have a π to make the last step.
Here is an elementary example which highlights the necessity of aperiodicity in
the convergence theorem. Let X be a Markov chain with state space S = {1, 2}
and transition matrix P = ( 01 10 ). Thus, X alternates deterministically between
the two states. It is immediate that P 2m = I and P 2m+1 = P for all m, and
in particular, the limit limn→∞ pi,j (n) doesn’t exist for any i, j ∈ S. The proof
of the Theorem fails since the paired chain Z is not irreducible: for example, if
Z0 = (0, 1), then Zn 6= (0, 0) for all n.
E. 8-48
<Coupling game> A pack of playing cards is shuffled, and the cards dealt (face
up) one by one. A friend is asked to select some card, secretly, from amongst the
first six. If the face value of this card is m (aces count 1 and court cards count 10
etc.), the next m − 1 cards are allowed to pass, and your friend is asked to note
the face value of the mth card. Continuing according to this rule, there arrives a
last card in this sequence, with face value of say X, and with fewer than X cards
remaining. We call X your friend’s ‘score’. If you follow the same rules as your
friend, starting for simplicity at the first card, you obtain thereby a score Y , there
is a high probability that X = Y .
Why is this the case? Suppose your friend picks the m1 th card, m2 th card, and
so on, and you pick the n1 (= 1)th, n2 th and so on. If mi = nj for some i, j, the
two of you are ‘stuck together’ forever after. When this occurs first, we say that
coupling has occurred. Prior to coupling, each time you read the value of a card,
there is a positive probability that you will arrive at the next stage on exactly
the same card as the other person. If the pack of cards were infinitely large, then
coupling would take place sooner or later. It turns out that there is a reasonable
chance that coupling takes place before the last card of a regular pack of 52 cards
has been dealt.
P. 8-49
Let X be an irreducible, recurrent Markov chain. Then every state is null recurrent
if and only if there exists a state i such that pi,i (n) → 0 as n → ∞.
We will only prove the backward direction. Suppose X is positive recurrent and
in addition, aperiodic, then pii (n) → πi = 1/µi > 0. Therefore, there’s no state
such that pi,i (n) → 0. In the periodic case, consider the chain Yn = Xnd where
d is the period of the chain, then Yn is aperiodic. So by the same argument
pii (nd) → πi = 1/µi > 0, hence pii (n) 6→ 0.
E. 8-50
In [T.8-25] we have that for 1 and 2 dimensional symmetric random walk p0,0 (2n) →
0 as n → ∞ and we know p0,0 (2n + 1) = 0 for all n. Hence p0,0 (n) → 0 as n → ∞.
Therefore the 1 and 2 dimensional symmetric random walk is null recurrent.
E. 8-51
Intuitively, for a irreducible and positive recurrent chain, πi = 1/µi is the propor-
tion of time we spend in state i. To prove this, we let Vi (n) = |{m ≤ n : Xm = i}|.
We want to show Vi (n)/n → πi as n → ∞. Note that technically, this is not a
well-formed questions, since we don’t exactly know how convergence of random
336 CHAPTER 8. MARKOV CHAINS
The idea is to look at the average time between successive visits. We assume
X0 = i. We let Tm be the time of mth return to i. In particular, T0 = 0. Let
Um = Tm − Tm−1 . All of these are iid by the strong Markov property, and has
mean µi by definition of µi . Hence, by the law of large numbers,
m
1 1 X
Tm = Ur → E[U1 ] = µi . (∗)
m m r=1
However, using (∗), we know that TAn/µi /An/µi → µi . Multiply both sides by
A/µi to get
TAn/µi
→ A.
n
So if A < 1, the event n1 TAn/µi ≥ 1 (and hence the event n1 Vi (n) ≤ µAi ) occurs
with almost probability 0 as n → ∞. Otherwise, it happens with probability 1.
So in some sense Vi (n)/n → 1/µi = πi . It should be clear that even if we didn’t
assume X0 = i, this should still holds, since we will reach i for the first time in a
finite time T 0 , and then after that the behaviour is the same as our calculation, so
Vi (n)/(n − T 0 ) → 1/µi . Hence
Vi (n) Vi (n) n − T 0 1
= → as n → ∞.
n n − T0 n µi
P P
We now show π is invariant for
P P̂ : We have i πi p̂i,j = i πj pj,i = πj since
P is a stochastic matrix and i p ji = 1. Note that our formula for p̂i,j gives
πi p̂i,j = pj,i πj .
8.4. TIME REVERSAL 337
So Y is a Markov chain.
Most of the results here should not be surprising, apart from the fact that Y is
Markov. Since Y is just X reversed, the transition matrix of Y is just the transpose
of the transition matrix of X, with some factors to get the normalization right.
Also, it is not surprising that π is invariant for P̂ , since each Xi , and hence Yi has
distribution π by assumption.
D. 8-53
• An irreducible Markov chain X = (X0 , · · · , XN ) in its invariant distribution π is
said to be reversible if its reversal has the same transition probabilities as does
X, that is it satisfies the detailed balance equation πi pi,j = πj pj,i for all i, j ∈ S.
• In general, if λ is a distribution that satisfies λi pi,j = λj pj,i we say (P, λ) is in
detailed balance .
E. 8-54
Time reversibility is a very useful concept in the theory of random networks. There
is a valuable analogy using the language of flows. Let X be a Markov chain with
state space S and invariant distribution π. To this chain there corresponds the
following directed network (or graph). The vertices of the network are the states
of the chain, and an arrow is placed from vertex i to vertex j if pi,j > 0. At the
start, one unit of a notional material is distributed about the vertices such that
proportion πi of the material is placed initially at vertex i. At each epoch of time
and for each vertex i, a proportion pi,j of the material at i is transported to each
vertex j.
P
The amount of material at vertex i after one epoch is j πj pj,i , which equals πi
since π = πP . That is to say, the deterministic flow of probability is in equilibrium:
there is ‘global balance’ in the sense that the total quantity leaving each vertex
is balanced by an equal quantity arriving there. There may or may not be ‘local
balance’, in the sense that, for every i, j ∈ S, the amount flowing from i to j equals
the amount flowing from j to i. Local balance occurs if and only if πi pi,j = πj pj, i
for all i, j ∈ S, which is to say that P and π are in detailed balance.
P. 8-55
Let P be the transition matrix of an irreducible Markov chain X. Suppose (P, λ)
is in detailed balance. Then λ is the unique invariant distribution and the chain
is reversible (when X0 has distribution λ).
are (
0 j is a neighbour of i
pi,j = 1
,
di
j is a neighbour of i
where di is the number of neighbours of i, commonly known as the degree of i.
By connectivity, the Markov chain is irreducible. Since it is finite, it is recurrent,
and in fact positive recurrent. This process is a rather “physical” process, and we
would expect it to be reversible. So we try to solve the detailed balance equation
λi pi,j = λj pj,i .
If j is not a neighbour of i, then both sides are zero, and it is trivially balanced.
Otherwise, the equation becomes λi d1i = λj d1j . The solution is obvious. Take
λi = di . In fact we can multiply by any constant c,Pand λi =P cdi for any c. So we
pick our c such that this is a distribution, ie. 1 = i λi = c i di .
We nowP note that since each edge adds 1 to the degrees of each vertex on the two
ends, di is just twice the number of edges. So the equation gives 1 = 2c|E|.
Hence we get c = 1/2|E|. Hence, our invariant distribution is λi = di /2|E|.
Let’s look at a specific scenario. Suppose we have a knight on the chessboard. In
each step, the allowed moves are:
• Move two steps horizontally, then one step vertically;
× ×
• Move two steps vertically, then one step horizontally. × ×
For example on the diagram on the right, if the knight is
× ×
at the black dot, then the possible moves are indicated
× ×
with black crosses.
At each epoch of time, our erratic knight follows a legal
move chosen uniformly from the set of possible moves.
4 4 3 2
Hence we have a Markov chain derived from the chess- 6 6 4 3
board. What is his invariant distribution? We can com- 8 8 6 4
pute the number of possible moves from each position, 8 8 6 4
shown on the diagram on the right.
P
The sum of degrees is i di = 336. So the invariant
2
distribution at, say, the corner is πcorner = 336 .
340 CHAPTER 8. MARKOV CHAINS
CHAPTER 9
Groups, Rings and Modules
9.1 Groups
For the basics see Part IA groups. We will only repeat here a minimal amount of
things that are covered in Part IA groups.
E. 9-1
We have the following familiar examples of groups
1. (Z, +, 0), (Q, +, 0), (R, +, 0), (C, +, 0) where we write 0 to emphasise it’s the
identity.
2. The symmetric group Sn is the collection of all permutations of {1, 2, · · · , n}.
The alternating group An ≤ Sn .
3. The dihedral group D2n is the symmetries of a regular n-gon. The cyclic group
Cn ≤ D2n .
4. The group GLn (R) is the group of invertible n × n real matrices, which also is
the group of invertible R-linear maps from the vector space Rn to itself. The
special linear group SLn (R) ≤ GLn (R), the subgroup of matrices of determi-
nant 1.
5. The Klein-four group C2 × C2 .
6. The quaternions Q8 = {±1, ±i, ±j, ±k} with ij = k, ji = −k, i2 = j 2 = k2 =
−1, (−1)2 = 1.
T. 9-2
<First isomorphism theorem> Let φ : G → H be a homomorphism. Then
ker(φ) C G and G/ ker(φ) ∼
= Im(φ).
341
342 CHAPTER 9. GROUPS, RINGS AND MODULES
(C/(2πiZ), +) ∼
= (C \ {0}, ×).
T. 9-4
<Second isomorphism theorem> Let H ≤ G and K C G. Then HK = {hk :
h ∈ H, k ∈ K} is a subgroup of G, and H ∩ K C H. Moreover,
HK ∼ H
= .
K H ∩K
since the first term is in H, while the second term is k0 k−1 ∈ K conjugated by h,
which has to be in K be normality. Now HK is non-empty as it contains e, hence
it’s a group.
Suppose x ∈ H ∩ K and h ∈ H. Consider h−1 xh. Since x ∈ K, the normality
of K implies h−1 xh ∈ K. Also, since x, h ∈ H, closure implies h−1 xh ∈ H. So
h−1 xh ∈ H ∩ K. Therefore H ∩ K C H.
Now we prove the isomorphism part. To do so, we apply the first isomorphism
theorem. Define φ : H → G/K by h 7→ hK. This is easily seen to be a homo-
morphism. The image is all K cosets represented by something in H, ie. Im(φ) =
HK/K. Then the kernel of φ is ker(φ) = {h : hK = eK} = {h : h ∈ K} = H ∩ K.
So the first isomorphism theorem says H/(H ∩ K) ∼= HK/K.
Notice we did more work than we really had to. We could have started by writing
down φ and checked it is a homomorphism. Then since H ∩ K is its kernel, it has
to be a normal subgroup.
C. 9-5
<Subgroup correspondence> For K C G, there is a bijection between sub-
groups of G/K and subgroups of G containing K, given by
Using the same bijection this specializes to the bijection of normal subgroups:
T. 9-6
<Third isomorphism theorem> Let K ≤ L ≤ G be normal subgroups of G.
Then
G/K ∼ G
= .
L/K L
φ(g −1 )(φ(g)(x)) = g −1 ∗ (g ∗ x) = (g −1 · g) ∗ x = e ∗ x = x,
Since this is true for all x ∈ X, we know φ(g1 ) ◦ φ(g2 ) = φ(g1 · g2 ). Also,
φ(e)(x) = e ∗ x = x.
2. Suppose φ : G → Sym X is a homomorphism, define a function G × X → X
by g ∗ x = φ(g)(x). This is a group action since
• g1 ∗(g2 ∗x) = φ(g1 )(φ(g2 )(x)) = (φ(g1 )◦φ(g2 ))(x) = φ(g1 ·g2 )(x) = (g1 ·g2 )∗x.
• e ∗ x = φ(e)(x) = idX (x) = x.
These two operations are clearly inverses to each other. So group actions of G on
X is the same as homomorphisms G → Sym(X).
We have thus shown that a permutation representation is the same as a group
action. Also a good thing about thinking of group actions as homomorphisms is
that we can use all we know about homomorphisms on it.
D. 9-13
• For an action of G on X given by homomorphism φ : G → Sym(X), we write
GX = Im(φ) and GX = ker(φ).
• Given subgroup H ≤ G write G/H as the set of left cosets of H.1
• If G acts on a set X, the orbit of x ∈ X is G · x = {g ∗ x ∈ X : g ∈ G}. The
stabilizer of x ∈ X is Gx = {g ∈ G : g ∗ x = x}.
E. 9-14
• The first isomorphism theorem theorem immediately gives GX C G and G/GX ∼
=
GX . In particular, if GX = {e} is trivial, then G ∼
= GX ≤ Sym(X).
• Let G be the group of symmetries of a cube. Let X be the
set of diagonals of the cube. Then G acts on X, and so we
get φ : G → Sym(X).
Kernel: To preserve the diagonals, it either does nothing to
the diagonal, or flips the two vertices of each diagonals. So
GX = ker(φ) = {id, symmetry that sends each vertex to its
opposite} ∼
= C2 .
T. 9-17
Let G be a finite group, and H ≤ G a subgroup of index n. Then there is a normal
subgroup K C G with K ≤ H such that G/K is isomorphic to a subgroup of Sn .
Hence |G/K| | n! and |G/K| ≥ n.
intersection is not the trivial group. We use the second isomorphism theorem. If
Im(φ) ∩ An = {e}, then
Im(φ) Im(φ)An Sn ∼
Im(φ) ∼
= ∼
= ≤ = C2 .
Im(φ) ∩ An An An
So G ∼= Im(φ) is a subgroup of C2 , ie. either {e} or C2 itself. Neither of these are
non-abelian. So this cannot be the case. So we must have Im(φ) ∩ An = Im(φ),
ie. Im(φ) ≤ An .
The last part follows from the fact that S1 , S2 , S3 , S4 have no non-abelian simple
subgroups, which can be check by listing out all their subgroups.
T. 9-19
<Orbit-stabilizer theorem> Let G act on X. Then for any x ∈ X, there is a
bijection between G · x and G/Gx , given by g · x ↔ g · Gx . In particular, if G is
finite, then |G| = |Gx ||G · x|.
This is a group under composition, with the identity map as the identity.
• The conjugacy class of g ∈ G is cclG (g) = {hgh−1 : h ∈ G}, ie. the orbit of
g ∈ G under the conjugation action.
• The centralizer of g ∈ G is CG (g) = {h ∈ G : hgh−1 = g}, ie. the stabilizer
of g under the conjugation action. This is alternatively the set of all h ∈ G that
commutes with g.
• The center of a group G is
\
Z(G) = {h ∈ G : hgh−1 = g for all g ∈ G} = CG (g) = ker(φ).
g∈G
where φ is the homomorphism for the conjugate action. It is the elements of the
group that commute with everything else
• Let H ≤ G. The normalizer of H in G is NG (H) = {g ∈ G : g −1 Hg = H}.
E. 9-21
We have seen that every group acts on itself by multiplying on the left. A group
G can also act on itself by conjugation g ∗ g1 = gg1 g −1 .
Let φ : G → Sym(G) be the associated permutation representation. We know, by
definition, that φ(g) is a bijection from G to G as sets. However, here G is not
an arbitrary set, but is a group. A natural question to ask is whether φ(g) is a
homomorphism or not. Indeed, we have
Thus, for any group G, there are many isomorphisms from G to itself, one for
every g ∈ G, and can be obtained from a group action of G on itself. We can, of
course, take the collection of all isomorphisms of G, and form a new group out of
it, the automorphism group. It is a subgroup of Sym(G), and the homomorphism
φ : G → Sym(G) by conjugation lands in Aut(G).
This is pretty fun — we can use this to cook up some more groups, by taking a
group and looking at its automorphism group. We can also take a group, take
its automorphism group, and then take its automorphism group again, and do it
again, and see if this process stabilizes, or becomes periodic, or something!
P. 9-22
Let G be a finite group. Then | ccl(x)| = |G : CG (x)| = |G|/|CG (x)|.
Since they have the same cycle type, so we have σ ∈ Sn such that (a b c) =
σ(1 2 3)σ −1 . If σ is even, ie. σ ∈ An , then we have that (1 2 3) ∈ σ −1 Hσ = H,
by the normality of H and we are trivially done.
If σ is odd, replace it by σ̄ = σ · (4 5). Here is where we use the fact that n ≥ 5
(we will use it again later). Then we have
σ̄(1 2 3)σ̄ −1 = σ(4 5)(1 2 3)(4 5)σ −1 = σ(1 2 3)σ −1 = (a b c),
since (1 2 3) and (4 5) commute. Now σ̄ is even, so (1 2 3) ∈ H as above.
(Part iii) We separate this into many cases
1. Suppose H contains an element which can be written in disjoint cycle notation
σ = (1 2 3 · · · r)τ,
for r ≥ 4. We now let δ = (1 2 3) ∈ An . Then by normality of H, we know
δ −1 σδ ∈ H. Then σ −1 δ −1 σδ ∈ H. Also, we notice that τ does not contain
1, 2, 3. So it commutes with δ, and also trivially with (1 2 3 · · · r). We can
expand σ −1 δ −1 σδ to obtain a 3 cycle in H:
σ −1 δ −1 σδ = (r · · · 2 1)(1 3 2)(1 2 3 · · · r)(1 2 3) = (2 3 r),
The same argument goes through if σ = (a1 a2 · · · ar )τ for any a1 , · · · , ar .
2. Suppose H contains an element consisting of at least two 3-cycles in disjoint
cycle notation, say σ = (1 2 3)(4 5 6)τ . We now let δ = (1 2 4), and again
calculate
σ −1 δ −1 σδ = (1 3 2)(4 6 5)(1 4 2)(1 2 3)(4 5 6)(1 2 4) = (1 2 4 3 6).
This is a 5-cycle, which is necessarily in H. By the previous case, we get a
3-cycle in H too, and hence H = An .
3. Suppose H contains σ = (1 2 3)τ , with τ a product of 2-cycles (if τ contains
anything longer, then it would fit in one of the previous two cases). Then
σ 2 = (1 2 3)2 = (1 3 2) is a three-cycle.
4. Suppose H contains σ = (1 2)(3 4)τ , where τ is a product of 2-cycles. We first
let δ = (1 2 3) and calculate
u = σ −1 δ −1 σδ = (1 2)(3 4)(1 3 2)(1 2)(3 4)(1 2 3) = (1 4)(2 3),
which is again in u. We landed in the same case, but instead of two transpo-
sitions times a mess, we just have two transpositions, which is nicer. Let
v = (1 5 2)u(1 2 5) = (1 3)(4 5) ∈ H.
Note that we used n ≥ 5 again. We have yet again landed in the same case.
Notice however, that these are not the same transpositions. We multiply
uv = (1 4)(2 3)(1 3)(4 5) = (1 2 3 4 5) ∈ H.
This is then covered by the first case, and we are done.
Recall we proved that A5 is simple in IA Groups by brute force – we listed all its
conjugacy classes, and see they cannot be put together to make a normal subgroup.
This obviously cannot be easily generalized to higher values of n. Hence we need
to prove this with a different approach.
350 CHAPTER 9. GROUPS, RINGS AND MODULES
D. 9-25
A finite group G is a p-group if |G| = pn for some prime number p and n ≥ 1.
T. 9-26
If G is a finite p-group, then Z(G) = {x ∈ G : xg = gx for all g ∈ G} is non-trivial.
Let G act on itself by conjugation. The orbits of this action (ie. the conjugacy
classes) have order dividing |G| = pn . So it is either a singleton, or its size is
divisible by p. Since the conjugacy classes partition G, we know the total size of
the conjugacy classes is |G|. In particular,
X
|G| = number of conjugacy class of size 1 + size of other conjugacy classes.
Let gZ(G) be a generator of the cyclic group G/Z(G). Hence every coset of Z(G)
is of the form g r Z(G). So every element x ∈ G must be of the form g r z for
z ∈ Z(G) and r ∈ Z. To show G is abelian, let x̄ = g r̄ z̄ be another element, with
z̄ ∈ Z(G), r̄ ∈ Z. Note that z and z̄ are in the center, and hence commute with
every element. So we have
Since Z(G) ≤ G, its order must be 1, p or p2 . Since it is not trivial, it can only
be p or p2 . If it has order p2 , then it is the whole group and the group is abelian.
Otherwise, G/Z(G) has order p2 /p = p. But then it must be cyclic, and thus G
must be abelian.
9.1. GROUPS 351
T. 9-29
Let G be a group of order pa , where p is a prime number. Then it has a subgroup
of order pb for any 0 ≤ b ≤ a.
Since this is a strictly smaller group, we can by induction suppose G/hxi has a
subgroup of any order. In particular, it has a subgroup L of order pb−1 . By the
subgroup correspondence, there is some K ≤ G such that L = K/hxi and hxi C K.
But then K has order pb .
This means there is a subgroup of every conceivable order. This is not true for
general groups. For example, A5 has no subgroup of order 30 or else that would
be a normal subgroup.
T. 9-30
<Classification of finite abelian groups> Let G be a finite abelian group.
Then there exists some d1 , · · · , dr such that
G∼
= Cd1 × Cd2 × · · · × Cdr .
Moreover, we can pick di such that di+1 | di for each i, and this expression is
unique.
We will prove this later in [T.9-145] as it turns out the best way to prove this is
not to think of it as a group, but as a Z-module.
E. 9-31
So it turns out finite abelian groups are very easy to classify. We can just write
down a list of all finite abelian groups. For example the abelian groups of order 8
are C8 , C4 × C2 , C2 × C2 × C2 .
L. 9-32
If n and m are coprime, then Cmn ∼
= Cm × Cn .
This is a grown-up version of the Chinese remainder theorem. This is what the
Chinese remainder theorem really says.
352 CHAPTER 9. GROUPS, RINGS AND MODULES
P. 9-33
For any finite abelian group G, we have G ∼
= Cd1 × Cd2 × · · · × Cdr where each di
is some prime power.
From the classification theorem, iteratively apply the previous lemma to break
each component up into products of prime powers.
This is somewhat a more useful form of decomposition of finite abelian groups.
T. 9-34
<Sylow theorems> Let G be a finite group of order pa m, with p and prime
and p - m. Then
1. The set of Sylow p-subgroups of G, given by Sylp (G) = {P ≤ G : |P | = pa } is
non-empty. In other words, G has a subgroup of order pa .
2. All elements of Sylp (G) are conjugate in G.
3. The number of Sylow p-subgroups np = | Sylp (G)| satisfies np ≡ 1 (mod p)
and np | |G| (in fact np | m, since p is not a factor of np ).
Now note that the largest power of p dividing pa m − j is the largest power of
p dividing j. Similarly, the largest power of p dividing pa − j is also the largest
power of p dividing j. So we have the same power of p on top and bottom for
each item in the product, and they cancel. So |Ω| is not divisible by p. 1
This proof is not straightforward. We first needed the clever idea of letting G
act on Ω. But then if we are given this set, the obvious thing to do would be
to find something in Ω that is also a group. This is not what we do. Instead,
we find an orbit whose stabilizer is a Sylow p-subgroup.
9.1. GROUPS 353
We let Q act on the set of cosets of G/P via q ∗ gP = qgP . The orbits of this
action have size dividing |Q|, so is either 1 or divisible by p. But they can’t
all be divisible by p, since |G/P | is coprime to p. So at least one of them have
size 1, say {gP }. In other words, for every q ∈ Q, we have qgP = gP . This
means g −1 qg ∈ P . This holds for every element q ∈ Q. So we have found a g
such that g −1 Qg ≤ P . 2
3. Finally, we need to show that np ∼= 1 (mod p) and np | |G|. The second part is
easier — by result 2, the action of G on Sylp (G) by conjugation has one orbit.
By the orbit-stabilizer theorem, the size of the orbit, which is | Sylp (G)| = np ,
divides |G|. This proves the second part.
For the first part, let P ∈ Sylp (G). Consider the action by conjugation of P
on Sylp (G). Again by the orbit-stabilizer theorem, the orbits each have size 1
or size divisible by p. But we know there is one orbit of size 1, namely {P }
itself. To show np = | Sylp (G)| ∼
= 1 (mod p), it is enough to show there are no
other orbits of size 1.
p−1 Qp = Q.
In other words, P ≤ NG (Q). Now NG (Q) is itself a group, and we can look at
its Sylow p-subgroups. We know Q ≤ NG (Q) ≤ G. So pa | |NG (Q)| | pa m. So
pa is the biggest power of p that divides |NG (Q)|. So Q is a Sylow p-subgroup
of NG (Q).
L. 9-35
A Sylow p-subgroup is normal in G iff it is the only Sylow p-subgroup in G (i.e.
np = | Sylp (G)| = 1)
(Forward) It it’s normal, then we can’t conjugate it to any other subgroup, hence
it’s the only Sylow p-subgroup.
(Backward) Let P be the unique Sylow p-subgroup, and let g ∈ G, and consider
g −1 P g. Since this is isomorphic to P , we must have |g −1 P g| = pa , ie. it is also
a Sylow p-subgroup. Since there is only one, we must have P = g −1 P g. So P is
normal.
354 CHAPTER 9. GROUPS, RINGS AND MODULES
E. 9-36
• Suppose |G| = 1000. Then |G| is not simple. To show this, we need to factorize
1000. We have |G| = 23 · 53 . We pick our favorite prime to be p = 5. We know
n5 ≡ 1 (mod 5), and n5 | 23 = 8. The only number that satisfies this is n5 = 1.
So the Sylow 5-subgroup is normal, and hence G is not simple.
• Consider the group conjugate action on the set of Sylow p-subgroup Sylp (G). Its
orbit is all the Sylow p-subgroups, so the stabiliser (ie. normaliser) of any Sylow
p-subgroup must have index np .
P. 9-37
Let G be a non-abelian simple group. Then |G| | np !/2 for every prime p such that
p | |G|.
If this is surjective, then ker(sgn ◦φ) C G has index 2, and is not the whole of
G. This means G is not simple (the case where G = C2 is ruled out since it is
abelian). So the kernel must be the whole G. In other words, sgn(φ(g)) = +1 for
all G. Hence G ∼ = Im(φ) ≤ Anp . So we get |G| | np !/2.
Note that this result in fact follows directly from [P.9-18] applied on the normaliser
of a Sylow p-subgroup. This proof simply write out this argument directly.
E. 9-38
No group of order 132 is simple.
L. 9-39
Suppose |G| = pn a with p and a coprime. If H C G with |H| = pn b, then
| Sylp (G)| = | Sylp (H)|.
az + b
=z for all z ∈ Z∞
5
cz + d
then z = 0 tell us b = 0, z = ∞ tell us c = 0 and z = 1 tell us a = d. So
in PSL2 (Z5 ), ( ac db ) = ( 10 01 ). Moreover we claim that Im φ ≤ A6 . Consider ψ =
sgn ◦φ, we need to show ψ( ac db ) = 1 always. Possible element orders of PSL2 (Z5 )
is 2, 3, 4, 5, 6, 12, 15, 20, 30. Firstly elements of odd orders in PSL2 (Z5 ) must be
sent to 1. Secondly elements of order 2 and 4 must be sent to 1, this is because
λ 0 0 λ ∗
H= , : λ ∈ Z
0 λ−1 λ−1 0 5
9.2 Rings I
In a ring, we are allowed to add, subtract, multiply but not divide. Our canonical
example of a ring would be Z. In this course, we are only going to consider rings in
which multiplication is commutative, since these rings behave like “number systems”,
where we can study number theory. We will study properties of arbitrary rings.
D. 9-41
• A ring is a quintuple (R, +, · , 0R , 1R ) where 0R , 1R ∈ R, and +, · : R × R → R
are binary operations such that
1. (R, +, 0R ) is an abelian group.
2. The operation · : R × R → R satisfies associativity a · (b · c) = (a · b) · c and
identity2 1R · r = r · 1R = r.
3. Multiplication distributes over addition, that is
• The familiar number systems are all rings: we have Z ≤ Q ≤ R ≤ C, under the
usual 0, 1, +, · . The set Z[i] = {a + ib : a, b ∈ Z} ≤ C is the Gaussian integers ,
√ √
which is a ring. We also have the ring Q[ 2] = {a + b 2 ∈ R : a, b ∈ Q} ≤ R.
We will use the square brackets notation quite frequently. It should be clear what
it should mean, and we will define it properly later.
• In general, elements in a ring do not have inverses. This is not a bad thing. This
is what makes rings interesting. For example, the division algorithm would be
rather contentless if everything in Z had an inverse. Fortunately, Z only has two
invertible elements — 1 and −1. We call these units
• Note that the notion of units depends on R, not just on u. For example, 2 ∈ Z is
not a unit, but 2 ∈ Q is a unit (since 21 is an inverse). We will later show that 0R
cannot be a unit unless in a very degenerate case.
√
• Z is not a field, but Q, R, C are all fields. Similarly, Z[i] is not a field, while Q[ 2]
is.
• Let R be a ring. Then 0R + 0R = 0R , since this is true in the group (R, +, 0R ).
Then for any r ∈ R, we get r · (0R + 0R ) = r · 0R . Multiplication distributes
over addition, so r · 0R + r · 0R = r · 0R . Adding −(r · 0R ) to both sides give
r · 0R = 0R . This is true for any element r ∈ R. From this, it follows that if
R 6= {0}, then 1R 6= 0R — if they were equal, then take r 6= 0R and we have
r = r · 1R = r · 0R = 0R , which is a contradiction.
Note, however, that {0} forms a ring (with the only possible operations and iden-
tities), the zero ring, albeit a boring one. However, this is often a counterexample
to many things.
D. 9-43
• Let R, S be rings. Then the product R × S is a ring via
f = a0 + a1 X + a2 X 2 + · · · + an X n ,
ai X i
P
We identify the ring R with the constant polynomials, ie. polynomials
with ai = 0 for i > 0. In particular, 0R ∈ R and 1R ∈ R are the zero and one of
R[X].
3
It can be can checked that R × S is indeed a ring.
358 CHAPTER 9. GROUPS, RINGS AND MODULES
• We write R[[X]] for the ring of power series on the ring R, ie. f = a0 + a1 X +
a2 X 2 + · · · where each ai ∈ R. This has addition and multiplication the same as
for polynomials, but without upper limits.
• The Laurent polynomials on the ring R is the set R[X, X −1 ] so that each element
is of the form f = i∈Z ai X i where ai ∈ R and only finitely many ai with i < 0
P
are non-zero. The operations are the obvious ones.
E. 9-44
• For polynomial f , we identify f and f + 0R · X n+1 as the same thing.
Note that a polynomial is just a sequence of numbers, interpreted as the coefficients
of some formal symbols. While it does indeed induce a function in the obvious way,
we shall not identify the polynomial with the function given by it, since different
polynomials can give rise to the same function.
For example, in Z/2Z[X], f = X 2 + X is not the zero polynomial, since its
coefficients are not zero. However, f (0) = 0 and f (1) = 0. As a function, this is
identically zero. So f 6= 0 as a polynomial but f = 0 as a function.
• A power series is very not a function. We don’t talk about whether the sum
converges or not, because it is not a sum.
• R[X] is in fact a ring. Is 1 − X ∈ R[X] a unit? For every g = a0 + · · · + an X n
(with an 6= 0), we get
(1 − X)(1 + X + X 2 + X 3 + · · · ) = 1.
• We can also think of Laurent series, but we have to be careful. We allow infinitely
many positive coefficients, but only finitely many negative ones. Or else, in the
formula for multiplication, we will have an infinite sum, which is undefined.
• Let X be a set, and R be a ring. Then the set of all functions on X, ie. functions
f : X → R is a ring given by
Here zero is the constant function 0 and one is the constant function 1.
Usually, we don’t want to consider all functions X → R. Instead, we look at some
subrings of this. For example, we can consider the ring of all continuous functions
R → R. This contains, for example, the polynomial functions, which is just R[X]
(since in R, polynomials are functions).
D. 9-45
• Let R, S be rings. A function φ : R → S is a ring homomorphism if it preserves
everything we can think of, that is
1. φ(r1 + r2 ) = φ(r1 ) + φ(r2 ),
2. φ(0R ) = 0S ,
3. φ(r1 · r2 ) = φ(r1 ) · φ(r2 ),
9.2. RINGS I 359
4. φ(1R ) = 1S .
If a homomorphism φ : R → S is a bijection, we call it an isomorphism .
• The kernel of a homomorphism φ : R → S is ker(φ) = {r ∈ R : φ(r) = 0S }, and
its image is Im(φ) = {s ∈ S : s = φ(r) for some r ∈ R}.
• A subset I ⊆ R is an ideal , written I C R, if
1. It is an additive subgroup of (R, +, 0R ), ie. it is closed under addition and
additive inverses. (additive closure)
2. If a ∈ I and b ∈ R, then a · b ∈ I. (strong closure)
We say I is a proper ideal if I 6= R.
• Let R be a ring,
For an element a ∈ R, we write (a) = aR = {a · r : r ∈ R} C R. This is the
ideal generated by a . An ideal I is a principal ideal if I = (a) for some a ∈ R.
Let a1 , a2 , · · · , ak ∈ R, the ideal generated by a1 , · · · , ak is
(a1 , a2 , · · · , ak ) = {a1 r1 + · · · + ak rk : r1 , · · · , rk ∈ R}.
E. 9-46
In the group scenario, we had groups, subgroups and normal subgroups, which
are special subgroups. Here, we have a special kind of subsets of a ring that act
like normal subgroups, known as ideals.
Note that the multiplicative closure is stronger than what we require for subrings
— for subrings, it has to be closed under multiplication by its own elements; for
ideals, it has to be closed under multiplication by everything in the world. This
is similar to how normal subgroups not only have to be closed under internal
multiplication, but also conjugation by external elements.
Principal ideals are rather nice ideals, since they are easy to describe, and often
have some nice properties.
Note that it is easier to come up with ideals than normal subgroups — we can
just pick up random elements, and then take the ideal generated by them.
L. 9-47
A homomorphism φ : R → S is injective if and only if ker φ = {0R }.
E. 9-49
Suppose I C R is an ideal, and 1R ∈ I. Then for any r ∈ R, the axioms entail
1R · r ∈ I. But 1R · r = r. So if 1R ∈ I, then I = R.
In other words, every proper ideal does not contain 1. In particular, every proper
ideal is not a subring, since a subring must contain 1. We are starting to diverge
from groups. In groups, a normal subgroup is a subgroup, but here an ideal is not
a subring.
We can generalize the above a bit. Suppose I C R and u ∈ I is a unit, ie. there is
some v ∈ R such that uv = 1R . Then by strong closure, 1R = u · v ∈ I. So I = R.
Hence proper ideals are not allowed to contain any unit at all, not just 1R .
E. 9-50
Consider the ring Z of integers. Then every ideal of Z is of the form
To show these are all the ideals, let I C Z. If I = {0}, then I = 0Z. Otherwise,
let n ∈ N be the smallest positive element of T . We want to show in fact I = nZ.
Certainly nZ ⊆ I by strong closure.
Now let m ∈ I. By the Euclidean algorithm, we can write m = q · n + r with
0 ≤ r < n. Now n, m ∈ I. So by strong closure, m, qn ∈ I. So r = m − q · n ∈ I.
As n is the smallest positive element of I, and r < n, we must have r = 0. So
m = q · n ∈ nZ. So I ⊆ nZ. So I = nZ.
So what we have just shown for Z is that all ideals are principal. Not all rings are
like this. These are special types of rings. The key to proving this was that we
can perform the Euclidean algorithm on Z. Thus, for any ring R in which we can
“do Euclidean algorithm”, every ideal is of the form aR = {a · r : r ∈ R} for some
a ∈ R. We will make this notion precise in later.
E. 9-51
Consider {f ∈ R[X] : the constant coefficient of f is 0}. This is an ideal, as we
can check manually (alternatively, it is the kernel of the “evaluate at 0” homomor-
phism). It turns out this is a principal ideal, in fact it is (X).
D. 9-52
Let I C R. The quotient ring R/I consists of the (additive) cosets r + I with the
zero and one as 0R + I and 1R + I, and operations
P. 9-53
The quotient ring is a ring, and the function R → R/I with r 7→ r + I is a ring
homomorphism.
By the strong closure property, the last three objects are in I. So r10 r20 +I = r1 r2 +I.
9.2. RINGS I 361
It is easy to check that 0R + I and 1R + I are indeed the zero and one, and the
function given is clearly a homomorphism.
This is true, because we defined ideals to be those things that can be quotiented
by. So we just have to check we made the right definition. Just as we could have
come up with the definition of a normal subgroup by requiring operations on the
cosets to be well-defined, we could have come up with the definition of an ideal by
requiring the multiplication of cosets is well-defined, and we will end up with the
strong closure property.
E. 9-54
• We have the ideals nZ C Z. So we have the quotient ring Z/nZ. The elements are
of the form m + nZ, so are just
a0 + a1 X + a2 X 2 + · · · + an X n + (X).
But everything but the first term is in (X). So every such thing is equivalent to
a0 + (X). This representation is unique since a0 + (X) = b0 + (X) ⇒ a0 − b0 is
divisible by X ⇒ a0 − b0 = 0 ⇒ a0 = b0 . So in fact C[X]/(X) ∼ = C, with the
isomorphism a0 + (X) ↔ a0 .
T. 9-55
<Euclidean algorithm for polynomials> Let F be a field and f, g ∈ F[X].
Then there is some r, q ∈ F[X] such that f = gq + r with deg r < deg g.
) = n and write f = n i
P
Let deg(fP i=0 ai X with an 6= 0. Similarly, write deg g = m
m
and g = i=0 bi X i with bm 6= 0. If n < m, we let q = 0 and r = f , and done.
Otherwise, suppose n ≥ m, and proceed by induction on n. We let f1 = f −
an b−1
m X
n−m
g. This is possible since bm 6= 0, and F is a field. Then by construction,
the coefficients of X n cancel out. So deg(f1 ) < n.
If n = m, then deg(f1 ) < n = m. So we can write f = (an b−1 m X
n−m
)g + f1
and deg(f1 ) < deg(f ). So done. Otherwise, if n > m, then as deg(f1 ) < n, by
induction, we can find r1 , q1 such that f1 = gq1 + r1 and deg(r1 ) < deg g = m.
Then
−1 n−m
f = an bm X g + q1 g + r1 = (an b−1
m X
n−m
+ q1 )g + r1 .
This is like the usual Euclidean algorithm, except that instead of the absolute
value, we use the degree to measure how “big” the polynomial is.
Now that we have a Euclidean algorithm for polynomials. So we should be able to
show that every ideal of F[X] is generated by one polynomial. We will not prove it
specifically here, but later show that in general, in every ring where the Euclidean
algorithm is possible, all ideals are principal.
362 CHAPTER 9. GROUPS, RINGS AND MODULES
E. 9-56
Consider R[X], and consider the principal ideal (X 2 + 1) C R[X]. We let R =
R[X]/(X 2 + 1). Elements of R are polynomials
a0 + a1 X + a2 X 2 + · · · + an X n +(X 2 + 1).
| {z }
=f
= φ((ac − bd) + (ad + bc)X + (X 2 + 1)) = (ac − bd) + (ad + bc)i = (a + bi)(c + di)
= φ(a + bX + (X 2 + 1))φ(c + dX + (X 2 + 1)).
Write f = n k n−1
+ bn−2 X n−2 +
P
k=0 ak X . We try to write f = (X − w)(bn−1 X
· · · + b0 ) + r where r ∈ R. Equating the coefficients from the highest power to the
lowest we find that
n−1
X Xn Xn
f = (X − w) wj−(k+1) aj X k + a0 + wj aj
k=0 j=k+1 j=1
| {z } | {z }
α(X) β
T. 9-58
<First isomorphism theorem> Let φ : R → S be a ring homomorphism.
Then ker(φ) C R and R/ ker(φ) ∼
= Im(φ) ≤ S.
It is important to note here quotienting in groups and rings have different purposes.
In groups, we take quotients so that we have a simpler group to work with. In rings,
we often take quotients to get more interesting rings. For example, R[X] is quite
boring, but R[X]/(X 2 +1) ∼ = C is more interesting. Thus this ideal correspondence
allows us to occasionally get interesting ideals from boring ones.
T. 9-62
<Third isomorphism theorem> Let I C R and J C R, and I ⊆ J. Then
J/I C R/I and
R/I ∼ R
= .
J/I J
J
ker(φ) = {r + I : r + J = 0} = {r + I : r ∈ J} = .
I
So the result follows from the first isomorphism theorem.
E. 9-63
Note that for any ring R, there is a unique ring homomorphism Z → R, given by
1
R + 1R + · · · + 1R for n ≥ 0
| {z }
|n| times
ι:Z→R with ι(n) =
−(1
| R + 1R + · · · + 1R ) for n < 0
{z }
|n| times
We need to show the product of two non-zero elements are non-zero. Let f, g ∈
R[X] be non-zero, say
f = a0 + a1 X + · · · + an X n ∈ R[X]
g = b0 + b1 X + · · · + bm X m ∈ R[X],
So, for instance, Z[X] is an integral domain. We can also iterate this: if R is an
integral domain, so is R[X1 , · · · , Xn ].
T. 9-68
Every integral domain has a field of fractions.
The construction is exactly how we construct the rationals from the integers – as
equivalence classes of pairs of integers. We let S = {(a, b) ∈ R × R : b 6= 0}. We
think of (a, b) ∈ S as ab . We define the equivalence relation ∼ on S by
To show this is indeed a equivalence relation: symmetry and reflexivity are obvious.
To show transitivity, suppose (a, b) ∼ (c, d) and (c, d) ∼ (e, f ), ie. ad = bc and
cf = de. We multiply the first equation by f and the second by b, to obtain
adf = bcf and bcf = bed. Rearranging, we get d(af − be) = 0. Since d is in the
denominator, d 6= 0. Since R is an integral domain, we must have af − be = 0, ie.
af = be. So (a, b) ∼ (e, f ). This is where being an integral domain is important.
Now let F = S/∼ be the set of equivalence classes. We now want to check this
is indeed the field of fractions. We first want to show it is a field. We write
a
b
= [(a, b)] ∈ F , and define the operations by
a c ad + bc a c ac
+ = · = .
b d bd b d bd
We can check that this is well-defined, and makes (F, +, ·, 01 , 11 ) into a ring.
b a ba
· = = 1.
a b ba
a
So b
has a multiplicative inverse. So F is a field.
Recall that the subring of any field is an integral domain. This says the converse
– every integral domain is the subring of some field. For example, Q is the field of
fractions of Z. The field of fractions of C[X] is the field of all rational functions
p(X)
q(X)
, where p, q ∈ C[X].
This gives us a very useful tool. Since this gives us a field from an integral domain,
this allows us to use field techniques to study integral domains. Moreover, we can
use this to construct new interesting fields from integral domains.
9.2. RINGS I 367
D. 9-69
An proper ideal I of a ring R (i.e. with I 6= R) is said to be
E. 9-70
A non-zero ideal nZ C Z is prime if and only if n is a prime. To show this, first
suppose n = p is a prime, and a · b ∈ pZ, then p | a · b. So p | a or p | b, ie.
a ∈ pZ or b ∈ pZ. For the other direction, suppose n = pq is a composite number
(p, q 6= 1). Then n ∈ nZ but p 6∈ nZ and q 6∈ nZ, since 0 < p, q < n.
So instead of talking about prime numbers, we can talk about prime ideals instead.
L. 9-71
A (non-zero) ring R is a field if and only if its only ideals are {0} and R.
Note that we don’t need elements to define the ideals {0} and R. {0} can be
defined as the ideal that all other ideals contain, and R is the ideal that contains
all other ideals. Alternatively, we can reword this as “R is a field if and only if
it has only two ideals” to avoid mentioning explicit ideals. This is another reason
why fields are special. They have the simplest possible ideal structure.
L. 9-72
1. An ideal I C R is maximal if and only if R/I is a field.
2. An ideal I C R is prime if and only if R/I is an integral domain.
1. R/I is a field if and only if {0} and R/I are the only ideals of R/I. By the
ideal correspondence, this is equivalent to saying I and R are the only ideals
of R which contains I, ie. I is maximal.
P. 9-73
Every maximal ideal is a prime ideal.
The converse is not true. For example, {0} ⊆ Z is prime but not maximal. Also,
(X) ∈ Z[X, Y ] is prime but not maximal (since Z[X, Y ]/(X) ∼
= Z[Y ]).
L. 9-74
Let R be an integral domain. Then its characteristic is either 0 or a prime number.
Consider the unique map φ : Z → R, and ker(φ) = nZ. Then n is the characteristic
of R by definition. By the first isomorphism theorem, Z/nZ ∼ = Im(φ) ≤ R. So
Z/nZ is an integral domain. So nZCZ is a prime. So n = 0 or a prime number.
9.3 Rings II
D. 9-75
Let R be a integral domain.
E. 9-76
In integers, two numbers are associates only if a and b differ by a sign, but in
more interesting rings, more interesting things can happen. When considering
division in rings, we often consider two associates to be “the same”. For example,
in Z, we can factorize 6 as 6 = 2 · 3 = (−2) · (−3) but this does not violate unique
factorization, since 2 and −2 are associates (and so are 3 and −3), and we consider
these two factorizations to be “the same”.
For integers, being irreducible is the same as being a prime number. However,
“prime” means something different in general rings.
It is important to note all these properties depend on the ring, not the element
itself. For example 2 ∈ Z is a prime, but 2 ∈ Q is not (since it is a unit). Similarly,
the polynomial 2X ∈ Q[X] is irreducible (since 2 is a unit), but 2X ∈ Z[X] not
irreducible.
9.3. RINGS II 369
L. 9-77
Let R be a integral domain. A principal ideal (r) is a prime ideal iff r = 0 or r is
prime.
Let r ∈ R be prime, and suppose r = ab. Since r | r = ab, and r is prime, we must
have r | a or r | b. wlog, r | a. So a = rc for some c ∈ R. So r = ab = rcb. Since
we are in an integral domain, we must have 1 = cb. So b is a unit.
The converse is in general not true, although it’s true in Z.
E. 9-79
Consider √ √
R = Z[ −5] = {a + b −5 : a, b ∈ Z} ≤ C.
By definition, it is a subring of a field. So it is an integral domain. What are the
units of the ring? There is a nice trick we can use, when things are lying inside C.
Consider the function
√
N : R → Z≥0 given by N (a + b −5) = a2 + 5b2 .
a = b · q + |{z}
b·c
=r
We know r = a − bq ∈ Z[i], and φ(r) = N (bc) = N (b)N (c) < N (b) = φ(b).
This is not just true for the Gaussian integers. All we really needed was that
R ≤ C, and for any x ∈ C, there is some point in R that is not more than√1 away
from x. If we draw some more pictures, we will see this is not true for Z[ −5].
E. 9-82
Z is a principal ideal domain.
P. 9-83
If R is a Euclidean domain, then R is a principal ideal domain.
9.3. RINGS II 371
Then we continue the proof as above. Hence what we did in the middle is to do
something similar to showing p and a are “coprime”.
Note that this result also true for general unique factorization domains, which we
can prove directly by unique factorization.
L. 9-87
Let R be a principal ideal domain. Let I1 ⊆ I2 ⊆ I3 ⊆ · · · be a chain of ideals.
Then there is some N ∈ N such that In = In+1 for all n ≥ N . (Every principal
ideal domain is Noetherian)
The obvious thing to do Swhen we have an infinite chain of ideals is to take the
union of them. Let I = ∞ n≥1 In which is again an ideal. Since
S R is a principal
ideal domain, I = (a) for some a ∈ R. We know a ∈ I = ∞ n=0 In . So a ∈ IN
for some N . Then we have (a) ⊆ IN ⊆ I = (a). So we must have IN = I. So
In = IN = I for all n ≥ N .
Notice it is not important that I is generated by one element. If, for some reason,
we know I is generated by finitely many elements, then the same argument work.
So if every ideal is finitely generated, then the ring must be Noetherian. It turns
out this is an if-and-only-if — if you are Noetherian, then every ideal is finitely
generated. We will prove this later.
P. 9-88
If R is a principal ideal domain, then R is a unique factorization domain.
By assumption, the process does not end, and then we have the following chain
of ideals: (r) ⊆ (r1 ) ⊆ (r2 ) ⊆ · · · ⊆ (rn ) ⊆ · · · . But then we have an ascending
chain of ideals. By the ascending chain condition, these are all eventually equal,
ie. there is some n such that (rn ) = (rn+1 ) = (rn+2 ) = · · · . In particular, since
(rn ) = (rn+1 ), and rn = rn+1 sn+1 , then sn+1 is a unit. But this is a contradiction,
since sn+1 is not a unit. So r must be a product of irreducibles.
D. 9-89
• d is a greatest common divisor (gcd) of a1 , a2 , · · · , an if d | ai for all i, and if any
other d0 satisfies d0 | ai for all i, then d0 | d.
m is a least common multiple (lcm) of a1 , a2 , · · · , an if ai | m for all i, and if any
other d0 satisfies ai | m0 for all i, then m | m0 .
• Let R be a UFD and f = a0 + a1 X + · · · + an X n ∈ R[X]. The content c(f ) of
f is c(f ) = gcd(a0 , a1 , · · · , an ) ∈ R. A polynomial is called primitive if c(f ) is a
unit, ie. the ai are coprime.
E. 9-90
Note that the gcd or lcm of a set of numbers, if exists (which it might not), is not
unique. It is only well-defined up to a unit. And since the gcd is only defined up
to a unit, so is the content. In the definition of primitive, we ask for c(f ) to be a
unit, we cannot ask for c(f ) to be exactly 1, since the gcd is only well-defined up
to a unit.
L. 9-91
In a unique factorization domain, the gcd and lcm always exists, and are unique
up to associates.
We construct the greatest common divisor using the good-old way of prime fac-
torization. We let p1 , p2 , · · · , pm be a list of all irreducible factors of the ai s, such
that no two of these are associates of each other. For each i we can write
m
n
Y
ai = ui pj ij where nij ∈ N and ui are units
j=1
Qm m
We let mj = mini {nij } and let d = j=1 pj j . As, by definition, mj ≤ nij for all
i, we know d | ai for all i.
tj
Finally, if d0 | ai for all i, then we let d0 = v m
Q
j=1 pj . Then we must have tj ≤ nij
0
for all i, j. Hence tj ≤ mj for all j. So d | d.
Uniqueness is immediate since any two greatest common divisors have to divide
each other. The argument for lcm is similar.
This result tell us that by definition c(f ) always exist.
E. 9-92
<Factorisations of polynomials over a field> Since polynomial rings are a
bit more special than general integral domains, we can say a bit more about them.
Recall that for F a field, we know F [X] is a Euclidean domain, hence a principal
ideal domain, hence a unique factorization domain. Therefore we know
1. If I C F [X], then I = (f ) for some f ∈ F [X].
2. If f ∈ F [X], then f is irreducible if and only if f is prime.
3. Let f be irreducible, and suppose (f ) ⊆ J ⊆ F [X]. Then J = (g) for some g.
Since (f ) ⊆ (g), we must have f = gh for some h. But f is irreducible. So
either g or h is a unit. If g is a unit, then (g) = F [X]. If h is a unit, then
(f ) = (g). So (f ) is a maximal ideal. Note that this argument is valid for any
PID, not just polynomial rings.
374 CHAPTER 9. GROUPS, RINGS AND MODULES
We can write f = c(f )f1 and g = c(g)g1 , with f1 and g1 primitive. Now f g =
c(f )c(g)f1 g1 . Since f1 g1 is primitive, c(f )c(g) is a gcd of the coefficients of f g,
and so is c(f g), by definition. So they are associates.
Again, we cannot say they are equal, since content is only well-defined up to a
unit.
L. 9-95
<Gauss’ lemma> Let R be a UFD, and f ∈ R[X] be a primitive polynomial.
Then f is reducible in R[X] if and only if f is reducible in F [X], where F is the
field of fractions of R.
(Forward) Let f = gh be a product in R[X] with g, h not units. As f is primitive,
so are g and h. So both have degree > 0. So g, h are not units in F [X]. So f is
reducible in F [X].
(Backward) Let f = gh in F [X], with g, h not units. So g and h have degree
> 0, since F is a field. So we can clear denominators by finding a, b ∈ R such
that (ag), (bh) ∈ R[X] (eg. let a be the product of denominators of coefficients
of g). Then we get abf = (ag)(bh) and this is a factorization in R[X]. Here we
9.3. RINGS II 375
have to be careful — (ag) is one thing that lives in R[X], and is not necessarily
a product in R[X], since g might not be in R[X]. So we should just treat it as a
single symbol. We now write (ag) = c(ag)g1 and (bh) = c(bh)h1 where g1 , h1 are
primitive. So we have
by the previous result. But also we have abf = c(ag)c(gh)g1 h1 = u−1 abg1 h1 . So
cancelling ab gives f = u−1 g1 h1 ∈ R[X]. So f is reducible in R[X].
If this looks fancy and magical, you can try to do this explicitly in the case where
R = Z and F = Q. Then you will probably get enlightened.
E. 9-96
Consider X 3 + X + 1 ∈ Z[X]. This has content 1 so is primitive. We show it is
not reducible in Z[X], and hence not reducible in Q[X].
Suppose f is reducible in Q[X]. Then by Gauss’ lemma, this is reducible in Z[X].
So we can write X 3 + X + 1 = gh for some polynomials g, h ∈ Z[X], with g, h
not units. But if g and h are not units, then they cannot be constant, since the
coefficients of X 3 + X + 1 are all 1 or 0. So they have degree at least 1. Since
the degrees add up to 3, we wlog suppose g has degree 1 and h has degree 2. So
suppose
g = b0 + b1 X, h = c0 + c1 X + c2 X 2 .
Multiplying out and equating coefficients, we get b0 c0 = 1 and c2 b1 = 1. So b0
and b1 must be ±1. So g is either 1 + X, 1 − X, −1 + X or −1 − X, and hence has
±1 as a root. But this is a contradiction, since ±1 is not a root of X 3 + X + 1.
So f is not reducible in Q. In particular, f has no root in Q.
We see the advantage of using Gauss’ lemma — if we worked in Q instead, we
could have gotten to the step b0 c0 = 1, and then we can do nothing, since b0 and
c0 can be many things if we live in Q.
P. 9-97
Let R be a UFD, and F be its field of fractions. Let g ∈ R[X] be primitive. Write
J = (g)CR[X] and I = (g)CF [X], then J = I ∩R[X]. In other words, if f ∈ R[X]
and we can write it as f = gh, with h ∈ F [X], then we also have f = gh0 with
h0 ∈ R[X].
The strategy is the same the Gauss’ lemma – we clear denominators in the equation
f = gh, and then use contents to get that down in R[X]. We certainly have
J ⊆ I ∩ R[X]. Now let f ∈ I ∩ R[X]. So we can write f = gh with h ∈ F [X].
We can choose b ∈ R such that bh ∈ R[X]. Then bf = g(bh) ∈ R[X]. Write
(bh) = c(bh)h1 with h1 ∈ R[X] primitive. Thus bf = c(bh)gh1 .
Since g is primitive, so is gh1 and so c(bh) = uc(bf ) for u a unit. Moreover bf is a
product in R[X], so c(bf ) = vc(b)c(f ) = vbc(f ) for some unit v. So now we have
bf = uvbc(f )gh1 . Cancelling b gives f = g(uvc(f )h1 ). So f ∈ J.
T. 9-98
If R is a UFD, then R[X] is a UFD. In particular, R[X1 , · · · , Xn ] is also a UFD.
We know R[X] has a notion of degree. So we will combine this with the fact that
R is a UFD. Let f ∈ R[X]. We can write f = c(f )f1 , with f1 primitive. Firstly,
376 CHAPTER 9. GROUPS, RINGS AND MODULES
then f is irreducible in R[X], and hence also in F [X], where F is the field of
fractions of F .
are all divisible by p. Also, since p - rj and p - s0 , we know p - rj s0 , using the fact
that p is prime. So p - aj . So we must have j = n.
We also know that j ≤ k ≤ n. So we must have j = k = n. So deg g = n. Hence
` = n − h = 0. So h is a constant. But we also know f is primitive. So h must be
a unit. So this is not a proper factorization.
It is important that we work in R[X] all the time, until the end where we apply
Gauss’ lemma. Otherwise, we cannot possibly apply Eisentein’s criterion since
there are no primes in F .
E. 9-100
• Consider the polynomial X n − p ∈ Z[X] for p a prime. Apply Eisentein’s criterion
with p, and observe all the conditions hold. This is certainly primitive, since this
is monic. So X n − p is irreducible in Z[X], hence in Q[X]. In particular, X n − p
√
has no rational roots, ie. n p is irrational (for n > 1).
• Consider a polynomial f = X p−1 + X p−2 + · · · + X 2 + X + 1 ∈ Z[X] where p is
a prime number. If we look at this, we notice Eisentein’s criteria does not apply.
What should we do? We observe that
Xp − 1
f= .
X −1
When we look at it hard enough, we notice Eisentein’s criteria can be applied —
we know p | pi for 1 ≤ i ≤ p − 1, but p2 - p−1
p
= p. So fˆ is irreducible in Z[Y ].
P. 9-103
A prime number p ∈ Z is prime in Z[i] if and only if p 6= a2 +b2 for all a, b ∈ Z\{0}.
(Backward) Now suppose p = uv, with u, v not units. Taking norms, we get
p2 = N (u)N (v). Since u and v are not units, N (u) = N (v) = p. Writing
u = a + ib, then this says a2 + b2 = p.
L. 9-104
Let p be a prime number. Let Fp = Z/pZ be the field with p elements. Let
F×
p = Fp \ {0} be the group of invertible elements under multiplication. Then
F× ∼
p = Cp−1 .
This is a funny proof, since we have not found any element that has order p − 1.
P. 9-105
The primes in Z[i] are, up to associates,
1. Prime numbers p ∈ Z ≤ Z[i] such that p ≡ 3 (mod 4).
2. Gaussian integers z ∈ Z[i] with N (z) = z z̄ = p for some prime p such that
p = 2 or p ≡ 1 (mod 4).
We first show these are indeed primes in Z[i]. If p ≡ 3 (mod 4), then p 6= a2 + b2 ,
since a square number mod 4 is always 0 or 1. So those in 1. are primes in Z[i]. If
N (z) = p, and z = uv, then N (u)N (v) = p. So N (u) is 1 or N (v) is 1. So u or v
is a unit. Note that we did not use the condition that p 6≡ 3 (mod 4). This is not
needed, since N (z) is always a sum of squares, and hence N (z) cannot be a prime
that is 3 mod 4.
Now we show that they are all the primes in Z[i]. Suppose z ∈ Z[i] is a irreducible,
hence prime. Then z̄ is also irreducible. So N (z) = z z̄ is a factorization of N (z)
into irreducibles. Let p ∈ Z be an ordinary prime number dividing N (z), which
exists since N (z) 6= 1.
Now if p ≡ 3 (mod 4), then p itself is prime in Z[i] by the first part of the proof.
So p | N (z) = z z̄. So p | z or p | z̄. Note that if p | z̄, then p | z by taking complex
conjugates. So we get p | z. Since both p and z are both irreducible, they must
be equal up to associates.
9.3. RINGS II 379
Note that we have already proved this in the case when n is a prime.
(Forward) If n = x2 + y 2 , then we have n = (x + iy)(x − iy) = N (x + iy). Let
z = x + iy. So we can write z = α1 · · · αq as a product of irreducibles in Z[i]. By
the previous proposition, each αi is either αi = p (a genuine prime number with
p ≡ 3 (mod 4)), or N (αi ) = p is a prime number which is either 2 or ≡ 1 (mod 4).
We now take the norm to obtain
n /2
pi , either pi ≡ 3 (mod 4), and ni is even, in which case pn i 2 ni /2
i = (pi ) = N (pi i );
or pi = 2 or pi ≡ 1 (mod 4), in which case, the above proof shows that pi = N (αi )
for some αi . So pn n
i = N (αi ). Since the norm is multiplicative, we can write n as
the norm of some z ∈ Z[i]. So n = N (z) = N (x + iy) = x2 + y 2 as required.
E. 9-107
We know 65 = 5 × 13. Since 5, 13 ≡ 1 (mod 4), it is a sum of squares. Moreover,
the proof tells us how to find 65 as the sum of squares. We have to factor 5 and
13 in Z[i]. We have 5 = (2 + i)(2 − i) and 13 = (2 + 3i)(2 − 3i). So we know
But there is a choice here. We had to pick which factor is α and which is ᾱ. So
we can also write
So not only are we able to write them as sum of squares, but this also gives us
many ways of writing 65 as a sum of squares.
380 CHAPTER 9. GROUPS, RINGS AND MODULES
D. 9-108
An α ∈ C is called an algebraic integer if it is a root of a monic polynomial in
Z[X], ie. there is a monic f ∈ Z[X] such that f (α) = 0.
For α an algebraic integer, we write Z[α] ∈ C for the smallest subring of C con-
taining α.
E. 9-109
We generalize the idea of Gaussian integers to algebraic integers. We can im-
mediately check that this is a sensible definition — not all complex numbers are
algebraic integers, since there are only countably many polynomials with integer
coefficients, hence only countably many algebraic integers, but there are uncount-
ably many complex numbers.
Z[α] can also be defined for arbitrary complex numbers, but it is less interesting.
We can also construct Z[α] by taking it as the image of the map φ : Z[X] → C
given by g 7→ g(α). So we can also write Z[α] = Z[X]/I where I = ker φ. Note
that I is non-empty, since, say, f ∈ I, by definition of an algebraic integer.
P. 9-110
If α ∈ C is an algebraic integer, then the ideal I = ker(φ : Z[X] → C, f 7→ f (α))
is principal, and equal to (fα ) for some irreducible monic fα .
p(α), q(α) ∈ C, which is an integral domain, we must have, say, p(α) = 0. But
then deg p < deg fα . Contradiction.
This is a non-trivial theorem, since Z[X] is not a principal ideal domain. So there
is no immediate guarantee that I is generated by one polynomial. From this result
we see that the irreducible monic polynomial fα is the minimal polynomial of the
algebraic integer α over Z.
E. 9-111
1. We know α = i is an algebraic integer with fα = X 2 + 1.
√
2. Also, α = 2 is an algebraic integer with fα = X 2 − 2.
√
3. More interestingly, α = 12 (1+ −3) is an algebraic integer with fα = X 2 −X−1.
4. The polynomial X 5 − X + d ∈ Z[X] with d ∈ Z≥0 has precisely one real root
α, which is an algebraic integer. In fact there is a theorem,
√ that say this
α cannot be constructed from integers via +, −, ×, ÷, n · . There is another
theorem which says that degree 5 polynomials are the smallest degree for which
this can happen (the prove involves writing down formulas analogous to the
quadratic formula for degree 3 and 4 polynomials).
L. 9-112
If α ∈ Q is an algebraic integer, then α ∈ Z.
It turns out the collection of all algebraic integers form a subring of C. This is not
at all obvious — given f, g ∈ Z[X] monic such that f (α) = g(α) = 0, there is no
easy way to find a new monic h such that h(α + β) = 0. We will prove this much
later on in the course.
D. 9-113
An ideal I is finitely generated if it can be written as I = (r1 , · · · , rn ) for some
r1 , · · · , rn ∈ R.
E. 9-114
Recall a ring is Noetherian if for any chain of ideals I1 ⊆ I2 ⊆ I3 ⊆ · · · m there is
some N such that IN = IN +1 = IN +2 = · · · .
• Every finite ring is Noetherian. This is since there are only finitely many pos-
sible ideals.
• Every field is Noetherian. This is since there are only two possible ideals.
• Most rings we love and know are indeed Noetherian. However, we can explic-
itly construct some non-Noetherian ideals. The ring Z[X1 , X2 , X3 , · · · ] is not
Noetherian, it has the chain of strictly increasing ideals (X1 ) ⊆ (X1 , X2 ) ⊆
(X1 , X2 , X3 ) ⊆ · · · .
382 CHAPTER 9. GROUPS, RINGS AND MODULES
P. 9-115
A ring is Noetherian if and only if every ideal is finitely generated.
Then it is easy to see, using the strong closure property, that each ideal In is an
ideal of R. Moreover, they form a chain, since if f ∈ I, then Xf ∈ I, by strong
closure. So In ⊆ In+1 for all n.
By the ascending chain condition of R, we know there is some N such that IN =
IN +1 = · · · . Now for each 0 ≤ n ≤ N , since R is Noetherian, we can write
(n) (n) (n)
In = (r1 , r2 , · · · , rk(n) ).
9.4. MODULES I 383
(n) (n) (n) (n)
Now for each ri , we choose some fi ∈ I with fi = ri X n + · · · . We now
(n)
claim the polynomials fi for 0 ≤ n ≤ N and 1 ≤ i ≤ k(n) generate I. Suppose
(n)
not. We pick g ∈ I of minimal degree not generated by the fi . There are two
possible cases:
• If deg
Pg = (n)n ≤ N , say g = rX n + · · · . We know r ∈ In . So we can write
(n)
r = i λi ri for some λi ∈ R. Then we know i λi fi = rX n + · · · ∈ I. But
P
(j) P (n)
if g is not in the span of the fi , then so isn’t g − i λi fi . But this has a
lower degree than g. This is a contradiction.
• In the case deg g = n > N , since In = IN , we have the sameP proof. We write
(N )
g = rX n + · · · , but we know r ∈ In = IN . So we know r = I λi ri . So
X (n)
X n−N λi fi = rX N + · · · ∈ I.
i
n−N P (N )
Hence g − X λi fi has smaller degree than g, but is not in the span of
(j)
fi . Contradiction.
Since Z is Noetherian, we know Z[X] also is. Hence so is Z[X, Y ] etc.
Before the Hilbert basis theorem, there were many mathematicians studying some-
thing known as invariant theory. The idea is that we have some interesting objects,
and we want to look at their symmetries. Often, there are infinitely many possible
such symmetries, and one interesting question to ask is whether there is a finite
set of symmetries that generate all possible symmetries. Turns out the collection
of such symmetries are often just ideals of some funny ring. So Hilbert came along
and proved the Hilbert basis theorem, and showed once and for all that those rings
are Noetherian, and hence the symmetries are finitely generated.
As an aside, let E ⊆ F [X1 , X2 , · · · , Xn ] be any set of polynomials. We view this as
a set of equations f = 0 for each f ∈ E. The claim is that to solve the potentially
infinite set of equations E, we actually only have to solve finitely many equations.
Consider the ideal (E) C F [X1 , · · · , Xn ]. By the Hilbert basis theorem, there is
a finite list f1 , · · · , fk such that (f1 , · · · , fk ) = (E). We want to show that we
only have to solve fi (x) = 0 for these fi . Given (α1 , · · · , αn ) ∈ F n , consider the
homomorphism
9.4 Modules I
Recall that to define a vector space, we first pick some base field F. We then defined
a vector space to be an abelian group V with an action of F on V (ie. scalar multi-
plication) that is compatible with the multiplicative and additive structure of F. In
the definition, we did not at all mention division in F. So in fact we can make the
384 CHAPTER 9. GROUPS, RINGS AND MODULES
same definition, but allow F to be a ring instead of a field. We call these modules.
Unfortunately, most results we prove about vector spaces do use the fact that F is a
field, so many linear algebra results do not apply to modules, and modules have much
richer structures.
D. 9-118
• Let R be a commutative ring. We say a quadruple (M, +, 0M , · ) is an R- module
if
1. (M, +, 0M ) is an abelian group
2. The operation · : R × M → M satisfies
i. (r1 + r2 ) · m = (r1 · m) + (r2 · m);
ii. r · (m1 + m2 ) = (r · m1 ) + (r · m2 );
iii. r1 · (r2 · m) = (r1 · r2 ) · m; and
iv. 1R · M = M .
E. 9-119
Note that there are two different additions going on – addition in the ring and
addition in the module, and similarly two notions of multiplication. However, it
is easy to distinguish them since they operate on different things. If needed, we
can make them explicit by writing, say, +R and +M .
We can imagine modules as rings acting on abelian groups, just as groups can act
on sets. Hence we might say “R acts on M ” to mean M is an R-module.
• Let F be a field. An F-module is precisely the same as a vector space over F
(the axioms are the same).
• For any ring R, we have the R-module Rn = R×R×· · ·×R via r·(r1 , · · · , rn ) =
(rr1 , · · · , rrn ) using the ring multiplication. This is the same as the definition
of the vector space Fn for fields F.
• Let I C R be an ideal. Then it is a R-module via r ·M a = r ·R a and r1 +M r2 =
r1 +R r2 . Also, R/I is an R-module via r ·M (a + I) = (r ·R a) + I.
• A Z-module is precisely the same as an abelian group. For A an abelian group,
we have
Z×A→A with (n, a) →
7 a + · · · + a,
| {z }
n times
where if n is negative that means adding −a to itself |n| times, and adding
something to itself 0 times is just 0. This definition is essentially forced upon
us, since by the axioms of a module, we must have (1, a) 7→ a. Then we must
send, say, (2, a) = (1 + 1, a) 7→ a + a.
• Let F be a field and V a vector space over F, and α : V → V be a linear map.
Then V is an F[X]-module via F[X] × V → V with (f, v) 7→ f (α)(v). This is
a module. Note that we cannot just say that V is an F[X] module. We have
to specify the α as well. Picking a different α will give a different F[X]-module
structure.
• Let φ : R → S be a homomorphism of rings. Then any S-module M may be
considered as an R module via R × M → M with (r, m) 7→ φ(r) ·M m.
9.4. MODULES I 385
D. 9-120
• Let M be an R-module. A subset N ⊆ M is an R- submodule if it is a subgroup
of (M, +, 0M ), and rn ∈ N whenever n ∈ N and r ∈ R. We write N ≤ M .
E. 9-121
• We know R itself is an R-module. Then a subset of R is a submodule if and only
if it is an ideal.
• Note that modules are different from rings and groups. In groups, we had sub-
groups, and we have some really nice ones called normal subgroups. We are only
allowed to quotient by normal subgroups. In rings, we have subrings and ideals,
which are unrelated objects, and we only quotient by ideals. In modules, we only
have submodules, and we can quotient by arbitrary submodules.
• If F is a field and V, W are F-modules (ie. vector spaces over F), then an F-module
homomorphism is precisely an F-linear map.
T. 9-122
<Isomorphism theorems>
1. Let f : M → N be an R-module homomorphism. Then ker f = {m ∈ M :
f (m) = 0} ≤ M is an R-submodule of M , and Im f = {f (m) : m ∈ M } ≤ N
is an R-submodule of N . Moreover,
M/ ker f ∼
= Im f.
The proof is almost exactly the same as for rings and groups.
4
It is easy to check this is well-defined and is indeed a module.
386 CHAPTER 9. GROUPS, RINGS AND MODULES
C. 9-123
<Submodule correspondence> Similar to groups and rings, given N ≤ M ,
we have a correspondence
D. 9-124
• Let M be a R-module, and m ∈ M . The annihilator of m is Ann(m) = {r ∈ R :
r · m = 0}. For any set S ⊆ M , we define
\
Ann(S) = {r ∈ R : r · m = 0 for all m ∈ S} = Ann(m).
m∈S
So the mi generate M .
P. 9-127
Let N ≤ M and M be finitely-generated. Then M/N is also finitely generated.
E. 9-128
• It is very tempting to believe that if a module is finitely generated, then its sub-
modules are also finitely generated. But in fact a submodule of a finitely-generated
module need not be finitely generated.
• For a complex number α, the ring Z[α] (ie. the smallest subring of C containing
α) is a finitely-generated as a Z-module if and only if α is an algebraic integer,
which one can prove. This allows us to prove that algebraic integers are closed
under addition and multiplication, since it is easier to argue about whether Z[α]
is finitely generated.
D. 9-129
• Let M1 , M2 , · · · , Mk be R-modules. The direct sum is the R-module M1 ⊕
M2 ⊕ · · · ⊕ Mk which is the set M1 × M2 × · · · × Mk , with addition given by
(m1 , · · · , mk ) + (m01 , · · · , m0k ) = (m1 + m01 , · · · , mk + m0k ) and the R action is
given by r · (m1 , · · · , mk ) = (rm1 , · · · , rmk ).
Pk
• Let m1 , · · · , mk ∈ M . Then {m1 , · · · , mk } is linearly independent if i=1 ri mi =
0 implies r1 = r2 = · · · = rk = 0.
1. S generates M
E. 9-130
• We’ve been using one example of the direct sum already, namely
Rn = R ⊕ R ⊕ · · · ⊕ R .
| {z }
n times
Recall we said modules are like vector spaces. So we can try to define things like
basis and linear independence. However, we will fail massively, since we really
can’t prove much about them. Still, we can define them.
We will soon prove that if R is a field, then every module is free. However, if
R is not a field, then there are non-free modules. For example, the Z-module
Z/2Z is not freely generated. Suppose Z/2Z were generated freely by some S ⊆
Z/2Z. Then this can only possibly be S = {1}. Then this implies there is a
homomorphism θ : Z/2Z → Z sending 1 to 1. But it does not send 0 = 1 + 1 to
1 + 1, since homomorphisms send 0 to 0. So Z/2Z is not freely generated.
• Being finitely presented means I can tell you everything about the module with
finitely many paper. More precisely, if {m1 , · · · , mk } generate M and {n1 , n2 , · · · , nk }
generate ker(φ), then each ni = (ri1 , · · · rik ) corresponds to the relation
P. 9-131
For a subset S = {m1 , · · · , mk } ⊆ M , the following are equivalent:
1. S generates M freely.
2. S generates M and the set S is independent.
3. Every element of M is uniquely expressible as r1 m1 + r2 m2 + · · · + rk mk for
some ri ∈ R.
The fact that (2) and (3) are equivalent is something we would expect from what
we know from linear algebra, and in fact the proof is the same. So we only show
that (1) and (2) are equivalent.
0 = θ(0) = θ(r1 m1 + r2 m2 + · · · + rk mk )
= r1 θ(m1 ) + r2 θ(m2 ) + · · · + rk θ(mk ) = r1 .
We know is true if R is a field. We now want to reduce this to the case where R
is a field. If R is an integral domain, then we can produce a field by taking the
field of fractions, and this might be a good starting point. However, we want to
do this for general rings. So we need some more magic.
Firstly we we will show that if I CR is an ideal and M is an R-module, then M/IM
is an R/I module in a natural way, where IM = {am ∈ M : a ∈ I, m ∈ M } ≤ M .
We can take the quotient module M/IM , which is an R-module again. Now if
390 CHAPTER 9. GROUPS, RINGS AND MODULES
(r + I) · (m + IM ) = r · m + IM.
Rn /IRn ∼ m
= R /IR
m
as R/I modules. But staring at it long enough, we figure that Rn /IRn ∼ = (R/I)n
and similarly for m. Since R/I is a field, the result follows by linear algebra.
The point of this proposition is not the result itself (which is not too interesting),
but the general constructions used behind the proof.
9.5 Modules II
In the is section we will prove the classification of finite abelian groups and Jordan
normal forms. We will mostly work with R a Euclidean domain, and we write φ :
R \ {0} → Z≥0 for its Euclidean function.
D. 9-136
• Elementary row operations on an m×n matrix A with entries in R are operations
of the form
1. Add a multiple of one row to another.
2. Swap two rows.
3. Multiply a row by a unit.
We also have elementary column operations defined in a similar fashion.
• Two matrices are equivalent if we can get from one to the other via a sequence
of such elementary row and column operations.
• A k × k minor of a matrix A is the determinant of a k × k sub-matrix of A (ie.
a matrix formed by removing all but k rows and all but k columns).
• For a matrix A, the kth Fitting ideal Fitk (A) C R is the ideal generated by the
set of all k × k minors of A.
E. 9-137
Elementary rows and columns operations on an m × n matrix A with entries in R
can be done by achieve as follows
1. Add c ∈ R times the ith row to the jth row: this may be done by multiplying
on the left by the matrix, which is the m×m identity matrix but with c instead
of 0 at the j, i entry.
2. Swap the ith and jth rows: this may be done by multiplying on the left by the
matrix, which is the m × m identity matrix but swapping the i and j rows.
3. Multiplying the ith row by a unit c ∈ R: this may be done by multiplying on
the left by the matrix, which is the m × m identity matrix by with c instead
of 1 at the i, i entry. Notice that if R is a field, then we can multiply any row
by any non-zero number, since they are all units.
9.5. MODULES II 391
Throughout the process, we will keep calling our matrix A, even though it keeps
changing in each step, so that we don’t have to invent hundreds of names for these
matrices.
1. If A = 0, then done! So suppose A 6= 0. So some entry is not zero, say,
Aij 6= 0. Swapping the ith and first row, then jth and first column, we
arrange that A11 6= 0.
We now try to reduce A11 as much as possible. We do the following:
2. If there is an A1j not divisible by A11 , then we can use the Euclidean algo-
rithm to write A1j = qA11 + r. By assumption, r 6= 0. So φ(r) < φ(A11 )
(where φ is the Euclidean function). So we subtract q copies of the first col-
umn from the jth column. Then in position (1, j), we now have r. We swap
the first and jth column such that r is in position (1, 1), and we have strictly
reduced the value of φ at the first entry.
If there is an Ai1 not divisible by A11 , we do the same thing, and this again
reduces φ(A11 ). We keep performing these until no move is possible. Since
the value of φ(A11 ) strictly decreases every move, we stop after finitely many
applications. Then we know that we must have A11 dividing all A1j and
Ai1 . Now we can just subtract appropriate multiples of the first column from
others so that A1j = 0 for j 6= 1. We do the same thing with rows so that
the first row is cleared. Then we have a matrix of the form
d 0 ··· 0
0
A = .. .
. C
0
We would like to say “do the same thing with C”, but then this would get us a
regular diagonal matrix, not necessarily in Smith normal form. So we need some
preparation.
3. Suppose there is an entry of C not divisible by d, say Aij with i, j > 1. We
suppose Aij = qd + r with r 6= 0 and φ(r) < φ(d). We add column 1 to
column j, and subtract q times row 1 from row i. Now we get r in the (i, j)th
392 CHAPTER 9. GROUPS, RINGS AND MODULES
entry, and we want to send it back to the (1, 1) position. We swap row i
with row 1, swap column j with row 1, so that r is in the (1, 1)th entry, and
φ(r) < φ(d).
Now we have messed up the first row and column. So we go back and do (1)
again until the first row and columns are cleared. Then we get
0
d 0 ··· 0
0
A = .. , where φ(d0 ) ≤ φ(r) < φ(d).
. C0
0
We keep on repeating this process. As this strictly decreases the value of
φ(A11 ), we can only repeat this finitely many times. When we stop, we will
end up with a matrix
d 0 ··· 0
0
A = .. ,
. C
0
and d divides every entry of C.
4. Now we apply the entire process again to C. When we do this process, notice
all allowed operations don’t change the fact that d divides every entry of C.
So applying this recursively, we obtain a diagonal matrix with the claimed
divisibility property.
Note that if we didn’t have to care about the divisibility property, we can just do
(1) and (2), and we can get a diagonal matrix. The magic to get to the Smith
normal form is (3).
The dk obtained in the Smith normal form are called the invariant factors of A.
It would be nice if we can prove that the di are indeed invariant. It is not clear
from the algorithm that we will always end up with the same di . Indeed, we can
multiply a whole row by −1 and get different invariant factors. However, it turns
out that these are unique up to multiplication by units. To study the uniqueness
of the invariant factors of a matrix A, we relate them to other invariants, which
involves minors. Any given matrix has many minors, since we get to decide which
rows and columns we can throw away. The idea is to consider the ideal generated
by all the minors of matrix. A key property (we will show next) is that equivalent
matrices have the same Fitting ideal, even if they might have very different minors.
Note that the divisibility criterion is similar to the classification of finitely-generated
abelian groups. In fact, we will derive that as a consequence of the Smith normal
form.
E. 9-139
We exhibit the algorithm of producing the Smith normal
3 7 4
form with an algorithm in Z. We start with the matrix on 1 −1 2
the right. 3 5 1
We want to move the 1 to the top-left corner. So we swap 1 −1 2
the first and second rows to obtain 3 7 4
3 5 1
9.5. MODULES II 393
It suffices to show that changing A by a row or column operation does not change
the Fitting ideal. Since taking the transpose does not change the determinant, ie.
Fitk (A) = Fitk (AT ), it suffices to consider the row operations.
The most difficult one is taking linear combinations. Let B be the result of adding
c times the ith row to the jth row, and fix C a k × k sub-matrix of A. Suppose the
corresponding matrix wrt B is C 0 . We then want to show that det C 0 ∈ Fitk (A).
If the jth row is outside of C, then the minor det C is unchanged. If both the ith
and jth rows are in C, then the submatrix C changes by a row operation, which
does not affect the determinant. These are the boring cases.
Suppose the jth row is in C and the ith row is not. Suppose the ith row is
f1 , · · · , fk . Then C is changed to C 0 , with the jth row being
where D is the matrix obtained by replacing the jth row of C with (f1 , · · · , fk ).
The point is that det C is definitely a minor of A, and det D is still a minor of A,
just another one. Since ideals are closed under addition and multiplications, we
know det(C 0 ) ∈ Fitk (A).
The other operations are much simpler. They just follow by standard properties
of the effect of swapping rows or multiplying rows on determinants. So after any
394 CHAPTER 9. GROUPS, RINGS AND MODULES
row operation, the resultant submatrix C 0 satisfies det(C 0 ) ∈ Fitk (A). Since this
is true for all minors, we must have Fitk (B) ⊆ Fitk (A). But row operations are
invertible. So we must have Fitk (A) ⊆ Fitk (B) as well. So they must be equal.
P. 9-141
If A has Smith normal form B = diag(d1 , d2 , · · · , dr , 0, · · · , 0) then Fitk (A) =
(d1 d2 · · · dk ) (where dk = 0 if k > r). And in fact dk is unique up to associates.
The fact Fitk (B) = (d1 d2 · · · dk ) is clear once we notice that the only possible
contributing minors are from the diagonal submatrices, and the minor from the
top left square submatrix divides all other diagonal ones. The uniqueness of di
follows since we can find dk by dividing the generator of Fitk (A) by the generator
of Fitk−1 (A).
E. 9-142
Consider the matrix in Z: A = ( 20 03 ). This is diagonal, but not in Smith normal
form. We can potentially apply the algorithm, but that would be messy. We
notice that Fit1 (A) = (2, 3) = (1). So we know d1 = ±1. We can then look at the
second Fitting ideal Fit2 (A) = (6), hence d1 d2 = ±6. So we must have d2 = ±6.
So the Smith normal form is ( 10 06 ). That was is much easier.
L. 9-143
Let R be a principal ideal domain. Then any submodule of Rm is generated by at
most m elements.
Let N ≤ Rm be a submodule. Consider the ideal
Recall that we got to the Smith normal form by row and column operations.
Performing row operations is just changing the basis of Rm , while each column
operation changes the generators of N . So what this tells us is that there is
a new basis v1 , · · · , vm of Rm such that N is generated by d1 v1 , · · · , dr vr . By
definition of Smith normal form, the divisibility condition holds.
T. 9-145
<Classification of finitely-generated modules over a Euclidean domain>
Let R be a Euclidean domain, and M be a finitely generated R-module. Then
R R R
M∼
= ⊕ ⊕ ··· ⊕ ⊕ R ⊕ R ⊕ ··· ⊕ R
(d1 ) (d1 ) (dr )
Rm
M∼
=
((d1 , 0, · · · , 0), (0, d2 , 0, · · · , 0), · · · , (0, · · · , 0, dr , 0, · · · , 0))
∼ R R R
= ⊕ ⊕ ··· ⊕ ⊕ R ⊕ ··· ⊕ R .
(d1 ) (d2 ) (dr ) | {z }
m − r copies of R
So all finitely-generated modules are of this simple form, so we can prove things
about them assuming they look like this.
This result is particularly useful in the case where R = Z, where R-modules are
abelian groups. In which case we get: Any finitely-generated abelian group is
isomorphic to Cd1 × · · · × Cdr × C∞ × · · · × C∞ where C∞ ∼ = Z is the infinite cyclic
group, with d1 | d2 | · · · | dr . Note that if the group is finite, then there cannot be
any C∞ factors. So it is just a product of finite cyclic groups. That is, if A is a
finite abelian group, then A ∼ = Cd1 × · · · Cdr with d1 | d2 | · · · | dr .
E. 9-146
Let A be the abelian group generated by a, b, c with relations 2a + 3b + c = 0,
a + 2b = 0, and 5a + 6b + 7c = 0. In other words, we have
Z3
A= .
((2, 3, 1), (1, 2, 0), (5, 6, 7))
396 CHAPTER 9. GROUPS, RINGS AND MODULES
We would like to get a better description of A. It is not even obvious if this module
is the zero module or not. To work out a good description, We consider the matrix
2 1 5
X = 3 2 6 .
1 0 7
To figure out the Smith normal form, we find the fitting ideals. We have Fit1 (X) =
(1, · · · ) = (1). So d1 = 1. We have to work out the second fitting ideal. In
principle, we have to check all the minors, but we immediately notice | 23 12 | = 1.
So Fit2 (X) = (1), and d2 = 1. Finally, we find
2 1 5
Fit3 (X) = 3 2 6 = (3) =⇒ d3 = 3
1 0 7
So we know
Z Z Z ∼ Z ∼
A∼
= ⊕ ⊕ = = C3 .
(1) (1) (3) (3)
If you don’t feel like computing determinants, doing row and column reduction is
often as quick and straightforward.
L. 9-147
<Chinese remainder theorem> Let R be a Euclidean domain, and a, b ∈ R
be such that gcd(a, b) = 1. Then
R ∼ R R
= × as R-modules.
(ab) (a) (b)
T. 9-148
<Prime decomposition theorem> Let R be a Euclidean domain, and M be
a finitely-generated R-module. Then
M∼
= N1 ⊕ N2 ⊕ · · · ⊕ Nt ,
We already know
R R
M∼
= ⊕ ··· ⊕ ⊕ R ⊕ · · · ⊕ R.
(d1 ) (dr )
So it suffices to show that each R/(d1 ) can be written in that form. We let
n
d = pn 1 n2
1 p2 · · · pk
k
Recall that we were also to decompose a finite abelian group into products of the
form Cpk , where p is a prime, and it was just the Chinese remainder theorem.
This is again in general true.
C. 9-149
<F-vector space as F[X]-module> We next want to consider the Jordan normal
form. This is less straightforward, since considering V directly as an F module
would not be too helpful (since that would just be pure linear algebra). Instead,
we use the following trick: For a field F, the polynomial ring F[X] is a Euclidean
domain, so the results we had apply. If V is a vector space on F, and α : V → V
is a linear map, then we can make V into an F[X]-module via
L. 9-150
If V is a finite-dimensional vector space, then Vα is a finitely-generated F[X]-
module.
E. 9-151
1. Suppose Vα ∼= F[X]/(X r ) as F[X]-modules, then in particular also Vα ∼=
r
F[X]/(X ) as F-modules (since being a map of F-modules has fewer require-
ments than being a map of F[X]-modules).
This is a Jordan block. The Jordan blocks we defined in Linear algebra are the
other way round, with zeroes below the diagonal. However a simply change of
basis (conjugate it with ( 01 10 )) would gives us the “right” form.
3. Suppose Vα ∼
= F[X]/(f ) for some polynomial f , for
f = a0 + a1 X + · · · + ar−1 X r−1 + X r .
We call this the companion matrix for the monic polynomial f (this is not
the content despite the notation).
These are different things that can possibly happen. Since we have already classi-
fied all finitely generated F[X] modules, this allows us to put matrices in a rather
nice form.
T. 9-152
<Rational canonical form> Let α : V → V be a linear endomorphism of
a finite-dimensional vector space over F, and Vα be the associated F[X]-module.
Then
F[X] F[X] F[X]
Vα ∼= ⊕ ⊕ ··· ⊕ ,
(f1 ) (f2 ) (fs )
with f1 | f2 | · · · | fs . Thus there is a basis for V in which the matrix for α is
represented by the block diagonal matrix diag(c(f1 ), c(f2 ), · · · , c(fs )) where c(f )
denote the companion matrix of f .
9.5. MODULES II 399
Apply the prime decomposition theorem to Vα . Then all primes are of the form
X − λ. We then use 2 of [E.9-151] to get the form of the matrix.
The blocks Jm (λ) are called the Jordan λ-blocks. It turns out that the Jordan
blocks are unique up to reordering, which we proved in the Linear Algebra.
400 CHAPTER 9. GROUPS, RINGS AND MODULES
so it has nullity 1.
Combining these two we see that nullity of X · − : Vα → Vα is equal to the number
of Jordan blocks with eigenvalue 0. In linear algebra language this says that the
linear map α has nullity equal to the number of Jordan blocks (in its Jordan
normal form) with eigenvalue 0.
Similarly X 2 · − : C[X]/((X − λ)q ) → C[X]/((X − λ)q ) is an isomorphism for
λ 6= 0, whereas for λ = 0 it has matrix
0 0 0 ··· 0 0 0
0 0 0 ··· 0 0 0
10 01 00 ··· 0 0 0
. . . ··· 0 0 0
.. .. .. .. .. ..
. . .
... ... .. .. ..
. . .
..
.
0 0 0 ··· 1 0 0
where n = dim ker. Similarly, we can extract the numbers of Jordan blocks of any
size and any eigenvalue. This is what we do in [P.4-119].
T. 9-156
<Cayley-Hamilton theorem> Let M be a finitely-generated R-module, where
R is some commutative ring. Let α : M → M be an R-module homomorphism.
Let A be a matrix representation of α under some choice of generators, and let
9.5. MODULES II 401
We write C for the matrix with entries cij = Xδij − aji ∈ F[X]. We now use
the fact that adj(C)C = det(C)I which we proved in [T.4-81] (the proof did
not assume that the underlying ring is a field). Expanding this out, we get the
following equation (in F [X]).
χα (X)I = det(XI − A)I = (adj(XI − A))(XI − A).
Writing this in components, and multiplying by ek , we have
X
χα (X)δik ek = (adj(XI − A)ij )(Xδjk − akj )ek .
j
by our choice of aij . But the left hand side is just χα (X)ei . So χα (X) acts trivially
on all of the generators ei . So it in fact acts trivially. So χα (α) is the zero map
(since acting by X is the same as acting by α, by construction).
So we can also use the idea of viewing V as an F[X] module to prove Cayley-
Hamilton theorem. In fact, we don’t need F to be a field.
L. 9-157
Let α, β : V → V be two linear maps. Vα ∼ = Vβ as F[X]-modules if and only if
α and β are conjugate as linear maps, ie. there is some γ : V → V such that
α = γ −1 βγ.
E. 9-158
Let’s classify conjugacy classes in GL2 (F), ie. we need to classify F[X] modules of
the form
F[X] F[X] F[X]
⊕ ⊕ ··· ⊕
(d1 ) (d2 ) (dr )
which are two-dimensional as F-modules. As we must have deg(d1 d2 · · · dr ) = 2,
we either have a quadratic thing or two linear things, ie. either
1. r = 1 and deg(d1 ) = 2. In this first case, the module is F[X]/(d1 ) where, say,
d1 = X 2 + a1 X + a2 .
2. r = 2 and deg(d1 ) = deg(d2 ) = 1. In this case, since we have d1 | d2 , and they
are both monic linear, we must have d1 = d2 = X − λ for some λ. In this case,
we get
F[X] F[X]
⊕ .
(X − λ) (X − λ)
We use the basis 1, X, and the linear map corresponding to 1 and 2 respectively
have matrix
0 −a2 λ 0
.
1 −a1 0 λ
Do these cases overlap? Suppose the two of them are conjugate. Then they have
the same determinant and same trace. So we know −a1 = 2λ and a2 = λ2 . So in
fact our polynomial is
X 2 + a1 X + a2 = X 2 − 2λ + λ2 = (X − λ)2 .
things to try. However, we can be a bit slightly more clever. We first count how
many irreducibles we are expecting, and then find that many of them.
There are 9 monic quadratic polynomials in total, since a1 , a2 ∈ Z/3. The re-
ducibles are (X − λ)2 or (X − λ)(X − µ) with λ 6= µ. There are three of each kind.
So we have 6 reducible polynomials, and so 3 irreducible ones.
We can then check that X 2 + 1, X 2 + X + 2 and X 2 + 2X + 2 are the irreducible
polynomials. So every matrix in GL2 (Z/3) is either congruent to
0 −1 0 −2 0 −2 λ 0 λ 0
, , , , ,
1 0 1 −1 1 −2 0 µ 1 λ
L. 9-160
Let M be a module over a ring R, and let N be a submodule of M . If M/N is
free, then M ∼
= N ⊕ M/N .
E. 9-161
√
Let R = Z[X]/(X 2 + 5) ∼
= C[ −5] ⊆ C, then (1 + X)(1 − X) = 1 − X 2 = 1 + 5 =
6 = 2 × 3. Now 1 ± X, 2, 3 are irreducible, so R is not a UFD. Let
ker φ = {(a, b) ∈ I1 ⊕ I2 : a + b = 0} = I1 ∩ I2
405
406 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
P
S
z
Conceptually, we can visualize this using the Riemann sphere , which is a sphere
resting on the complex plane with its “South Pole” S at z = 0. For any point z ∈ C,
drawing a line through the “North Pole” N of the sphere to z, and noting where
this intersects the sphere. This specifies an equivalent point P on the sphere. Then
∞ is equivalent to the North Pole of the sphere itself. So the extended complex
plane is mapped bijectively to the sphere.
This is a useful way to visualize things, but is not as useful when we actually want
to do computations. To investigate properties of ∞, we use the substitution ζ = z1 ,
ie. a Möbius map to take the point ∞ to 0. A function f (z) is said to have a
particular property at ∞ if f ( ζ1 ) has that same property at ζ = 0.
P. 10-4
Let f be defined on an open set U ⊆ C. Write f (x + iy) = u(x, y) + iv(x, y), where
u, v are functions R2 → R. Then f is complex differentiable at w = c + id ∈ U
if and only if u, v are differentiable at (c, d) and the Cauchy-Riemann equations
ux = vy and uy = −vx holds at (c, d). Moreover, when these holds we have
f 0 (w) = ux (c, d) + ivx (c, d) = vy (c, d) − iuy (c, d).
So, breaking into real and imaginary parts, we know (†) holds if and only if
u(x, y) − u(c, d) − (p(x − c) − q(y − d))
lim p =0
(x,y)→(c,d) (x − c)2 + (y − d)2
v(x, y) − v(c, d) − (q(x − c) + p(y − d))
and lim p = 0.
(x,y)→(c,d) (x − c)2 + (y − d)2
Comparing this to the definition of the differentiability of a real-valued function, we
see this holds exactly if u and v are differentiable at (c, d) with Du(c, d) = (p, −q)
and Dv(c, d) = (q, p).
Note that if we just have ux = vy and uy = −vx at (c, d) ∈ U , we cannot conclude
that f is complex differentiable at (c, d). These conditions only say the partial
derivatives exist, but this does not imply imply that u and v are differentiable,
as required by the proposition. However, if the partial derivatives exist and are
continuous, then by Analysis II we know they are differentiable.
E. 10-5
• The usual rules of differentiation (sum rule, product, rule, chain rule, derivative of
inverse) all hold for complex differentiable functions, with the same proof as the
real case.
• A polynomial p : C → C is entire. This can be checked directly from definition, or
using the product rule.
• A rational function p(z)/q(z) : U → C, where U ⊆ C \ {z : q(z) = 0}, is holomor-
phic on U . Here p, q are polynomials.
• f (z) = |z| is not complex differentiable
p at any point of C. Indeed, we can write
this as f = u + iv, where u(x, y) = x2 + y 2 and v(x, y) = 0. If (x, y) 6= (0, 0),
then
x y
ux = p , uy = p .
x2 + y 2 x2 + y 2
If we are not at the origin, then clearly both cannot vanish, but the partials of
v both vanish. Hence the Cauchy-Riemann equations do not hold and it is not
differentiable outside of the origin. At the origin, we can compute directly that
f (h) − f (0) |h|
= .
h h
This is, say, +1 for h ∈ R+ and −1 for h ∈ R− . So the limit as h → 0 does not
exist.
• For f (z) = |z|2 = x2 + y 2 , the Cauchy-Riemann equations are satisfied only at the
origin. So f is only differentiable at z = 0. However, it is not analytic since there
is no neighbourhood of 0 throughout which f is differentiable.
• Let f (z) = Re z. This has u = x, v = 0. But f is nowhere analytic as
∂u ∂v
= 1 6= 0 = .
∂x ∂y
1. Any compact subset of BR (a) is contained within some B̄r (a) for some r < R.
So it suffice to show that the series converge uniformly on B̄r (a) for any r < R.
The proof is same as that in [T.5-14].
2. Without loss of generality, take a = 0. We first prove that the derivative
series has radius of convergence R, so that we can freely happily manipulate it.
Certainly, we have |ncn | ≥ |cn |. So by Pcomparison to the series for f , we can
see that the radius of convergence of ncn z n−1 is at most R. Given |z| < R,
pick |z| < ρ < R, then we can see
n−1
|ncn z n−1 | z
= n →0
|cn ρn−1 | ρ
If f vanishes on B(a, ε), then all its derivatives at a vanish, and hence the coeffi-
cients all vanish as cn = f (n) (a)/n!. So it is identically zero.
This is another circle of Apollonius. Note that the proof fails if either cz1 + d = 0
or cz2 + d = 0, but then (∗) trivially represents a circle.
P. 10-11
Given six points α, β, γ, α0 , β 0 , γ 0 ∈ C∗ , we can find a Möbius map which sends
α 7→ α0 , β 7→ β 0 and γ → γ 0 .
β 0 − γ 0 z − α0
f2 (z) = .
β 0 − α0 z − γ 0
Therefore, we can therefore find a Möbius map taking any given circline to any
other, which is convenient.
D. 10-12
• We say that a point p ∈ C is a branch point of a multivalued function if the
function cannot be given a continuous single-valued definition in a (punctured)
neighbourhood B(p, ε) \ {p} of p for any ε > 0. The function is said to have a
branch point singularity there.
E. 10-13
When we are attempt to define an inverse for an non-injective function, we have
many choice of sending each point to. So we end up with a multi-valued function.
For example for the exponential function ez , we want to define the inverse log z =
log r + iθ where r = |z| and θ = arg(z). There are infinitely many possible values
of log z, for every choice of θ (differing by a multiple of 2π). However when we
write down an expression we often want them to be singled-valued, well-defined,
and continuous.
10.1. BASIC NOTIONS 411
Consider the three curves shown in the diagram. In
C1 , we could always choose θ to be always in the range C1
(0, π2 ), and then log z would be continuous and single-
valued going round C1 . On C2 , we could choose θ ∈ C2
( π2 , 3π
2
) and log z would again be continuous and single-
valued. However, this doesn’t work for C3 . Since this C3
encircles the origin, there is no such choice. Whatever
we do, log z cannot be made continuous and single-valued around C3 . It must
either “jump” somewhere, or the value has to increase by 2πi every time we go
round the circle, ie. the function is multi-valued. This is true for any curves going
around the origin, so the origin in this case is a branch point.
1. log(z − a) has a branch point at z = a.
2. log z−1
z+1
= log(z − 1) − log(z + 1) has two branch points at ±1.
not true on, say, the unit circle. For practical (and applied) purpose, instead of
restricting θ ∈ (−π, π) (i.e. on U ) we might also give the function a value on θ = π,
so that θ ∈ (−π, π] so that we have log on C∗ . However we should still imagine
there being a imaginary barrier at the negative real axis which we can’t go through
and where we have a discontinuity of 2πi. The branch is then single-valued and
continuous on any curve C that does not cross the cut.
We have picked an arbitrary branch cut and branch. We can pick other branch
cuts or branches. Even with the same branch cut, we can still have a different
branch — we can instead require θ to fall in (π, 3π]. Of course, we can also pick
other branch cuts, eg. the non-negative imaginary axis. Any cut that stops curves
wrapping around the branch point will do:
2. Specify the location of the branch cut and give the value of the required branch
at a single point not on the cut. The values everywhere else are then defined
uniquely by continuity. For example, we have log z with a branch cut along
R≤0 and log 1 = 0. Of course, we could have defined log 1 = 2πi as well, and
this would correspond to picking arg z ∈ (π, 3π].
E. 10-14
<Powers> Having defined the logarithm, we define general power functions.
Let α ∈ C and log : U → C be a branch of the logarithm. Then we can define
z α = eα log z on U . This is again only defined when log is.
p
Consider the function f (z) = z(z − 1). This has two branch points, z = 0
and z = 1, since we cannot define a square root consistently near 0, as it is
defined via the logarithm. Note we can define a continuous branch of f on either
C \ ((−∞, 0) ∪ (1, ∞)) or C \ (0, 1). Why is the second case possible? Note that
1
f (z) = e 2 (log(z)+log(z−1)) .
If we move around a path encircling the finite slit (0, 1), the argument of each of
log(z) and log(z − 1) will jump by 2πi, and the total change in the exponent is
2πi. So the expression for f (z) becomes uniquely defined. While these two ways of
cutting slits look rather different, if we consider this to be on the Riemann sphere,
then these two cuts now look similar. It’s just that one passes through the point
∞, and the other doesn’t.
10.1. BASIC NOTIONS 413
E. 10-15
<Riemann surfaces> The introduction of these
slits/cuts is practical and helpful for many of our
problems. However, theoretically, this is not the
best way to think about multi-valued functions.
Instead of this brutal way of introducing a cut
and forbidding crossing, Riemann imagined differ- C
ent branches as separate copies of C, all stacked on C
top of each other but each one joined to the next at C
the branch cut. This structure is a Riemann surface. C
The idea is that traditionally, we are not allowed to cross branch cuts. Here, when
we cross a branch cut, we will move to a different copy of C, and this corresponds
to a different branch of our function. We will not investigate this further here —
the Part II Riemann Surfaces wil study this in detail.
L. 10-16
Let D be a domain. Suppose f : D → C is holomorphic with 0 derivative every-
where, then f is constant on D.
L. 10-17
If f is holomorphic at w ∈ C and f 0 (w) 6= 0, then f is locally invertible at w and
its local inverse g is holomorphic at f (w) with g 0 (f (w)) = 1/f 0 (w).
Then det(Df ) = ux vy − uy vx = u2x + u2y . Using the formula for the complex
derivative in terms of the partials, this shows that if f 0 (w) 6= 0, then det(Df |w ) 6=
0. Hence, by the inverse function theorem (viewing f as a function R2 → R2 ), f
is locally invertible at w (technically, we need f to be continuously differentiable,
instead of just differentiable, but we will later show that f in fact must be infinitely
differentiable and hence have continuous derivatives). Moreover, by the same
proof as in real analysis, the local inverse is holomorphic. More precisely, say
f |U : U → f (U ) (with w ∈ U and U open) has local inverse g : f (U ) → U . Wlog
by continuity f 0 (z) 6= 0 on U . Fix any z ∈ f (U ) and let k = g(z + h) − g(z), then
f (k + g(z)) − f (g(z)) = h. So
g(z + h) − g(z) h 1
= → 0 as h→0
h f (k + g(z)) − f (g(z)) f (g(z))
since k → 0 as h → 0 by continuity.
414 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
P. 10-18
1. A branch of logarithm λ : U → C is holomorphic with λ0 (z) = 1/z.
2. Let log : U → C (where U = {z ∈ C : z 6∈ R≤0 }) be the principle branch of
logarithm. Then for |z| < 1,
X zn z2 z3
log(1 + z) = (−1)n−1 =z− + − ··· .
n 2 3
n≥1
using the chain rule and the fact that f 0 (w) 6= 0. So angles are preserved.
E. 10-21
<Examples of conformal maps/equivalence>
• Any Möbius map A(z) = az+bcz+d
(with ad − bc 6= 0) defines a conformal equivalence
C ∪ {∞} → C ∪ {∞} in the obvious sense. A0 (z) 6= 0 follows from the chain rule
and the invertibility of A(z). In particular, the Möbius group of the disk D,
z−a
Möb(D) = {f ∈ Möbius group : f (D) = D} = λ ∈ Möb : |a| < 1, |λ| = 1
āz − 1
is a group of conformal equivalences of the disk. One can prove that the Möbius
group of the disk is indeed of this form, and that in fact these are all conformal
equivalences of the disk.
• The map z 7→ z n for n ≥ 2 is holomorphic everywhere and conformal except at
z = 0. This gives a conformal equivalence
n πo
z ∈ C∗ : 0 < arg(z) < ↔ H,
n
U
V
We need to halve the angle. We saw that z 7→ z 2 doubles the angle, so we might
try z 1/2 , for which we need to choose a branch (of log). The branch cut must not
lie in U , since z 1/2 is not analytic on the branch cut. In particular, the principal
branch does not work. So we choose √ a cut along the negative imaginary axis, and
the function is defined by reiθ 7→ reiθ/2 , where θ ∈ (− π2 , 3π2
]. This produces the
wedge {z 0 : π4 < arg z 0 < 3π4
}. This isn’t exactly the wedge we want. So we need
to rotate it through − π2 . So the final map is f (z) = −iz 1/2 .
• The exponential function
z2 z3
ez = 1 + z + + + ···
2! 3!
defines a function C → C∗ . In fact it is a conformal mapping. ez takes rectangles
conformally to sectors of annuli:
416 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
iy2
y1 y
U 2
x1 x2
With an appropriate choice of branch, log z does the reverse. In particular the
map sends the region {z : Re(z) ∈ [a, b]} to the annulus {ea ≤ |w| ≤ eb }. One
is simply connected, but the other is not — this is not a problem since ez is not
bijective on the strip (and hence not a conformal equivalence).
ez
a b
Note that the above strip to annulus conformal map cannot be achieve by a Möbius
map since in the strip both lines (the boundary) pass through the point ∞ while
on the annulus the two boundaries do not intersect.
• Note that z ∈ H if and only if z is closer to i than to −i. In other words |z − i| <
|z +i|, equivalently | z−i
z+i
| < 1. So z 7→ z−i
z+i
defines a conformal equivalence H → D,
the unit disk. We know this is conformal since it is a special case of the Möbius
map. Another way to see that this map maps H → D is to note that the maps
sends −1, 0, 1 to i, −1, −i. Then since Möbius map maps circle/line to circle/line
we see that the real line is mapped to the unit circle, ie. ∂H 7→ ∂D. Moreover,
the map sends i ∈ H to 0 ∈ D, hence by continuity we must have H → D.
Similarly the map f (z) = z−1z+1
maps D to the left-hand plane. In 4 1
fact these maps can be deployed more generally on quadrants, in
particular f (z) = z−1 permutes 8 divisions on the complex plane 3 2
z+1
as follows: it sends 1 7→ 2 7→ 3 7→ 4 7→ 1 and 5 7→ 6 7→ 7 7→ 8 7→ 5. 7 6
In particular, this agrees with what we had above — it sends the
complete circle to the left hand half plane. 8 5
E. 10-22
Consider the map
1 1
z 7→ w = z+ assuming z ∈ C∗ = C \ {0}
2 z
z2 + 1
f 0 (z) = 1 − .
2z 2
So f is conformal except at ±1. Recall Möbius maps maps lines and circles to
lines and circles. This does something different. We write z = reiθ . Then if we
write z 7→ w = u + iv, we have
1 1 1 1
u= r+ cos θ v= r− sin θ
2 r 2 r
Fixing the radius, we see that a circle of radius ρ is mapped to the ellipse
u2 v2
1 + = 1,
4
(ρ + ρ1 )2 1
4
(ρ − ρ1 )2
Fixing the argument, we see that the half-line arg(z) = µ is mapped to the hyper-
bola
u2 v2
− = 1.
2
cos µ sin2 µ
We can do something more interesting. Consider a off-centered circle, chosen to
pass through the point −1 and −i. Then the image looks like this:
f
f (−i)
−1 f (−1)
−i
Note that we have a singularity at f (−1) = −1. This is exactly the point where
f is not conformal, and is no longer required to preserve angles. This is a crude
model of an aerofoil, and the transformation is known as the Joukowsky transform.
In applied mathematics, this is used to model fluid flow over a wing in terms of
the analytically simpler flow across a circular section.
E. 10-23
Often, there is no simple way to describe regions in space. However, if the region
is bounded by circular arcs, there is a trick that can be useful.
Suppose we have a circular arc between α and β. z
Along this arc, µ = θ − φ = arg(z − α) − arg(z − β)
is constant, by elementary geometry. Thus, for each α µ
β
fixed µ, the equation arg(z − α) − arg(z − β) = µ θ
φ
determines an arc through the points α, β.
To obtain a region bounded by two arcs, we find the two µ− and µ+ that describe
the boundary arcs. Then a point lie between the two arcs if and only if its µ is in
between µ− and µ+ , ie. the region is
z−α
z : arg ∈ [µ− , µ+ ] .
z−β
418 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
This says the point has to lie in some arc between those given by µ− and µ+ . For
example, the following region:
i
h
z−1 π i
is U= z : arg ∈ ,π .
z+1 2
−1 1
close to −1 as you wish. Squaring doubles the angle and gives the lower half-
plane, and multiplying by −1 gives the upper half plane.
z−1
z 7→ z+1 z 7→ z 2 z 7→ −z
In practice, complicated conformal maps are usually built up from individual build-
ing blocks, each a simple conformal map like the above. This makes use of the fact
that composition of conformal maps is conformal, by the chain rule. We know H is
conformal equivalent to D, hence U is conformally equivalent to D. In fact, there
is a really powerful theorem telling us most things are conformally equivalent to
the unit disk.
D. 10-24
• A simple closed curve is the image of an continuous injective map S 1 → C.
• Two functions u, v : R2 → R satisfying the Cauchy-Riemann equations are called
harmonic conjugates .
E. 10-25
• It should be clear (though not trivial to prove) that a simple closed curve separates
C into a bounded part and an unbounded part.
• If we know one of the harmonic conjugates, then we can find the other up to a
constant. For example, if u(x, y) = x2 − y 2 , then v must satisfy
∂v ∂u
= = 2x.
∂y ∂x
So we must have v = 2xy + g(x) for some function g(x). The other Cauchy-
Riemann equation gives
∂u ∂v
−2y = =− = −2y − g 0 (x).
∂y ∂x
This tells us g 0 (x) = 0. So g must be a genuine constant, say α. The corresponding
analytic function whose real part is u is therefore
f (z) = x2 − y 2 + 2ixy + iα = (x + iy)2 + iα = z 2 + iα.
10.2. CONFORMAL MAPS 419
• Recall that a domain U ⊆ C is simply connected if every continuous map from the
circle f : S 1 → U can be extended to a continuous map from the disk F : D2 → U
such that F |∂D2 = f . Alternatively, any loop can be continuously shrunk to a
point. For example, the unit disk is simply-connected, but the region defined by
1 < |z| < 2 is not, since the circle |z| = 1.5 cannot be extended to a map from a
disk.
T. 10-26
<Riemann mapping theorem> Let U ⊆ C be the bounded domain enclosed
by a simple closed curve, or more generally any simply connected domain not equal
to all of C. Then U is conformally equivalent to D = {z : |z| < 1} ⊆ C.
We will not prove this. This in particular tells us any two simply connected
domains are conformally equivalent. If we believe that the unit disk is relatively
simple, then since all simply connected regions are conformally equivalent to the
disk, all simply connected domains are boring. We will later encounter domains
with holes to make things interesting.
L. 10-27
A map is a conformal equivalences iff it is a bijective holomorphic map.
We will only prove the forward direction. If a map is conformal, then the inverse
mapping theorem tells us there is a local conformal inverse. And if the function is
also bijective, these patch together to give a global conformal inverse.
P. 10-28
The real and imaginary parts of any holomorphic function are harmonic.
In light of this result, conformal maps have another use: to solve Laplace’s equa-
tion. The idea is that if we are required to solve the 2D Laplace’s equation on a
funny domain U subject to some boundary conditions, we can try to find a con-
formal equivalence (bijective holomorphic map) f between U and some other nice
domain V . We can then solve Laplace’s equation on V subject to the boundary
conditions carried forward by f , which is hopefully easier. And then we bring the
solution back to U . More concretely, the following algorithm can be used to solve
Laplace’s Equation ∇2 φ(x, y) = 0 on a tricky domain U ⊆ R2 with given Dirich-
let boundary conditions on ∂U . We now pretend R2 is actually C, and identify
subsets of R2 with subsets of C in the obvious manner.
To prove this works, we can take the ∇2 of this expression, write f = u + iv, use
the Cauchy-Riemann equation, and expand out to see it gives 0. Alternatively, it
can be shown that (via simply-connect Cauchy theorem which we will prove at the
end of the Chapter) any harmonic function on a simply-connected domain has a
harmonic conjugate unique upto an additive constant. Then since Φ is harmonic,
it is the real part of some holomorphic function F (z) = Φ(x, y) + iΨ(x, y), where
z = x + iy. Now F (f (z)) is holomorphic, as it is a composition of holomorphic
functions. So its real part Φ(Re f, Im f ) is harmonic.
E. 10-29
Find a bounded solution of ∇2 φ = 0 on the first quadrant of R2 subject to φ(x, 0) =
0 and φ(0, y) = 1 for all x, y > 0.
π
We choose f (z) = log z, which maps U to the strip 0 < Im z < 2
.
1 U 1 i π2
z 7→ log z V
0 0 0
Recall that we said log maps (sections of) an annulus to a rectangle. This is indeed
the case here — U is an annulus with zero inner radius and infinite outer radius;
V is an infinitely long rectangle. Now, we must now solve ∇2 Φ = 0 in V subject
to Φ(x, 0) = 0 and Φ x, π2 = 1 for all x ∈ R. Note that we have these boundary
conditions since f (z) takes positive real axis of ∂V to the line Im z = 0, and the
positive imaginary axis to Im z = π2 . By inspection, the solution is Φ(x, y) = π2 y.
Hence,
2 2 y
Φ(x, y) = Φ(Re log z, Im log z) = Im log z = tan−1 .
π π x
L. 10-31
Suppose f : [a, b] → C is continuous (and hence integrable). Then
Z b
f (t) dt ≤ (b − a) sup |f (t)|
t
a
with equality if and only if |f (t)| = M and arg f (t) = θ for all t, ie. f is con-
stant.
D. 10-32
• A path (or curve) in C is a continuous function γ : [a, b] → C, where a, b ∈ R. A
path γ : [a, b] → C is
simple if γ(t1 ) = γ(t2 ) implies t1 = t2 or {t1 , t2 } = {a, b}.
closed if γ(a) = γ(b).
• A contour is a simple closed path which is piecewise C 1 , ie. piecewise continuously
differentiable.5
? Given a path γ : [a, b] → C, sometimes we write −γ to mean the path γ traversed
in the opposite direction, that is −γ : [a, b] → C is given by (−γ)(t) = γ(a + b − t).
Given two paths γ1 : [a, b] → C and γ2 : [α, β] → C with γ1 (b) = γ2 (α), we
sometimes write γ1 + γ2 to denote the path form by joining the two paths at
γ1 (b) = γ2 (α). That is, γ1 + γ2 : [a, b + (β − α)] → C is given by
(
γ1 (t) t ∈ [a, b]
(γ1 + γ2 )(t) = .
γ2 (t − b + α) t ∈ [b, b + (β − α)]
? If we only specify the image of a contour but not its orientation. The convention
is that when we integrate over it the direction of traversal is anticlockwise. This is
also the direction that keeps the interior of the contour on the left. This direction
of traversal is sometimes called positive .
• If γ : [a, b] → U ⊆ C is C 1 -smooth and f : U → C is continuous, then we define
the integral of f along γ as
Z Z b
f (z) dz = f (γ(t))γ 0 (t) dt.
γ a
E. 10-33
• For general paths, we just require continuity, and do not impose any conditions
about, say, differentiability. Unfortunately, the world is full of weird paths. There
are even paths that fill up the whole of the unit square. So we might want to look
at some nicer paths, call simple paths. This is pats such that it either does not
intersect itself, or only intersects itself at the end points.
• For example, contour can look something like the diagram on the
right. Most of the time, we are just interested in integration along
contours. However, it is also important to understand integration
along just simple C 1 smooth paths, since we might want to break
our contour up into different segments.
• Our definition of integration along paths have following elementary properties:
1. The definition is insensitive to reparametrization. Let φ : [a0 , b0 ] → [a, b] be C 1
such that φ(a0 ) = a, φ(b0 ) = b. If γ is a C 1 path and δ = γ ◦ φ, then
Z Z
f (z) dz = f (z) dz.
γ δ
These together tells us the integral depends only on the path itself, not how we
look at the path or how we cut up the path into pieces. We also have the following
easy properties:
R R
3. If −γ is γ with reversed orientation, then −γ f (z) dz = − γ f (z) dz.
b
Z Z
|γ 0 (t)| dt,
length(γ) = then f (z) dz ≤ length(γ) sup |f (γ(t))|.
t
a γ
Z N
X −1
f (z) dz = lim f (zn )δzn ,
γ ∆→0
n=0
E. 10-34
• Take U = C∗ , and let f (z) = z n for n ∈ Z. We pick φ : [0, 2π] → U that sends
θ 7→ eiθ . Then (
2πi n = −1
Z
f (z) dz =
φ 0 otherwise
If n = −1, then the integrand is constantly 1, and hence gives 2πi. Otherwise, the
integrand is a non-trivial exponential which is made of trigonometric functions,
and when integrated over 2π gives zero.
• Take γ to be the contour (given in two parts) by iR γ2
γ1 :[−R, R] → C with t 7→ t
γ2 :[0, 1] → C with t 7→ Reiπt
−R γ1 R
2
Consider the function f (z) = z . Then the integral is
Z Z R Z 1 Z 1
2 3
f (z) dz = t2 dt + R2 e2πit iπReiπt dt = R + R3 iπ e3πit dt
γ −R 0 3 0
3πit 1
2 3 3 e
= R + R iπ =0
3 3πi 0
We worked this out explicitly, but we have just wasted our time, since this is just
an instance of the fundamental theorem of calculus!
T. 10-35
<Fundamental theorem of calculus> Let f : U → C be continuous with
antiderivative F . If γ : [a, b] → U is piecewise C 1 -smooth, then
Z
f (z) dz = F (γ(b)) − F (γ(a)).
γ
We have Z Z b Z b
f (z) dz = f (γ(t))γ 0 (t) dt = (F ◦ γ)0 (t) dt.
γ a a
Then the result follows from the usual fundamental theorem of calculus, applied
to the real and imaginary parts separately.
In particular, the integral depends only on the end points, and not the path itself.
Moreover, if γ is closed, then the integral vanishes.
E. 10-36
This allows us to understand the first example we had. We had the function
f (z) = z n integrated along the path φ(t) = eit (for 0 ≤ t ≤ 2π). If n 6= −1, then
z n+1
d
f= .
dt n+1
424 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
isn’t a continuous branch of log on any set U containing the unit circle.
D. 10-37
• A star-shaped domain or star domain is a domain U such w
that there is some a0 ∈ U such that the line segment [a0 , w] ⊆
U for all w ∈ U . a0
So the two integrals agree. Now we need to check that F is complex differentiable.
Since U is open, we can pick ε > 0 such that B(w; ε) ⊆ U . Let δh be the radial
10.3. CONTOUR INTEGRATION 425
path in B(w, ε) from W to w + h, with |h| < ε. Now note that γw ∗ δh is a path
from a0 to w + h. So
Z Z
F (w + h) = f (z) dz = F (w) + f (z) dz
γw ∗δh δh
Z
= F (w) + hf (w) + (f (z) − f (w)) dz.
δh
Thus, we know
Z
F (w + h) − F (w)
≤ 1
− f (w) f (z) − f (w) dz
h |h| δh
1
≤ length(δh ) sup |f (z) − f (w)| = sup |f (z) − f (w)|.
|h| δh δh
This is more-or-less the same proof we gave in IA Vector Calculus that a real
function is a gradient if and only if the integral about any closed path vanishes.
L. 10-40
Let M be a metric space. T Suppose Un ⊆ M is compact for all n and is such that
U1 ⊇ U2 ⊇ U3 ⊇ · · · , then n Un is non-empty.
Since the diameters of the triangles are shrinking each time, we can pick an n such
that T n ⊆ B(z0 , ε). Now note that since 1 and z both have anti-derivatives on
T n , we have Z Z
1 dz = 0 = z dz,
∂T n ∂T n
Therefore, noting that f (z0 ) and f 0 (z0 ) are just constants, we have
Z Z
0
f (z) dz =
(f (z) − f (z 0 ) − (z − z 0 )f (z 0 )) dz
∂T n n
Z ∂T
≤ |f (z) − f (z0 ) − (z − z0 )f 0 (z0 )| dz
∂T n
where the last inequality comes from the fact that z0 ∈ T n , and the distance
between any two points in the triangle cannot be greater than the perimeter of
the triangle. Substituting our formulas for these in, we have
η 1
≤ n `2 ε =⇒ η ≤ `2 ε.
4n 4
Since ` is fixed and ε was arbitrary, it follows that we must have η = 0.
10.3. CONTOUR INTEGRATION 427
P. 10-42
<Star-shaped Cauchy’s theorem> If U is a star-shaped domain, and f : U →
1
R is holomorphic, then for any closed piecewise C paths γ ∈ U , we must have
C
γ
f (z) dz = 0.
If f is holomorphic, then Cauchy’s theorem says the integral over any triangle
vanishes. If U is star shaped, 2 of [P.10-39] says f has an antiderivative. Then
the fundamental theorem of calculus tells us the integral around any closed path
vanishes.
Is this the best we can do? Can we formulate this for an arbitrary domain, and
not just star-shaped ones? It is obviously not true if the domain is not simply
connected, eg. for f (z) = z1 defined on C \ {0}. However, it turns out this holds
as long as the domain is simply connected, as we will show in a later part of the
course. However, this is not surprising given the Riemann mapping theorem, since
any simply connected domain is conformally equivalent to the unit disk, which is
star-shaped (and in fact convex).
Why is this?
R In the proof, it was sufficient to focus on
showing ∂T f (z) dz = 0 for a triangle T ⊆ U . Con-
sider the simple case where we only have a single point
of non-holomorphicity a ∈ T . The idea is again to sub-
divide like on the diagram. We call the center triangle a
0
T
R . Along all other triangles in our subdivision, we get
f (z) dz = 0, as these triangles lie in a region where f
is holomorphic. So
Z Z
f (z) dz = f (z) dz.
∂T ∂T 0
From here, it’s straightforward to conclude the general case with many points of
non-holomorphicity — we can divide the triangle in a way such that each small
triangle contains one bad point.
D. 10-43
Let U ⊆ C be open and φ, ψ piecewise C 1 -smooth closed paths in U . We say ψ
is an elementary deformation of φ if φ and ψ can each be express as a join of
paths φ1 , φ2 , · · · , φn and ψ1 , ψ2 , · · · , ψn such that for all i both φi , ψi ⊆ Ci for
some convex set Ci ⊆ U .
428 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
E. 10-44
The idea of elementary deformation is that for each of the sections φi and ψi , since
they lies in a convex region, we can continuously deform ψi to φi . So that in fact
the two curves φ and ψ are “the same”.
Suppose f : U → C is holomorphic. Wlog suppose each of φi and ψi are
parametrised by [0, 1]. Let `i be the straight line from φi (0) = φi−1 (1) to ψi (0) =
ψi−1 (1) (where φ0 means φn and same for ψ). Note that φi + `i+1 − ψi − `i (where
`n+1 means `1 ) is a closed
R curve that lies in the convex (hence star shaped) region
Ci , so by [P.10-42], φ +` −ψ −` f (z)dz = 0 for all i. Now since `n+1 = `1 , we
i i+1 i i
have
Z Z n Z
X Z n Z
X
f (z) dz − f (z) dz = f (z) dz − f (z) dz = f (z) dz
φ ψ i=1 φi ψi i=1 φi −ψi
n Z Z Z !
X
= f (z) dz + f (z) dz − f (z) dz
i=1 φi −ψi `i+1 `i
n Z !
X
= f (z) dz =0
i=1 φi +`i+1 −ψi −`i
R R
Hence φ
f (z) dz = ψ
f (z) dz when ψ and φ are elementary deformation of each
other.
L. 10-45
Let M be a open subset of a metric space. If U ⊆ M is compact, then ∃δ > 0
such that B(u, δ) ⊆ M for all u ∈ U .
where we have fixed z ∈ B(z0 ; r) as in the statement of the theorem. Now note that
g is holomorphic as a function of w ∈ B(z0 , r + δ), except perhaps at w = z. But
10.3. CONTOUR INTEGRATION 429
We now rewrite
∞
1 1 1 X (z − z0 )n
= · z−z0 = .
w−z w − z0 1 − ( w−z ) n=0
(w − z0 )n+1
0
z−z0
Note that this sum converges uniformly on ∂ B̄(z0 ; r) since | w−z 0
| < 1 for w on
this circle. By uniform convergence, we can exchange summation and integration.
So
∞ Z
(z − z0 )n
Z
f (w) X
dw = f (z) dw.
∂ B̄(z0 ;r) w − z n=0 ∂ B̄(z0 ,r)
(w − z0 )n+1
We note that f (z)(z − z0 )n is just a constant, and that we have previously proven
(
2πi k = −1
Z
k
(w − z0 ) dw = .
∂ B̄(z0 ;r) 0 k 6= −1
So the right hand side is just 2πif (z).
(Proof 2) Given ε > 0, we pick δ > 0 such that B̄(z, δ) ⊆ B(z0 , r), and such that
whenever |w − z| < δ, then |f (w) − f (z)| < ε. This is possible since f is uniformly
continuous on the neighbourhood of z. We now cut our region apart:
z z
z0 z0
• In fact the Cauchy integral formula is like the mean-value property for harmonic
functions:
Z Z 2π
1 f (w) 1
f (z) = dw = f (z + reiθ ) dθ.
2πi ∂ B̄(z;r) w − z 2π 0
• So this result says that, if we know the value of f on the boundary of a disc
and that it is holomorphic on a domain containing the disc, then we know the
value of f at all points within the disc. While this seems magical, it is less
surprising if we look at it in another way. We can write f = u + iv, where u and
v are harmonic functions, ie. they satisfy Laplace’s equation. Then if we know
the values of u and v on the boundary of a disc, then what we essentially have
is Laplace’s equation with Dirichlet boundary conditions! Then the fact that
this tells us everything about f within the boundary is just the statement that
Laplace’s equation with Dirichlet boundary conditions has a unique solution!
• We can have a slightly more general form of this result: In fact we have
Z
1 f (w)
f (z) = dw.
2πi γ w − z
for any closed piecewise C 1 curve γ that can be obtained from ∂ B̄(z0 ; r) via a
sequence of elementary deformations in U \ {z}.
T. 10-47
<Taylor’s theorem> Let f : B(a, r) → P C be holomorphic. Then f has a con-
vergent power series representation f (z) = ∞ n
n=0 cn (z − a) on B(a, r). Moreover,
f (n) (a)
Z
1 f (z)
cn = = dz for any 0<ρ<r
n! 2πi ∂B(a,ρ) (z − a)n+1
This series is uniformly convergent everywhere on the ρ disk, including its bound-
ary. By uniform convergence, we can exchange integration and summation to
get
∞ Z ! ∞
X 1 f (z) X
f (w) = dz (w − a)n = cn (w − a)n .
n=0
2πi ∂B(a,ρ) (z − a)n+1 n=0
Since cn does not depend on w, this is a genuine power series representation, and
this is valid on any disk B(a, ρ) ⊆ B(a, r). Then the formula for cn in terms of
the derivative comes for free since that’s the formula for the derivative of a power
series.
This tells us every holomorphic function behaves like a power series. The statement
of the theorem implies any holomorphic function has to be infinitely differentiable!
2
Also, we do not get weird things like e−1/x on R that have a trivial Taylor series
expansion, but is itself non-trivial. Similarly, we know that there are no “bump
functions” on C that are non-zero only on a compact set (since power series don’t
behave like that). Of course, we already knew that from Liouville’s theorem.
Note that the formula for f (n) (a) is what we would expect: Differentiate Cauchy’s
integral formula with respect to a we get
Z Z
1 f (z) 1 f (z)
f (a) = dz =⇒ f 0 (a) = dz.
2πi ∂ B̄(a;ρ) z−a 2πi ∂ B̄(a;ρ) (z − a)2
We have just taken the differentiation inside the integral sign. This works since
the integrand, both before and after, is a continuous function of both z and a. We
can do this any numbers of times to get
Z
n! f (z)
f (n) (a) = dz.
2πi ∂ B̄(a;ρ) (z − a)n+1
P. 10-48
If f : U → C is holomorphic on a disc, then f is infinitely differentiable on the
disc.
Complex power series are infinitely differentiable (and f had better be infinitely
differentiable for us to write down the formula for cn in terms of f (n) ).
This justifies our claim from the very beginning that Re(f ) and Im(f ) are harmonic
functions if f is holomorphic.
T. 10-49
<Liouville’s theorem> Let f : C → C be an entire function. If f is bounded,
then f is constant.
Note that we get the bound on the denominator since |w| = R implies |w −zi | > R
2
by our choice of R. Letting R → ∞, we know we must have f (z1 ) = f (z2 ). So f
is constant.
(Proof 2) Suppose that |f (z)| ≤ M for all z ∈ C, and consider an arbitrary point
z0 ∈ C. By our formula for the derivative, we have
Z
1 f (z) 1 M
f 0 (z0 ) = dz ≤ · 2πr · 2 → 0 as r→∞
2πi ∂B(z0 ;r) (z − z0 )2 2π r
since the first equality is valid for any r > 0. Hence f 0 (z0 ) = 0 for all z0 ∈ C. So
f is constant.
This, for example, means there is no interesting holomorphic period functions like
sin and cos that are bounded everywhere.
T. 10-50
<Fundamental theorem of algebra> All non-constant complex polynomial
has a root in C.
There are many many ways we can prove the fundamental theorem of algebra.
However, none of them belong wholely to algebra. They all involve some analysis
or topology. This is not surprising since the construction of R, and hence C,
is intrinsically analytic — we get from N to Z by requiring it to have additive
inverses; Z to Q by requiring multiplicative inverses; R to C by requiring the root
to x2 + 1 = 0. These are all algebraic. However, to get from Q to R, we are
requiring something about convergence in Q. This is not algebraic. It requires a
particular of metric on Q. If we pick a different metric, then you get a different
completion, as you may have seen in IB Metric and Topological Spaces. Hence the
construction of R is actually analytic, and not purely algebraic.
10.3. CONTOUR INTEGRATION 433
P. 10-51
If f : U → C is a complex-valued function, then f = u+iv is holomorphic at p ∈ U
if and only if u, v satisfy the Cauchy-Riemann equations, and that ux , uy , vx , vy
are continuous in a neighbourhood of p.
f is holomorphic on U .
We have previously shown that the condition implies that f has an antiderivative
F : U → C, ie. F is a holomorphic function such that F 0 = f . But F is infinitely
differentiable. So f must be holomorphic.
So we have here a (partial) converse to Cauchy’s theorem. Recall that Cauchy’s
theorem required U to be sufficiently nice, eg. being star-shaped or just simply-
connected. However, Morera’s theorem does not. It just requires that U is a
domain. This is since holomorphicity is a local property, while vanishing on closed
curves is a global result. Cauchy’s theorem gets us from a local property to a
global property, and hence we need to assume more about what the “globe” looks
like. On the other hand, passing from a global property to a local one does not.
Hence we have this asymmetry.
L. 10-53
Let U be open. A sequence of functions fn : U → C is locally uniformly convergent
on U iff it is uniformly convergent on all compact subsets of U .
We know that f 0 (z) = limn fn0 (z) since we can express f 0 (a) in terms of the integral
f (z)
of (z−a) 2 , as in Taylor’s theorem, and exchange the limit and the integral. To show
that this is locally uniform we need more work. Pick any a ∈ U , then we can find
Br (a) ⊆ U such that fn → f uniformly inside it. Then for any w ∈ B(a, r/2) we
have
Z
1 fn (z) − f (z)
0 0
|fn (w) − f (w)| = dz
2πi ∂ B̄(w;r/2) (z − w)2
supz∈Br (a) |fn (z) − f (z)|
≤ length(∂ B̄(w; r/2))
2π(r/2)2
2
= sup |fn (z) − f (z)| → 0 as n → ∞
r z∈Br (a)
D. 10-55
• Let f : B(a, r) → C be holomorphic. Then we can write f (z) = ∞ n
P
n=0 cn (z − a)
as a convergent power series. Then either all cn = 0, in which case f = 0 on
B(a, r), or there is a least N such that cN 6= 0 (N is the smallest n such that
f (n) (a) 6= 0). If N > 0, then we say f has a zero (root) of order N . A zero of
order one is called a simple zero .
E. 10-56
If f has a zero of order N at a, then we can write f (z) = (z − a)N g(z) on B(a, r),
where g(a) = cN 6= 0. Often, it is not the actual order that is too important.
Instead, it is the ability to factor f in this way.
• sinh z has zeros where 12 (ez − e−z ) = 0, ie. e2z = 1, ie. z = nπi, where n ∈ Z.
The zeros are all simple, since cosh(nπi) = cos(nπ) 6= 0.
Since sinh z has a simple zero at z = πi, we know sinh3 z has a zero of order
3 there. This is since the first term of the Taylor series of sinh z about z = πi
has order 1, and hence the first term of the Taylor series of sinh3 z has order 3.
We can also find the Taylor series about πi by writing ζ = z − πi:
3
1 1
sinh3 z = (sinh(ζ + πi))3 = (− sinh ζ)3 = − ζ + + ··· = −ζ 3 − ζ 5 − · · ·
3! 2
3 1 5
= −(z − πi) − (z − πi) − · · · .
2
10.3. CONTOUR INTEGRATION 435
L. 10-57
<Principle of isolated zeroes> Let f : B(a, r) → C be holomorphic and not
identically zero. Then there exists some 0 < ρ < a such that f (z) 6= 0 in the
punctured neighbourhood B(a, ρ) \ {a}.
Consider the function h(z) = f (z) − g(z). Then the hypothesis says h(z) has a
non-isolated zero at w, ie. there is no non-punctured neighbourhood of w on which
h is non-zero. By the previous lemma, this means there is some ρ > 0 such that
h = 0 on B(w, ρ) ⊆ U . Now let
So we have
X ∞ ∞
X X
m−z ≤ m−z ≤
−z
|ζ(z) − Pn | = m → 0 as n → ∞
m6∈Sn m=pn+1 m=pn+1
If there were only finitelyQ many primes, then Sn = N for some n, and so we
would have ζ(z) = Pn = n −z −1
i=1 (1 − pi ) which is a well-defined at z = 1 since
this is a finite product. Hence, the fact that this blows up at z = 1 implies that
there are infinitely many primes.
L. 10-60
Let f : U → C be holomorphic on an open set U ⊆ C. If |f | is constant, than f
must also be constant.
P. 10-61
1. <Local maximum principle> Let U be a domain and f : U → C be
holomorphic. If for some z ∈ U we have |f (w)| ≤ |f (z)| for all w ∈ B(z; r) ⊆ U ,
then f is constant. In other words, a non-constant holomorphic function cannot
achieve an interior local maximum.
2. <Global maximum principle> Let U be a domain and Ū its closure. If
f : Ū → C is a continuous function that is holomorphic on U , then |f | achieve
its maximum in the boundary ∂U = Ū \ U .
In the case (z − z0 )f (z) → 0 as z → z0 we also have the same conclusion that the
LHS tends to 0. So in fact h is also differentiable at z0 , and h(z0 ) = h0 (z0 ) = 0.
So near z0 , h has a Taylor series h(z) = n≥0 an (z − z0 )n . We know a0 = a1 = 0.
P
Now define a g(z) by X
g(z) = an+2 (z − z0 )n ,
n≥0
438 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
defined on some ball B(z0 , ρ), where the Taylor series for h is defined, (and equal
to f elsewhere). By construction, on the punctured ball B(z0 , ρ) \ {z0 }, we get
g(z) = f (z). Moreover, g(z) → a2 as z → z0 . So f (z) → a2 as z → z0 . Since g is
a power series, it is holomorphic at z0 (and hence on U ).
Singularities of holomorphic functionsare places where the function is not defined.
There are many ways a function can be ill-defined. For example, if we write
1−z
f (z) = ,
1−z
then on the face of it, this function is not defined at z = 1. However, elsewhere,
f is just the constant function 1, and we might as well define f (1) = 1. Then we
get a holomorphic function. These are rather silly singularities, and are singular
solely because we were not bothered to define f there. Some singularities however
are genuinely singular. For example, the function
1
f (z) =
1−z
is actually singular at z = 1, since f is unbounded near the point. It turns out
these are the only possibilities. This result tells us the only way for a function to
fail to be holomorphic at an isolated point is that it blows up near the point. This
won’t happen because f fails to be continuous in some weird ways.
However, we are not yet done with our classification. There are many ways in
which things can blow up. We can further classify these into two cases — the
case where |f (z)| → ∞ as z → z0 , and the case where |f (z)| does not converge as
z → z0 . It happens that the first case is almost just as boring as the removable
ones.
P. 10-63
Let U be a domain, z0 ∈ U and f : U \ {z0 } → C be holomorphic. Suppose
|f (z)| → ∞ as z → z0 . Then there is a unique k ∈ Z≥1 and a unique holomorphic
function g : U → C such that g(z0 ) 6= 0, and
g(z)
f (z) = .
(z − z0 )k
We shall construct g near z0 in some small neighbourhood, and then apply analytic
continuation to the whole of U . The idea is that since f (z) blows up nicely as
z → z0 , we know 1/f (z) behaves sensibly near z0 . We pick some δ > 0 such
that |f (z)| ≥ 1 for all z ∈ B(z0 ; δ) \ {z0 }. In particular, f (z) is non-zero on
B(z0 ; δ) \ {z0 }. So we can define
(
1
z ∈ B(z0 ; δ) \ {z0 }
h(z) = f (z) .
0 z = z0
g was initially defined on B(z0 ; ε) → C, but now this expression certainly makes
sense on all of U . So g admits an analytic continuation from B(z0 ; ε) to U .
D. 10-64
• Let U be a domain and f a complex-valued function defined on some subset of U .
We say a point z0 ∈ U is an isolated singularity of f if there exist some open disc
(ball) Bε (z0 ) such that f is defined and holomorphic on Bε (z0 ) \ {z0 } but not on
Bε (z0 ). Such a singularity is called
removable singularity if f is bounded near z0 , or equivalently if there exist a
holomorhic function g : Bε (z0 ) → C such that f = g on Bε (z0 ) \ {z0 }.
pole if |f (z)| → ∞ as z → z0 , or equivalently if on Bε (z0 ) \ {z0 } one can write
g(z)
f (z) = where g : Bε (z0 ) → C is holomorphic with g(z0 ) 6= 0
(z − z0 )k
This is then a “continuous” function. So the singularity is just a point that gets
mapped to the point ∞. The point infinity is not a special point in the Riemann
sphere. Similarly, poles are also not really singularities from the viewpoint of the
Riemann sphere. It’s just that we are looking at it in a wrong way. Indeed, if we
change coordinates on the Riemann sphere so that we label each point w ∈ CP1
by w0 = w1 instead, then f just maps z0 to 0 under the new coordinate system.
In particular, at the point z0 , we find that f is holomorphic and has an innocent
zero of order k.
Since poles are not bad, we might as well allow them, so we talk about meromorphic
functions. Note that the requirement that S is discrete is so that each pole in S
is actually an isolated singularity.
440 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
P (z)
A rational function Q(z) , where P, Q are polynomials, is holomorphic on C \ {z :
Q(z) = 0}, and meromorphic on C. More is true — it is in fact holomorphic as a
function CP1 → CP1 !
T. 10-67
<Casorati-Weierstrass theorem> Let U be a domain, z ∈ U , and suppose
f : U \ {z} → C has an essential singularity at z. Then for all w ∈ C, there
is a sequence zn → z such that f (zn ) → w. In other words, on any punctured
neighbourhood B(z; ε) \ {z}, the image of f is dense in C.
So essential singularities are very bad. In fact it’s actually worse than that. The
theorem only tells us the image is dense, but not that we will hit every point. It
1
is in fact not true that every point will get hit. For example e z can never be zero.
However, Picard’s theorem says that this is the worst we can get.
T. 10-68
<Picard’s theorem> If f has an isolated essential singularity at z0 , then there
is some b ∈ C such that on each punctured neighbourhood B(z0 ; ε) \ {z0 }, the
image of f contains C \ {b}.
T. 10-69
<Laurent series> Let 0 ≤ r < R < ∞, and let A = {z ∈ C : r < |z − a| < R}.
If f : A → C is holomorphic, then f has a (unique) convergent series expansion
∞ Z
X 1 f (z)
f (z) = cn (z − a)n where cn = dz
n=−∞
2πi ∂ B̄(a,ρ) (z − a)n+1
with ρ ∈ (r, R). Moreover, the series converges uniformly on compact subsets of
A.
10.3. CONTOUR INTEGRATION 441
As in the first proof of the Cauchy integral formula, we make the following expan-
sions: for the first integral, we have |w − a| < |z − a|. So
! ∞
1 1 1 X (w − a)n
= = ,
z−w z − a 1 − w−a
z−a n=0
(z − a)n+1
which is uniformly convergent on z ∈ ∂B(a, ρ00 ). For the second integral, we have
|w − a| > |z − a|. So
! ∞
−1 1 1 X (z − a)m−1
= z−a = ,
z−w w − a 1 − w−a m=1
(w − a)m
for the integrals c̃n . However, some of the coefficients are integrals around the ρ00
circle, while the others are around the ρ0 circle. This is not a problem. For any
r < ρ < R, these circles are elementary deformations of |z − a| = ρ inside the
annulus A. So Z
f (z)
dz
∂B(a,ρ) (z − a)n+1
442 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
Z Z !
1 f (z) 1 X
ck = dz = bn (z − a)n−k−1 dz
2πi ∂B(a,ρ) (z − a)k+1 2πi ∂B(a,ρ) n∈Z
Z
1 X
= bn (z − a)n−k−1 dz = bk .
2πi n∈Z ∂B(a,ρ)
3 5
• We know sin z = z − z3! + z5! − · · · defines a holomorphic function, with a radius
of convergence of ∞. Now consider cosec z = 1/ sin z which is holomorphic
except for z = kπ, with k ∈ Z. So cosec z has a Laurent series near z = 0.
Using
z2 z2
1
sin z = z 1 − + O(z 4 ) we get cosec z = 1+ + O(z 4 ) .
6 z 6
From this, we can read off that the Laurent series has cn = 0 for all n ≤ −2,
c−1 = 1, c1 = 15 . If we want, we can go further, but we already see that cosec
has a simple pole at z = 0. By periodicity, cosec has a simple pole at all other
singularities.
• Consider instead
1 1 1 1
sin = − + − ··· .
z z 3!z 3 5!z 5
We see this is holomorphic on C∗ , with cn 6= 0 for infinitely many n < 0. So
this has an isolated essential singularity.
• Consider cosec z1 . This has singularities at z = 1/kπ for k ∈ N = {1, 2, 3, · · · }.
• Now consider
ez
f (z) = .
z2 −1
This has a singularity at z0 = 1 but is holomorphic in an annulus 0 < |z−z0 | < 2
(the 2 comes from the other singularity at z = −1). How do we find its Laurent
series? This is a standard trick that turns out to be useful — we write everything
in terms of ζ = z − z0 . Then
−1
eζ ez0 ez0 eζ
1
f (z) = = 1+ ζ
ζ(ζ + 2) 2ζ 2
ez0 ez0
1 2 1 1
= 1 + ζ + ζ + ··· 1 − ζ + ··· = 1 + ζ + ···
2ζ 2! 2 2ζ 2
ez0
1 1 1
= + + ··· .
2 z − z0 2 (z − z0 )2
• f (z) = z −1/2 has no Laurent series about 0. The reason is that the required
branch cut of z −1/2 would pass through any annulus about z = 0. So we cannot
find an annulus on which f is holomorphic.
• Consider z 2 /((z − 1)3 (z − i)2 ). This has a double pole at z = i and a triple pole
at z = 1. To show formally that, for instance, there is a double pole at z = i,
notice first that z 2 /(z − 1)3 is analytic at z = i. So it has a Taylor series, say,
b0 + b1 (z − i) + b2 (z − i)2 + · · ·
g(z) = (z − z0 )N G(z)
1
for some G with G(z0 ) 6= 0. Then G(z)
has a Taylor series about z0 , and then
the result follows.
E. 10-71
<Series summation> We claim that the following two function are equal and
holomorphic on C \ Z,
∞
X 1 π2
f (z) = g(z) = ,
n=−∞
(z − n)2 sin2 (πz)
Our strategy is as follows — we first show that f (z) converges and is holomorphic,
which is not hard, given the Weierstrass M -test and Morera’s theorem. To show
that indeed we have f (z) = g(z), we first show that they have equal principal
part, so that f (z) − g(z) is entire. We then show it is zero by proving f − g is
bounded, hence constant, and that f (z)−g(z) → 0 as z → ∞ (in some appropriate
direction).
1/n2 and apply the Weierstrass
P
For any fixed w ∈ C \ Z, we can compare it with
M -test. We pick r > 0 such that |w − n| > 2r for all n ∈ Z. Then for all
z ∈ B(w; r), we have |z − n| ≥ max{r, |n − |w| − r|}. Hence
1 1 1
≤ min , = Mn .
|z − n|2 r2 (n − |w| − r)2
1/n2 , we know n Mn converges. So by the Weierstrass M -
P P
By comparison to
test, we know our series converges uniformly on B(w, r). We see that f |B(w,r) is a
uniform limit of holomorphic functions N 2
P
n=−N 1/(z −n) , and hence holomorphic
on B(w, r). Since w was arbitrary, we know f is holomorphic on C \ Z. Note that
we do not say the sum converges uniformly on C \ Z. It’s just that for any point
w ∈ C \ Z, there is a small neighbourhood of w on which the sum is uniformly
convergent, and this is sufficient to apply [P.10-54].
For the second part, note that f is periodic, since f (z + 1) = f (z). Also, at 0,
f has a double pole, since f (z) = z12 + holomorphic stuff near z = 0. So f has a
double pole at each k ∈ Z. Note that 1/ sin2 (πz) also has a double pole at each
k ∈ Z.
Now, consider the principal parts of our functions — at k ∈ Z, f (z) has principal
part 1/(z − k)2 . Looking at our previous Laurent series for cosec(z) we see that
π 2
g(z) = =⇒ lim z 2 g(z) = 1.
sin πz z→0
So g(z) must have the same principal part at 0 and hence at k for all k ∈ Z. Thus
h(z) = f (z) − g(z) is holomorphic on C \ Z. However, since its principal part
vanishes at the integers, it has at worst a removable singularity. Removing the
singularity, we know h(z) is entire.
Now we will show that h(z) = 0. We first show it is boundedness. We know f
and g are both periodic with period 1. So it suffices to focus attention on the strip
− 12 ≤ x = Re(z) ≤ 12 . To show this is bounded on the rectangle, it suffices to
show that h(x + iy) → 0 as y → ±∞, by continuity. To do so, we show that f and
g both vanish as y → ∞. We set z = x + iy, with |x| ≤ 21 . Then we have
4π 2
|g(z)| ≤ →0 as y → ∞
|eπy − e−πy |
∞
X 1 1 X 1
|f (z)| ≤ ≤ + 2 1 2 →0 as y → ∞
n∈Z
|x + iy − n|2 y2 n=1
(n − 2 ) + y 2
446 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
Recall that the principal branch of log, and hence of the argument Im(log), takes
values in (−π, π) and is defined on C \ R≤0 . If γ(t) always lied in, say C \ R≤0 the
right-hand half plane, we would have no problem defining θ consistently, since we
can just let θ(t) = arg(γ(t)) for arg the principal branch.
There is nothing special about the right-hand half plane. So
similarly, if γ lies in α
n z o
z : Re iα > 0
e
for a fixed α, we can define
γ(t)
θ(t) = α + arg .
eiα
So we define θj : [aj−1 , aj ] → R such that γ(t) = eiθj (t) for t ∈ [aj−1 , aj ], and
1 ≤ j ≤ n−1. On each region [aj−1 , aj ], this gives a continuous argument function.
We cannot immediately extend this to the whole of [a, b], since it is entirely possible
that θj (aj ) 6= θj+1 (aj ). However, we do know that θj (aj ) are both values of the
argument of γ(aj ). So they must differ by an integer multiple of 2π, say 2nπ.
Then we can just replace θj+1 by θj+1 − 2nπ, which is an equally valid argument
function, and then the two functions will agree at aj . Hence, for j > 1, we can
successively re-define θj such that the resulting map θ is continuous.
7
Of course, at each point t, we can find r and θ such that the above holds. The key point of the
lemma is that we can do so continuously.
10.4. RESIDUE CALCULUS 447
Rt 0
0
γ (s)/γ(s)ds, then h0 (t) = γ 0 (t)/γ(t). Now
d −h
0= γe = γ 0 e−h − γh0 e−h = e−h (γ 0 − γh0 )
dt
where by uniform convergence we can swap the integral and sum and then all the
other term vanish since
R they have an anti-derivative. Indeed by definition of the
Laurent coefficients ∂ B̄(a,ρ) f (z) dz = 2πic−1 . So we can alternatively write the
residue as Z
1
Res(a) = f (z) dz.
f 2πi ∂ B̄(a,ρ)
This gives us a formulation of the residue without reference to the Laurent se-
ries. Deforming paths if necessary, it is not too far-fetching
R to imagine that for
any simple curve γ around the singularity a, we have γ f (z) dz = 2πi Res(f, a).
Moreover, if the path actually encircles
R two singularities a and b, then deforming
the path, we would expect to have γ f (z) dz = 2πi(Res(f, a) + Res(f, b))
a b a b
and this generalizes to multiple singularities in the obvious way. If this were true,
then it would be very helpful, since this turns integration into addition, which is
(hopefully) much easier!
Indeed, we will soon prove that this result holds. However, we first get rid of
the technical restriction that we only work with simple (ie. non-self intersecting)
curves. This is completely not needed. We are actually not really worried in the
curve intersecting itself. The reason why we’ve always talked about simple closed
curves is that we want to avoid the curve going around the same point many times.
There is a simple workaround to this problem — we consider arbitrary curves, and
448 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
then count how many times we are looping around the point. If we are looping
around it twice, then we count its contribution twice!
So we define what it means for a curve to loop around a point n times, called the
winding number. There are many ways we can define the winding number. The
definition we pick is based on the following observation — suppose, for convenience,
that the point in question is the origin. As we move along a simple closed curve
around 0, our argument will change. If we keep track of our argument continuously,
then we will find that when we return to starting point, the argument would have
increased by 2π. If we have a curve that winds around the point twice, then our
argument will increase by 4π. What we do is exactly this — given a path, find
a continuous function that gives the “argument” of the path, and then define the
winding number to be the difference between the argument at the start and end
points, divided by 2π.
Note that we always have I(γ, w) ∈ Z, since θ(b) and θ(a) are arguments of
the same number. More importantly, I(γ, w) is well-defined — suppose γ(t) =
r(t)eiθ1 (t) = r(t)eiθ2 (t) for continuous functions θ1 , θ2 : [a, b] → R. Then θ1 − θ2 :
[a, b] → R is continuous, but takes values in the discrete set 2πZ. So it must in
fact be constant, and thus θ1 (b) − θ1 (a) = θ2 (b) − θ2 (a).
L. 10-75
Suppose γ : [a, b] → C is a piecewise C 1 -smooth closed path, and w 6∈ image(γ).
Then Z
1 1
I(γ, w) = dz.
2πi γ z − w
In some books, this integral expression is taken as the definition of the winding
number. While this is elegant in complex analysis, it is not clear a priori that this
is an integer, and only works for piecewise C 1 -smooth closed curves, not arbitrary
continuous closed curves.
On the other hand, what is evident from this expression is that I(γ, w) is contin-
uous as a function of w ∈ C \ image(γ), since it is even holomorphic as a function
of w. Since I(γ; w) is integer valued, I(γ; w) must be locally constant on path
components of C \ image(γ).
We can quickly verify that this is a sensible definition, in that the winding number
around a point “outside” the curve is zero. More precisely, since image(γ) is
compact (so contained in some B̄r (0)), all points of sufficiently large modulus in C
10.4. RESIDUE CALCULUS 449
belong to one component of C \ image(γ). This is indeed the only path component
of C \ image(γ) that is unbounded.
To find the winding number about a point in this unbounded component, note
that I(γ; w) is consistent on this component, and so we can consider arbitrarily
larger w. By the integral formula,
1 1
|I(γ, w)| ≤ length(γ) max →0 as w→∞
2π z∈γ |w − z|
So it does vanish outside the curve. Alternatively one could show this by noting
that if w is outside of B̄r (0), we can draw a line through w such that B̄r (0) is
entirely on one side of the line, hence we can define a branch of log so that 1/(z−w)
has an anti-derivative, and so the integral vanish. Of course, inside the other path
components, we can still have some interesting values of the winding number.
D. 10-76
Let U ⊆ C be a domain, and let φ : [a, b] → U and ψ : [a, b] → U be piecewise
C 1 -smooth closed paths. A homotopy from φ to ψ is a continuous map F :
[0, 1] × [a, b] → U such that F (0, t) = φ(t) and F (1, t) = ψ(t), and moreover for all
s ∈ [0, t] the map t 7→ F (s, t) viewed as a map [a, b] → U is closed and piecewise
C 1 -smooth.
E. 10-77
The idea now is to define a more general and natural notion of deforming a curve,
known as “homotopy”. We will then show that each homotopy can be given by
a sequence of elementary deformations. So homotopies also preserve integrals
of holomorphic functions. We can imagine this as a process of “continuously
deforming” the path φ to ψ, with a path F (s, · ) at each point in time s ∈ [0, 1].
P. 10-78
Let φ, ψ : [a, b] → U be homotopic (piecewise C 1 ) closed paths in a domain U .
1. There exists a sequence of paths φ = φ0 , φ1 , · · · , φN = ψ such that each φj is
piecewise C 1 closed and φi+1 is obtained from φi by elementary deformation.
R R
2. If f : U → C is holomorphic, then φ f (z) dz = ψ f (z) dz.
This is in fact equivalent to our definition of simply connectedness that every con-
tinuous map S 1 → U can be extended to a continuous map D2 → U , which is
equivalent to saying that any loop can be continuously shrunk to a point. This
result is basically this, except that we now only works with piecewise C 1 paths
(instead of general continuous paths). We can show their equivalence by approx-
imating any continuous curve with a piecewise C 1 -smooth one, but we shall not
do that here.
In fact U being simply connected is also equivalent to the following:
1. I(γ, w) = 0 for any closed curve γ in U and any w 6∈ U .
2. The complement of U in the extended complex plane C∞ is connected.
P. 10-80
<Cauchy’s theorem> Let U be a simply connected domain, and let f : U →
1
R be holomorphic. If γ is any piecewise C -smooth closed curve in U , then
C
γ
f (z) dz = 0.
γ is homotopic to the constant path, and the integral along a constant path is
zero.
We will sometimes refer to this theorem as “simply-connected Cauchy”. This
theorem in fact also follows from Green’s theorem (that is if we can prove Green’s
theorem rigorously). Recall Green’s theorem: Let ∂S be a positively oriented,
piecewise smooth, simple closed curve in a plane, and let S be the region bounded
by ∂S. If P and Q are functions of (x, y) defined on an open region containing S
and have continuous partial derivatives there, then
I ZZ
∂Q ∂P
(P dx + Q dy) = − dx dy.
∂S S ∂x ∂y
Let u, v be the real and imaginary parts of f . Then
I I I I
f (z) dz = (u + iv)(dx + i dy) = (u dx − v dy) + i (v dx + u dy)
γ γ γ γ
ZZ ZZ
∂v ∂u ∂u ∂v
= − − dx dy + i − dx dy
S ∂x ∂y S ∂x ∂y
But both integrands vanish by the Cauchy-Riemann equations, since f is differ-
entiable throughout S. So the result follows. This proof relies u and v to have
continuous partial derivatives. We know this is true since a holomorphic func-
tion f is infinitely differentiable, however, our proof of holomorphic function being
infinitely differentiable utilize Cauchy’s theorem! So actually we still have to do
most of the stuff we have done before even if we know Green’s theorem.
One useful consequence of Cauchy’s theorem is that we can freely deform contours
along regions where f is holomorphic without changing the value of the integral.
More precisely if that γ1 and γ2 are contours from a to b, and that f is holomorphic
10.4. RESIDUE CALCULUS 451
R R
on the contours and between the contours. Then γ1 f (z) dz = γ2 f (z) dz. In
particular
R b if f is a holomorphic function defined on a simply connected domain,
then a f (z) dz does not depend on Rthe chosen contour. This result of path
independence, is very much related to f (z) dz as a path integral in R2 , because
(i)
At each zi , f has a Laurent expansion f (z) = n∈Z cn (z − zi )n valid in some
P
(j)
For each j, we use uniform convergence of the series n≤−1 cn (z − zj )n on com-
P
pact subsets of U \ {zj }, and hence on γ, to write
Z X (j) Z Z
(j) 1
gj (z) dz = cn (z − zj )n dz = c−1 dz.
γ n≤−1 γ γ z − zj
The last equality is since for n 6= −1, the function (z − zj )n has an antiderivative,
(j)
and hence the integral around γ vanishes. But c−1 is by definition the residue of
f at zj , and the integral is just the integral definition of the winding number (up
to a factor of 2πi). So we get
Z k
X
f (z) dz = 2πi Res(f ; zj )I(γ, zj ).
γ j=1
The Cauchy integral formula and simply-connected Cauchy are special cases of
this. This in some sense a mix of all the results we’ve previously had. Simply-
connected Cauchy tells us the integral of a holomorphic f around a closed curve
depends only on its homotopy class, ie. we can deform curves by homotopy and
this preserves the integral. This means the value of the integral really only depends
on the “holes” enclosed by the curve.
452 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
We also had the Cauchy integral formula. This says if f : B(a, r) → C is holomor-
phic, w ∈ B(a, ρ) and ρ < r, then
Z
1 f (z)
f (w) = dz.
2πi ∂ B̄(a,ρ) z−w
f (z)
Note that f (w) also happens to be the residue of the function z−w . So this really
says if g has a simple pole at a inside the region bounded by a simple closed curve
γ, then Z
1
g(z) dz = Res(g, a).
2πi γ
The Cauchy’s residue theorem says the result holds for any type of singularities,
and any number of singularities.
We can picture a simple case of our proof as follows: Consider a simple curve
encircling singularities zi .
z2 γ γ̂
z1
z3
Consider the simple curve γ̂, consisting of small clockwise circles γ1 , · · · , γn around
each singularity; cross cuts, which cancel in the limit as they approach each other,
in pairs; and the large outer curve
H (which is the same as γ in the limit). Note that
γ̂ encircles no singularities. So γ̂ f (z) dz = 0 by Cauchy’s theorem. So in the
limit when the cross cuts cancel, we have
I n I
X I
f (z) dz + f (z) dz = f (z) dz = 0.
γ k=1 γk γ̂
H
But from what we did in the previous section, we know γ f (z) dz = −2πi Res(f, zk )
k
(since γk encircles onlyHone singularity,
Pn and we get a negative sign since γk is a
clockwise contour). So γ f (z) dz = k=1 2πi Res(f, zk ).
L. 10-82
Let f : U \ {a} → C be holomorphic with a pole at a, i.e f is meromorphic on U .
1. If the pole is simple, then Res(f, a) = limz→a (z − a)f (z).
2. If there exist g, h holomorphic on some B(a, ε) \ {a} with g(a) 6= 0 and h with
a simple zero at a, and that
g(z) g(a)
f (z) = then Res(f, a) = .
h(z) h0 (a)
• Consider h(z) = (z 8 − w8 )−1 , for any complex constant w. We know this has 8
simple poles at z = wenπi/4 for n = 0, · · · , 7. What is the residue at z = w? We
can try to compute this directly by
z−w
Res(h, w) = lim
z→w (z − w)(z − weiπ/4 ) · · · (z − we7πi/4 )
1 1 1
= = 7
(w − weiπ/4 ) · · · (w − we7πi/4 ) w (1 − eiπ/4 ) · · · (1 − e7iπ/4 )
Now we are quite stuck. We don’t know what to do with this. We can think really
hard about complex numbers and figure out what it should be, but this is difficult.
What we should do is to apply L’Hôpital’s rule and obtain
z−w 1 1
Res(h, w) = lim = 7 = .
z→w z 8 − w8 8z 8w7
• Consider the function (sinh πz)−1 . This has a simple pole at z = ni for all integers
n (because the zeros of sinh z are at nπi and are simple). Again, we can compute
this by finding the Laurent expansion. However, it turns out it is easier to use our
magic formula together with L’Hôpital’s rule. We have
z − ni 1 1 1 (−1)n
lim = lim = = = .
z→ni sinh πz z→ni π cosh πz π cosh nπi π cos nπ π
454 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
• Consider the function (sinh3 z)−1 . This time, we find the residue by looking at
the Laurent series. We first look at sinh3 z. This has a zero of order 3 at z = πi.
Its Taylor series is sinh3 z = −(z − πi)3 − 21 (z − πi)5 + · · · . So
−1
1 −3 1 2
= −(z − πi) 1 + (z − πi) + · · ·
sinh3 z 2
1
= −(z − πi)−3 1 − (z − πi)2 + · · ·
2
1
−3
= −(z − πi) + (z − πi)−1 + · · ·
2
1
Therefore the residue is 2
.
E. 10-84
Z ∞
1
Compute the integral dx.
0 1 + x4
The first term is something we care about, while the second is something we
despise. So we might want to get rid of it. We notice the integrand of the second
integral is O(R−3 ). Since we are integrating it over something of length R, the
whole thing tends to 0 as R → ∞. We also know the left hand side is just
Z
1
4
dz = 2πi(Res(f, eiπ/4 ) + Res(f, e3iπ/4 )).
γR 1 + z
So we just have to compute the residues. But our function is of the form given by
part 2 of the lemma above. So we know
1 1
Res(f, eiπ/4
)= = e−3πi/4 ,
4z 3 z=eiπ/4 4
i3π/4
R ∞ at e 4 −1. On the other hand, as R → ∞, the first integral on the
and similarly
right is −∞ (1 + x ) dx, which is, by evenness, twice of what we want. So
Z ∞ Z ∞
1 1 2πi iπ/4 π
2 dx = dx = − (e + e3πi/4 ) = √ .
0 1 + x4 −∞ 1 + x4 4 2
R∞
Hence our integral is 0
(1 + x4 )−1 dx = 2
π
√
2
.
When computing contour integrals, there are two things we have to decide. First,
we need to pick a nice contour to integrate along. Secondly, as we will see in the
next example, we have to decide what function to integrate.
10.4. RESIDUE CALCULUS 455
E. 10-85
Z
cos(x)
Compute dx.
R 1 + x + x2
eiz
f (z) = .
1 + z + z2
Now, again by the previous lemma, we get Res(f ; ω) = eiω /(2ω + 1). On the
semicircle, we have
Re−R sin θ
Z π Z π
iθ iθ
f (Re )Re dθ ≤ dθ,
0 |R e
2 2iθ + Reiθ + 1|
0
E. 10-86
Z π/2
1
Compute dt.
0 1 + sin2 (t)
We use the expression of sin in terms of the exponential function, namely sin(t) =
1
2i
(eit −e−it ). So if we are on the unit circle, and z = eit , then sin(t) = 2i
1
(z −z −1 ).
dz it 1
Moreover, we can check dt = ie . So dt = iz dz. Hence we get
Z π/2 Z 2π Z
1 1 1 1 1 dz
dt = dt =
0 1 + sin2 (t) 4 0 1 + sin2 (t) 4 |z|=1 1 − (z − z −1 )2 /4 iz
Z
iz
= dz.
|z|=1 z 4 − 6z 2 + 1
trigonometric functions can be integrated around |z| = 1 in this way, using the
fact that
eikt − e−ikt z k − z −k eikt + e−ikt z k + z −k
sin(kt) = = , cos(kt) = = .
2i 2i 2 2
L. 10-87
Let f : B(a, r) \ {a} → C be holomorphic, and suppose f has a simple pole at a.
We let γε : [α, β] → C be given by t 7→ a + εeit . Then
Z
lim f (z) dz = (β − α)i Res(f, a).
ε→0 γε
c
We can write f (z) = z−a
+ g(z) near a, where c = Res(f ; a), and g : B(a, δ) → C
is holomorphic. We take ε < δ. Then
Z
g(z) dz ≤ (β − α) · ε sup |g(z)|.
z∈γ
γε ε
L. 10-88
<Jordan’s lemma> Let f be holomorphic on a neighbourhood of infinity in C
(i.e. on {|z| > r} for some r > 0), and that zf (z) is bounded in this region. Let
γR (t) = Reit for t ∈ [0, π] (which is not closed). Then for α > 0, we have
Z
f (z)eiαz dz → 0 as R→∞
γR
By assumption, we have |f (z)| ≤ M/|z| for large |z| and some constant M > 0.
We also have |eiαz | = e−Rα sin t on γR . To avoid messing with sin t, we note that
on (0, π2 ], the function sinθ θ is decreasing. Then by considering the end points, we
find sin(t) ≥ 2t/π for t ∈ [0, π2 ]. This gives us the bound
(
iαz −Rα sin t e−Ra2t/π 0 ≤ t ≤ π2
|e | = e ≤ 0
e−Ra2t /π 0 ≤ t0 = π − t ≤ π2
So we get
Z Z
π/2 2π
1
iRαeit
e it it
f (Re )Re dt ≤ e−2αRt/π · M dt = (1 − eαR ) → 0
2R
0 0
Rπ
as R → ∞. The estimate for π/2 f (z)eiαz dz is analogous.
Note that the condition that zf (z) is bounded near infinity is saying that its
Laurent series only has terms of negative power, which is equivalent to saying that
f (z) → 0 as |z| → ∞.
This lemma allows us to consider integrals on expanding semicircles. In previous
cases, we had f (z) = O(R−2 ), and then we can bound the integral simply as
O(R−1 ) → −1
R 0. In this case, we only require f (z) = O(R ). The drawback is that
the case γ f (z) dz need not work — it is possible that this does not vanish.
R
However, if we have the extra help from eiαx , then we do get that the integral
vanishes.
10.4. RESIDUE CALCULUS 457
E. 10-89
Z ∞
sin x π
Show that dx = .
0 x 2
Considering the ε-semicircle γε , and using the first lemma, we get a contribution
of −iπ, where the sign comes from the orientation. Rearranging, and using the
fact that the function is even, we get the desired result.
E. 10-90
∞
eax
Z
Evaluate dx where a ∈ (−1, 1) is a real constant.
−∞ cosh x
Note that the function f (z) = eaz / cosh z has simple poles where z = n + 21 iπ
for n ∈ Z. So if we did as we have done above, then we would run into infinitely
many singularities, which is not fun.
Instead, we note that cos(x+iπ) = − cosh x and
consider a rectangular contour as shown in the πi γ1
diagram. We now enclose only one singularity,
− πi
namely ρ = iπ2
, where γvert 2
× +
γvert
eaρ −R γ0
Res(f, ρ) = = ieaπi/2 . R
cosh0 (ρ)
since a < 1. We can do a similar bound for γvert− , where we use the fact that
a > −1. Thus, letting R → ∞, we get
∞ −∞
eax eaπi eax
Z Z
dx + dx = 2πi(−ieaπi/2 ).
−∞ cosh x +∞ cosh(x + iπ)
∞
eax 2πeaiπ/2
Z πa
dx = aπi
= π sec .
−∞ cosh x 1+e 2
E. 10-91
∞
X 1 π2
Show that 2
= .
n=1
n 6
π cos(πn) 1 1
Res(f ; n) = = 2.
n2 π cos(πn) n
Note that the reason why we have those funny π’s all around the place is so that
we can get this nice expression for the residue. At z = 0, we get
−1
z2 z3
1 z
cot(z) = 1− + O(z 4 ) z− + O(z 5 ) = − + O(z 2 ).
2 3 z 3
So we get
π cot(πz) 1 π2
2
= 3 − + ···
z z 3z
So the residue is −π 2 /3. Now we consider the
1
(N + 2 )i
following square contour γN as shown in the di-
agram. Since we don’t want the contour itself to
pass through singularities, we make the square −(N + 1 1
2) N+
pass through ± N + 21 . Then the residue theo- 2
×××××××××××
rem says
N
!
π2
Z X 1
f (z) dz = 2πi 2 − . −(N + 1
2 )i
γN n=1
n2 3
10.4. RESIDUE CALCULUS 459
R
We can thus get the desired series if we can show that γN
f (z) dz → 0 as n → ∞.
We first note that
Z
≤ sup π cot πz 4(2N + 1)
f (z) dz 2
γN
γ
Nz
4(2N + 1)π −1
≤ sup | cot πz| = sup | cot πz|O(N ).
1 2
γN N+2 γN
log R log R
|f (z)||dz| = O R · = O →0 as R → ∞.
R2 R
On the small semicircular arc of radius ε, the integrand
|f (z)||dz| = O(ε log ε) → 0 as ε → 0.
Hence, as ε → 0 and R → ∞, we are left with the integral along the real axis.
Along the negative real axis, we have log z = log |z| + iπ. So the residue theorem
says Z ∞ Z 0
log x log |z| + iπ
2
dx + (−dx) = 2πi Res(f ; i).
0 1 + x ∞ 1 + x2
We can compute the residue as
1
log i iπ π
Res(f, i) = = 2 = .
2i 2i 4
So we find
∞ Z ∞
iπ 2
Z
log x 1
2 2
dx + iπ 2
dx = .
0 1+x 0 1+x 2
Taking the real part of this, we obtain 0 as the answer to the original integral.
In this case, we had a branch cut, and we managed to avoid it by going around
our magic contour. Sometimes, it is helpful to run our integral along the branch
cut.
460 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
E. 10-93
Z ∞ √
x
Compute dx where a, b ∈ R.
0 x2 + ax + b
√
To define z, we need to pick a branch cut. We pick
it to lie along the positive real line, and consider the
keyhole contour . As usual this has a small circle
of radius ε around the origin, and a large circle of
radius R. Note that these both avoid the branch
cut. Again, on the R circle, we have
1
|f (z)||dz| = O √ → 0 as R → ∞.
R
√ 1
On the ε-circle, we have |f (z)||dz| = O(ε3/2 ) → 0 as ε → 0. Viewing √ z = e 2 log z ,
on the two pieces of the contour along R≥0 , log z differs by 2πi. So z changes
sign. This cancels with the sign change arising from going in the wrong direction.
Therefore the residue theorem says
Z ∞ √
X x
2πi residues inside contour = 2 dx.
0 x2 + ax + b
What the residues are depends on what the quadratic actually is, but we will not
go into details.
E. 10-94
∞
xα
Z
Compute I = √ dx where −1 < α < 1.
0 1 + 2x + x2
So we get
zα
I
√ dz → (1 − e2απi )I.
γ 1 + 2z + z 2
All that remains is to compute the residues. We write the integrand as
zα
.
(z − e3eπi/4 )(z − e5πi/4 )
√
So the poles are
√ at z0 = e3πi/4 and z1 = e5πi/4 . The residues are e3απi/4 /( 2i)
and e5απi/4 /(− 2i) respectively. Hence we know
3απi/4
e5απi/4
e
(1 − e2απi )I = 2πi √ + √ .
2i − 2i
√
In other words, we get eαπi (e−απi − eαπi )I = 2πeαπi (e−απi/4 − eαπi/4 ). Thus
we have
√ sin(απ/4)
I = 2π .
sin(απ)
Note that we we labeled the poles as e3πi/4 and e5πi/4 . The second point is the
same point as e−3πi/4 , but it would be wrong to label it like that. We have decided
at the beginning to pick the branch such that 0 ≤ θ < 2π, and −3πi/4 is not in
If we wrote it as e−3πi/4 instead, we might
that range. √ √ have got the residue as
−3απi/4
e /(− 2i), which is not the same as e5απi/4 /(− 2i).
L. 10-95
Suppose the holomorphic f have a zero (or pole) of order k > 0 at z = a, then
f 0 (z)/f (z) has a simple pole at z = a with residue k (respectively −k for pole).
whence the result. For a pole we have f (z) = (z − a)−k g(z) and proceed in the
same way.
T. 10-96
<Argument principle> Let U be a simply connected domain, and let f be
meromorphic on U . Suppose in fact f has finitely many zeroes z1 , · · · , zk and
finitely many poles w1 , · · · , w` . Let γ be a piecewise-C 1 closed curve in U such
that zi , wj 6∈ image(γ) for all i, j. Then
k `
f 0 (z)
Z
1 X X
I(f ◦ γ, 0) = dz = ord(f ; zi )I(γ, zi ) − ord(f ; wj )I(γ, wj ).
2πi γ f (z) i=1 j=1
f 0 (z)
Z Z Z
1 dw 1 df 1
I(f ◦ γ, 0) = = = dz.
2πi f ◦γ w 2πi γ f (z) 2πi γ f (z)
Let S = {z1 , · · · , zk , w1 , · · · , w` }. By the residue theorem, we have
Z 0 0
1 f (z) X f
dz = Res , z I(γ, z),
2πi γ f (z) z∈S
f
462 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
Note that outside these zeroes and poles, the function f 0 (z)/f (z) is holomorphic.
By the above lemma Res(f 0 /f, zj ) = ord(f ; zj ) and Res(f 0 /f, wj ) = − ord(f ; wj ).
Note that if c is a constant function, then
f 0 (z)
Z Z
1 1 dw
I((f − c) ◦ γ, 0) = dz = = I(f ◦ γ, c).
2πi γ f (z) − c 2πi f ◦γ w − c
This “shifting property” would be useful later.
Recall we said that if f : B(a; r) → C is holomorphic, and f (a) = 0, then f has
a zero of order k if, locally f (z) = (z − a)k g(z) with g holomorphic and g(a) 6= 0.
Analogously, if f : B(a, r) \ {a} → C is holomorphic, and f has at worst a pole
at a, we can again write f (z) = (z − a)k g(z) where now k ∈ Z may be negative.
Since we like numbers to be positive, we say the order of the zero/pole is |k|. It
turns out we can use integrals to help count poles and zeroes. In particular, if γ is
a simple closed curve, then all the winding numbers of γ about points zi , wj lying
in the region bound by γ are all +1 (with the right choice of orientation). Then
this results ays that in the region
1
number of zeroes − number of poles = (change in argument of f along γ).
2π
where the zeros and poles are counted with multiplicity.
This might be the right place to put the following remark — all the time, we have
assumed that a simple closed curve “bounds a region”, and then we talk about
which poles or zeroes are bounded by the curve. While this seems obvious, it is
not. This is given by the Jordan curve theorem, which is actually hard. Instead of
resorting to this theorem, we can instead define what it means to bound a region
in a more convenient way. One can say that for a domain U , a closed curve γ ⊆ U
bounds a domain D ⊆ U if
(
+1 z ∈ D
I(γ, z) = ,
0 z 6∈ D
If |f | > |g| on γ, then f and f + g cannot have zeroes on the curve γ. We let
f (z) + g(z) g(z)
h(z) = =1+ .
f (z) f (z)
This is a natural thing to consider, since zeroes of f + g is zeroes of h, while poles
of h are zeroes of f . Note that by assumption, for all z ∈ γ, we have
Therefore h◦γ is a closed curve in the half-plane {z : Re(z) > 0}. So I(h◦γ; 0) = 0.
Then by the argument principle, h must have the same number of zeros as poles
10.4. RESIDUE CALCULUS 463
in D, when counted with multiplicity (note that the winding numbers are all +1).
Thus, as the zeroes of h are the zeroes of f + g, and the poles of h are the poles
of f , the result follows.
E. 10-98
Typical application of Rouchés theorem is to determine the approximate location
of the zeroes of (say) a polynomial. Consider the function z 4 + 6z + 3, we claim
this has three roots (with multiplicity) in {1 < |z| < 2}. To show this, note that
on |z| = 2, we have
|z|4 = 16 > 6|z| + 3 ≥ |6z + 3|.
So if we let f (z) = z 4 and g(z) = 6z + 3, then f and f + g have the same number
of roots in {|z| < 2}. Hence all four roots lie inside {|z| < 2}.
On the other hand, on |z| = 1, we have |6z| = 6 > |z 4 + 3|. So 6z and z 4 + 6z + 3
have the same number of roots in {|z| < 1}. So there is exactly one root in there,
and the remaining three must lie in {1 < |z| < 2} (the bounds above show that
|z| cannot be exactly 1 or 2).
E. 10-99
Let P (x) = xn + an−1 xn−1 + · · · + a1 x + a0 ∈ Z[x] and suppose a0 6= 0. We claim
that if |an−1 | > 1 + |an−2 | + · · · + |a1 | + |a0 |, then P is irreducible over Z (and
hence irreducible over Q, by Gauss’ lemma).
To show this, we let f (z) = an−1 z n−1 and g(z) = z n + an−2 z n−2 + · · · + a1 z + a0 .
Then our hypothesis tells us |f | > |g| on |z| = 1. So f and P = f + g both have
n − 1 roots in the open unit disc {|z| < 1}.
Now if we could factor P (z) = Q(z)R(z), where Q, R ∈ Z[x], then at least one of
Q, R must have all its roots inside the unit disk. Say all roots of Q are inside the
unit disk. But we assumed a0 6= 0. So 0 is not a root of P , hence it is not a root
of Q. But the product of the roots Q is a coefficient of Q, hence an integer strictly
between −1 and 1. This is a contradiction.
The argument principle and Rouchés theorem tell us how many roots we have got.
However, we do not know if they are distinct or not. This information is given to
us via the local degree theorem which we’ll do next.
D. 10-100
Let f : B(a, r) → C be holomorphic and non-constant. Then the local degree of
f at a, written deg(f, a) is the order of the zero of f (z) − f (a) at a.
E. 10-101
If we take the Taylor expansion of f about a, then the local degree is the degree
of the first non-zero term after the constant term.
L. 10-102
The local degree is given by deg(f, a) = I(f ◦ γ, f (a)) where γ(t) = a + reit with
0 ≤ t ≤ 2π, for r > 0 sufficiently small.
Note that by the identity theorem, we know that, f (z) − f (a) has an isolated zero
at a (since f is non-constant). So for sufficiently small r, the function f (z) − f (a)
does not vanish on B̄(a, r) \ {a}. If we use this r, then f ◦ γ never hits f (a),
and the winding number is well-defined. The result then follows directly from the
argument principle.
464 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
T. 10-103
<Local degree theorem> Let f : B(a, R) → C be holomorphic and non-
constant. Then ∃δ > 0 s.t. ∀r ∈ (0, δ], ∃ε > 0 s.t. ∀w ∈ B(f (a), ε) \ {f (a)}, the
equation f (z) = w has exactly deg(f, a) distinct solutions in B(a, r).
We pick δ > 0 such that f (z) − f (a) and f 0 (z) don’t vanish on B̄(a, δ) \ {a}.
Then in particular the same applies to r ∈ (0, δ]. We let γ(t) = a + reit . Then
f (a) 6∈ image(f ◦ γ). So there is some ε > 0 such that B(f (a), ε) ∩ image(f ◦ γ) = ∅
(since B(a, R) \ image(f ◦ γ) is open).
We now let w ∈ B(f (a), ε). Then the number of zeros of f (z) − w in B(a, r) is
just I(f ◦ γ, w), by the argument principle. This is just equal to I(f ◦ γ, f (a)) =
deg(f, a), by the invariance of I(Γ, ∗) as we move ∗ on path components of C \ Γ.
Now if w 6= f (a), since f 0 (z) 6= 0 on B(a, r) \ {a}, all roots of f (z) − w must be
simple. So there are exactly deg(f, a) distinct zeros.
P. 10-104
<Open mapping theorem> Let U be a domain and f : U → C is holomorphic
and non-constant, then f is an open map, ie. for all open V ⊆ U , we get that
f (V ) is open.
This is an immediate consequence of the local degree theorem. Firstly note that
by the identity theorem, f is not locally constant anywhere. For every a ∈ U we
pick r > 0 sufficiently small so that r ∈ (0, δ] and B(a, r) ⊆ V . Then by the local
degree theorem ∃ε > 0 such that B(f (a), ε) ⊆ f (B(a, r)) ⊆ f (V ). Hence f (V ) is
open.
L. 10-105
Suppose U ⊆ C is a simply connected domain and 0 6∈ U , then there exist a branch
of logarithm on U
Pick a ∈ U . Since exp is surjective onto C∗ , we can pick b such that eb = a. Given
any x ∈ U , Rlet γx be any piece-wise C 1 path from a to x. Define F : U → C by
F (x) = b + γx z1 dz. By [P.10-80], if γ is a closed piece-wise C 1 path in U , then
R 1
γ z
dz = 0. Hence by 1 of [P.10-39], F is holomorphic with derivative 1/z. Now
by the chain rule, hence eF (z) = Az for some constant A.[L.10-16] Now eF (a) = eb =
a, hence A = 1. Therefore F is a continuous log in U .
P. 10-106
Let U ⊆ C be a simply connected domain, and U 6= C. Then there is a non-
constant holomorphic function U → B(0, 1).
is injective, and that h(z1 ) = ±h(z2 ) implies φ(z1 ) = φ(z2 ). So we deduce that
B(−y, r) ∩ h(U ) = ∅. Now define
r
f : z 7→ .
2(h(z) + y)
It is common for the terms e−ikx and eikx to be swapped around in these definitions.
It might even be swapped around by the same author in the same paper — for some
reason, if we have a function in two variables, then it is traditional to transform one
variable with e−ikx ikx
√ and the other with e , just to confuse people. More rarely,
factors of 2π or 2π are rearranged. Traditionally, if f is a function of position x,
then the transform variable is called k; while if f is a function of time t, then it is
called ω.
In fact, a more precise version of the inverse transform is
Z ∞
1 1
(f (x+ ) + f (x− )) = PV f˜(k)eikx dk.
2 2π −∞
The left-hand side indicates that at a discontinuity, the inverse Fourier transform gives
the average value. The right-hand side shows that only the Cauchy principal value
466 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
R R R
of the integral (denoted PV , P or −) is required, ie. the limit
Z R Z R
lim f˜(k)eikx dk, rather than lim f˜(k)eikx dk.
R→∞ −R R→∞
S→−∞ S
Several functions have PV integrals, but not normal ones. For example,
Z ∞
x
PV 2
dx = 0,
−∞ 1 + x
R∞ x
since it is an odd function, but −∞ 1+x 2 dx diverges at both −∞ and ∞. So the
normal proper integral does not exist. So for the inverse Fourier transform, we only
have to care about the Cauchy principal value. This is convenient because that’s how
we are used to compute contour integrals all the time!
E. 10-107
2
Consider f (x) = e−x /2 . Then
Z ∞ Z ∞ Z ∞+ik
2 2
/2 −k2 /2 2 2
f˜(k) = e−x /2 e−ikx dx = e−(x+ik) e dx = e−k /2
e−z /2
dz
−∞ −∞ −∞+ik
E. 10-108
When inverting Fourier transforms, we generally use a semicircular contour (in the
upper half-plane if x > 0, lower if x < 0), and apply Jordan’s lemma. Consider
the real function (
0 x<0
f (x) = ,
e−ax x > 0
where a > 0 is a real constant. The Fourier transform of f is
Z ∞ Z ∞
1 1
f˜(k) = f (x)e−ikx dx = e−ax−ikx dx = − [e−ax−ikx ]∞
0 = .
−∞ 0 a + ik a + ik
For x < 0, we have to close in the lower Rhalf plane (to apply Jordan’s lemma).
∞ ˜
Since there are no singularities, we get 2π1
−∞
f (k)eikx dk = 0. Combining these
results, we obtain
Z ∞ (
1 ˜ ikx 0 x<0
f (k)e dk = −ax
,
2π −∞ e x>0
This exists for functions that grow no more than exponentially fast. There is no
standard notation for the Laplace transform. We sometimes write fˆ = L(f ) or
fˆ(p) = L(f (t)). The variable p is also not standard. Sometimes, s is used instead.
E. 10-110
Many functions (eg. t and et ) which do not have Fourier transforms do have
Laplace transforms. Note that fˆ(p) = f˜(−ip), where f˜ is the Fourier transform,
provided that both transforms exist.
Z ∞
1
1. L(1) = e−pt dt =
0 p
2. Integrating by parts, we find L(t) = 1/p2 .
468 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
Z ∞
1
3. L(eλt ) = e(λ−p)t dt =
0 p−λ
1 1 1 1 1
4. L(sin t) = L eit − e−it = − = 2
2i 2i p − i p+i p +1
Note that the integral only converges if Re p is sufficiently large. For example, in
(3), we require Re p > Re λ. However, once we have calculated fˆ in this domain, we
can consider it to exist everywhere in the complete p-plane, except at singularities
(such as at p = λ in this example). That is we use the analytic continuation.
C. 10-111
<Properties of the Laplace transform> We will come up with seven elemen-
tary properties of the Laplace transform. The first 4 properties are easily proved
by direct substitution
1. Linearity: L(αf + βg) = αL(f ) + βL(g).
2. Translation: L(f (t − t0 )H(t − t0 )) = e−pt0 fˆ(p).
3. Scaling: L(f (λt)) = λ1 fˆ λp where we require λ > 0 so that f (λt) vanishes
for t < 0.
4. Shifting: L(ep0 t f (t)) = fˆ(p − p0 ).
5. Transform of a derivative: L(f 0 (t)) = pfˆ(p) − f (0). To see this note
Z ∞ Z ∞
f 0 (t)e−pt dt = [f (t)e−pt ]∞
0 +p f (t)e−pt dt = pfˆ(p) − f (0).
0 0
and so on. This is the key fact for solving ODEs using Laplace transforms.
6. RDerivative of a transform: fˆ0 (p) = L(−tf (t)). To see this ˆ
R note that f (p) =
∞ −pt ˆ0 (p) = − ∞ tf (t)e−pt dt.
0
f (t)e dt, differentiating wrt p, we have f 0
Of course, the point of this is not that we know what the derivative of fˆ is.
It is we know how to find the Laplace transform of tf (t)! For example, this
lets us find the derivative of t2 with ease. In general, fˆ(n) (p) = L((−t)n f (t)).
7. Asymptotic limits (
f (0) as p → ∞
pfˆ(p) → ,
f (∞) as p → 0
where the second case requires f to have a limit at ∞. Too see that we use
using property 5 to write
Z ∞
pfˆ(p) = f (0) + f 0 (t)e−pt dt.
0
E. 10-112
d d 1 2p
L(t sin t) = − L(sin t) = − = 2 .
dp dp p2 + 1 (p + 1)2
P. 10-113
<Inverse Laplace transform> The inverse Laplace transform is given by
Z c+i∞
1
f (t) = fˆ(p)ept dp,
2πi c−i∞
where c is a real constant such that the Bromwich inversion contour γ from c − i∞
to c + i∞ lies to the right of all the singularities of fˆ.
Z ∞
g̃(ω) = f (t)e−ct e−iωt dt = fˆ(c + iω).
−∞
Z c+i∞
1
f (t)e−ct = fˆ(p)e(p−c)t dp.
2πi c−i∞
Multiplying both sides by ect , we get the result we were looking for (the require-
ment that c lies to the right of all singularities is to fix the “constant of integration”
so that f (t) = 0 for all t < 0, as we will soon see).
P. 10-114
In the case that fˆ(p) has only a finite number of isolated singularities pk for
k = 1, · · · , n, and fˆ(p) → 0 as |p| → ∞, then
n
!
X
f (t) = ˆ pt
Res (f (p)e ) H(t).
p=pk
k=1
470 CHAPTER 10. COMPLEX ANALYSIS AND METHODS
If fˆ decays less rapidly at infinity, but still tends to zero there, the same result
holds, but we need to use a slight modification of Jordan’s lemma. So in either
case, the integral Z
fˆ(p)ept dp → 0 as R → ∞.
0
γR
R R
Thus, we know γ0 → γ . Hence, by Cauchy’s theorem, we know f (t) = 0 for t < 0.
This is in agreement with the requirement that functions with Laplace transform
vanish for t < 0. Here we see why γ must lie to the right of all singularities. If not,
then the contour would encircle some singularities, and then the integral would no
longer be zero.
c + iR
When t > 0, we close the contour to the left. This
time, our γ does enclose some singularities. Since there
are only finitely many singularities, we enclose all sin- γR
γ0
gularities for sufficiently large R. Once again, we get ×
R ×
γR
→ 0 as R → ∞. Thus, by the residue theorem, we
know ×
Z Z n
X
fˆ(p)ept dp = lim fˆ(p)ept dp = 2πi Res (fˆ(p)ept ).
γ R→∞ γ0 p=pk
k=1
c − iR
So the inversion formula gives f (t) =
Pn ˆ pt
k=1 Resp=pk (f (p)e ).
E. 10-115
• We know
1
fˆ(p) =
p−1
has a pole at p = 1. So we must use c > 1. We have fˆ(p) → 0 as |p| → ∞. So
Jordan’s lemma applies as above. Hence f (t) = 0 for t < 0, and for t > 0, we have
pt
e
f (t) = Res = et .
p=1 p−1
then we cannot use the standard result about residues, since fˆ(p) does not vanish
as |p| → ∞. But we can use the original Bromwich inversion formula to get
e−p pt
Z Z
1 1 1 pt0
f (t) = e dp = e dp,
2πi γ p 2πi γ p
where t0 = t − 1. Now we can close to the right when t0 < 0, and to the left when
t0 > 0, picking up the residue from the pole at p = 0. Then we get
( (
0 t0 < 0 0 t<1
f (t) = = = H(t − 1).
1 t0 > 0 1 t>1
E. 10-116
<Solve differential equations by Laplace transform> The Laplace trans-
form converts ODEs to algebraic equations, and PDEs to ODEs. We will illustrate
this by examples.
• Consider the differential equation tÿ − tẏ + y = 0 with y(0) = 2 and ẏ(0) = −1.
Note that
d d
L(tẏ) = − L(ẏ) − (pŷ − y(0)) = −pŷ 0 − ŷ.
dp dp
2 A
ŷ = + 2,
p p
E. 10-117
Recall that the convolution of two functions f and g is defined as
Z ∞
(f ∗ g)(t) = f (t − t0 )g(t0 ) dt0 .
−∞
Recall from Methods that the Fourier transforms turn convolutions into products,
and vice versa. We will now prove an analogous result for Laplace transforms.
T. 10-118
<Convolution theorem> The Laplace transform of a convolution is given by
473
474 CHAPTER 11. GEOMETRY
Nevertheless, we will still view the linear isometries as “special” isometries, since
they are more convenient to work with, despite not being special fundamentally.
• For any matrix A and x, y ∈ Rn , we get
hAx, Ayi = (Ax)T (Ay) = xT AT Ay = hx, AT Ayi.
So A is orthogonal if and only if hAx, Ayi = hx, yi for all x, y ∈ Rn .
• Note that the inner product can be expressed in terms of the norm by
1
hx, yi = (kx + yk2 − kxk2 − kyk2 ).
2
So if A preserves norm, then it preserves the inner product, and the converse
is obviously true. So A is orthogonal if and only if kAxk = kxk for all x ∈ Rn .
Hence matrices are orthogonal if and only if they are isometries.
• More generally, let f (x) = Ax + b. Then d(f (x), f (y)) = kA(x − y)k. So any
f of this form is an isometry if and only if A is orthogonal. This is not too
surprising. What might not be expected is that all isometries are of this form.
T. 11-3
Every isometry of f : Rn → Rn is of the form f (x) = Ax + b for A orthogonal
and b ∈ Rn .
This shows us two things. Firstly we see that RH is indeed an isometry since
Secondly we see that RH fixes exactly the points of H. The converse is also true
— any isometry S ∈ Isom(Rn ) that fixes the points in some affine hyperplane H is
either the identity or RH . To show this, we first want to translate the plane such
that it becomes a vector subspace. Then we can use our linear algebra magic. For
any a ∈ Rn , we can define the translation by a as Ta (x) = x + a. This is clearly
an isometry.
We pick an arbitrary a ∈ H, and let R = T−a STa ∈ Isom(Rn ). Then R fixes
exactly H 0 = T−a H. Since 0 ∈ H 0 , H 0 is a vector subspace. In particular, if
H = {x : x · u = c}, then by putting c = a · u, we find H 0 = {x : x · u = 0}. To
understand R, we already know it fixes everything in H 0 . So we want to see what
it does to u. Note that since R is an isometry and fixes the origin, it is in fact an
orthogonal map. Hence for any x ∈ H 0 , we get
If the points P and Q are represented by vectors p and q, we consider the perpen-
dicular bisector of the line segment P Q which is a hyperplane H with equation
1 1
x · (p − q) = (p + q) · (p − q) = (kpk2 − kqk2 ).
2 2
476 CHAPTER 11. GEOMETRY
D. 11-8
• An orientation of a vector space is an equivalence class of bases — let v1 , · · · , vn
and v10 , · · · , vn0 be two bases and A be the change of basis matrix. We say the two
bases are equivalent iff det A > 0. This is an equivalence relation on the bases,
and the equivalence classes are the orientations.
E. 11-9
We now want to look at O(3). First focus on the case where A ∈ SO(3), ie.
det A = 1. Then we can compute
This is the most general orientation-preserving isometry of R3 that fixes the origin.
How about the orientation-reversing ones? Suppose det A = −1. Then det(−A) =
1. So in some orthonormal basis, we can express A as
1 0 0 −1 0 0
−A = 0 cos θ − sin θ =⇒ A= 0 cos ϕ − sin ϕ ,
0 sin θ cos θ 0 sin ϕ cos ϕ
D. 11-10
A curve Γ in Rn is a continuous map Γ : [a, b] → Rn . Given such a curve we can
define dissection D = a = t0 < t1 < · · · < tN = b of [a, b], and set Pi = Γ(ti ). We
define
X −−−−→
SD = kPi Pi+1 k.
i
E. 11-11
Here we can think of the curve as the trajectory of a particle moving through time.
Our main objective of this
R b section is to define the length of a curve. We might want
to define the length as a kΓ0 (t)k dt as is familiar from, say, IA Vector Calculus.
However, we can’t do this, since our definition of a curve does not require Γ to be
differentiable. It is merely required to be continuous. Hence we have to define the
length in a more roundabout way.
Similar to the definition of the Riemann integral, we PN −1
consider dissections. Notice that if we add more points
to the dissection, then SD will necessarily increase, by PN
the triangle inequality. So it makes sense to define the P2
length as supremum ` = supD SD . Alternatively, if we
P0
let mesh(D) = maxi (ti − ti−1 ) then if ` exists, then we
have P1
`= lim sD .
mesh(D)→0
Note also that by definition, we can write ` = inf{`˜ : `˜ ≥ SD for all D}. The
definition by itself isn’t too helpful, since there is no nice and easy way to check if
the supremum exists. However, differentiability allows us to compute this easily
in the expected way.
P. 11-12
If Γ is continuously differentiable (ie. C 1 ), then the length of Γ is given by
Z b
length(Γ) = kΓ0 (t)k dt.
a
To simplify notation, we assume n = 3. However, the proof works for all possible
dimensions. We write Γ(t) = (f1 (t), f2 (t), f3 (t)). For every s 6= t ∈ [a, b], the mean
value theorem tells us for each i = 1, 2, 3
fi (t) − fi (s)
= fi0 (ξi ) for some ξi ∈ (s, t)
t−s
Now note that fi0 are continuous on a closed, bounded interval, and hence uniformly
continuous. For all ε > 0, there is some δ > 0 such that if s, t is such that |t−s| < δ,
then |fi0 (ξi ) − f 0 (ξ)| < 3ε for all ξ ∈ (s, t). And thus for any ξ ∈ (s, t), we have
0
2
2
f10 (ξ1 )
Γ(t) − Γ(s) f1 (ξ)
2 2 2
0
=
f2 (ξ2 ) − f20 (ξ)
≤ ε + ε + ε < ε2 .
0
− Γ (ξ)
t−s
f30 (ξ3 ) 9 9 9
f30 (ξ)
In other words, kΓ(t) − Γ(s) − (t − s)Γ0 (ξ)k ≤ ε(t − s). We relabel t = ti , s = ti−1
and ξ = s+t
2
. Using the triangle inequality, we have
0 ti + ti−1
(ti − ti−1 )
Γ
− ε(ti − ti−1 ) < kΓ(ti ) − Γ(ti−1 )k
2
0 ti + ti−1
< (ti − ti−1 )
Γ
+ ε(ti − ti−1 ).
2
11.2. SPHERICAL GEOMETRY 479
X
0 ti + ti−1
(ti − ti−1 )
Γ
− ε(b − a) < SD
i
2
X
0 ti + ti−1
< (ti − ti−1 )
Γ
+ ε(b − a),
i
2
which is valid whenever mesh(D) < δ. Since Γ0 is continuous, and hence integrable,
we know
b
Z
X
0 ti + ti−1
(ti − ti−1 )
Γ
→
kΓ0 (t)k dt as mesh(D) → 0
i
2 a
Rb
and so length(Γ) = limmesh(D)→0 SD = a
kΓ0 (t)k dt.
This proof is just a careful check that the definition of the integral coincides with
the definition of length.
D. 11-13
• A great circle (in S 2 ) is S 2 ∩ (a plane through O). We also call these (spherical)
lines .
• Given P, Q ∈ S, the distance d(P, Q) is the shorter of the two (spherical) line
segments (ie. arcs) P Q along the respective great circle. When P and Q are
antipodal, there are infinitely many line segments between them of the same length,
and the distance is π.
E. 11-14
When we live on the sphere, we can no longer use regular lines in R3 , since these
do not lie fully on the sphere. Instead, we have a concept of a spherical line, also
known as a great circle. We will also call these geodesics , which is a much more
general term defined on any surface, and happens to be these great circles in S 2 .
In R3 , we know that any three points that are not colinear determine a unique
plane through them. Hence given any two non-antipodal points P, Q ∈ S, there
exists a unique spherical line through P and Q.
−
−→
Note that by the definition of the radian, d(P, Q) is the angle between OP and
−
−→ −1 −−
→ −−→
OQ, which is also cos (P · Q) where P = OP , Q = OQ.
480 CHAPTER 11. GEOMETRY
C. 11-15
<Triangles on a sphere> One main object of study is A
spherical triangles – they are defined just like Euclidean tri-
angles, with AB, AC, BC line segments on S of length < π. c α b
γ
The restriction of length is just merely for convenience. We B β C
will take advantage of the fact that the sphere sits in R3 . a
We set
T. 11-16
1. <Spherical cosine rule> sin a sin b cos γ = cos c − cos a cos b.
sin a sin b sin c
2. <Spherical sine rule> = = .
sin α sin β sin γ
Rearranging gives c2 = a2 +b2 −2ab cos γ +O(k(a, b, c)k3 ). The sine rule transforms
similarly as well. This is what we would expect, since making a, b, c small is
equivalent to zooming into the surface of the sphere, and it looks more and more
like flat space.
P. 11-17
<Triangle inequality> For any P, Q, R ∈ S 2 , we have d(P, Q) + d(Q, R) ≥
d(P, R) with equality if and only if Q lies is in the line segment P R of shortest
length.
Form a spherical triangle with the three point so that P, Q, R are our A, C, B
respectively. If γ = π, it then follows that Q is in the line segment given by P R.
So c = a + b. If γ 6= π, since cos γ > −1 spherical cosine rule gives
Since cos is decreasing on [0, π], we know c < a + b. The only case left to check
is if d(P, R) = π, since we do not allow our triangles to have side length π. But
in this case they are antipodal points, and any Q lies in a line through P R, and
equality holds.
Thus, we find that (S 2 , d) is a metric space.
On Rn , straight lines are curves that minimize distance. Since we are calling
spherical lines lines, we would expect them to minimize distance as well. This is
in fact true.
P. 11-18
The group of ismotries of S 2 , Isom(S 2 ), is isomorphic to O(3, R).
From this we see that g is an isometry of R3 which fixes the origin. Therefore
Isom(S 2 ) is naturally identified with the group O(3, R).
We define a reflection of S 2 in a spherical line/great circle H ∩ S 2 where H is
a plane through the origin, to be the restriction to S 2 of the isometry RH of
R3 , the reflection of R3 in the hyperplane H. It therefore follows from results in
the Euclidean case that any element of Isom(S 2 ) is the composite of at most three
reflection of S 2 . Note in passing that the exact same result holds for the isometries
of the Euclidean plane R2 . It follows that Isom(S 2 ) has an index two subgroup
482 CHAPTER 11. GEOMETRY
corresponding to SO(3) ⊆ S(3); these isometries are just the rotations of S 2 , and
are the composite of two reflections. Since any element of O(3) is of the form ±A
with A ∈ SO(3), it follows that the group O(3) is isomorphic to SO(3) × C2 .
P. 11-19
Given a curve Γ on S 2 ⊆ R3 from P to Q, we have ` = length(Γ) ≥ d(P, Q).
Moreover, if ` = d(P, Q), then the image of Γ is a spherical line segment P Q.
Let Γ : [0, 1] → S and ` = length(Γ). Then for any dissection D of [0, 1], say
0 = t0 < · · · < tN = 1, write Pi = Γ(ti ). We define
def
X X −−−−→
S̃D = d(Pi−1 , Pi ) > SD = |Pi−1 Pi |,
i i
where the length in the right hand expression is the distance in Euclidean 3-space.
Now suppose ` < d(P, Q). Then there is some ε > 0 such that `(1 + ε) < d(P, Q).
Recall from basic trigonometric that if θ > 0, then sin θ < θ. Also, (sin θ)/θ →
1 as θ → 0. Thus for small θ we have θ ≤ (1 + ε) sin θ.
What we really want is the double of this: 2θ ≤ (1 + ε)2 sin θ.
This is useful since these lengths are those in the diagram. This
−
−→ 2θ
means for P, Q sufficiently close, we have d(P, Q) ≤ (1+ε)|P Q|. 2 sin θ
From Analysis II, we know Γ is uniformly continuous on [0, 1].
2θ
So we can choose D such that
−−−−→
d(Pi−1 , Pi ) ≤ (1 + ε)|Pi−1 Pi | for all i.
since SD → ` from below. However, by the triangle inequality S̃D ≥ d(P, Q). This
is a contradiction. Hence we must have ` ≥ d(P, Q).
Suppose now ` = d(P, Q) for some Γ : [0, 1] → S, ` = length(Γ). Then for every
t ∈ [0, 1], we have
d(P, Q) = ` = length Γ|[0,t] + length Γ|[t,1] ≥ d(P, Γ(t)) + d(Γ(t), Q) ≥ d(P, Q).
Hence we must have equality all along the way, that is d(P, Q) = d(P, Γ(t)) +
d(Γ(t), Q) for all Γ(t). However, this is possible only if Γ(t) lies on the shorter
spherical line segment P Q, as we have previously proved.
So if Γ is a curve of minimal length from P to Q in S 2 , then Γ is a spherical
line segment. Further, from the proof of this proposition, we know length Γ|[0,t] =
d(P, Γ(t)) for all t. So the parametrisation of Γ is monotonic. Such a Γ is called a
minimizing geodesic.
Finally, we get to an important theorem whose prove involves complicated pictures.
This is known as the Gauss-Bonnet theorem. The Gauss-Bonnet theorem is in
fact a much more general theorem. However, here we will specialize in the special
case of the sphere. Later, when doing hyperbolic geometry, we will prove the
hyperbolic version of the Gauss-Bonnet theorem. Near the end of the course,
when we have developed sufficient machinery, we would be able to state the Gauss-
Bonnet theorem in its full glory. However, we will not be able to prove the general
version.
11.2. SPHERICAL GEOMETRY 483
P. 11-20
<Gauss-Bonnet theorem for S 2 > If ∆ is a spherical triangle with angles
α, β, γ, then area(∆) = (α + β + γ) − π.
This follows directly from cutting the polygon up into the constituent triangles.
This is very different from Euclidean space. On R2 , we always have α + β + γ = π.
Not only is this false on S 2 , but by measuring the difference, we can tell the area
of the triangle. In fact, we can identify triangles up to congruence just by knowing
the three angles.
Instead of taking the whole GL(2, C) and quotienting out multiples of the identities,
we can instead start with SL(2, C). Again, A1 , A2 ∈ SL(2, C) define the same map if
and only if A1 = λA2 for some λ. What are the possible values of λ? By definition of
the special linear group, we must have 1 = det(λA) = λ2 det A = λ2 . So λ2 = ±1. So
each Möbius map is represented by two matrices, A and −A, and we get
G∼
= PSL(2, C) = SL(2, C)/{±1}.
Now let’s think about the sphere. On S 2 , the rotations SO(3) act as isometries. Recall
the full isometry group of S 2 is O(3). We would like to show that rotations of S 2 corre-
spond to Möbius transformations coming from the subgroup SU(2) ≤ GL(2, C).
C. 11-21
<Stereographic projection> We first find a way to identify S 2 with C∞ . We
use coordinates ζ ∈ C∞ . We define the stereographic projection π : S 2 → C∞ by
π(P ) = (line P N ) ∩ {z = 0} which is well defined except where P = N , in which
case we define π(N ) = ∞.
N
P
π(P )
L. 11-22
If π 0 : S 2 → C∞ denotes the stereographic projection from the South Pole, then
π 0 (P ) = 1/π(P ).
x+iy
Let P = (x, y, z). We know π(x, y, z) = 1−z
. We have
x + iy
π 0 (x, y, z) = ,
1+z
since we have just flipped the z axis around. So we have
x2 + y 2
π(P )π 0 (P ) = = 1,
1 − z2
11.2. SPHERICAL GEOMETRY 485
1. Consider the r(ẑ, θ), the rotations about the z axis by θ. These corresponds
to the Möbius map ζ 7→ eiθ ζ, which is given by the unitary matrix
iθ/2
e 0
−iθ/2 .
0 e
ζ −1 x + iy − 1 + z x − 1 + z + iy (z + iy)(x − 1 + z + iy)
= = =
ζ +1 x + iy + 1 − z x + 1 − (z − iy) (x + 1)(z + iy) − (z 2 + y 2 )
(z + iy)(x − 1 + z + iy) (z + iy)(x − 1 + z + iy) z + iy
= = = .
(x + 1)(z + iy) + (x2 − 1) (x + 1)(z + iy + x − 1) x+1
3. We claim that SO(3) is generated by r(ŷ, π2 ) and r(ẑ, θ) for 0 ≤ θ < 2π. To
show this, we first observe that r(x̂, ϕ) = r(ŷ, π2 )r(ẑ, ϕ)r(ŷ, − π2 ). Note that
we read the composition from right to left. You can convince yourself this is
true by taking a physical sphere and try rotating. To prove it formally, we can
just multiply the matrices out.
486 CHAPTER 11. GEOMETRY
Next, observe that for v ∈ S 2 ⊆ R3 , there are some angles ϕ, ψ such that
g = r(ẑ, ψ)r(x̂, ϕ) maps v to x̂. We can do so by first picking r(x̂, ϕ) to rotate
v into the (x, y)-plane. Then we rotate about the z-axis to send it to x̂. Then
for any θ, we have r(v, θ) = g −1 r(x̂, θ)g, and our claim follows by composition.
4. Thus, via the stereographic projection, every rotation of S 2 corresponds to
products of Möbius transformations of C∞ with matrices in SU(2).
The key of the proof is step 3. Apart from enabling us to perform the proof, it
exemplifies a useful technique in geometry — we know how to rotate arbitrary
things in the z axis. When we want to rotate things about the x axis instead,
we first rotate the sphere to move the x axis to where the z axis used to be, do
those rotations, and then rotate it back. In general, we can use some isometries
or rotations to move what we want to do to a convenient location.
T. 11-24
The group of rotations SO(3) acting on S 2 corresponds precisely with the subgroup
PSU(2) = SU(2)/{±1} of Möbius transformations acting on C∞ .
SU(2) → PSU(2) ∼
= SO(3).
If we treat these groups as topological spaces, this map does something funny.
Suppose we start with a (non-closed) path from I to −I in SU(2). Applying the
map, we get a closed loop from I to I in SO(3). Hence, in SO(3), loops are behave
slightly weirdly. If we go around this loop in SO(3), we didn’t really get back to
the same place. Instead, we have actually moved from I to −I in SU(2). It takes
two full loops to actually get back to I. In physics, this corresponds to the idea
of spinors.
We can also understand this topologically as follows: since SU(2) is defined by two
complex points a, b ∈ C such that |a|2 + |b|2 , we can view it as a three-sphere S 3
in SO(3). A nice property of S 3 is it is simply connected, as in any loop in S 3 can
be shrunk to a point. On the other hand, SO(3) is not simply connected. We have
just constructed a loop in SO(3) by mapping the path from I to −I in SU(2). We
11.3. TRIANGULATIONS AND THE EULER NUMBER 487
cannot deform this loop until it just sits at a single point, since if we lift it back
up to SU(2), it still has to move from I to −I.
∼ SU(2) is just “two copies” of SO(3). By
The neat thing is that in some sense, S 3 =
duplicating SO(3), we have produced SU(2), a simply connected space. Thus we
say SU(2) is a universal cover of SO(3). We’ve just been waffling about spaces
and loops, and throwing around terms we haven’t defined properly. These vague
notions will be made precise in Algebraic Topology.
(x1 , y1 ) ∼ (x2 , y2 ) ⇔ x1 − x2 , y1 − y2 ∈ Z.
Now given any point in the torus represented by P + Z2 , we can find a square Q
such that P ∈ Q̊. Then f : Q̊ → T restricted to an small enough open disk about
P is an isometry. Thus we say d is a locally Euclidean metric.
One can also think of the torus T as the surface of a doughnut,
“embedded” in Euclidean space R3 . Given this, it is natural
to define the distance between two points to be the length of
the shortest curve between them on the torus. However, this
distance function is not the same as what we have here. So
it is misleading to think of our locally Euclidean torus as a
“doughnut”.
With one more example in our toolkit, we can start doing what we really want to
do. The idea of a triangulation is to cut a space X up into many smaller triangles,
since we like triangles. Note that a spherical triangle is in fact a topological
triangle, using the radial projection to the plane R2 from the center of the sphere.
These notions are useful only if the space X is “two dimensional” — there is no
way we can triangulate, say R3 , or a line. We can generalize triangulation to allow
higher dimensional “triangles”, namely topological tetrahedrons, and in general,
n-simplices, and make an analogous definition of triangulation. However, we will
not bother ourselves with this.
T. 11-27
The Euler number e is independent of the choice of triangulation.
This is one important fact about triangulations from algebraic topology, which we
will state without proof. So the Euler number e = e(X) is a property of the space
X itself, not a particular triangulation.
E. 11-28
1. Consider the following triangulation of the sphere as shown
on the diagram. This has 8 faces, 12 edges and 6 vertices.
So e = 2.
In both cases, we did not cut up our space with funny, squiggly lines. Instead, we
used “straight” lines (a curve that minimises distance between two points). These
triangles are known as geodesic triangles. In particular, we used spherical triangles
in S 2 and Euclidean triangles in Q̊. Triangulations made of geodesic triangles are
rather nice and we will prove a result about it. In general however a topological
triangle can look like any simple closed curve (even a circle), it is just that there’s
three distant points on them that we consider to be vertices.
11.4. HYPERBOLIC GEOMETRY 489
P. 11-29
Every geodesic triangulation of S 2 has e = 2 and those of T has e = 0.
So F − E + V = 2.
P
2. For the torus, we have τi = π for every face in Q̊. So 2πV = τi = πF and
so 2V = F = 2E − 2F . Hence we get 2(F − V + E) = 0 as required.
Of course, we know this is true for any triangulation, but it is difficult to prove
that without algebraic topology.
Note that in the definition of triangulation, we decomposed X into topological
triangles. We can also use decompositions by topological polygons, but they are
slightly more complicated, since we have to worry about convexity. However, apart
from this, everything works out well. In particular, the previous proposition also
holds, and we have Euler’s formula for S 2 : V − E + F = 2 for any polygonal
decomposition of S 2 .
E. 11-31
Suppose we have a holomorphic (analytic) function of complex variables f : U ⊆
C → C. Say f 0 (z) = a + ib and w = h1 + ih2 , then we have
f 0 (z)w = ah1 − bh2 + i(ah2 + bh1 ).
If we identify R2 = C, then f : U ⊆ R2 → R2 has a derivative dfz : R2 → R2 given
by the matrix
a −b
.
b a
D. 11-32
We use coordinates (u, v) ∈ R2 . We let V ⊆ R2 be open. Then a Riemannian metric
on V is defined by giving C ∞ functions E, F, G : V → R such that
E(P ) F (P )
F (P ) G(P )
is a positive definite definite matrix for all P ∈ V . Alternatively, this is a smooth
function that gives a 2 × 2 symmetric positive definite matrix, ie. inner product
h · , · iP , for each point in V .
E. 11-33
Not that we can pick E = G = 1 and F = 0, then this is just the standard
Euclidean inner product. Note also that by definition, if e1 , e2 are the standard
basis, then
he1 , e1 iP = E(P )
he1 , e2 iP = F (P )
he2 , e2 iP = G(P ).
The basic idea of a Riemannian metric is not too unfamiliar. Presumably, we
have all seen maps of the Earth, where we try to draw the spherical Earth on
a piece of paper, ie. a subset of R2 . However, this does not behave like R2 .
You cannot measure distances on Earth by placing a ruler on the map, since
distances are distorted. Instead, you have to find the coordinates of the points (eg.
the longitude and latitude), and then plug them into some complicated formula.
Similarly, straight lines on the map are not really straight (spherical) lines on
Earth. We really should not think of Earth a subset of R2 . All we have done was
to “force” Earth to live in R2 to get a convenient way of depicting the Earth, as
well as a convenient system of labelling points (in many map projections, the x
and y axes are the longitude and latitude).
This is the idea of a Riemannian metric. To describe some complicated surface,
we take a subset U of R2 , and define a new way of measuring distances, angles
and areas on U . All these information are packed into an entity known as the
Riemannian metric. As mentioned, we should not imagine V as a subset of R2 .
Instead, we should think of it as an abstract two-dimensional surface, with some
coordinate system given by a subset of R2 . However, this coordinate system is
just a convenient way of labelling points. They do not represent any notion of
distance. For example, (0, 1) need not be closer to (0, 2) than to (7, 0). These are
just abstract labels.
With this in mind, V does not have any intrinsic notion of distances, angles and
areas. However, we do want these notions. We can certainly write down things
11.4. HYPERBOLIC GEOMETRY 491
like the difference of two points, or even the compute the derivative of a function.
However, these numbers you get are not meaningful, since we can easily use a
different coordinate system (eg. by scaling the axes) and get a different number.
They have to be interpreted with the Riemannian metric. This tells us how to
measure these things, via an inner product “that varies with space”. This variation
in space is not an oddity arising from us not being able to make up our minds.
This is since we have “forced” our space to lie in R2 . Inside V , going from (0, 1)
to (0, 2) might be very different from going from (5, 5) to (6, 5), since coordinates
don’t mean anything. Hence our inner product needs to measure “going from
(0, 1) to (0, 2)” differently from “going from (5, 5) to (6, 5)”, and must vary with
space.
We’ll soon come to defining how this inner product gives rise to the notion of
distance and similar stuff. Before that, we want to understand what we can put
into the inner product h · , · iP . Obviously these would be vectors in R2 , but where
do these vectors come from? What are they supposed to represent? The answer
is “directions” (more formally, tangent vectors). For example, he1 , e1 iP will tell
us how far we actually are going if we move in the direction of e1 from P . Note
that we say “move in the direction of e1 ”, not “move by e1 ”. We really should
read
p this as “if we move by he1 for some small h, then the distance covered is
h he1 , e1 iP ”. This statement is to be interpreted along the same lines as “if we
vary x by some small h, then the value of f will vary by f 0 (x)h”. Notice how
the inner product allows us to translate a length in R2 (namely khe1 keucl = h)
into the actual length in V . What we needed for this is just the norm induced by
the inner product. Since what we have is the whole inner product, we in fact can
define more interesting things such as areas and angles. We will formalize these
ideas very soon, after getting some more notation out of the way.
Often, instead of specifying the three functions separately, we write the metric as
E du2 + 2F du dv + G dv 2 .
This notation has some mathematical meaning. We can view the coordinates as
smooth functions u : V → R, v : V → R. Since they are smooth, they have
derivatives. They are linear maps
duP : R2 → R dvP : R2 → R
(h1 , h2 ) 7→ h1 (h1 , h2 ) 7→ h2 .
These formula are valid for all P ∈ V . So we just write du and dv instead.
Since they are maps R2 → R, we can view them as vectors in the dual space,
du, dv ∈ (R2 )∗ . Moreover, they form a basis for the dual space. In particular,
they are the dual basis to the standard basis e1 , e2 of R2 . Then we can consider
du2 , du dv and dv 2 as bilinear forms on R2 . For example,
E. 11-35
One can show that the distance as define is indeed a metric (assuming V is path-
connected). In the area formula, what we are integrating is just the determinant
of the metric, this is also known as the Gram determinant.
To understand the definition of isometry: Consider the point P ∈ V , at this point
we can go in say the two directions x and y. At Ṽ , this corresponds to going in
the two directions dϕP (x) and dϕP (y) from point the ϕ(P ). We want an isometry
to preserve the inner product, hence we want hx, yiP = hdϕP (x), dϕP (y)i∼ϕ(P ) .
where the second equality is the change of variable formula from vector calculus,
and the third equality is obtained by taking the determinant of (∗).
D. 11-38
• The (Poincaré) disk model for the hyperbolic plane is given by the unit disk
D⊆C∼ = R2 where D = {ζ ∈ C : |ζ| < 1}, and a Riemannian metric on this disk
given by
4(du2 + dv 2 ) 4|dζ|2
= where ζ = u + iv (∗)
(1 − u − v )
2 2 2 (1 − |ζ|2 )2
• The upper half-plane is H = {z ∈ C : =(z) > 0}. The upper half-plane model
of the hyperbolic plane is the upper half-plane H with the Riemannian metric
dx2 + dy 2
.
y2
494 CHAPTER 11. GEOMETRY
E. 11-39
Hyperbolic geometry is another possible type of geometry after Euclidean and
Spherical geometry. We provide two models of the hyperbolic plane. Each model
has its own strengths, and often proving something is significantly easier in one
model than the other.
Note that the Poincaré disk model is similar to our previous metric for the sphere,
but we have 1 − u2 − v 2 instead of 1 + u2 + v 2 . To interpret the term |dζ|2 ,
we can either formally set |dζ|2 = du2 + dv 2 , or interpret it as the derivative
dζ = du + idv : C → C.
We see that (∗) is a scaling of the standard Euclidean metric by a factor depending
on the polar radius r = |ζ|2 . The distances are scaled by 2/(1 − r2 ), while the
areas are scaled by 4/(1 − r2 )2 . Note, however, that the angles in the hyperbolic
disk are the same as that in R2 . This is in general true for metrics that are just
scaled versions of the Euclidean metric.
What is the appropriate Riemannian metric to put on the upper half plane? We
know D bijects to H via the Möbius transformation
1+ζ
ϕ : ζ ∈ D 7→ i ∈ H.
1−ζ
1+ζ z−i
z=i and ζ= .
1−ζ z+i
We can compute
dζ 1 z−i 2i
= − = .
dz z+i (z + i)2 (z + i)2
11.4. HYPERBOLIC GEOMETRY 495
This is what we get if we started with a Euclidean metric. If we start with the
hyperbolic metric on D, we get an additional scaling factor. We can do some
computations to get
|z − i|2 1 |z + i|2 |z + i|2
1 − |ζ|2 = 1 − , and hence = = .
|z + i|2 1 − |ζ| 2 |z + i| − |z − i|
2 2 4 Im z
4|dζ|2
Putting all these together, the metric corresponding to (1−|ζ|2 )2
on D is
2
|z + i|2 |dz|2 dx2 + dy 2
4
4· · · |dz|2 = = .
|z + i|4 4 Im z (Im z) 2 y2
which is the upper half-plane model as expected. The lengths on H are scaled
(from the Euclidean one) by 1/y, while the areas are scaled by 1/y 2 . Again, the
angles are the same.
Note that we did not have to go through so much mess in order to define the
sphere. This is since we can easily “embed” the surface of the sphere in R3 .
However, there is no easy surface in R3 that gives us the hyperbolic plane. As we
don’t have an actual prototype, we need to rely on the more abstract data of a
Riemannian metric in order to work with hyperbolic geometry.
C. 11-40
<Möbius maps that fixes H> Consider the following group of Möbius maps:
az + b
PSL(2, R) = z 7→ : a, b, c, d ∈ R and ad − bc = 1 .
cz + d
Note that the coefficients have to be real, not complex. It is easy to check that
these are precisely the Möbius maps that send R ∪ {∞} to R ∪ {∞} and sends H
to H. A Möbius maps with real coefficients may always be represented by a real
matrix with determinant ±1; the condition that the determinant is positive is just
saying that the upper half-plane is sent to itself (and not the lower half plane).
We will show that these maps are in fact isometries of H.
P. 11-41
The elements of PSL(2, R) are isometries of H.
So it suffices to show each of these preserves the metric |dz|2 /y 2 , where z = x + iy.
The first two are straightforward to see, by plugging it into formula and notice
the metric does not change. We now look at the last one, given by z 7→ − z1 . The
derivative at z is f 0 (z) = 1/z 2 . So we get
2 2
1 dz d − 1 = |dz| .
dz 7→ d − = 2 and so
z z z |z|4
We also have
1 1 Im z
Im − = − 2 Im z̄ = .
z |z| |z|2
496 CHAPTER 11. GEOMETRY
It suffices to show that for each hyperbolic line `, there is some g ∈ PSL(2, R) such
that g(`) = L+ . This is clear when ` is a vertical half-line, since we can just apply
a horizontal translation. If it is a semicircle, suppose it has end-points s < t ∈ R.
Then consider
z−t
g(z) = .
z−s
This has determinant −s+t > 0. So g ∈ PSL(2, R) (after scaling g). Then g(t) = 0
and g(s) = ∞. Then we must have g(`) = L+ , since g(`) is a hyperbolic line, and
the only hyperbolic lines passing through ∞ are the vertical half-lines.
Note that we can achieve g(s) = 0 and g(t) = ∞ by composing with − z1 . Also, for
any P ∈ ` not on the endpoints, we can construct a g such that g(P ) = i ∈ L+ , by
composing with z 7→ az. So the isometries act transitively on pairs (`, P ), where
` is a hyperbolic line and P ∈ `.
Note that PSL(2, R) preserves hyperbolic distances. Similar to Euclidean space
and the sphere, we show these lines minimize distance.
11.4. HYPERBOLIC GEOMETRY 497
P. 11-45
1. If γ : [0, 1] → H is a piecewise C 1 -smooth curve with γ(0) = z1 and γ(1) = z2 ,
then length(γ) ≥ ρ(z1 , z2 ), with equality iff γ is a monotonic parametrisation
of [z1 , z2 ] ⊆ `, where ` is the hyperbolic line through z1 and z2 .
2. <Triangle inequality> Given three points z1 , z2 , z3 ∈ H, we have ρ(z1 , z3 ) ≤
ρ(z1 , z2 ) + ρ(z2 , z3 ) with equality if and only if z2 lies between z1 and z2 .
L. 11-47
Let G be the set of isometries of the hyperbolic disk. Then
1. Rotations z 7→ eiθ z (for θ ∈ R) are elements of G.
z−a
2. If a ∈ D, then g(z) = 1−āz
is in G.
498 CHAPTER 11. GEOMETRY
1. This is clearly an isometry, since this is a linear map, preserves |z| and |dz|,
and hence also the metric
4|dz|2
.
(1 − |z|2 )2
2. First, we need to check this indeed maps D to itself. To do this, we first make
sure it sends {|z| = 1} to itself. If |z| = 1, then
By the lemma above, we can rotate the hyperbolic disk so that reiθ is rotated to
r. So ρ(0, reiθ ) = ρ(0, r). We can evaluate this by performing the integral
Z r
2 dt
ρ(0, r) = = 2 tanh−1 r.
0 1 − t2
For the general case, we apply the Möbius transformation
z − z1
g(z) = .
1 − z̄1 z
Then we have
z2 − z1 z1 − z2 iθ
g(z1 ) = 0 and g(z2 ) = = e .
1 − z̄1 z2 1 − z¯1 z2
z1 − z2
=⇒ ρ(z1 , z2 ) = ρ(g(z1 ), g(z2 )) = 2 tanh−1 .
1 − z¯1 z2
Again, we exploited the idea of performing the calculation in an easy case, and
then using isometries to move everything else to the easy case. In general, when we
have a “distinguished” point in the hyperbolic plane, it is often more convenient
to use the disk model, move it to 0 by an isometry.
P. 11-49
For every point P and hyperbolic line `, with P 6∈ `, there is a unique hyperbolic
line `0 with P ∈ `0 such that `0 meets ` orthogonally, say at Q. Moreover ρ(P, Q) ≤
ρ(P, Q̃) for all Q̃ ∈ `, ie. Q minimises the distance P to `.
This is a familiar fact from Euclidean geometry. To prove this, we again apply the
trick of letting P = 0. Wlog, assume P = 0 ∈ D. Note that a hyperbolic line in
D (that is not a diameter) is a Euclidean circle. So it has a center, say C. Since
any line through P is a diameter, there is clearly only one line that intersects `
perpendicularly.
11.4. HYPERBOLIC GEOMETRY 499
D `
`0 P Q C
It is also clear that P Q minimizes the Euclidean distance between P and `. While
this is not the same as the hyperbolic distance, since hyperbolic lines through P
are diameters, having a larger hyperbolic distance is equivalent to having a higher
Euclidean distance. So this indeed minimizes the distance.
L. 11-50
<Hyperbolic reflection> Suppose g is an isometry of the hyperbolic half-plane
H and g fixes every point in L+ = {iy : y ∈ R+ }. Then G is either the identity or
g(z) = −z̄, ie. it is a reflection in the vertical axis L+ .
Now note that g(`0 ) is also a line meeting L+ perpendicularly at Q, since g fixes
L+ and preserves angles. So we must have g(`0 ) = `0 . Then in particular g(P ) ∈ `0 .
So we must have g(P ) = P or g(P ) = P 0 , where P 0 is the image under reflection
in L+ .
Now it suffices to prove that if g(P ) = P for any one P , then g(P ) must be
the identity (since if g(P ) = P 0 for all P , then g must be given by g(z) = −z̄).
Suppose g(P ) = P , and let A ∈ H + , where H + = {z ∈ H : Re z > 0}. Now
if g(A) 6= A, then g(A) = A0 . Then ρ(A0 , P ) = ρ(A0 , g(P )) = ρ(A, P ). But
L+
ρ(A0 , P ) = ρ(A0 , B) + ρ(B, P ) B
= ρ(A, B) + ρ(B, P ) > ρ(A, P ), A0 A
This time to study reflections, we work in the upper half-plane model, since we
have a favorite line L+ . Note that we have proved a similar result in Euclidean
geometry, and the spherical version is in the example sheets.
D. 11-51
• The map R : H → H with z 7→ −z̄ is the (hyperbolic) reflection in L+ . More
generally, given any hyperbolic line `, let T be the isometry that sends ` to L+ .
Then the (hyperbolic) reflection in ` is R` = T −1 RT .
• A hyperbolic triangle ABC is the region determined by three hyperbolic line
segments AB, BC and CA, including extreme cases where some vertices A, B, C
are allowed to be “at infinity”. More precisely, in the half-plane model, we allow
them to lie in R ∪ {∞}; in the disk model we allow them to lie on the unit circle
|z| = 1.
500 CHAPTER 11. GEOMETRY
E. 11-52
We already know how to reflect in L+ . So to reflect in another line `, we move our
plane such that ` becomes L+ , do the reflection, and move back. By the previous
proposition, R` is the unique isometry that is not identity and fixes `.
For hyperbolic triangle we see that if A is “at infinity”, then the angle at A must
be zero. Recall for a region R ⊆ H, we can compute the area of R as
ZZ
dx dy
area(R) = .
R y2
T. 11-53
<Gauss-Bonnet theorem for hyperbolic triangles> If ∆ with vertices
A, B, C is a hyperbolic triangle with angles α, β, γ ≥ 0 (note that zero angle
is possible), then area(∆) = π − (α + β + γ).
First do the case where γ = 0, so C is “at infinity”. Recall that we like to use the
disk model if we have a distinguished point in the hyperbolic plane. If we have a
distinguished point at infinity, it is often advantageous to use the upper half plane
model, since ∞ is a distinguished point at infinity.
So we use the upper-half plane model, and wlog C =
∞ (apply PSL(2, R) if necessary). Then AC and C C
BC are vertical half-lines and AB is in the arc of a β
semi-circle. We use the transformation z 7→ z + a α B
(with a ∈ R) to center the semi-circle at 0. We then A
apply z 7→ bz (with b > 0) to make the circle have π−α
radius 1. Thus wlog AB ⊆ {x2 + y 2 = 1}. Now we β
have
Z cos β Z ∞ Z cos β
1 1
area(T ) = √ 2
dy dx = √ dx
cos(π−α) 1−x2 y cos(π−α) 1 − x2
= [− cos−1 (x)]cos β
cos(π−α) = π − α − β,
D. 11-55
• We use the disk model of the hyperbolic plane. Two hyperbolic lines are parallel
iff they meet only at the boundary of the disk (at |z| = 1).
• Two hyperbolic lines are ultraparallel if they don’t meet anywhere in {|z| ≤ 1}.
E. 11-56
Recall that in S 2 , any two lines meet (in two points). In the Euclidean plane R2 ,
any two lines meet (in one point) iff they are not parallel. In the Euclidean plane,
we have the parallel axiom: given a line ` and P 6∈ `, there exists a unique line
`0 containing P with ` ∩ `0 = ∅. This fails in both S 2 and the hyperbolic plane
— but for very different reasons! In S 2 , there are no such parallel lines. In the
hyperbolic plane, there are many parallel lines. There is a more deep reason for
why this is the case, which we will come to at the very end of the course.
C. 11-57
<Hyperboloid model> Recall we said there is no way to view the hyperbolic
plane as a subset of R3 , and hence we need to mess with Riemannian metrics.
However, it turns out we can indeed embed the hyperbolic plane in R3 , if we give
R3 a different metric! The Lorentzian inner product on R3 has the matrix
1 0 0
0 1 0
0 0 −1
Recall from IB Linear Algebra that we can always pick a basis where a non-
degenerate symmetric bilinear form has diagonal made of 1 and −1. If we further
identify A and −A as the “same” symmetric bilinear form, then the above matrix
is the only other possibility left.
Thus, we obtain the quadratic form given by q(x) = hx, xi = x2 + y 2 − z 2 . We now
define the 2-sheet hyperboloid as {x ∈ R2 : q(x) = −1}. This is given explicitly
by the formula x2 + y 2 = z 2 − 1. We don’t actually need two sheets. So we define
S + = S ∩ {z > 0}. We let π : S + → D ⊆ C = R2 be the stereographic projection
from (0, 0, -1) by
x + iy
π(x, y, z) = = u + iv.
1+z
π(P )
We have thus recovered the Poincare disk model of the hyperbolic plane.
P. 11-60
If σ : V → U and σ̃ : Ṽ → U are two C ∞ parametrisations of a surface, then the
homeomorphism ϕ = σ −1 ◦ σ̃ : Ṽ → V is in fact a C ∞ diffeomorphism.
By definition, this matrix has rank two at each point. wlog, we assume the first
two rows are linearly independent. So det( xyuu xyvv ) 6= 0 at (v0 , u0 ) ∈ Ṽ . We define a
new function F (u, v) = (x(u, v), y(u, v)) ie. composing σ with the projection map
onto the first two coordinates. Now the inverse function theorem applies. So F has
a local C ∞ inverse, ie. there are two open neighbourhoods N, N 0 with (u0 , v0 ) ∈ N
and F (u0 , v0 ) ∈ N 0 ⊆ R2 such that FN : N → N 0 is a diffeomorphism. Writing
π : σ(N ) → N 0 for the projection π(x, y, z) = (x, y) we can put these things in a
commutative diagram:
σ(N )
σ π .
N F
N0
We now let Ñ = σ̃ −1 (σ(N )) and F̃ = π ◦ σ̃, which is yet again smooth. Then we
have the following larger commutative diagram.
σ(N )
σ π σ̃ .
N F
N0 Ñ
F̃
We know σ̃(ũ, ṽ) = σ(ϕ1 (ũ, ṽ), ϕ2 (ũ, ṽ)). We can then compute the partial deriva-
tives as σ̃ũ = ϕ1,ũ σu + ϕ2,ũ σv and σ̃ṽ = ϕ1,ṽ σu + ϕ2,ṽ σv . Here the transformation
is related by the Jacobian matrix
ϕ1,ũ ϕ1,ṽ
= J(ϕ).
ϕ2,ũ ϕ2,ṽ
This is invertible since ϕ is a diffeomorphism. So (σũ , σṽ ) and (σu , σv ) are different
basis of the same two-dimensional vector space.
Note that we have σ̃ũ × σ̃ṽ = det(J(ϕ))σu × σv .
504 CHAPTER 11. GEOMETRY
D. 11-62
• Let S ⊆ R3 be an embedded surface. The map θ = σ −1 : U ⊆ S → V ⊆ R2 is a
chart . A collection of charts which covers S is called an atlas .
• If S ⊆ R3 is an embedded surface, then at each point we have inner product given
by restricting the standard inner product to the tangent space at the point. This
family of inner products together are called the first fundamental form .
• Given a smooth curve Γ : [a, b] → S ⊆ R3 , the length and the energy of γ are
Z b Z b
length(Γ) = kΓ0 (t)k dt, energy(Γ) = kΓ0 (t)k2 dt.
a a
With respect to the standard basis e1 , e2 ∈ R2 , we can write the first fundamental
form as
E = hσu , σu i = he1 , e1 iP
E du2 + 2F du dv + G dv 2 where F = hσu , σv i = he1 , e2 iP
G = hσv , σv i = he2 , e2 iP .
Thus, this induces a Riemannian metric on V . This is also called the first funda-
mental form corresponding to σ. This is what we do in practical examples.
We can think of the energy as something like the average kinetic energy of a particle
along the path. Length and energy works well with parametrizations? For the sake
of simplicity, we assume Γ([a, b]) ⊆ U for some parametrization σ : V → U . Then
we define the new curve
γ = σ −1 ◦ Γ : [a, b] → V.
This curve has two components in V , say γ = (γ1 , γ2 ). Then we have
So we get Z b
1
length Γ = (E γ̇12 + 2F γ̇1 γ̇2 + Gγ̇22 ) 2 dt.
a
Similarly, the energy is given by
Z b
energy Γ = (E γ̇12 + 2F γ̇1 γ̇2 + Gγ̇22 ) dt.
a
ha, bi∼
P = hdσ̃P (a), dσ̃(b)iR3 = hdσφ(P ) ◦ dφP (a), dσφ(P ) ◦ dφP (b)iR3
P. 11-65
The area of T is independent of the choice of parametrization.
We consider this as a function of four variables u, u̇, v, v̇, which are not necessarily
related to one another. From the calculus of variations, we know γ is stationary
if and only if
d ∂I ∂I d dI ∂I
= , = .
dt ∂ u̇ ∂u dt dv̇ ∂v
The first equation gives us
d
(2(E u̇ + F v̇)) = Eu u̇2 + 2Fu u̇v̇ + Gu v̇ 2 ,
dt
which is exactly the geodesic ODE. Similarly, the second equation gives the other
geodesic ODE.
So a curve from a to b is geodesic if it is stationary points of energy among all
curves from a to b. The definition of a geodesic involves the derivative only, which
is a local property, we can generalize the definition to arbitrary embedded surfaces
as done above.
P. 11-68
If a curve Γ minimizes the energy among all curves from P = Γ(a) to Q = Γ(b),
then Γ is a geodesic.
Here minimizing a quantity locally means for every t ∈ [a, b], there is some ε > 0
such that Γ|[t−ε,t+ε] minimizes the quantity over al curves from Γ(t−ε) to Γ(t+ε).
We will not prove this. Local minimization is the best we can hope for, since the
definition of a geodesic involves differentiation, and derivatives are local properties.
The geodesic ODEs imply kΓ0 (t)k is constant. In the special case of the hyperbolic
plane, we can check this directly.
E. 11-71
A natural question to ask is that if we pick a point P and a tangent direction a,
can we find a geodesic through P whose tangent vector at P is a? In the geodesic
equations, if we expand out the derivative, we can write the equation as
E F γ̈1
= something.
F G γ̈2
Since the Riemannian metric is positive definite, we can invert the matrix and get
an equation of the form
γ̈1
= H(γ1 , γ2 , γ̇1 , γ)˙ 2
γ̈2
for some function H. From the general theory of ODE’s in Analysis II, subject to
some sensible conditions, given any P = (u0 , v0 ) ∈ V and a = (p0 , q0 ) ∈ R2 , there
is a unique geodesic curve γ(t) defined for |t| < ε with γ(0) = P and γ̇(0) = a. In
other words, we can choose a point, and a direction, and then there is a geodesic
going that way. Note that we need the restriction that γ is defined only for |t| < ε
since we might run off to the boundary in finite time. So we need not be able to
define it for all t ∈ R.
How is this result useful? We can use the uniqueness part to find geodesics. We
can try to find some family of curves C that are length-minimizing. To prove that
we have found all of them, we can show that given any point P ∈ V and direction
a, there is some curve in C through P with direction a.
Consider the sphere S 2 . Recall that arcs of great circles are length-minimizing,
at least locally. So these are indeed geodesics. Are these all? We know for any
508 CHAPTER 11. GEOMETRY
P ∈ S 2 and any tangent direction, there exists a unique great circle through P
in this direction. So there cannot be any other geodesics on S 2 , by uniqueness.
Similarly, we find that hyperbolic line are precisely all the geodesics on a hyperbolic
plane.
E. 11-72
<Geodesic polar coordinates> We have defined these geodesics as solutions
of certain ODEs. It is possible to show that the solutions of these ODEs depend
C ∞ -smoothly on the initial conditions. We shall use this to construct the geodesic
polar coordinates around each point P ∈ S in a surface. The idea is that to
specify a point near P , we can just say “go in direction θ, and then move along
the corresponding geodesic for time r”.
We can make this (slightly) more precise, and provide a quick sketch of how we
can do this formally. We let ψ : U → V be some chart with P ∈ U ⊆ S. We wlog
ψ(P ) = 0 ∈ V ⊆ R2 . We denote by θ the polar angle (coordinate), defined on
V \ {0}. Then for any given θ, there is a unique geodesic γ θ : (−ε, ε) → V such
that γ θ (0) = 0, and γ̇ θ (0) is the unit vector in the θ direction (it makes an angle
of θ with the x-axis). We define σ(r, θ) = γ θ (r) whenever this is defined. It is
possible to check that σ is C ∞ -smooth. While we would like to say that σ gives us
a parametrization, this is not exactly true, since we cannot define θ continuously.
Instead, for each θ0 , we define the region
σ ψ −1
Wθ 0 V0 U0 ⊆ S
This is why we like geodesic polar coordinates. Using these, we can put the
Riemannian metric into a very simple form. Of course, this is just a sketch of
what really happens, and there are many holes to fill in.
C. 11-74
<Surfaces of revolution> So far, we do not have many examples of surfaces.
We now describe a nice way of obtaining surfaces — we obtain a surface S by
rotating a plane curve η around a line `. Wlog the coordinates is chosen so that `
is the z-axis, and η lies in the x − z plane. More precisely, we let η : (a, b) → R3 ,
and write η(u) = (f (u), 0, g(u)). Note that it is possible that a = −∞ and/or
b = ∞. We require kη 0 (u)k = 1 for all u, that is parametrization by arclength.
We also require f (u) > 0 for all u > 0, or else things won’t make sense. Finally,
we require that η is a homeomorphism to its image. In particular this requires η
to be injective (non-self intersecting) and continuous. Then S is the image of the
11.5. SMOOTH EMBEDDED SURFACES (IN R3 ) 509
following map:
σ(u, v) = (f (u) cos v, f (u) sin v, g(u)) for a < u < b and 0 ≤ v ≤ 2π.
given by the same formula, and this is a homeomorphism onto the image as one
can check. It is evidently smooth, since f and g both are. To show this is a
parametrization, we need to show that the partial derivatives are linearly indepen-
dent. We can compute the partial derivatives and show that they are non-zero. We
have σu = (f 0 cos v, f 0 sin v, g 0 ) and σv = (−f sin v, f cos v, 0). We then compute
the cross product as σu × σv = (−f g 0 cos v, −f g 0 sin v, f f 0 ). So we have
kσu × σv k = f 2 (g 02 + f 02 ) = f 2 6= 0.
E = kσu k2 = f 02 + g 02 = 1 F = σu · σv = 0 G = kσv k2 = f 2 .
So its first fundamental form is also of the simple form, like the geodesic polar
coordinates. Putting these explicit expressions into the geodesic formula, we find
that the geodesic equations are
df 2 d 2
ü = f v̇ (f v̇) = 0.
du dt
P. 11-75
Consider a surface of revolution. We assume kγ̇k = 1, ie. u̇2 + f 2 (u)v̇ 2 = 1. Then
1. Every unit speed meridians is a geodesic.
df
2. A (unit speed) parallel will be a geodesic if and only if du
(u0 ) = 0, ie. u0 is a
critical point for f .
D. 11-76
• We let η : [0, `] → R2 be a curve parametrized with unit speed, ie. kη 0 k = 1. The
curvature κ at the point η(s) is determined by
η 00 = κn,
where n is the unit normal, chosen so that κ is non-negative.
• The second fundamental form on V with σ : V → U ⊆ S for surface S with unit
normal N is
L du2 + 2M du dv + N dv 2 ,
where L = σuu · N M = σuv · N N = σvv · N.
1
(γ(t + ∆t) − γ(t)) · n = κkγ̇k2 (∆t)2 + · · · (∗)
2
γ(t + ∆t)
where κ is the curvature. This is the distance shown as
the double arrow in the diagram. We can also compute γ(t)
2 2 2
kγ(t + ∆t) − γ(t)k = kγ̇k (∆t) . (†)
So we find that 21 κ is the ratio of the leading (quadratic) terms of (∗) and (†), and
is independent of the choice of parametrization.
We now try to apply this thinking to embedded surfaces. We let σ : V → U ⊆ S be
a parametrization of a surface S (with V ⊆ R2 open). We apply Taylor’s theorem
to σ to get
1
(σ(u + ∆u, v + ∆v) − σ(u, v)) · N = (L(∆u)2 + 2M ∆u∆v + N (∆v)2 ) + · · · ,
2
The Gaussian curvature is then a quarter of the ratio of these two equations, that
is the ratio of the determinants of the two fundamental forms.
E. 11-78
It can be deduced, similar to the curves, that K is independent of parametrization.
Note also that K > 0 means the second fundamental form is definite (ie. either
positive definite or negative definite). If K < 0, then the second fundamental form
is indefinite. If K = 0, then the second fundamental form is semi-definite (but not
definite).
Consider the unit sphere S 2 ⊆ R3 . This has K > 0 at each point. We can compute
this directly, or we can, for the moment, pretend that M = 0. Then by symmetry,
N = L. So K > 0.
On the other hand, we can imagine a Pringle crisp (also known as a hyperbolic
paraboloid), and this has K < 0. More examples are left on the third example
sheet. For example we will see that the embedded torus in R3 has points at which
K > 0, some where K < 0, and others where K = 0.
P. 11-79
We let N = σu × σv /kσu × σv k be our unit normal for a surface patch. Then at
each point, we have
Nu = aσu + bσv L M a b E F
, where − = .
Nv = cσu + dσv M N c d F G
In particular K = ad − bc.
−L = aE + bF −M = aF + bG
−M = cE + dF −N = cF + dG.
T. 11-80
Suppose for a parametrization σ : V → U ⊆ S ⊆ R3 , the first fundamental form
∞
is given by du2 + G(u, 2
√ v) dv √for some G ∈ C (V ). Then the Gaussian curvature
is given by K = −( G)uu / G. In particular, we do not need to compute the
second fundamental form of the surface.
√
We set e = σu and f = σv / G. Then e and f are unit and orthogonal. We also
let N = e × f be a third unit vector orthogonal to e and f so that they form a
basis of R3 . Using the notation of the previous proposition, we have
Thus we know
√
K G = (Nu × Nv ) · N = (Nu × Nv ) × (e × f )
= (Nu · e)(Nv · f ) − (Nu · f )(Nv · e).
eu = αf + λ1 N fu = −α̃e + µ1 N
and similarly we have
ev = βf + λ2 N fv = −β̃e + µ2 N.
√
Our objective now is to find the coefficients µi , λi , and then K G = λ1 µ2 − λ2 µ1 .
Since we know e · f = 0, differentiating gives
eu · f + e · fu = 0 α̃ = α
=⇒
ev · f + e · fv = 0 β̃ = β.
But we have
σv 1 1
α = eu · f = σuu · √ = (σu · σv )u − (σu · σu )v √ = 0,
G 2 G
since σu · σv = 0 and σu · σu = 1. So α vanishes. Also, we have
σv 1 Gu √
β = ev · f = σuv · √ = √ = ( G)u .
G 2 G
λ1 µ1 − λ2 µ1 = eu · fv − ev · fu = (e · fv )u − (e · fu )v
√
= −β̃u − (−α̃)u = −( G)uu .
√ √
So we have K G = −( G)uu as required.
So if we have nice coordinates on S, then we get a nice formula for the Gaussian
curvature K. Observe, for this σ, K depends only on the first fundamental from,
not on the second fundamental form. When Gauss discovered this, he was so
impressed that he called it the Theorema Egregium :
If S1 and S2 have locally isometric charts, then K is locally the same.
11.6. ABSTRACT SMOOTH SURFACES 513
We know that this is valid under the assumption of the theorem, ie. the existence
of a parametrization σ of the surface S such that the first fundamental form is
du2 + G(u, v) dv 2 . Suitable σ includes, for each point P ∈ S, the geodesic polars
(ρ, θ). However, P itself is not in the chart, ie. P 6∈ σ(U ), and there is no guarantee
that there will be some geodesic polar that covers P . To solve this problem, we
notice that K is a C ∞ function of S, and in particular continuous. So we can
determine the curvature at P as
Note also that every surface of revolution has such a suitable parametrization, as
we have previously explicitly seen.
Note that on the right, we are computing the Riemannian metric on Vi , while on
the left we are computing it on Vj .
E. 11-82
It is clear that every embedded surface is an abstract surface, by forgetting that
it is embedded in R3 . The three classical geometries are all abstract surfaces.
1. The Euclidean space R2 with dx2 + dy 2 is an abstract surface.
2. The sphere S 2 ⊆ R2 , being an embedded surface, is an abstract surface with
metric
4(dx2 + dy 2 )
.
(1 + x2 + y 2 )2
514 CHAPTER 11. GEOMETRY
4(dx2 + dy 2 )
.
(1 − x2 − y 2 )2
and this is isometric to the upper half plane H with metric
dx2 + dy 2
y2
Note that in the first and last example, it was sufficient to use just one chart to
cover every point of the surface, but not for the sphere. Also, in the case of the
hyperbolic plane, we can have many different charts, and they are compatible.
Finally, we notice that we really need the notion of abstract surface for the hyper-
bolic plane, since it cannot be realized as an embedded surface in R3 . The proof
is not obvious at all, and is a theorem of Hilbert.
E. 11-83
One important thing we can do is to study the curvature of surfaces. Given a
P ∈ S, the Riemannian metric (on a chart) around P determines a “reparametriza-
tion” by geodesics, similar to embedded surfaces. We have a local geodesic polar
coordinates where the metric takes the form
Note that ρ is not really the radius in spherical coordinates, but just one of
2 2 2
the
√ angle coordinates. We then have the metric dρ + sin ρ dθ . Then we get
G = sin ρ and K = 1.
3. For the hyperbolic plane, we use the disk model D, and we first express our
original metric in polar coordinates of the Euclidean plane to get
2
2
(dr2 + r2 dθ2 ).
1 − r2
11.6. ABSTRACT SMOOTH SURFACES 515
This is not geodesic polar coordinates, since r is given by the Euclidean dis-
tance, not hyperbolic distance. We will need to put
2
ρ = 2 tanh−1 r, dρ = dr.
1 − r2
ρ
Then we have r = tanh 2
which gives
4r2
= sinh2 ρ.
(1 − r2 )2
√
So we finally get G = sinh ρ with K = −1.
We see that the three classic geometries are characterized by having constant 0, 1
and −1 curvatures.
We will next state the Gauss-Bonnet theorem. Recall the definition of triangu-
lations which makes sense for (compact) abstract surfaces S. Recall the Euler
number e(S) = F − E + V is independent of triangulations, so we know that this
is invariant under homeomorphisms.
T. 11-84
<Gauss-Bonnet theorem>
1. Let ABC ⊆ S be a triangle with angles α, β, γ. If the sides of a triangle
ABC ⊆ S are geodesic segments, then
Z p
K dA = (α + β + γ) − π where dA = EG − F 2 du dv,
ABC
This is a genuine generalization of what we previously had for the sphere and
hyperbolic plane, as one can easily see. We will not prove this theorem, but we
will make some remarks. Note that we can deduce the second part from the first
part. The basic idea is to take a triangulation of S, and then use things like each
edge belongs to two triangles and each triangle has three edges.
Using the Gauss-Bonnet theorem, we can define the curvature K(P ) for a point
P ∈ S alternatively by considering triangles containing P , and then taking the
limit
(α + β + γ) − π
lim = K(P ).
area→0 area
Finally, we note how this relates to the problem of the parallel postulate we have
mentioned previously. The parallel postulate, in some form, states that given a line
and a point not on it, there is a unique line through the point and parallel to the
line. This holds in Euclidean geometry, but not hyperbolic and spherical geometry.
It is a fact that this is equivalent to the axiom that the angles of a triangle sum
to π. Thus, the Gauss-Bonnet theorem tells us the parallel postulate is captured
by the fact that the curvature of the Euclidean plane is zero everywhere.
516 CHAPTER 11. GEOMETRY
CHAPTER 12
Statistics
Statistics is a set of principles and procedures for gaining and processing quantita-
tive evidence in order to help us make judgements and decisions. Here we will focus
on formal statistical inference: we assume that we have some data generated from
some unknown probability model, and we aim to use the data to learn about certain
properties of the underlying probability model.
In particular, we perform parametric inference . We assume that we have a random
variable X that follows a particular known family of distribution (eg. Poisson distribu-
tion). However, we do not know the parameters of the distribution. We then attempt
to estimate the parameter from the data given. We repeat the experiment (or observa-
tion) many times, then will have X1 , X2 , · · · , Xn being iid with the same distribution
as X. We have a simple random sample X = (X1 , X2 , · · · , Xn ), this is the data we
have. We will use the observed X = x to make inferences about the parameter.
Knowledge of part IA probability assumed. We’ll introduce two distributions not seen
explicitly in part IA probability:
• A random variable X has a negative binomial distribution (NegBin) with param-
eters k and p (k ∈ N, p ∈ (0, 1)), if
!
x−1
P (X = x) = (1 − p)x−k pk for x = k, k + 1, · · ·
k−1
and zero otherwise. It has E[x] = k/p and var(x) = k(1 − p)/p2 . This is a discrete
distribution. It is the distribution of the number of trials up to and including the kth
success, in a sequence of independent Bernoulli trials each with success probability
p. The negative binomial distribution with k = 1 is the geometric distribution with
parameter p. And if X1 , · · · , Xk are iid geometric distribution with parameter p,
then ki=1 Xi ∼ NegBin(k, p).
P
This is the distribution of the number of failures before the kth success in a sequence
of independent Bernoulli trials each with success probability p. It is also sometimes
called the negative binomial distribution: be careful!
• P
If Z1 , · · · , Zk are iid N (0, 1) random variables, then the random variable X =
k 2 2
i=1 Zi has a chi-squared distribution on k degrees of freedom, we write X ∼ χk .
This is a continuous distribution. Since E[Zi2 ] = 1 and E[Zi4 ] = 3, we find that
E[X] = k and var(X) = 2k. Further, the moment generating function of Zi2 is
Z ∞
2 2 1 2
MZ 2 (t) = E[eZi t ] = E z t √ e−z /2 dz = (1 − 2t)−1/2 for t < 1/2,
i
−∞ 2π
517
518 CHAPTER 12. STATISTICS
so that the mgf of X = ki=1 Zi2 is MX (t) = (MZ 2 (t))k = (1 − 2t)−k/2 for t < 1/2.
P
We recognise this as the mgf of a Gamma(k/2, 1/2), so that X has pdf
k/2
1 1
fX (x) = xk/2−1 e−x/2 x > 0.
Γ(k/2) 2
We will denote the upper 100α% point of χ2k by χ2k (α), so that if χ ∼ χ2k then
P(X > χ2k (α)) = α. The above connections between gamma and χ2 means that
sometimes we can use χ2 -tables to find percentage points for gamma distributions.
χ2 satisfies the following additive property (which can be proven via mgf’s): If
X ∼ χ2m and Y ∼ χ2n are independent, then X + Y ∼ χ2m+n . It is also worth noting
that if X ∼Gamma(n, λ) then 2λY ∼ χ22n , this can be prove via mgf’s or density
transformation formula.
12.1 Estimation
The goal of estimation is as follows: we are given iid X1 , · · · , Xn , and we know that
their probability density/mass function is fX (x; θ) for some unknown θ. We know fX
but not θ. For example, we might know that they follow a Poisson distribution, but
we do not know what the mean is. The objective is to estimate the value of θ.
D. 12-1
• A statistic is a function T of the data x = (x1 , · · · , xn ). We obtain estimates of
θ using a statistics, so we can write our estimate as θ̂ = T (x). Then the random
variable T (X) is called an estimator of θ. The distribution of T = T (X) is the
sampling distribution of the statistic.
? We adopt the convention where capital X denotes a random variable and x is an
observed value. So T (X) is a random variable and T (x) is a particular value we
obtain after experiments.
• Let θ̂ = T (X) be an estimator of θ. The bias of θ̂ is the difference between its
expected value and true value: bias(θ̂) = Eθ (θ̂) − θ.1 An estimator is unbiased if
it has no bias, ie. Eθ (θ̂) = θ.
• The mean squared error of an estimator θ̂ is Eθ [(θ̂ − θ)2 ].
The root mean squared error is the square root of the above.
E. 12-2
= n1
P
• Let X1 , · · · , Xn be iid N (µ, 1). A possible estimator for µ is T (X) P Xi .
1
Then for any particular observed sample x, our estimate is T (x) = n xi .
What is the sampling distribution ofP
T ? Recall
P from IA Probability that in general,
if Xi ∼ N (µi , σi2 ), then Xi ∼ N ( µi , σi2 ), which is something we can prove
P
by considering moment-generating functions.
So we have T (X) ∼ N (µ, 1/n). Note that by the Central Limit Theorem, even
if Xi were not normal, we still have approximately T (X) ∼ N (µ, 1/n) for large
values of n, but here we get exactly the normal distribution even for small values
of n.
1
Note that the subscript θ does not represent the random variable, but the thing we want to
estimate. This is inconsistent with the use for, say, the probability mass function.
12.1. ESTIMATION 519
• The estimator n1
P
Xi we had above is a rather sensible estimator. Of course, we
can also have silly estimators such as T (X) = X1 , or even T (X) = 0.32 always.
One way to decide if an estimator is silly is to look at its bias. To find out Eθ (T ),
we can either find the distribution of T and find its expected value, or evaluate
T as a function of X directly, and find its expected value. In the above example,
Eµ (T ) = µ. So T is unbiased for µ.
• Given an estimator, we want to know how good the estimator is. We have the
concept of the bias. However, this is generally not a good measure of how good
the estimator is. For example, if we do 1000 random trials X1 , · · · , X1000 , we can
pick our estimator as T (X) = X1 . This is an unbiased estimator, but is really
bad because we have just wasted the data from the other 999 trials. On the other
hand, T 0 (X) = 0.01 + 1000
1
P
Xi is biased (with a bias of 0.01), but is in general
much more trustworthy than T . In fact, at the end of the section, we will construct
cases where the only possible unbiased estimator is a completely silly estimator to
use.
We can express the mean squared error in terms of the variance and bias:
Eθ [(θ̂ − θ)2 ] = Eθ [(θ̂ − Eθ (θ̂) + Eθ (θ̂) − θ)2 ]
= Eθ [(θ̂ − Eθ (θ̂))2 ] + [Eθ (θ̂) − θ]2 + 2[Eθ (θ̂) − θ] Eθ [θ̂ − Eθ (θ̂)]
= var(θ̂) + bias2 (θ̂).
If we are aiming for a low mean squared error, sometimes it could be preferable to
have a biased estimator with a lower variance. This is known as the “bias-variance
trade-off”.
For example, suppose X ∼ binomial(n, θ). The standard estimator is TU =
X/n, which is unbiased. TU has variance
varθ (X) θ(1 − θ)
varθ (TU ) = = .
n2 n
Hence the mean squared error of the usual estimator is given by
mse(TU ) = varθ (TU ) + bias2 (TU ) = θ(1 − θ)/n.
Consider an alternative estimator
X +1 X 1
TB = = w + (1 − w) where w = n/(n + 2)
n+2 n 2
This can be interpreted to be a weighted average (by the sample size) of the
sample mean and 1/2. We have
nθ + 1 1
Eθ (TB ) − θ = − θ = (1 − w) −θ ,
n+2 2
and is biased. The variance is given by
varθ (X) θ(1 − θ)
varθ (TB ) = = w2 .
(n + 2)2 n
Hence the mean squared error is
2
θ(1 − θ) 1
mse(TB ) = varθ (TB ) + bias2 (TB ) = w2 + (1 − w)2 −θ .
n 2
We can plot the mean squared error of each estimator for possible values of θ.
Here we plot the case where n = 10.
520 CHAPTER 12. STATISTICS
mse
0.03 unbiased estimator
0.02
0.01
biased estimator
0 θ
0 0.2 0.4 0.6 0.8 1.0
This biased estimator has smaller MSE unless θ has extreme values.
We see that sometimes biased estimators could give better mean squared er-
rors. In some cases, not only could unbiased estimators be worse — they could
be completely nonsense. Suppose X ∼ Poisson(λ), and we want to estimate
θ = [P (X = 0)]2 = e−2λ . Then any unbiased estimator T (X) must satisfy
Eθ (T (X)) = θ, or equivalently,
∞
X λx
Eλ (T (X)) = e−λ T (x) = e−2λ .
x=0
x!
The only function T that can satisfy this equation is T (X) = (−1)X . Thus the
unbiased estimator would estimate e−2λ to be 1 if X is even, -1 if X is odd.
This is clearly nonsense.
D. 12-3
• A statistic T is sufficient for θ if the conditional distribution of X given T does
not depend on θ.
• A sufficient statistic T (X) is minimal if it is a function of every other sufficient
statistic, ie. if T 0 (X) is also sufficient, then T 0 (X) = T 0 (Y) ⇒ T (X) = T (Y).
E. 12-4
Often, we do experiments just to find out the value of θ. For example, we might
want to estimate what proportion of the population supports some political can-
didate. We are seldom interested in the data points themselves, and just want to
learn about the big picture. This leads us to the concept of a sufficient statis-
tic. This is a statistic T (X) that contains all information we have about θ in the
sample.
In general the distribution of X depends on θ, that’s why we can use a data x to
estimate θ. However for sufficient statistics the conditional distribution of X given
any particular T = t does not depend on θ. So if we know T , then the knowledge
of the exact form of our data does x not give us more information about θ. In
other words, for any fixed t, all x ∈ {x : T (x) = t} are “the same” to us.
Note that sufficient statistics are not unique. If T is sufficient for θ, then so is any
injective function of T . Note that X is always sufficient for θ as well, but it is not
of much use. How can we decide if a sufficient statistic is “good”?
Given any statistic T , we can partition the sample space X n into sets {x ∈ X n :
T (X) = t} for each t. Then after an experiment, instead of recording the actual
value of x, we can simply record the partition x falls into. If there are less partitions
than possible values of x, then effectively there is less information we have to store.
If T is sufficient, then this data reduction does not lose any information about θ.
12.1. ESTIMATION 521
The “best” sufficient statistic would be one in which we achieve the maximum
possible reduction. This is known as the minimal sufficient statistic.
E. 12-5
Let X1 , · · · Xn be iid Bernoulli(θ),Pso that P(Xi = 1) = 1 − P(Xi = 0) = θ for
some 0 < θ < 1. Suppose T (X) = xi , the total number of ones. Then
n
Y P P
fX (x | θ) = θxi (1 − θ)1−xi = θ xi
(1 − θ)n− xi
.
i=1
This depends on the data only through T . Suppose we are now given that T (X) =
t. Then what is the distribution of X? We have
P P !−1
Pθ (X = x, T = t) Pθ (X = x) θ xi (1 − θ)n− xi n
fX|T =t (x) = = = n t
= ,
Pθ (T = t) Pθ (T = t) t
θ (1 − θ)n−t t
where the second equality comes because since if X = x, then T must be equal to
t. So T is sufficient.
T. 12-6
<Factorization criterion> T is sufficient for θ iff fX (x | θ) = g(T (x), θ)h(x)
for some functions g and h.
(Backward) We first prove the discrete case. Suppose fX (x | θ) = g(T (x), θ)h(x).
If T (x) = t, then
Pθ (X = x, T (X) = t) g(T (x), θ)h(x)
fX|T =t (x) = = P
Pθ (T = t) {y:T (y)=t} g(T (y), θ)h(y)
The first factor does not depend on θ by assumption; call it h(x). Let the second
factor be g(t, θ), and so we have the required factorisation.
E. 12-7 P P
• Continuing the previous example fX (x | θ) = θP xi (1 − θ)n− xi . Take g(t, θ) =
θt (1 − θ)n−t and h(x) = 1 to see that T (X) = Xi is sufficient for θ.
• Let X1 , · · · , Xn be iid U [0, θ]. Write 1[A] for the indicator function of an arbitrary
set A. We have
n
Y 1 1
fX (x | θ) = 1[0≤xi ≤θ] = n 1[maxi xi ≤θ] 1[mini xi ≥0] .
i=1
θ θ
522 CHAPTER 12. STATISTICS
T. 12-8
Suppose T = T (X) is a statistic that satisfies
fX (x; θ)
does not depend on θ ⇐⇒ T (x) = T (y).
fX (y; θ)
Then T is minimal sufficient for θ.
First we show T is sufficient. We will use the factorization criterion to do so.
Firstly, for each possible t, pick a favorite xt such that T (xt ) = t. Now let
x ∈ X n . Then T (x) = T (xT (x) ). By the hypothesis, fX (x; θ)/fX (xT (x) ; θ) does
not depend on θ. Let this be h(x). Let g(t, θ) = fX (xt , θ). Then
fX (x; θ)
fX (x; θ) = fX (xt ; θ) = g(t, θ)h(x).
fX (xt ; θ)
So T is sufficient for θ. To show that this is minimal, suppose that S(X) is also
sufficient. By the factorization criterion, there exist functions gS and hS such that
fX (x; θ) = gS (S(x), θ)hS (x). Now suppose that S(x) = S(y). Then
fX (x; θ) gS (S(x), θ)hS (x) hS (x)
= = .
fX (y; θ) gS (S(y), θ)hS (y) hS (y)
of (µ, σ 2 ) iff
P 2 P 2 P P
This is a constant
P 2 function i xi = i yi and i xi = i yi .
So T (X) = ( i Xi , i Xi ) is minimal sufficient for (µ, σ 2 ). Since functions of
P
Note that above has vector T sufficient for a vector θ. Dimensions doP
not have
Pto be
the same. For example, one can check that for N (µ, µ2 ), T (X) = ( i Xi2 , i Xi )
is minimal sufficient for µ.
T. 12-10
<Rao-Blackwell Theorem> Let T be a sufficient statistic for θ and let θ̃ be
an estimator for θ with E(θ̃2 ) < ∞ for all θ. Let θ̂(x) = E[θ̃(X) | T (X) = T (x)].
Then E[(θ̂ − θ)2 ] ≤ E[(θ̃ − θ)2 ] for all θ. The inequality is strict unless θ̃ is a
12.1. ESTIMATION 523
function of T .
Therefore θ̂ = n+1
n
max Xi is our new estimator. In case this is not clear, X1 |
X1 < t has distribution U [0, t] since
D. 12-12
Let X1 , · · · , Xn be random variables with joint pdf/pmg fX (x | θ). We observe
X = x. For any given x, the likelihood of θ is like(θ) = fX (x | θ), regarded as a
function of θ. The maximum likelihood estimator (mle) of θ is an estimator that
picks the value of θ that maximizes like(θ).
E. 12-13
There are many different estimators we can pick, and we have just come up with
some criteria to determine whether an estimator is “good”. However, these do
not give us a systematic way of coming up with an estimator to actually use. In
practice, we often use the maximum likelihood estimator. We can imagine that
given a data, the mle picks the “distrubution” which that is “most likely” to give
us that data.
Often there is no closed form for the mle, and we have to find θ̂ numerically.
When we can find the mle explicitly, in practice, we often equivalently maximize
the log-likelihood instead of the likelihood since it’s often easier. In particular, if
X1 , · · · , Xn are iid, each with pdf/pmf fX (x | θ), then
n
Y n
X
like(θ) = fX (xi | θ) log like(θ) = log fX (xi | θ).
i=1 i=1
How does the mle relate to sufficient statistics? Suppose that T is sufficient for
θ. Then the likelihood is g(T (x), θ)h(x), which depends on θ through T (x). To
maximise this as a function of θ, we only need to maximize g. So the mle θ̂ is a
function of the sufficient statistic T .
Note that if φ = h(θ) with h injective, then the mle of φ is given by h(θ̂). This is
called the invariance property of mle’s. For example, if the mle of the standard
deviation σ is σ̂, then the mle of the variance σ 2 is σ̂ 2 . This is rather useful in
practice, since we can use this to simplify a lot of computations.
√
It can be shown that, under regularity conditions, that n(θ̂ − θ is asymptotically
multivariate normal with mean 0 and ‘smallest attainable variance’ (see Part II
Principles of Statistics).
E. 12-14
• Let X1 , · · · , Xn be iid Bernoulli(p). Then
X X
l(p) = log like(p) = xi log p + n − xi log(1 − p).
P P
dl xi n − xi
=⇒ = − .
dp p 1−p
P
This is zero when p = xi /n. So this is the maximum likelihood estimator (and
is unbiased).
12.1. ESTIMATION 525
n n 1 X
l(µ, σ 2 ) = log like(µ, σ 2 ) = − log(2π) − log(σ 2 ) − 2 (xi − µ)2 .
2 2 2σ
∂l ∂l
This is maximized when ∂µ
= ∂σ 2
= 0. We have
∂l 1 X ∂l n 1 X
=− 2 (xi − µ), 2
=− 2 + 4 (xi − µ)2 .
∂µ σ ∂σ 2σ 2σ
So the Psolution, hence maximum likelihood estimator is (µ̂, σ̂ 2 ) = (x̄, Sxx /n), where
1
xi and Sxx = (xi − x̄)2 . We shall see later that SXX /σ 2 = nσ̂ 2 /σ 2 ∼
P
x̄ = n
χ2n−1 , the chi-squared distribution. This has E[χ2n−1 ] = n − 1 and so E(σ̂ 2 ) = (n −
1)σ 2 /n, ie. σ̂ 2 is biased. However it is asymptotically unbiased since E(σ̂ 2 ) − σ 2 →
0 as n → ∞.
• Suppose the American army discovers some German tanks that are sequentially
numbered, ie. the first tank is numbered 1, the second is numbered 2, etc. Then
if θ tanks are produced, then the probability distribution of the tank number is
U (0, θ). Suppose we have discovered n tanks whose numbers are x1 , x2 , · · · , xn ,
and we want to estimate θ, the total number of tanks produced. We want to find
the maximum likelihood estimator.
1
like(θ) = 1[max xi ≤θ] 1[min xi ≥0] .
θn
Differentiating with respect to t, we find the pdf fθ̂ (t) = ntn−1 /θn . Hence
θ
ntn−1
Z
nθ
E(θ̂) = t dt = .
0 θn n+1
The maximum likelihood is 5 (by trial and error). If we plot like(k) against k, the
plot is fairly flat, so k = 5 is not that much more likelier than any other k ≥ 3.
526 CHAPTER 12. STATISTICS
D. 12-15
A 100γ% (0 < γ < 1) confidence interval for θ is a random interval (A(X), B(X))
such that P(A(X) < θ < B(X)) = γ, no matter what the true value of θ may be.
E. 12-16
In fact it is also possible to have confidence intervals for vector parameters. Notice
that it is the endpoints of the interval that are random quantities, while θ is a fixed
constant we want to find out. We can interpret this in terms of repeat sampling.
If we calculate (A(x), B(x)) for a large number of samples x, then approximately
100γ% of them will cover the true value of θ.
It is important to know that having observed some data x and calculated 95%
confidence interval, we cannot say that θ has 95% chance of being within the
interval. Apart from the standard objection that θ is a fixed value and either is or
is not in the interval, and hence we cannot assign probabilities to this event, we
will later construct an example where even though we have got a 50% confidence
interval, we are 100% sure that θ lies in that interval.
Note that if (A(x), B(x)) is a 100γ% confidence interval for θ, and T (θ) is a
monotone increasing function of θ, then (T (A(x)), T (B(x))) is a 100γ% confidence
interval for T (θ). And if T is monotone decreasing, then (T (B(x)), T (A(x))) is a
100γ% confidence interval for T (θ).
E. 12-17
Suppose X1 , · · · , Xn are iid N (θ, 1). Find a 95% confidence interval for θ.
√
We know X̄ ∼ N (θ, n1 σ 2 ), so that n(X̄ − θ) ∼ N (0, 1). Let z1 , z2 be such that
φ(z2 ) − φ(z1 ) = 0.95, where√ φ is the standard normal (cumulative) distribution
function. We have P[z1 < n(X̄ − θ) < z2 ] = 0.95, which can be rearranged to
give
z2 z1
P X̄ − √ < θ < X̄ − √ = 0.95.
n n
so we obtain the following 95% confidence interval:
z2 z1
X̄ − √ , X̄ − √ .
n n
There are many possible choices for z1 and z2 . Since N (0, 1) density is symmetric,
the shortest such interval is obtained by z2 = φ−1 (0.975) = 1.96 = −z1 . We
can also choose other values such as z1 = −∞, z2 = 1.64, but we usually choose
symmetric end points.
This example illustrates a common procedure for finding confidence intervals:
• Find a quantity R(X, θ) such that the Pθ -distribution of R(X, θ)
√ does not
depend on θ. This is called a pivot . In our example, R(X, θ) = n(X̄ − θ).
• Write down a probability statement of the form Pθ (c1 < R(X, θ) < c2 ) = γ.
• Rearrange the inequalities inside P(. . .) to find the interval.
Usually c1 , c2 are percentage points from a known standardised distribution, often
equitailed. For example, we pick 2.5% and 97.5% points for a 95% confidence
interval. We could also use, say 0% and 95%, but this generally results in a wider
interval.
12.1. ESTIMATION 527
E. 12-18
Suppose X1 , · · · , X50 are iid N (0, σ 2 ). Find a 99% confidence interval for σ.
E. 12-19
Suppose X1 , · · · , Xn are iid Bernoulli(p). Find an approximate confidence interval
for p.
P
The mle of p is p̂ = Xi /n. By √ the CentralpLimit theorem, p̂ is approximately
N (p, p(1 − p)/n) for large n. So n(p̂ − p)/ p(1 − p) is approximately N (0, 1)
for large n. So we have
r r !
p(1 − p) p(1 − p)
P p̂ − z(1−γ)/2 < p < p̂ + z(1−γ)/2 ≈ γ.
n n
But p is unknown! So we approximate it by p̂ to get a confidence interval for p
when n is large:
r r !
p̂(1 − p̂) p̂(1 − p̂)
P p̂ − z(1−γ)/2 < p < p̂ + z(1−γ)/2 ≈ γ.
n n
Note that we have made a lot of approximations here, but it would be difficult to
do better than this.
E. 12-20
Suppose an opinion poll says 20% of the people are going to vote UKIP, based on
a random sample of 1, 000 people. What might the true proportion be?
If we don’t want to make that many approximations, we can note p that p(1 −
pp) ≤
1/4 for all 0 ≤ p ≤ 1. So a conservative 95% interval is p̂±1.96 √
1/4n ≈ p̂± 1/n.
So whatever proportion is reported, it will be ‘accurate’ to ±1/ n.
528 CHAPTER 12. STATISTICS
E. 12-21
Suppose X1 , X2 are iid from U (θ −1/2, θ +1/2). What is a sensible 50% confidence
interval for θ?
1
Pθ (min(X1 , X2 ) ≤ θ ≤ max(X1 , X2 )) = .
2
where the constant of proportionality is chosen to make the total mass of the pos-
terior distribution equal to one. Usually, we use this form, instead of attempting
to calculate fX (x). It should be clear that the data enters through the likelihood,
so if we have a sufficient statistic, the inference is automatically a function of the
sufficient statistic.
E. 12-24
Suppose I have 3 coins in my pocket. One is 3 : 1 in favour of tails, one is a fair
coin, and one is 3 : 1 in favour of heads. I randomly select one coin and flip it
once, observing a head. What is the probability that I have chosen coin 3?
Let X = 1 denote the event that I observe a head, X = 0 if a tail. Let θ denote
the probability of a head. So θ is either 0.25, 0.5 or 0.75. Our prior distribution
is θ(θ = 0.25) = π(θ = 0.5) = π(θ = 0.75) = 1/3. The probability mass function
fX (x | θ) = θx (1 − θ)1−x . So we have to following results:
So if we observe a head, then there is now a 50% chance that we have picked the
third coin.
E. 12-25
Suppose we are interested in the true mortality risk θ in a hospital H which is about
to try a new operation. On average in the country, around 10% of the people die,
but mortality rates in different hospitals vary from around 3% to around 20%.
Hospital H has no deaths in their first 10 operations. What should we believe
about θ?
Let Xi = 1 if the ith patient in H dies. Then
P P
xi
fx (x | θ) = θ (1 − θ)n− xi
.
Suppose a priori that θ ∼ Beta(a, b) for some unknown a > 0, b > 0 so that
π(θ) ∝ θa−1 (1 − θ)b−1 . Then the posteriori is
P P
π(θ | x) ∝ fx (x | θ)π(θ) ∝ θ xi +a−1 (1 − θ)n− xi +b−1
.
P P
We recognize this as Beta( xi + a, n − xi + b). So
P P
θ xi +a−1 (1 − θ)n− xi +b−1
π(θ | x) = P P .
B( xi + a, n − xi + b)
In practice, we need to find a Beta prior distribution that matches our information
from other hospitals. It turns out that Beta(a = 3, b = 27) priorPdistribution has
mean 0.1 and P(0.03 < θP < .20) = 0.9. PThen we observe data xi = 0, n = 10.
So the posterior is Beta( xi + a, n − xi + b) = Beta(3, 37). This has a mean
of 3/40 = 0.075.
This leads to a different conclusion than a frequentist analysis. Since nobody
has died so far, the mle is 0, which does not seem plausible. Using a Bayesian
530 CHAPTER 12. STATISTICS
approach, we have a higher mean than 0 because we take into account the data
from other hospitals. For this problem, a beta prior leads to a beta posterior. We
say that the beta family is a conjugate family of prior distribution for Bernoulli
samples.
Suppose that a = b = 1 so that
P π(θ) = 1 for
P0 < θ < 1 — the uniform distribution.
Then the posterior is Beta( xi + 1, n − xi + 1), with properties
• The Bayes estimator θ̂ is the estimator that minimises the expected posterior
loss.
C. 12-27
Common loss functions are quadratic loss L(θ, a) = (θ − a)2 , absolute error loss
L(θ, a) = |θ − a|, but we can have others.
• For quadratic loss, h(a) = (a − θ)2 π(θ | x) dθ and so h0 (a) = 0 if
R
Z Z Z
(a − θ)π(θ | x) dθ = 0 or a π(θ | x) dθ = θπ(θ | x) dθ.
R R
Since π(θ | x) dθ = 1, the Bayes estimator is θ̂ = θπ(θ | x) dθ, the
posterior mean .
• For absolute error loss,
Z Z a Z ∞
h(a) = |θ − a|π(θ | x) dθ = (a − θ)π(θ | x) dθ + (θ − a)π(θ | x) dθ
−∞ a
Z a Z a Z ∞ Z ∞
=a π(θ | x) dθ − θπ(θ | x) dθ + θπ(θ | x) dθ − a π(θ | x) dθ.
−∞ −∞ a a
12.2. HYPOTHESIS TESTING 531
Ra R∞
Now h0 (a) = 0 if −∞
π(θ | x) dθ = a
π(θ | x) dθ. This occurs when each side
is 1/2. So θ̂ is the posterior median .
E. 12-28
Suppose that X1 , · · · , Xn are iid N (µ, 1), and that a prior µ ∼ N (0, τ −2 ) for some
τ −2 . So τ is the certainty of our prior knowledge. The posterior is given by
µ2 τ 2
1X
π(µ | x) ∝ fx (x | µ)π(µ) ∝ exp − (xi − µ)2 exp −
2 2
P 2 !
1 xi
∝ exp − (n + τ 2 ) µ −
2 n + τ2
drugs are actually useless. Alternatively, it is more serious to deem an innocent person
guilty than to say a guilty person is innocent.
In general, let X1 , · · · , Xn be iid, each taking values in X , each with unknown pdf/pmf
f . We have two hypotheses, H0 and H1 , about f . On the basis of data X = x, we
make a choice between the two hypotheses.
D. 12-30
A simple hypothesis H specifies f completely (eg. H0 : θ = 12 ). Otherwise, H is
a composite hypothesis .
E. 12-31
• A coin has P(Heads) = θ, and is thrown independently n times. We could have
H0 : θ = 12 versus H1 : θ = 34 .
• Suppose X1 , · · · , Xn are iid discrete random variables. We could have H0 : the
distribution is Poisson with unknown mean, and H1 : the distribution is not Pois-
son.
• General parametric cases: Let X1 , · · · , Xn be iid with density f (x | θ). f is known
while θ is unknown. Then our hypotheses are H0 : θ ∈ Θ0 and H1 : θ ∈ Θ1 , with
Θ0 ∩ Θ1 = ∅.
• We could have H0 : f = f0 and H1 = f = f1 , where f0 and f1 are densities that
are completely specified but do not come form the same parametric family.
D. 12-32
• For testing a null hypothesis H0 against an alternative hypothesis H1 , a test
procedure has to partition X n into two disjoint exhaustive regions C and C̄, such
that if x ∈ C, then H0 is rejected, and if x ∈ C̄, then H0 is not rejected. C is
called the critical region , its complement is called the acceptance region .2
• When performing a test, we may either arrive at a correct conclusion, or make one
of the two types of error:
1. Type I error : reject H0 when H0 is true.
2. Type II error : not rejecting H0 when H0 is false.
• The p-value is the probability of obtaining a result equal to or “more extreme”
than what was actually observed, when the null hypothesis H0 is true.
• When H0 and H1 are both simple, let
Lx (H) = fX (x | θ = θ∗ ).
2
Note that when we say “acceptance”, we really mean “non-rejection”! The name is purely for
historical reasons.
12.2. HYPOTHESIS TESTING 533
Lx (H1 )
Λx (H0 ; H1 ) = .
Lx (H0 )
A likelihood ratio test (LR test) is one where the critical region C is of the form
C = {x : Λx (H0 ; H1 ) > k} for some k.
E. 12-33
Here the hypotheses are not treated symmetrically; H0 has precedence over H1 and
a Type I error is treated as more serious than a Type II error. The null hypothesis
is a conservative hypothesis, ie one of “no change,” “no bias,” “no association,”
and is only rejected if we have clear evidence against it. H1 represents the kind of
departure from H0 that is of interest to us.
Ideally we would like both the size and power of a test to be 0 (or at least very
small), however typically it is not possible to find a test that makes both of them
arbitrarily small. Usually there’s trade-off. Nonetheless we would like to pick the
best possible test.
L. 12-34
<Neyman-Pearson lemma> Suppose H0 : f = f0 , H1 : f = f1 , where f0 and
f1 are continuous densities that are nonzero on the same regions. Then among all
tests of size less than or equal to α, the test with the largest power is the likelihood
ratio test of size α.
Z
β = P(X 6∈ C | f1 ) = f1 (x) dx.
C̄
Let C ∗ be the critical region of any other test with size less than or equal to α.
Let α∗ = P(X ∈ C ∗ | f0 ) and β ∗ = P(X 6∈ C ∗ | f1 ). We want to show β ≤ β ∗ . We
know α∗ ≤ α, ie
Z Z
f0 (x) dx ≤ f0 (x) dx.
C∗ C
Also, on C, we have f1 (x) > kf0 (x), while on C̄ we have f1 (x) ≤ kf0 (x). So
Z Z
f1 (x) dx ≥ k f0 (x) dx
∗ ∩C ∗ ∩C
ZC̄ ZC̄
f1 (x) dx ≤ k f0 (x) dx.
C̄∩C ∗ C̄∩C ∗
534 CHAPTER 12. STATISTICS
Hence
Z Z
β − β∗ = f1 (x) dx − f1 (x) dx
C̄ ∗
ZC̄ Z Z Z
= f1 (x) dx + f1 (x) dx − f1 (x) dx − f1 (x) dx
∗ ∗ C̄ ∗ ∩C C̄∩C̄ ∗
ZC̄∩C ZC̄∩C̄
= f1 (x) dx − f1 (x) dx
C̄∩C ∗ C̄ ∗ ∩C
Z Z
≤k f0 (x) dx − k f0 (x) dx
C̄∩C ∗ C̄ ∗ ∩C
Z Z
=k f0 (x) dx + f0 (x) dx
C̄∩C ∗ C∩C ∗
Z Z
−k f0 (x) dx + f0 (x) dx = k(α∗ − α) ≤ 0.
C̄ ∗ ∩C C∩C ∗
Here we assumed the f0 and f1 are continuous densities. However, this assumption
is need just to ensure that the likelihood ratio test of exactly size α exists. Even
with non-continuous distributions, the likelihood ratio test is still a good idea. In
fact, for a discrete distribution, as long as a likelihood ratio test of exactly size α
exists, the same result holds.
E. 12-35
Suppose X1 , · · · , Xn are iid N (µ, σ02 ), where σ02 is known. We want to find the
best size α test of H0 : µ = µ0 against H1 : µ = µ1 , where µ0 and µ1 are known
fixed values with µ1 > µ0 . Then
n
(2πσ02 )− 2 exp 2σ −1
(xi − µ1 )2
P
n(µ20 − µ21 )
2
0 µ 1 − µ0
Λx (H0 ; H1 ) = = exp nx̄ + .
σ02 2σ02
n
−1
(2πσ02 ) 2 exp 2σ
P
2 (xi − µ0 )2
0
This is an increasing function of x̄, so for any k, Λx > k ⇔ x̄ > c for some c. Hence
we reject H0 if x̄ > c, where√c is chosen such that P(X̄ > c | H0 ) = α. Under H0 ,
X̄ ∼ N (µ0 , σ02 /n), so Z = n(X̄ − µ0 )/σ0 ∼ N (0, 1). Since x̄ > c ⇔ z > c0 for
some c0 , the size α test rejects H0 if
√
n(x̄ − µ0 )
z= > zα .
σ0
For example, suppose µ0 = 5, µ1 = 6, σ0 = 1, α = 0.05, n = 4 and x =
(5.1, 5.5, 4.9, 5.3). So x̄ = 5.2. From tables, z0.05 = 1.645. We have z = 0.4 and
this is less than 1.645. So x is not in the rejection region. We do not reject H0
at the 5% level and say that the data are consistent with H0 . Note that this does
not mean that we accept H0 . While we don’t have sufficient reason to believe it
is false, we also don’t have sufficient reason to believe it is true. This is called a
z-test .
In this example, LR tests reject H0 if z > k for some constant k. The size of such
a test is α = P(Z > k | H0 ) = 1 − Φ(k), and is decreasing as k increasing. Our
observed value z will be in the rejected region iff z > k ⇔ α > p∗ = P(Z > z | H0 ).
The quantity p∗ is called the p-value of our observed data x. For the example
above, z = 0.4 and so p∗ = 1 − Φ(0.4) = 0.3446.
In general, the p-value is sometimes called the “observed significance level” of x.
This is the probability under H0 of seeing data that is “more extreme” than our
12.2. HYPOTHESIS TESTING 535
To show this is UMP, we know that W (µ0 ) = α (by plugging in). W (µ) is an
increasing function of µ. So supµ≤µ0 W (µ) = α. So the first condition is satisfied.
For the second condition, observe that for any µ > µ0 , the Neyman Pearson size
α test of H00 vs H10 has critical region C. Let C ∗ and W ∗ belong to any other test
of H0 vs H1 of size ≤ α. Then C ∗ can be regarded as a test of H00 vs H10 of size
≤ α, and the Neyman-Pearson lemma says that W ∗ (µ1 ) ≤ W (µ1 ). This holds for
all µ1 > µ0 . So the condition is satisfied and it is UMP.
E. 12-39
So far we have considered disjoint hypotheses Θ0 , Θ1 . Sometimes it is easier to
take Θ1 = Θ rather than Θ \ Θ0 . Then
Lx (H1 ) supθ∈Θ1 f (x | θ)
Λx (H0 ; H1 ) = = ≥ 1,
Lx (H0 ) supθ∈Θ0 f (x | θ)
Here’s an example testing a given mean with known variance (z-test). Suppose
that X1 , · · · , Xn are iid N (µ, σ02 ), with σ02 known, and we wish to test H0 : µ = µ0
against H1 : µ 6= µ0 (for given constant µ0 ). Here Θ0 = {µ0 } and Θ = R.
For the denominator, we have supθ∈Θ0 f (x | θ) = f (x | µ0 ). For the numerator
supµ∈Θ f (x | µ) = f (x | µ̂), where µ̂ is the mle. We know that µ̂ = x̄. Hence
(2πσ02 )−n/2 exp − 2σ1 2 (xi − x̄)2
P
0
Λx (H0 ; H1 ) = .
(2πσ02 )−n/2 exp − 2σ1 2 (xi − µ0 )2
P
0
Then H0 is rejected if Λx is large. To make our lives easier, we can use the
logarithm instead:
1 X X n
2 log Λ(H0 ; H1 ) = 2
(xi − µ0 )2 − (xi − x̄)2 = 2 (x̄ − µ0 )2 .
σ0 σ0
√
So we can reject√H0 if we have | n(x̄ − µ0 )/σ0 | > c for some c. We know that
under H0 , Z = n(X̄ − µ0 )/σ0 ∼ N (0, 1). So the size α generalised likelihood
test rejects H0 if
√
n(x̄ − µ0 )
> zα/2 .
σ0
√
In fact a symmetric 100(1−α)% confidence interval for µ is x̄±zz/2 σ0 / n. There-
fore we reject H0 iff µ0 is not in this confidence interval. Later we’ll explore the
connection between confidence intervals and hypothesis tests further. Alterna-
tively, since n(X̄ − µ0 )/σ02 ∼ χ21 , we reject H0 if
n(X̄ − µ0 )2
> χ21 (α),
σ02
2
One can check that zα/2 = χ21 (α). Note that this is a two-tailed test — ie. we
reject H0 both for high and low values of x̄.
12.2. HYPOTHESIS TESTING 537
D. 12-40
Suppose we have a vector parameter Θ = {θ : θ = (θ1 , · · · , θk )}, we say Θ has k free
parameters and write |Θ| = k. If H0 : θ ∈ Θ0 imposes p independent restrictions
on Θ, then we say Θ0 has k − p free parameters and we write |Θ0 | = k − p.
E. 12-41
We consider the ”size“ or ”dimension“ of our hypotheses: suppose that H0 imposes
p independent restrictions on Θ = {θ : θ = (θ1 , · · · , θk )}, so for example
• H0 : θij = aj for j = 1, · · · , p; or
• H0 : Aθ = b (with A a p × k matrix and b a p × 1 matrix (vector) given); or
• H0 : θi = fi (ϕ) for i = 1, · · · , k for some ϕ = (ϕ1 , · · · , ϕk−p ) freely chosen.
Then Θ has k free parameters and Θ0 has k − p free parameters. And |Θ0 | = k − p
and |Θ| = k.
T. 12-42
<Generalized likelihood ratio theorem> Suppose Θ0 ⊆ Θ1 and |Θ1 |−|Θ0 | =
p. Let X = (X1 , · · · , Xn ) with all Xi iid. Then if H0 is true, as n → ∞,
If H0 is not true, then 2 log Λ tends to be larger. We reject H0 if 2 log Λ > c, where
c = χ2p (α) for a test of approximately size α.
We will not prove this. As an example in our example above, |Θ1 | − |Θ0 | = 1, and
in this case, we saw that under H0 , 2 log Λ ∼ χ21 exactly for all n in that particular
case, rather than just approximately. This theorem allows us to use likelihood
ratio tests even when we cannot find the exact relevant null distribution.
Here |Θ1 | − |Θ0 | = k − 1. So we reject H0 if 2 log Λ > χ2k−1 (α) for an approximate
size α test.
For H0 (no effect of month of birth), let p̃i be the proportion of births in month
i in say year 1993/1994 — this is not simply proportional to the number of days
1
in each month (or even worse, 12 ), as there is for example an excess of September
births (the “Christmas effect”). Turns out
X ni
2 log Λ = 2 ni log = 44.9.
np̃i
P(χ211 > 44.86) = 3 × 10−9 , which is our p-value. Since this is certainly less than
0.001, we can reject H0 at the 0.1% level (ie. with test size of α = 0.01), or can
say the result is “significant at the 0.1% level”.
The traditional levels for comparison are α = 0.04, 0.01, 0.001, roughly correspond-
ing to “evidence”, “strong evidence” and “very strong evidence”.
C. 12-44
<Pearson’s Chi-squared test> Like the above example, a similar situation
has H0 : pi = pi (θ) for some parameter θ and H1 unrestricted (except the obvious
positivity and summing to 1). Now |Θ0 | is the number of independent P parameters
to be estimated under H0 . Under H0 , we find mle θ̂ by maximizing ni log pi (θ),
and then
! !
pˆ1 n1 · · · pˆk nk X ni
2 log Λ = 2 log =2 ni log . (?)
p1 (θ̂)n1 · · · pk (θ̂)n1 npi (θ̂)
E. 12-45
Mendel crossed 556 smooth yellow male peas with wrinkled green peas. From the
progeny, let
1. N1 be the number of smooth yellow peas,
2. N2 be the number of smooth green peas,
3. N3 be the number of wrinkled yellow peas,
4. N4 be the number of wrinkled green peas.
We wish to test the goodness of fit of the model
9
, 3, 3, 1
H0 : (p1 , p2 , p3 , p4 ) = 16 16 16 16
.
Suppose we observe (n1 , n2 , n3 , n4 ) = (315, 108, 102, 31). We find (e1 , e2 , e3 , e4 ) =
(312.75, 104.25, 104.25, 34.75). The actual 2 log Λ = 0.618 and the approximation
we had is (oi − ei )2 /ei = 0.604. Here |Θ0 | = 0 and |Θ1 | = 4 − 1 = 3. So we refer
P
to test statistics χ23 (α).
Since χ23 (0.05) = 7.815, we see that neither value is significant at 5%. So there
is no evidence against Mendel’s theory. In fact, the p-value is approximately
P(χ23 > 0.6) ≈ 0.96. This is a really good fit!
E. 12-46
In a genetics problem, each individual has one of the three possible genotypes,
with probabilities p1 , p2 , p3 . Suppose we wish to test H0 : pi = pi (θ), where
We find that θ̂ = (2n1 + n2 )/2n. Also, |Θ0 | = 1 and |Θ1 | = 2. After conducting an
experiment, we can substitute pi (θ̂) into (?) in [C.12.2.1], or find the corresponding
Pearson’s chi-squared statistic, and refer to χ21 .
D. 12-47
A contingency table is a table in which observations or individuals are classified
according to one or more criteria.
E. 12-48
Suppose N is set of people. We classify them according to some criteria, then we
obtain a partition of N = a1 ∪a2 ∪· · ·∪ar . Suppose we then classify them according
to some other criteria, then we obtain another partition of N = b1 ∪ b2 ∪ · · · ∪ bs .
The r × s table (nij ) with entries nij = |ai ∩ bj | is a two
way contingency table. The entry nij tell us how many b1 ··· bs
people are in both of the categories ai and bj . If the two a1 n11 ··· n1s
classification is independent we expect for all i, j .. .. ..
. . .
|ai | |bj | |ai ∩ bj | ar nr1 ··· nrs
= .
|N | |N | |N |
540 CHAPTER 12. STATISTICS
C. 12-49
<Testing independence in contingency tables> Consider a two-way con-
tingency table with r rows and c columns. For i = 1, · · · , r And j = 1, · · · , c,
let pij be the probability that an individual selected from the population under
consideration is classified in row i and column j. (ie. in the (i, j) cell of the table).
Let
P P pi+ = P(in row i) and p+j = P(in column j). Then we must have p++ =
i j pij = 1. Suppose a random sample of n individuals is taken, and Plet nij be
the number
P of these classified in the (i, j) cell of the table. Let ni+ = j nij and
n+j = i nij . So n++ = n. We have
We may be interested in testing the null hypothesis that the two classifications are
independent. So we test
H0 : pij = pi+ p+j for all i, j, ie. independence of columns and rows
H1 : pij are unrestricted (ecpet the obious p++ = 1, pij ≥ 0).
Under H1 , the mles are p̂ij = nij /n. Under H0 , the mles are p̂i+ = ni+ /n and
p̂+j = n+j /n. Write oij = nij and eij = np̂i+ p̂+j = ni+ n+j /n. Then
r X c r X c
(oij − eij )2
X
X oij
2 log Λ = 2 oij log ≈ .
i=1 j=1
eij i=1 j=1
eij
using the same approximating steps for Pearson’s Chi-squared test. We have |Θ1 | =
rc−1, because under H1 the Ppij ’s sum to one. Also, |Θ0 | = (r −1)+(c−1)
P because
p1+ , · · · , pr+ must satisfy i pi+ = 1 and p+1 , · · · , p+c must satisfy j p+j = 1.
So
|Θ1 | − |Θ0 | = rc − 1 − (r − 1) − (c − 1) = (r − 1)(c − 1).
E. 12-50
500 people with recent car changes were asked about their previous and new cars.
The results are as follows:
New car
Large Medium Small Total
Large 56 52 42 150
Previous
Medium 50 83 67 120
car
Small 18 51 81 150
Total 124 186 190 500
This is a two-way contingency table: Each person is classified according to the
previous car size and new car size. We wish to test H0 : the new and previous car
sizes are independent. The expected values given by H0 is
New car
Large Medium Small Total
Large 37.2 55.8 57.0 150
Previous
Medium 49.6 74.4 76.0 120
car
Small 37.2 55.8 57.0 150
Total 124 186 190 500
12.2. HYPOTHESIS TESTING 541
Note the margins are the same. It is quite clear that they do not match well, but
we can find the p value to be sure.
X X (oij − eij )2
= 36.20,
eij
and the degrees of freedom is (3 − 1)(3 − 1) = 4. From the tables, χ24 (0.05) = 9.488
and χ24 (0.01) = 13.28. So our observed value of 36.20 is significant at the 1% level,
ie. there is strong evidence against H0 . So we conclude that the new and present
car sizes are not independent.
It may be informative to look at the contributions of each cell to Pearson’s chi-
squared:
New car
Large Medium Small
Large 9.50 0.26 3.95
Previous
Medium 0.00 0.99 1.07
car
Small 9.91 0.41 10.11
It seems that more owners of large cars than expected under H0 bought another
large car, and more owners of small cars than expected under H0 bought another
small car. Fewer than expected changed from a small to a large car.
C. 12-51
<Tests of homogeneity> We want to test whether two or more multino-
mial distributions are equal. In general, we have independent observations from
r multinomial distributions, each of which has c categories, ie. we observe an
r × c table (nij ), for i = 1, · · · , r and j = 1, · · · , c, where (Ni1 , · · · , Nic ) ∼
multinomial(ni+ , pi1 , · · · , pic ) independently for each i = 1, · · · , r. So the rows
of the table is the results for the different multinomial distributions. We want to
test
Using H1 ,
r r X c
Y ni+ ! ni1
X
like(pij ) = pi1 · · · pn
ic
ic
=⇒ log like = constant + nij log pij .
i=1
ni1 ! · · · nic ! i=1 j=1
which is the same as what we had last time, when the row totals are unrestricted!
We have |Θ1 | = r(c − 1) and |Θ0 | = c − 1. So the degrees of freedom is r(c − 1) −
542 CHAPTER 12. STATISTICS
E. 12-52
150 patients were randomly allocated to three groups of 50 patients each. Two
groups were given a new drug at different dosage levels, and the third group
received a placebo. The responses were as shown in the table below.
Improved No difference Worse Total
Placebo 18 17 15 50
Half dose 20 10 20 50
Full dose 25 13 12 50
Total 63 40 47 150
Here the row totals are fixed in advance, in contrast to our last section, where the
row totals are random variables. For the above, we may be interested in testing
H0 : the probability of “improved” is the same for each of the three treatment
groups, and so are the probabilities of “no difference” and “worse”, ie. H0 says
that we have homogeneity down the rows. The expected under H0 is
Improved No difference Worse Total
Placebo 21 13.3 15.7 50
Half dose 21 13.3 15.7 50
Full dose 21 13.3 15.7 50
Total 63 40 47 150
We find 2 log Λ = 5.129, and we refer this to χ24 .
Clearly this is not significant, as
the mean of χ24 is 4, and is something we would expect to happen solely by chance.
We can calculate the p-value: from tables, χ24 (0.05) = 9.488, so our observed value
is not significant at 5%, and the data are consistent with H0 . We conclude that
there is no evidence forPa difference between the drug at the given doses and the
placebo. For interest, (oij − eij )2 /eij = 5.173 giving the same conclusion.
T. 12-53
Suppose X1 , · · · , Xn have joint pdf fX (x | θ) for θ ∈ Θ
1. Suppose that for every θ0 ∈ Θ there is a size α test of H0 : θ = θ0 . Denote
the acceptance region by A(θ0 ). Then the set I(X) = {θ : X ∈ A(θ)} is a
100(1 − α)% confidence set for θ.
2. Suppose I(X) is a 100(1 − α)% confidence set for θ. Then A(θ0 ) = {X : θ0 ∈
I(X)} is an acceptance region for a size α test of H0 : θ = θ0 .
tT Σt = tT cov(X)t = var(tT X) ≥ 0.
544 CHAPTER 12. STATISTICS
Then (1)[[ Xi ∼ Nni (µi , Σii ) ]] and (2)[[ X1 and X2 are independent iff Σ12 = 0 ]].
t
) = exp(tT µ1 + 1 T
P
1. Note that MXi (t) = MX ( 0 2
t 11 t) and similarly for
component 2.
2. Note that by symmetry of Σ, Σ12 = 0 if and only if Σ21 = 0. Recall MX (t) =
exp(tT µ + 12 tT Σt) for each t ∈ Rn . We write t = ( tt12 ). Then
1 1 1
MX (t) = exp tT1 µ1 + tT2 µ2 + tT2 Σ11 t1 + tT2 Σ22 t2 + tT1 Σ12 t2 + tT2 Σ21 t1 .
2 2 2
From (1), we know that MXi (ti ) = exp(tTi µi + 12 tTi Σii ti ). Therefore MX (t) =
MX1 (t1 )MX2 (t2 ) for all t if and only if Σ12 = 0.
P. 12-60
When Σ is a positive definite, then X ∼ Nn (µ, Σ) has pdf
n
1 1 1 T −1
fX (x; µ, Σ) = √ exp − (x − µ) Σ (x − µ) .
|Σ|2 2π 2
Note that Σ is always positive semi-definite. The conditions just forbid the case
|Σ| = 0, since this would lead to dividing by zero.
12.2. HYPOTHESIS TESTING 545
T. 12-61
<Joint distribution of X̄ and
P SXX > Suppose X1 , · · · , Xn are iid N (µ, σ 2 ).
1
P 2
Write X̄ = n Xi and SXX = (Xi − X̄) . Then
1. X̄ ∼ N (µ, σ 2 /n)
2. SXX /σ 2 ∼ χ2n−1 .
3. X̄ and SXX are independent.
We can write the joint density as X ∼ Nn (µ, σ 2 I), where√µ = (µ, µ, · · · , µ). Let
A be an n × n orthogonal matrix with the first row all 1/ n (the other rows are
not important). One possible such matrix is
√1 √1 √1 √1 √1
n n n n
··· n
√1 √−1 0 0 ··· 0
2×1 2×1
√1 √1 √−2 0 ··· 0
A= 3×2 3×2 3×2
.. .. .. .. .. ..
. . . . . .
√−(n−1)
1 1 1 1
√ √ √ √ ···
n(n−1) n(n−1) n(n−1) n(n−1) n(n−1)
2 T
√ Y = AX.T Then Y ∼ N
Now define √n (Aµ,2Aσ IA ) = Nn (Aµ, σ 2 I). We have
2
Aµ = ( nµ, 0, · · · , 0) . So Y1 ∼ N ( nµ, σ ) and Yi ∼ N (0, σ ) for i = 2, · · · , n.
Also, Y1 , · · · , Yn are independent, since the covariance matrix has every non-
diagonal term equal 0. But from the definition of A, we have
n
1 X √
Y1 = √ Xi = nX̄.
n i=1
√ √
So nX̄ ∼ N ( nµ, σ 2 ), or X̄ ∼ N (µ, σ 2 /n). Also
D. 12-62
p
• Suppose that Z ∼ N (0, 1) and Y ∼ χ2k are independent, then T = Z/ Y /k is said
to have a t-distribution on k degrees of freedom, and we write T ∼ tk . We write
tk (α) be the upper 100α% point of the tk distribution, so that P(T > tk (α)) = α.
• Suppose U and V are independent with U ∼ χ2m and V ∼ χ2n . Then X = U/m V /n
is said to have an F-distribution on m and n degrees of freedom and we write
X ∼ Fm,n . We write Fm,n (α) be the upper 100α% point for the Fm,n -distribution
so that if X ∼ Fm,n , then P(X > Fm,n (α)) = α.
546 CHAPTER 12. STATISTICS
E. 12-63
• Since U and V have mean m and n respectively, U/m and V /n are approximately
1. So F is often approximately 1.
• Note that from the definition it’s clear that if X ∼ Fm,n , then 1/X ∼ Fn,m .
Suppose that we have the upper 5% point for all Fn,m . Using these information, it
is easy to find the lower 5% point for Fm,n since we know that P(Fm,n < 1/x) =
P(Fn,m > x).
• Note that it is immediate from definitions of tn and F1,n that if Y ∼ tn , then
Y 2 ∼ F1,n , ie. it is a ratio of independent χ21 and χ2n variables.
P. 12-64
Let T ∼ tk ,
1. The density of T is
−(k+1)/2
t2
Γ((k + 1)/2) 1
fT (t) = √ 1+ .
Γ(k/2) πk k
depend on the age and sex of the driver, and where they live (explanatory variables)?
As the name suggests, we assume the relationship is linear. In general we do not assume
normality (that the variables are normally distributed) in our calculations here, but
we will consider it in places.
Suppose we have p covariates xj , and we have n observations Yi . We assume n > p, or
else we can pick the parameters to fix our data exactly. We assume our n observations
(responses) are modelled as
where
• β1 , · · · , βp are unknown, fixed parameters we wish to work out (with n > p)
• xi1 , · · · , xip are the values of the p covariates for the ith response (which are all
known).
• ε1 , · · · , εn are independent (or possibly just uncorrelated) random variables with
mean 0 and variance σ 2 . We assume homoscedasticity here, that is all these εi have
the same variance. The case that is not homoscedasticity is called heteroscedasticity .
We think of the βj xij terms to be the causal effects of xij and εi to be a random
fluctuation (error term). Then we clearly have
1. E(Yi ) = β1 xi1 + · · · βp xip .
2. var(Yi ) = var(εi ) = σ 2 .
3. Y1 , · · · , Yn are independent.
Note that (∗) is linear in the parameters β1 , · · · , βp . Obviously the real world can be
much more complicated. But this is much easier to work with. In terms of matri-
ces:
Y1 x11 · · · x1p β1 ε1
. . .. .. .. ..
Y = .. , X = .. . . , β = . , ε = .
Yn xn1 ··· xnp βp εn
For each individual i, we let Yi be the time to run 2 miles, and xi be the maximum
volume of oxygen uptake, i = 1, · · · , 24. We might want to fit a straight line to it.
So a possible model is Yi = a + bxi + εi where εi are independent random variables
with variance σ 2 , and a and b are constants. We have, in matrix form
Y1 1 x1 ε1
. . .. a .
Y = .. , X = .. . , β= , ε = ..
b
Y24 1 x24 ε24 .
Then Y = Xβ + ε.
D. 12-67
In a linear model Y = Xβ + ε, the least squares estimator β̂ of β minimizes
n
X
S(β) = kY − Xβk2 = (Y − Xβ)T (Y − Xβ) = (Yi − xij βj )2
i=1
So −2xik (Yi − xij β̂j ) = 0 for each k (with implicit summation over i and j), that
is xik xij βˆj = xik Yi for all k. Putting this back in matrix form, we have the result.
We could also have derived this by completing the square of (Y − Xβ)T (Y − Xβ),
but that would be more complicated.
We assumed that X is of full rank p, so kXtk 6= 0 for all non-zero t. Hence
T. 12-70
<Gauss Markov theorem> In a full rank linear model, let β̂ be the least
squares estimator of β and let β ∗ be any other unbiased estimator for β which is
linear in the Yi ’s. Then var(tT β̂) ≤ var(tT β ∗ ) for all t ∈ Rp .
n n 1
l(β, σ 2 ) = − log 2π − log σ 2 − 2 S(β),
2 2 2σ
1 1 1
=⇒ σ̂ 2 = S(β̂) = (Y − X β̂)T (Y − X β̂) = RSS.
n n n
This isn’t coincidence! Historically, when Gauss devised the normal distribution,
he designed it so that the least squares estimator is the same as the maximum
likelihood estimator. Note also that the linear model under normal assumptions
is a special case of the linear model we just had, so all previous results hold.
L. 12-74
1. If Z ∼ Nn (0, σ 2 I) and A is n × n, symmetric, idempotent with rank r, then
ZT AZ ∼ σ 2 χ2r .
2. For a symmetric idempotent matrix A, rank(A) = tr(A).
Λ = QT AQ = diag(λ1 , · · · , λn ) = diag(1, · · · , 1, 0, · · · , 0)
1. We have β̂ = (X T X)−1 X T Y. Call this CY for later use. Then β̂ has a normal
distribution with mean (X T X)−1 X T (Xβ) = β and covariance
(X T X)−1 X T (σ 2 I)[(X T X)−1 X T ]T = σ 2 (X T X)−1 .
So β̂ ∼ Np (β, σ 2 (X T X)−1 ).
2. Our previous lemma says that ZT AZ ∼ σ 2 χ2r . So we want to pick our Z and
A so that ZT AZ = RSS, and the degrees of freedom of A being r = n − p. Let
Z = Y − Xβ and A = (In − P ), where P = X(X T X)−1 X T .
We first check that the conditions of the lemma hold: Since Y ∼ Nn (Xβ, σ 2 I),
Z = Y − Xβ ∼ Nn (0, σ 2 I). Since P is idempotent, In − P also is. We also
have rank(In − P ) = tr(In − P ) = n − p. Therefore the conditions of the lemma
hold.
To get the final useful result, we want to show that the RSS is indeed ZT AZ.
We simplify the expressions of RSS and ZT AZ and show that they are equal:
ZT AZ = (Y − Xβ)T (In − P )(Y − Xβ) = YT (In − P )Y.
Noting the fact that (In − P )X = 0. Since R = Y − Ŷ = (In − P )Y, we have
RSS = RT R = YT (In − P )Y using the symmetry and idempotence of In − P .
Hence RSS = ZT AZ ∼ σ 2 χ2n−p . Therefore
RSS σ2 2
σ̂ 2 = ∼ χn−p .
n n
D. 12-76
• σ̃ 2 = n−p
RSS
(an estimator of σ 2 ) is called the residual standard error on n − p
degrees of freedom.
• In the linear normal model, β̂ ∼ Np (β, σ 2 (X T X)−1 ). So β̂j ∼ N (βj , σ 2 (X T X)−1
jj ).
The standard error of β̂j is defined to be
q
SE(β̂j ) = σ̃ 2 (X T X)−1
jj where σ̃ 2 = RSS/(n − p)
By writing it in this somewhat weird form, we now recognize both the numerator
and denominator. The numerator
q is a standard normal N (0, 1), and the denom-
inator is an independent χ2n−p /(n − p), as we have previously shown. But a
standard normal divided by χ2 is, by definition, the t distribution. So
β̂j − βj
∼ tn−p .
SE(β̂j )
So a 100(1 − α)% confidence interval for βj has end points β̂j ± SE(β̂j )tn−p ( α2 ).
In particular, if we want to test H0 : βj = 0, we use the fact that under H0 ,
β̂j /SE(β̂j ) ∼ tn−p .
E. 12-78
<Wafer example> Suppose we want to measure the resistivity of silicon wafers.
We have five instruments, and five wafers were measured by each instrument (so
we have 25 wafers in total). We assume that the silicon wafers are all the same, and
want to see whether the instruments consistent with each other, ie. The results
are as follows:
Wafer
1 2 3 4 5
1 130.5 112.4 118.9 125.7 134.0
Instrument
random variables such that E[εij ] = 0 and var(εij ) = σ 2 , and the µi ’s are un-
known constants. This can be written in matrix form, with
Y1,1 1 0 ··· 0 ε1,1
.. .. .. . . . ..
.
. .
. ..
.
Y1,5 1 0 · · · 0 ε1,5
Y2,1 0 1 · · · 0 µ1 ε2,1
. . . . . .
. . . . . .. µ2 .
. . . .
Y= Y2,5 , X = 0 1 · · · 0 , β = µ3 , ε = ε2,5
µ4
.. .. .. . . .. ..
. . . . . µ 5
.
Y5,1 0 0 · · · 1 ε5,1
.. .. .. . . . ..
. . . . .. .
Y5,5 0 0 ··· 1 ε5,5
Then Y = Xβ + ε. We have
1
5 0 ··· 0 5
0 ··· 0
0 5 · · · 0 1
0
5
··· 0
X T X = .. .. . . . =⇒ (X T X)−1 = .. .. .. ..
. . . .. . . . .
1
0 0 ··· 5 0 0 ··· 5
So we have
Ȳ1
.
µ̂ = (X T X)−1 X T Y = ..
Ȳ5
The residual sum of squares is
5 X
X 5 5 X
X 5
RSS = (Yi,j − µ̂i )2 = (Yi,j − Ȳi )2 = 2170.
i=1 j=1 i=1 j=1
p
p has n − p = 25 − 5 = 20 degrees of freedom. So σ̄ =
This RSS/(n − p) =
2170/20 = 10.4.
With normality
Suppose in our simple linear regression Yi = a0 + b(xi − x̄) + εi where x̄ =
P
xi /n we
have εi being iid N (0, σ 2 ) for i = 1, · · · , n. Then we have
σ2 σ2
SxY
â0 = Ȳ ∼ N a0 , b̂ = ∼ N b,
n Sxx Sxx
0
X
Ŷi = â + b̂(xi − x̄) RSS = (Yi − Ŷi ) ∼ σ 2 χ2n−2 ,
2
and (â0 , b̂) and σ̂ 2 = RSS/n are independent, as we have previously shown. Note
that σ̂ 2 is obtained by dividing RSS by n, and is the maximum likelihood estimator.
On the other hand, σ̃ is obtained by dividing RSS by n − p, and is an unbiased
estimator.
E. 12-80
P 2 P 0
Using the oxygen/time example, we have RSS = i (yi − ŷi ) = i (yi − â −
2
b̂(xi − x̂) ) = 67968. So the Residual standard error squared is
RSS 67968
σ̃ 2 = = = 3089 = 55.62 .
n−p 24 − 2
on 22 degrees of freedom. So the standard error of β̂ is
r
3089 55.6
q
SE(b̂) = σ̃ 2 (X T X)−1
22 = = = 1.99.
Sxx 28.0
12.3. LINEAR MODELS 555
using the fact that t22 (0.025) = 2.07. Note that this interval does not contain 0.
So if we want to carry out a size 0.05 test of H0 : b = 0 (they are uncorrelated)
vs H1 : b 6= 0 (they are correlated), the test statistic would be b̂/SE(b̂) = −12.9
1.99
=
−6.48. Then we reject H0 because this is less than −t22 (0.025) = −2.07.
x∗T (β̂ − β)
∼ tn−p .
σ̃τ
Then a confidence interval for the expected response x∗T β has end points
α
x∗T β̂ ± σ̃τ tn−p .
2
We have a confidence interval for x∗T β, what about one for Y ∗ = x∗T β + ε∗ ? The
predicted response at x∗ is Y ∗ = x∗ β+ε∗ , where ε∗ ∼ N (0, σ 2 ), and Y ∗ is independent
of Y1 , · · · , Yn . Here we have more uncertainties in our prediction: β and ε∗ . A
100(1 − α)% prediction interval for Y ∗ is an interval I(Y) such that P(Y ∗ ∈ I(Y)) =
1 − α, where the probability is over the joint distribution of Y ∗ , Y1 , · · · , Yn . So I is a
random function of the past data Y that outputs an interval.
First of all, as above, the predicted expected response is Ŷ ∗ = x∗T β. This is an
unbiased estimator since Ŷ ∗ − Y ∗ = x∗T (β̂ − β) − ε∗ , and hence E[Ŷ ∗ − Y ∗ ] =
x∗T (β − β) = 0. To find the variance, we use that fact that x∗T (β̂ − β) and ε∗ are
independent, and the variance of the sum of independent variables is the sum of the
variances. So
We can see this as the uncertainty in the regression line σ 2 τ 2 , plus the wobble about
the regression line σ 2 . So Ŷ ∗ − Y ∗ ∼ N (0, σ 2 (τ 2 + 1)). We therefore find that
Ŷ ∗ − Y ∗
√ ∼ tn−p .
σ̃ τ 2 + 1
So the interval with endpoints
p α
x∗T β̂ ± σ̃ τ 2 + 1tn−p
2
is a 100(1 − α)% prediction interval for Y ∗ . We don’t call this a confidence interval
— confidence intervals are about finding parameters of the distribution, while the
prediction interval is about our predictions.
556 CHAPTER 12. STATISTICS
E. 12-81
Previous example continued: Suppose we wish to estimate the time to run 2 miles
for a man with an oxygen take-up measurement of 50. Here x∗T = (1, 50 − x̄),
where x̄ = 48.6. The estimated expected response at x∗T is
1 x∗2 1 1.42
τ 2 = x∗T (X T X)−1 x∗ = + = + = 0.044 = 0.212 .
n Sxx 24 783.5
So a 95% confidence interval for E[Y | x∗ = 50 − x̄] is
α
x∗T β̂ ± σ̃τ tn−p = 808.5 ± 55.6 × 0.21 × 2.07 = (783.6, 832.2).
2
Note that this is the confidence interval for the predicted expected value, NOT the
confidence interval for the actual obtained value. A 95% prediction interval for
Y ∗ at x∗T = (1, (50 − x̄)) is
p α
x∗T ± σ̃ τ 2 + 1tn−p = 808.5 ± 55.6 × 1.02 × 2.07 = (691.1, 925.8).
2
Note that this is much wider than our our expected response! This is since there
are three sources of uncertainty: we don’t know what σ is, what b̂ is, and the
random ε fluctuation!
E. 12-82
Wafer example continued: Suppose we wish to estimate the expected resistivity of
a new wafer in the first instrument. Here x∗T = (1, 0, · · · , 0) (recall that x is an
indicator vector to indicate which instrument is used). The estimated response at
x∗T is x∗T µ̂ = µ̂1 = y¯1 = 124.3. We find τ 2 = x∗T (X T X)−1 x∗ = 51 . So a 95%
confidence interval for E[Y1∗ ] is
α 10.4
x∗T µ̂ ± σ̃τ tn−p = 124.3 ± √ × 2.09 = (114.6, 134.0).
2 5
Note that we are using an estimate of σ obtained from all five instruments. If we
had only used the data from the first instrument, σ would be estimated as
sP
5
j=1 y1,j − y¯1
σ̃1 = = 8.74.
5−1
Let Xi = Ai Z, i = 1, 2 and
W1 A1 0 A1 0
W= = Z, then W ∼ N2n , σ2
W2 A2 0 0 A2
Hence we reject H0 if F > Fp−p0 ,n−p (α). RSS0 − RSS is the reduction in the sum of
squares due to fitting β1 in addition to β0 .
The ratio (RSS0 −RSS)/RSS0 is sometimes known as the proportion of variance explained
by β1 , and denoted R2 .
12.3. LINEAR MODELS 559
E. 12-84
<Simple linear regression> We assume that Yi = a0 + b(xi − x̄) + εi where x̄ =
xi /n and εi are N (0, σ 2 ). Suppose we want to test the hypothesis H0 : b = 0,
P
ie. no linear relationship. We have previously seen how to construct a confidence
interval, and so we could simply see if it included 0. Alternatively, under H0 , the
model is Yi ∼ N (a0 , σ 2 ), and so â0 = Ȳ , and the fitted values are Ŷi = Ȳ . The
observed RSS0 is therefore
X
RSS0 = (yi − ȳ)2 = Syy .
i
Checking whether |t| > tn−2 α2 is precisely the same as checking whether t2 =
F > F1,n−2 (α), since a F1,n−2 variable is t2n−2 . Hence the same conclusion is
reached, regardless of whether we use the t-distribution or the F statistic derived
form an analysis of variance table.
E. 12-85
<One way analysis of variance with equal numbers in each group>
Recall that in our wafer example, we made measurements in groups, and want to
know if there is a difference between groups. In general, suppose J measurements
are taken in each of I groups, and that Yij = µi + εij where εij are independent
N (0, σ 2 ) random variables, and the µi are unknown constants. Fitting this model
gives
XI X J I X
X J
RSS = (Yij − µ̂i )2 = (Yij − Ȳi. )2
i=1 j=1 i=1 j=1
− ȳ.. )2
P P
Total n−1 i j (yij
12.3.5 Examples
Suppose we have two independent samples X1 , · · · , Xm iid N (µX , σ 2 ), and Y1 , · · · , Yn
iid N (µY , σ 2 ), with σ 2 unknown. We wish to test H0 : µX = µY = µ against H1 : µX 6=
µY . Using the generalised likelihood ratio test Lx,y (H0 ) = supµ,σ2 fX (x | µ, σ 2 )fY (y |
µ, σ 2 ). Under H0 the mle’s are
mx̄ + nȳ
µ̂ =
m+n
1 X X 1 mn
σ̂02 = (xi − µ̂)2 + (yi − µ̂)2 = Sxx + Syy + (x̄ − ȳ)2
m+n m+n m+n
So
m+n 1 X X
Lx,y (H0 ) = (2πσ̂02 )− 2 exp − 2 (xi − µ̂)2 + (yi − µ̂)2
2σ̂0
m+n m+n
= (2πσ̂02 )− 2 e− 2
Similarly
m+n m+n
Lx,y (H1 ) = sup fX (x | µX , σ 2 )fY (y | µY , σ 2 ) = (2πσ12 )− 2 e− 2
µX ,µY ,σ 2
achieved by µ̂X = x̄, µ̂Y = ŷ and σ̂12 = (Sxx + Syy )/(m + n). Hence
m+n m+n
σ̂02 mn(x̄ − ȳ)2
2
2
Λx,y (H0 , H1 ) = = 1+ .
σ̂12 (m + n)(Sxx + Syy )
|x̄ − ȳ|
|t| = q
Sxx +Syy 1 1
n+m−2
(m + n
)
X̄ − Ȳ
q ∼ N (0, 1).
1 1
σ m + n
12.3. LINEAR MODELS 561
A size α test is to reject H0 if |t| > tn+m−2 (α/2). The analysis of variance gives
Total m+n−1
Seeing if F > F1,m+n−2 (α) is exactly the same as checking if |t| > Tn+m−2 (α/2).
Suppose now we are not observing from iid samples, instead we have X1 , · · · , Xn
all different but independent, and they correspond to Y1 , · · · , Yn respectively. More
precisely we have Xi ∼ N (µX + γi , σ 2 ) and Yi ∼ N (µP 2
Y + γi , σ ) for i = 1, · · · , n
all independent, where the parameter γi is such that i γi = 0. So observations
are made in pairs Xi , Yi each i corresponding to one pair, and each pair are slightly
different.
Working through the generalised likelihood ratio test, or expressing in matrix form,
leads to the intuitive conclusion that we should work with the differences Di = Xi − Yi
(i = 1, · · · , n) so that Di ∼ N (µX − µY , φ2 ) where φ2 = 2σ 2 . Thus D̄ ∼ N (µX −
µY , φ2 /n) and we test H0 : µX − µY = 0 by the t statistic
P 2
D̄ 2 SDD i (Di − D̄i )
t= √ where σ̃ = =
σ̃/ n n−1 n−1
and t ∼ tn−1 under H0 .
E. 12-86
Seeds of a particular variety of plant were randomly assigned either to a nutrition-
ally rich environment (the treatment) or to the standard conditions (the control).
After a predetermined period, all plants were harvested, dried and weighed, with
weights as shown below in grams.
Control 4.17 5.58 5.18 6.11 4.50 4.61 5.17 4.53 5.33 5.14
Treatment 4.81 4.17 4.41 3.59 5.87 3.83 6.03 4.89 4.32 4.69
Is there a difference between the mean weights due to the environmental condi-
tions?
Control observations are realisations of X1 , · · · , X10 iid N (µX , σ 2 ), and for the
treatment we have Y1 , · · · , Y10 iid N (µY , σ 2 ). We test H0 : µX = µY vs H1 : µx 6=
µY . Here m = n = 10, x̄ = 5.032, Sxx = 3.060, ȳ = 4.661 and Syy = 5.669, so
σ̃ 2 = (Sxx + Syy )/(m + n − 2) = 0.485. Then
s
1 1
|t| = |x̄ − ȳ|/ σ̃ 2 + = 1.19.
m n
562 CHAPTER 12. STATISTICS
From tables t18 (0.025) = 2.101, so we do not reject H0 . We conclude that there is
no evidence for a difference between the mean weights due to the environmental
conditions.
E. 12-87
Suppose we have 10 different species of plants. We sample a pair of seeds from
each specie of plants, and then one seed from each pair assigned to a nutritionally
rich environment (the treatment) and the other to the standard conditions (the
control).
Pair 1 2 3 4 5 6 7 8 9 10
Control 4.17 5.58 5.18 6.11 4.50 4.61 5.17 4.53 5.33 5.14
Treatment 4.81 4.17 4.41 3.59 5.87 3.83 6.03 4.89 4.32 4.69
Difference -0.64 1.41 0.77 2.52 -1.37 0.78 -0.86 -0.36 1.01 0.45
Does the treatment have any effect?
d¯ 0.37
t= √ = √ = 0.99
φ̃/ n 1.18/ 10
This can be compared to t18 (0.025) = 2.262 to show that we cannot reject H0 :
E[D] = 0, i.e. that there is no effect of the treatment. Alternatively, we see that
the observed p-value is the probability of getting such an extreme result under H0 ,
i.e.
P(|t9 | > |t| | H0 ) = 2P(t9 > |t|) = 2 × 0.17 = 0.34.
Let p be the chance of it occurring at each opportunity. Assume these are in-
dependent Bernoulli trials, so essentially we have X ∼Binom(n, p), we have ob-
served X = 0, and want a one-sided 95% confidence interval for p. Base this on
the set of values that cannot be rejected at the 5% level in a one-sided test,
i.e. the 95% interval is (0, p0 ) where the one-sided p-value for p0 is 0.05, so
0.05 = P(X = 0 | p0 ) = (1 − p)n . Hence since log(0.05) = −2.9957 we have
− log(0.05) 3
p0 = 1 − elog(0.05)/n ≈ ≈ .
n n
For example, suppose we have given a drug to 100 people and none of them
have had a serious adverse reaction. Then we can be 95% confident that the
chance the next person has a serious reaction is less than 3%. The exact p0 is
1 − elog(0.05)/100 = 0.0295.
12.4. RULES OF THUMB 563
E. 12-89
After n observations, if the number
√ of events differs from that expected under a
null hypothesis H0 by more than n, reject H0 .
Assume for simplicity that the confidence intervals are based on assuming Ȳ1 ∼
N (µ1 , s21 ), Ȳ2 ∼ N (µ2 , s22 ), where s1 and s2 are known standard errors. Suppose
wlog that ȳ1 > ȳ2 . Then since Ȳ1 p − Ȳ2 ∼ N (µ1 − µ2 , s21 + s22 ), we can reject
H0 at α = 0.05 if ȳ1 − ȳ2 > 1.96 s21 + s22 . The two CIs will not overlap p if
ȳ1 −1.96s1 > ȳ2 +1.96s2 , i.e. ȳ1 − ȳ2 > 1.96(s1 +s2 ). But since s1 +s2 > s21 + s22
for positive s1 , s2 , we have the first part of this ’rule of thumb’. Non-overlapping
CIs is a more stringent criterion: we cannot conclude ’not significantly different’
just because CIs overlap.
So if 95% CLs just touch, what is the p-value? Suppose s1 = s2 = s. Then CLs
just touch if |ȳ1 − ȳ2 | = 1.96 × 2s = 3.92 × s. So p-value is
Ȳ1 − Ȳ2 3.92
P(|Ȳ1 − Ȳ2 | > 3.92s) = P √ > √ = P(|Z| > 2.77)
2s 2
= 2P(Z > 2.77) = 0.0055.
where Z ∼ N (0, 1). And if ‘just not touching’ 100(1 − α)% CIs were to be
equivalent to ‘just rejecting H0 ’, then we would need to set α so that the crit-
difference between ȳ1 − ȳ2 was√exactly the width of each of the CIs, and so
ical √
1.96 2s = 2sφ−1 (1 − α/2) where 2s is√the standard deviation of Ȳ1 − Ȳ2 with
s1 = s2 = s. This means α = 2φ(−1.96/ 2) = 0.16. So in these specific circum-
stances, we would need to use 84% intervals in order to make non-overlapping CIs
the same as rejecting H0 at the 5% level.
CHAPTER 13
Numerical Analysis
Numerical analysis is the study of algorithms. There are many problems we would
like algorithms to solve. In general, there are two things we are concerned with —
accuracy and speed. We want our programs to be able to solve problem quickly and
accurately. Sometimes we are also concerned about stability, the sensitivity of the
solution of a given problem to small changes in the data or the given parameters of
the problem.
E. 13-2
It is easy to show that dim(Pn [x]) = n + 1. Writing down a polynomial of degree n
involves only n+1 numbers. They are easy to evaluate, integrate and differentiate.
So it would be nice if we can approximate things with polynomials.
There are many situations where the interpolation problem may come up. For
example, we may be given n+1 actual data points, and we want to fit a polynomial
through the points. Alternatively, we might have a complicated function f , and
want to approximate it with a polynomial p such that p and f agree on at least
that n + 1 points.
The naive way of looking at this is that we try a polynomial
fi = p(xi ) = an xn n−1
i + an−1 xi + · · · + a0 .
565
566 CHAPTER 13. NUMERICAL ANALYSIS
a system is not guaranteed to have a solution, and if the solution exists, it is not
guaranteed to be unique. That was not helpful. So our first goal is to show that
in the case of polynomial interpolation, the solution exists and is unique.
Note that the Lagrange cardinal polynomials have degree exactly n. The signif-
icance of these polynomials is we have `k (xi ) = 0 for i 6= k, and `k (xk ) = 1. In
other words, we have `k (xj ) = δjk . This is obvious from definition. With these
cardinal polynomials, we immediately write down a solution to the interpolation
problem.
T. 13-3
The interpolation problem has exactly one solution.
Pn
Pn define p ∈ PnP
We [x] by p(x) =
n
k=0 fk `k (x). Evaluating at xi gives p(xj ) =
k=0 f k `k (x j ) = k=0 fk δ jk = f j . So we get existence.
n n
X X fk ω(x)
p(x) = fk `k (x) = .
ω 0 (xk ) x − xk
k=0 k=0
While this works, the Lagrange forms are not ideal for numerical evaluation, both
because of speed of calculation and because of the accumulation of rounding error.
Moreover if we one day decide we should add one more interpolation point, we
would have to recompute all the cardinal polynomials, and that is not fun. Ideally,
we would like some way to reuse our previous computations when we have new
interpolation points, this lead us to the Newton form.
D. 13-4
• Write f [xj , · · · , xk ] as the leading coefficient of the unique q ∈ Pk−j [x] such that
q(xi ) = fi for i = j, · · · , k. This is called the Newton divided difference of degree
(or order ) k.
• Write C n [a, b] as the set of all nth times differentiable functions [a, b] → R with
continuous nth derivative.
• Suppose that our data values are derived from a particular function, i.e. fi = f (xi )
for i = 0, · · · , n for some smooth f . Write en (x) = f (x) − pn (x), i.e. the error of
the approximation.
C. 13-5
<The Newton formula> The idea of Newton’s formula is as follows — for
k = 0, · · · , n, we write pk ∈ Pk [x] for the polynomial that satisfies
pk (xi ) = fi for i = 0, · · · , k.
This is the unique degree-k polynomial that satisfies the first k + 1 conditions,
whose existence (and uniqueness) is guaranteed by the previous section. Then we
can write
p(x) = pn (x) = p0 (x) + (p1 (x) − p0 (x)) + · · · + (pn (x) − pn−1 (x)).
Hence we are done if we have an efficient way of finding the differences pk − pk−1 .
We know that pk and pk−1 agree on x0 , · · · , xk−1 . So pk − pk−1 evaluates to 0 at
those points, and we must have
k−1
Y
pk (x) − pk−1 (x) = Ak (x − xi ),
i=0
This formula has the advantage that it is built up gradually from the interpolation
points one-by-one. If we stop the sum at any point, we have obtained the polyno-
mial that interpolates the data for the first k points (for some k). Conversely, if we
have a new data point, we just need to add a new term, instead of re-computing
everything.
All that remains is to find the coefficients Ak . For k = 0, we know A0 is the unique
constant polynomial that interpolates the point at x0 , ie. A0 = f0 . For the others,
we note that in the formula for pk − pk−1 , we find that Ak is the leading coefficient
of xk . But pk−1 (x) has no degree k term. So Ak must be the leading coefficient of
pk . That is Ak = f [x0 , · · · , xk ].
So we have reduced our problem to finding the leading coefficients of pk . The algo-
rithm to obtain the solution to this is known as the Newton divided differences .
While we do not have an explicit formula for what these coefficients f [x0 , · · · , xk ]
are, it turns out if we consider a larger set of coefficient f [xj , · · · , xk ], we can come
up with a recurrence relation for these coefficients.
T. 13-6
<Recurrence relation for Newton divided differences> For 0 ≤ j < k ≤ n,
we have
f [xj+1 , · · · , xk ] − f [xj , · · · , xk−1 ]
f [xj , · · · , xk ] = .
xk − xj
q0 (xi ) = fi i = j, · · · , k − 1
q1 (xi ) = fi i = j + 1, · · · , k
q2 (xi ) = fi i = j, · · · , k
x − xj xk − x
q2 (x) = q1 (x) + q0 (x).
xk − xj xk − xj
We can check directly that the expression on the right correctly interpolates the
points xi for i = j, · · · , k. By uniqueness, the two expressions agree. Since
f [xj , · · · , xk ], f [xj+1 , · · · , xk ] and f [xj , · · · , xk−1 ] are the leading coefficients of
q2 , q1 , q0 respectively, the result follows.
Using this result the Newton divided difference table can be constructed
From the first n columns, we can find the n + 1th column using the recurrence
relation above. The values of Ak can then be found at the top diagonal, and this
is all we really need. However, to compute this diagonal, we will need to compute
everything in the table. The whole table can be evaluated in O(n2 ) operations.
In practice, we often need not find the actual interpolating polynomial. If we just
want to evaluateP p(x̂) at some
Qk−1new point x̂ using the divided table, we can simply
use p(x̂) = A0 + n k=1 A k i=0 (x̂−xi ). This is describe by the Horner’s scheme ,
given by
L. 13-7
If g ∈ C m [a, b] is zero at m + ` distinct points, then g (m) has at least ` distinct
zeros in [a, b].
1 (n)
f [x0 , · · · , xn ] = f (ξ).
n!
Consider e = f − pn ∈ C n [a, b]. This has at least n + 1 distinct zeros in [a, b].
(n) (n)
So by the lemma, e(n) = f (n) − pn must vanish at some ξ ∈ (a, b). But pn =
n!f [x0 , · · · , xn ] constantly. So the result follows.
A method of estimating a derivative, say f (n) (ξ) where ξ is given, is to let the
distinct points {xi }n i=0 be suitably close to ξ, and to make the approximation
f (n) (ξ) ≈ n!f [x0 , x1 , ..., xn ]. However, a drawback is that, although one achieves
good accuracy in theory by picking such close interpolation points, if f is smooth
and if the precision of the arithmetic is finite, significant loss of accuracy may
occur due to cancellation of the leading digits of the function values.
T. 13-9
Assume {xi }n
i=0 ⊆ [a, b] and f ∈ C[a, b]. Let x̄ ∈ [a, b] be a non-interpolation
point. Then
n
Y
en (x̄) ≡ f (x̄) − pn (x̄) = f [x0 , x1 , · · · , xn , x̄]ω(x̄) where ω(x) = (x − xi ).
i=0
for all x ∈ R. In particular, putting x = x̄, we have pn+1 (x̄) = f (x̄), and we get
the result.
Note that we forbid the case where x̄ is an interpolation point, since it is not clear
what the expression f [x0 , x1 , · · · , xn , x̄] means. However, if x̄ is an interpolation
point, then both en (x̄) and ω(x̄) are zero, so there isn’t much to say. This results
says that the error e = f − pn of the approximation is “like the next term in the
Newton’s formula”.
T. 13-10
Given f ∈ C n+1 [a, b] and distinct interpolation points {xi }n i=0 ⊆ [a, b], let pn ∈
Pn [x] be the unique solution of the polynomial interpolation problem for data
values {f (xi )}n
i=0 . Then for each x ∈ [a, b], we can find ξx ∈ (a, b) such that
1
en (x) ≡ f (x) − pn (x) = f (n+1) (ξx )ω(x) (∗)
(n + 1)!
570 CHAPTER 13. NUMERICAL ANALYSIS
Next, note that φ(xj ) = 0 for j = 0, 1, · · · , n, and φ(x) = 0. Hence, φ has at least
n + 2 distinct zeros in [a, b]. Moreover, φ ∈ C n+1 [a, b]. We deduce that φ0 has at
least n + 1 distinct zeros in (a, b), that φ00 vanishes at n points in (a, b), etc. We
conclude that φ(s) vanishes at n+2−s distinct points of (a, b) for s = 0, 1, · · · , n+1.
Letting s = n + 1, we have φ(n+1) (ξx ) = 0 for some ξx ∈ (a, b) and hence
n
Y
0 = φ(n+1) (ξx ) = (f (n+1) (ξx ) − p(n+1) (ξx )) (x − xi ) − (f (x) − p(x))(n + 1)!.
i=0
L. 13-12
<3-term recurrence relation> The Chebyshev polynomials satisfy the recur-
rence relations Tn+1 (x) = 2xTn (x) − Tn−1 (x) with initial condition T0 (x) = 1,
T1 (x) = x.
T. 13-14
For f ∈ C n+1 [−1, 1], the Chebyshev choice of interpolation points gives
1 1
kf − pn k∞ ≤ kf (n+1) k∞ .
2n (n + 1)!
(b − a)n+1 1
kf − pn k∞ ≤ kf (n+1) kk∞ ,
22n+1 (n + 1)!
where the transformed zeros of Tn+1 are used as interpolation points.
E. 13-15
Suppose f has as many continuous derivatives as we want. Then as we increase
n, what happens to the error bounds? The coefficients involve dividing by an
exponential and a factorial. Hence as long as the higher derivatives of f don’t
blow up too badly, in general, the error will tend to zero as n → ∞, which makes
sense.
2. We can allow [a, b] to be infinite, eg. [0, ∞) or even (−∞, ∞), but we have to
be more careful. We first define
Z b
hf, gi = w(x)f (x)g(x) dx
a
Rb
as before, but we now need more conditions. We require that a w(x)xn dx to
exist for all n ≥ 0, since we want to allow polynomials in our vector space. For
2
example, w(x) = e−x on [0, ∞), works, or w(x) = e−x on (−∞, ∞). These
are scalar products for Pn [x] for n ≥ 0, but we cannot extend this definition
13.2. ORTHOGONAL POLYNOMIALS 573
to all smooth functions since they might blow up too fast at infinity. We will
not go into the technical details, since we are only interested in polynomials,
and knowing it works for polynomials suffices.
3. We can also have a discrete inner product, defined by
m
X
hf, gi = wj f (ξj )g(ξj )
j=1
with {ξj }m m
j=1 distinct points and {wj }j=1 > 0. Now we have to restrict our-
selves a lot. This is a scalar product for V = Pm−1 [x], but not for higher
degrees, since a scalar product should satisfy hf, f i > 0 for f 6= 0. In particu-
lar, we cannot extend this to all smooth functions.
T. 13-18
S
Given a vector space V of functions containing n Pn [x] and an inner product
h · , · i, there exists a unique monic orthogonal polynomial pk for each degree n ≥ 0.
In addition, {pk }n k=0 form a basis for Pn [x].
This is a big induction proof over both parts of the theorem. We induct over n. For
the base case, we pick p0 (x) = 1, which is the only degree-zero monic polynomial.
Suppose we already have {pn }n k=0 satisfying the induction hypothesis.
1. Now pick any monic qn+1 ∈ Pn+1 [x], eg. xn+1 . We now construct pn+1 from
qn+1 by the Gram-Schmidt process. We define
n
X hqn+1 , pk i
pn+1 = qn+1 − pk .
hpk , pk i
k=0
This is again monic since qn+1 is, and we have hpn+1 , pm i = 0 for all m ≤ n,
and hence hpn+1 , pi = 0 for all p ∈ Pn [x] = h{p0 , · · · , pn }i.
2. To obtain uniqueness, assume both pn+1 , p̂n+1 ∈ Pn+1 [x] are both monic or-
thogonal polynomials. Then r = pn+1 − p̂n+1 ∈ Pn [x]. Now
By inspection, the p1 given is monic and satisfies hp1 , p0 i = 0. Using qn+1 = xpn
in the Gram-Schmidt process gives
n n
X hxpn , pk i X hpn , xpk i
pn+1 = xpn − pk = xpn − pk
hpk , pk i hpk , pk i
k=0 k=0
We notice that hpn , xpk i and vanishes whenever xpk has degree less than n. So we
are left with
hxpn , pn i hpn , xpn−1 i hpn , xpn−1 i
= xpn − pn − pn−1 = (x − αn )pn − pn−1 .
hpn , pn i hpn−1 , pn−1 i hpn−1 , pn−1 i
Now we notice that xpn−1 is a monic polynomial of degree n so we can write this as
xpn−1 = pn + q. Thus hpn , xpn−1 i = hpn , pn + qi = hpn , pn i. Hence the coefficient
of pn−1 is indeed the β we defined.
Note that x being self adjoing, i.e. hxf, gi = hf, xgi, is not necessarily true for
arbitrary inner products, but for most sensible inner products we will meet in this
course, this
R is true. In particular, it is clearly true for inner products of the form
hf, gi = w(x)f (x)g(x) dx.
E. 13-20
Legendre polynomials, Chebyshev polynomials, Laguerre polynomials and Hermite
polynomials are all examples of orthogonal polynomials that can be generated by
this recurrence relation. Chebyshev is based on the scalar product defined by
Z 1
1
hf, gi = √ f (x)g(x) dx.
−1 1 − x2
Note that the weight function blows up mildly at the end, but this is fine since it
is still integrable. This links up with
Tn (x) = cos(nθ)
for x = cos θ via the usual trigonometric substitution. We have
Z π
1
hTn , Tm i = √ cos(nθ) cos(mθ) sin θ dθ
1 − cos2 θ
Z0 π
= cos(nθ) cos(mθ) dθ = 0 if m 6= n.
0
The other
R b orthogonal polynomials all comes from scalar products of the form
hf, gi = a w(x)f (x)g(x) dx as described in the table below:
C. 13-21
<Least-squares polynomial approximation> If we want to approximate a
function with a polynomial, polynomial interpolation might not be the best idea,
since all we do is make sure the polynomial agrees with f at certain points, but
it might not be a good approximation elsewhere. Instead, we want to choose a
polynomial p in Pn [x] that “minimizes the error”.
What exactly do we mean by minimizing the error? The error is defined as the
function f − p. So given an appropriate inner product on the vector space of
continuous functions, we want to minimize kf − pk2 = hf − p, f − pi. This is
usually of the form
Z b
hf − p, f − pi = w(x)[f (x) − p(x)]2 dx,
a
T. 13-22
Let f be a given function and {pn }n k=0 orthogonal polynomials with respect to
h · , · i. Then the p ∈ Pn [x] that minimise kf − pk2 is given by
n
X hf, pk i
p= ck pk where ck = ,
kpk k2
k=0
Pn
We consider a general polynomial p = k=0 ck pk ∈ Pn [x]. We substitute this in
to obtain
n
X n
X
hf − p, f − pi = hf, f i − 2 ck hf, pk i + c2k kpk k2 .
k=0 k=0
Note that there are no cross terms between the different coefficients. We minimize
this quadratic by setting the partial derivatives to zero:
∂
0= hf − p, f − pi = −2hf, pk i + 2ck kpk k2 .
∂ck
To check this is indeed a minimum, note that the Hessian matrix is simply 2I,
which is positive definite. So this is really a minimum. So we get the formula for
the ck ’s as claimed, and putting the formula for ck gives the error formula.
Note that following:
576 CHAPTER 13. NUMERICAL ANALYSIS
hf, pk i
hf − p, pk i = hf, pk i − hp, pk i = hf, pk i − hpk , pk i = 0.
kpk k2
kf − pk2 + kpk2 = kf k2 ,
D. 13-23
• Given a vector space V of functions, a linear functional is an element of the dual
space of V .
? Here we’ll assume V is a real vector space, so a linear functional is a linear
mapping L : V → R.
E. 13-24
We usually don’t put so much emphasis on the actual vector space V . Instead, we
provide a formula for L, and take V to be the vector space of functions for which
the formula makes sense.
1. We can choose some fixed ξ ∈ R, and define a linear functional by L(f ) = f (ξ).
2. Alternatively, for fixed η ∈ R we can define our functional by L(f ) = f 0 (η).
In this case, we need to pick a vector space in which this makes sense, eg. the
space of continuously differentiable functions.
Rb
3. We can define L(f ) = a f (x) dx. The set of continuous (or even just in-
tegrable) functions defined on [a, b] will be a sensible domain for this linear
functional.
4. Any linear combination of these linear functions are also linear functionals.
For example, we can pick some fixed α, β ∈ R, and define
β−α 0
L(f ) = f (β) − f (α) − [f (β) + f 0 (α)].
2
How can we choose the coefficients ai and the points xi so that our approximation is
“good”? Note that most of our functionals can be easily evaluated exactly when f is
a polynomial. So we might approximate our function f by a polynomial, and then do
it exactly for polynomials. More precisely, we let {xi }N
i=0 ⊆ [a, b] be P
arbitrary points.
N
Then using the Lagrange cardinal polynomials `i , we have f (x) ≈ i=0 f (xi )`i (x).
Then using linearity, we can approximate
N
! N
X X
L(f ) ≈ L f (xi )`i (x) = L(`i )f (xi ).
i=0 i=0
So we can pick ai = L(`i ). Similar to polynomial interpolation, this formula is exact for
f ∈ PN [x]. But we could do better. If we can freely choose {ai }N N
i=0 and {xi }i=0 , then
since we now have 2n + 2 free parameters, we might expect to find an approximation
that is exact for f ∈ P2N +1 [x]. This is not always possible, but there are cases when
we can. The most famous example is Gaussian quadrature.
578 CHAPTER 13. NUMERICAL ANALYSIS
T. 13-25
In the above scenario, if in addition
1. N is even;
2. {xi }N 1
i=0 are symmetrically placed in [a, b], i.e. xN/2 = 2 (a+b) and xk +xN −k =
a + b for k = 0, · · · , N/2 − 1;
3. w is even with respect to [a, b], i.e. w(x − 12 (a + b)) is an even function for
x ∈ [− 21 (a + b)), 12 (a + b))],
Rb
then the approximation a f (x)w(x)dx = N
P
i=0 ai f (xi ) is exact when f ∈ PN +1 [x].
N
QN {xi }i=0 are symmetrically placed in [a, b], the nodal polynomial ω(x) =
Since
i=0 (x − xi ) is an odd function with respect to [a, b]; thus ω is an even function
with respect to [a, b]. Writing our Lagrange cardinal polynomials in terms of ω we
have
Z b
ω(x)
ai = w(x) 0 dx, i = 0, · · · , N.
a ω (xi )(x − xi )
where dk = ω 0 (xk ) = ω 0 (xN −k ), must be zero because the integrand is odd with
respect to [a, b]. The main part of the theorem is now simple and relies on the
following decomposition: given any f ∈ PN +1 [x], there is a unique c ∈ R and
unique q ∈ PN [x] such that f (x) = c(x − a+b 2
)N +1 + q(x) where c is the leading
coefficient of f . Now
Z b Z b N
X N
X
w(x)f (x)dx = w(x)q(x)dx = ai q(xi ) = ai f (xi )
a a i=0 i=0
E. 13-26
In practice, the restriction that the weight function w must be even usually occurs
because it is constant.
a+b
• Mid-point rule: This has w(x) = 1, N = 0 and x0 = 2
. The approximation
Rb
is a f (x)dx ≈ (b − a)f ( a+b
2
), this is exact for P1 [x].
• Simpson rule: This has w(x) = 1, N = 2 and x0 = a, x1 = a+b 2
, x2 = b.
Rb
The approximation is a f (x)dx ≈ b−a
6
(f (a) + 4f ( a+b
2
) + f (b)), this is exact for
P3 [x].
Gaussian quadrature
The objective of Gaussian quadrature is to approximate integrals of the form
Z b
L(f ) = w(x)f (x) dx,
a
where w(x) is a positive weight function that determines a scalar product. In particular
Rb
hf, gi = a w(x)f (x)g(x) dx is a scalar product for Pν [x]. We will show that we can
find weights , written {bn }νk=1 , and nodes , written {ck }νk=1 ⊆ [a, b], such that the
approximation
Z b Xν
w(x)f (x) dx ≈ bk f (ck )
a k=1
is exact for f ∈ P2ν−1 [x]. Such a quadrature with ν nodes that is exact on P2ν−1 [x]
is called Gaussian quadrature. We will show that the nodes {ck }νk=1 is in fact the
zeros of the orthogonal polynomial pν with respect to the scalar product. We start by
showing that this is the best we can achieve.
P. 13-27
Rb
There is no choice of ν weights and nodes such that the approximation of a w(x)f (x) dx
is exact for all f ∈ P2ν [x].
Rb
DefinePq(x) = νk=1 (x − ck ) ∈ Pν [x]. Then we know a w(x)q 2 (x) dx > 0. How-
Q
ν
ever, k=1 bk q 2 (cn ) = 0. So this cannot be exact for q 2 .
T. 13-28
<Ordinary quadrature> For any distinct {ck }νk=1 ⊆ [a, b], let {`k }νk=1 be the
Lagrange cardinal polynomials with respect to {ck }νk=1 . Then the approximation
Z b ν
X Z b
L(f ) = w(x)f (x) dx ≈ bk f (ck ) where bk = w(x)`k (x) dx
a k=1 a
rule. But those are quite inaccurate. It turns out a clever choice of {ck } does much
better — take them to be the zeros of the orthogonal polynomials. However, to do
this, we must make sure the roots indeed lie in [a, b]. This is what we will prove
now — given any inner product, the roots of the orthogonal polynomials must lie
in [a, b].
T. 13-29
For ν ≥ 1, the zeros of the orthogonal polynomial pν are real, distinct and lie in
(a, b).
First we show there is at least one root. Notice that p0 = 1. Thus for ν ≥ 1, by
orthogonality, we know
Z b Z b
w(x)pν (x)p1 (x) dx = w(x)pν (x) dx = 0.
a a
So there is at least one sign change in (a, b). We have already got the result we
need for ν = 1, since we only need one zero in (a, b). Now for ν > 1, suppose
{ξj }m
j=1 are the places where the sign of pν changes in (a, b) (which is a subset of
the roots of pν ). We define
m
Y
q(x) = (x − ξj ) ∈ Pm [x].
j=1
Since this changes sign at the same place as pν , we know qpν maintains the same
sign in (a, b). Now if we had m < ν, then orthogonality gives
Z b
hq, pν i = w(x)q(x)pν (x) dx = 0,
a
which is impossible, since qpν does not change sign. Hence we must have m = ν.
T. 13-30
In the ordinary quadrature, if we pick {ck }νk=1 to be the roots of pν (x), then get
we exactness for f ∈ P2ν−1 [x]. In addition, {bn }νk=1 are all positive.
Let f ∈ P2ν−1 [x]. Then by polynomial division, we get f = qpν + r where q, r are
polynomials of degree at most ν − 1. We apply orthogonality to get
Z b Z b Z b
w(x)f (x) dx = w(x)(q(x)pν (x) + r(x)) dx = w(x)r(x) dx.
a a a
But r has degree at most ν −1, and this formula is exact for polynomials in Pν−1 [x].
Hence we know
Z b Z b ν
X ν
X
w(x)f (x) dx = w(x)r(x) dx = bk r(ck ) = bk f (ck ).
a a k=1 k=1
13.3. APPROXIMATION OF LINEAR FUNCTIONALS 581
To show the weights are positive, we pick a special f . Consider f ∈ {`2k }νk=1 ⊆
P2ν−2 [x], for `k the Lagrange cardinal polynomials for {ck }νk=1 . Since the quadra-
ture is exact for these, we get
Z b ν
X ν
X
0< w(x)`2k (x) dx = bj `2k (cj ) = bj δjk = bk .
a j=1 j=1
In this subsection our vector space is C k [a, b], for a finite interval and some k > 1, and
our linear functional is L(f ) = f (k) (ξ) for some fixed ξ ∈ [a, b]. For N > k, we seek
distinct {xi }N N
i=0 ⊆ [a, b] and {ai }i=0 ⊆ R so that
N
X
f (k) (ξ) ≈ ai f (xi )
i=0
is a “good” approximation. Just as before we can for any choice of distinct {xi }N
i=0 have
(k)
ai = `i (ξ) for i = 0, · · · , N , where {`i }N
i=0 are the Lagrange cardinal polynomials
with respect to {xi }N
i=0 . Then our approximation is exact when f ∈ PN [x].
Note Pthat our approximation takes a particularly simple form when N = k because
k (k)
then i=0 ai f (xi ) = p (ξ) = k!f [x0 , · · · , xk ] where p ∈ Pk [x] is the interpolating
polynomial for f with respect to {xi }N i=0 . The last equality is since f [x0 , · · · , xk ] is by
definition the leading coefficient of p.
Like before, we can also slightly improve the above result in special cases. However
there is no analogue of Gaussian quadrature.
T. 13-31
In the above scenario (without assuming N = k), if in addition
a+b
1. k is even and ξ = 2
;
2. N is even and {xi }Ni=0 are symmetrically placed in [a, b], i.e. xN/2 =
a+b
2
and
xi + xN −i = a + b for i = 0, · · · , N/2 − 1.
Then the coefficients satisfy ai = aN −i for i = 0, · · · , N/2 − 1 and the approxima-
tion is exact for f ∈ PN +1 [x].
The proof has the same pattern as [T.13-25] and also slso similar to that we can
have a similar result for k and N odd.
E. 13-32
0 a+b f (b) − f (a)
• For N = 1, f ≈ is exact for P2 [x].
2 b−a
f (b) − 2f ( a+b
a+b ) + f (a)
• For N = 2, f 0 ≈ 2
is exact for P3 [x].
2 (b − a)2 /4
582 CHAPTER 13. NUMERICAL ANALYSIS
where LP i are some simpler linear functionals. If the the error is such that eL (f ) =
L(f ) − n i=0 ai Li (f ) = 0 whenever f ∈ Pk [x] (i.e. this approximation is exact for
f ∈ Pk [x]), we say the error annihilates for polynomials of degree less than k.
If the error annihilates for Pk [x], we will show that we can find a bound for the error
of L on k + 1 times continuous differentiable functions (i.e. f ∈ C k+1 [a, b]) in the
form
|eL (f )| ≤ cL kf (k+1) k∞ for some constant cL .
Moreover, we want to make cL as small as possible. Ideally we want it to be sharp ,
that is
∀ε > 0, ∃fε ∈ C k+1 [a, b] such that |eL (fε )| ≥ (cL − ε)kfε(k+1) k∞ .
This doesn’t say anything about whether cL can actually be achieved. This depends
on the particular form of the question.
Note that so far, everything we’ve done works if the interval is infinite, as long as the
weight function vanishes sufficiently quickly as we go far away. However, for this little
bit, we will need to require [a, b] to be finite, since we want to make sure we can take
the supremum of our functions.
T. 13-33
<Peano kernel theorem> Let f ∈ C k+1 [a, b] and write (x−θ)k+ = (x−θ)k I[θ ≤
x]. Suppose the linear functional λ annihilates polynomials in Pk [x] and that we
Rb
can exchange the order of λ and the integral in λ( a (x − θ)k+ f (k+1) (θ) dθ), then
Z b
1
λ(f ) = K(θ)f (k+1) (θ) dθ where K(θ) = λ((x − θ)k+ ).
k! a
Hence we can find the constant cL for different choices of the norm. When
1
computing cL , don’t forget the factor of k! ! By fiddling with functions a bit,
we can show these bounds are indeed sharp.
• If K(θ) does not change sign on [a, b], we can say a bit more. First, we note
that the bound Z b
1
K(θ) dθ kf (k+1) k∞
|λ(f )| ≤
k! a
can be achieved by xk+1 , since this has constant k + 1th derivative. Also, we
can use the integral mean value theorem to get the bound
Z b
1
λ(f ) = K(θ) dθ f (k+1) (ξ),
k! a
where ξ ∈ (a, b) depends on f . To see this note that λ(f ) is bounded above
Rb Rb
1
and below by ( k! a
K(θ) dθ)(inf [a,b] f (k+1) ) and ( k!
1
a
K(θ) dθ)(sup[a,b] f (k+1) )
hence we can find such a ξ by the intermediate value theorem.
• Finally, note that Peano’s kernel theorem says if eL (f ) = 0 for all f ∈ Pk [x],
then
Z b
1
eL (f ) = K(θ)f (k+1) (θ) dθ for all f ∈ C k+1 [a, b].
k! a
But for any other fixed j = 0, · · · , k−1, we also have eL (f ) = 0 for all f ∈ Pj [x].
So we also know
1 b
Z
eL (f ) = Kj (θ)f (j+1) (θ) dθ for all f ∈ C j+1 [a, b]
j! a
Note that we have a different kernel. In general, this might not be a good idea,
since we are throwing information away. Yet, this can be helpful if we get some
less smooth functions that don’t have k + 1 derivatives.
584 CHAPTER 13. NUMERICAL ANALYSIS
E. 13-34
• Let L(f ) = f (β). We decide to be silly and approximate L(f ) by
β−α 0
L(f ) ≈ f (α) + (f (β) + f 0 (α)) where α 6= β
2
The error is given by
β−α 0
eL (f ) = f (β) − f (α) − (f (β) + f 0 (α)),
2
and this vanishes for f ∈ P2 [x]. We wlog assume α < β. Then
Hence we get
0
a≤θ≤α
K(θ) = (α − θ)(β − θ) α≤θ≤β
0 β ≤ θ ≤ b.
Hence we know
Z β
1
eL (f ) = (α − θ)(β − θ)f 000 (θ) dθ for all f ∈ C 3 [a, b]
2 α
Note that in this particular case, our function K(θ) does not change sign on [a, b].
Rb
We have K(θ) ≤ 0 on [a, b], and a K(θ) dθ = − 61 (β − α)3 . Hence we have the
bound
1
|eL (f )| ≤ (β − α)3 kf 000 k∞ ,
12
and this bound is achieved for x3 . We also have eL (f ) = − 12
1
(β − α)3 f 000 (ξ) for
some f -dependent value of some ξ ∈ (a, b).
• Consider the approximation f (0) ≈ − 23 f (0) + 2f (1) − 12 f (2). The error of this
approximation is the linear functional eL (f ) = f (0) + 32 f (0) − 2f (1) + 12 f (2) and
(as may be verified by trying f (x) = 1, x, x2 ) eL (f ) = 0 for f ∈ P2 [x]. Hence the
Peano kernel theorem tells us that, for f ∈ C 3 [0, 2],
Z 2
1
eL (f ) = K(θ)f 000 (θ)dθ
2 0
3 1
where K(θ) ≡ eL ((x − θ)2+ ) = 2(0 − θ)+ + (0 − θ)2+ − 2(1 − θ)2+ + (2 − θ)2+
( 2 2
−2(1 − θ)2 + 21 (2 − θ)2 = 2θ − 23 θ2 0 ≤ θ ≤ 1
= 1
2
(2 − θ)2 1≤θ≤2
Note that K ≥ 0 so
Z 2 Z 1 Z 2
3 1 1 1 2
K(θ)dθ = (2θ − θ2 )dθ + (2 − θ)2 dθ = + = .
0 0 2 1 2 2 6 3
Furthermore since eL (p) = 0 for p ∈ Pk [x], (∗) remains true if xk+1 is replaced by
any monic q ∈ Pk+1 [x], hence we can choose such a q for which the evaluation of
eL (q) is straightforward.
It doesn’t really matter what norm we pick. It will just change the λ. The importance
is the existence of a λ. A special case is when λ = 0, ie. f does not depend on x. In this
case, this is just an integration problem, and is usually easy. This is a convenient test
case — if our numerical approximation does not even work for these easy problems,
then it’s pretty useless. An extra assumption we will often make is that f can be
expanded in a Taylor series to as many degrees as we want, since this is convenient for
our analysis.
What exactly does a numerical solution to the ODE consist of? We first choose a small
time step h > 0, let tn = nh and then construct approximations
yn ≈ y(tn ), n = 1, 2, · · · .
In particular, tn − tn−1 = h and is always constant. In practice, we don’t fix the step
size tn −tn−1 , and allow it to vary in each step. However, this makes the analysis much
more complicated, and we will mostly not consider varying time steps here. If we make
h smaller, then we will (probably) make better approximations. However, this is more
computationally demanding. So we want to study the behaviour of numerical methods
in order to figure out what h we should pick.
D. 13-35
• A numerical method is said to be
explicit if the value for the next step can be obtain by a direct computation
in terms of known quantities (e.g. values of previous steps), or in other words
the value for the next step are given explicitly by know quantities.
implicit if the next step are defined by systems of equations which we need
to solve to get the next step, or in other words the value for the next step are
given implicitly by known quantities.
• A numerical method is
one-step if yn+1 depends only on tn and yn , that is yn+1 = φh (tn , yn ) for
some function φh : R × RN → RN .1
multi-step if yn+1 also depends on yk for some k < n.
• The Euler’s method employs the formula yn+1 = yn + hf (tn , yn ).
• For θ ∈ [0, 1], the θ-method is
yn+1 = yn + h θf (tn , yn ) + (1 − θ)f (tn+1 , yn+1 ) .
The θ-method with θ = 1 is called the Euler’s method, the one with θ = 0 is called
the backward Euler method and the one with θ = 12 is called trapezoidal rule .
• For each h > 0, produce a sequence of discrete values yn for n = 0, h, 2h, · · · , [T /h],
where [T /h] is the integer part of T /h. We say a method converges if, as h → 0
and nh → t (hence n → ∞), we get yn → y(t) where y is the true solution to the
differential equation. Moreover, we require the convergence to be uniform in t.
• For a general (multi-step) numerical method yn+1 = φ(tn , y0 , y1 , · · · , yn ) the
local truncation error is
The order of a numerical method is the largest p ≥ 1 such that ηn+1 = O(hp+1 ).
E. 13-36
There are many ways we can classify numerical methods. One important classi-
fication is one-step versus multi-step methods. In one-step methods, the value of
yn+1 depends only on the previous iteration tn and yn . In multi-step methods,
we are allowed to look back further in time and use further results.
We want to show that Euler’s method “converges” to the real solution. First of
all, we need to make precise the notion of “convergence”. The Lipschitz condition
means there is a unique solution to the differential equation. So we would want
the numerical solution to be able to approximate the actual solution to arbitrary
accuracy as long as we take a small enough h. Hence our definition of converge.
The local truncation error is the error we will make at the (n + 1)th step if we had
accurate values for the first n steps. In the definition of order of an method, we
1
Sometimes a one-step method would be given in the form yn+1 = φh (tn , yn , yn+1 ) for some
function φh : R × RN × RN → RN , in which case by solving the equation we can still find yn+1
in terms of tn and yn , i.e. it is still of the form stated in the definition. This is an example of a
implicit method since yn+1 is defined in terms of itself and we have to solve equations to get yn+1 .
13.4. ORDINARY DIFFERENTIAL EQUATIONS 587
have the +1 in p + 1 because when we sum all the local errors to get the global
error we’ll drop a power of h. (at least this is true for one-step method)
When θ 6= 1, the θ-method is an implicit method. In general, we can’t just write
down the value of yn+1 given the value of yn . Instead, we have to treat the formula
as N (in general) non-linear equations, and solve them to find yn+1 !
In the past, people did not like to use this, because they didn’t have computers, or
computers were too slow. It is tedious to have to solve these equations in every step
of the method. Nowadays, these are becoming more and more popular because
it is getting easier to solve equations, and θ-methods have some huge theoretical
advantages, but we will not go into it.
We now look at the error of the θ-method. We have
η = y(tn+1 ) − y(tn ) − h θy0 (tn ) + (1 − θ)y0 (tn+1 )
eλT − 1
ken k ≤ ch for all 0 ≤ n ≤ [T /h] where en = yn − y(tn ).
λ
Note that the bound in 2 is uniform, so 1 follows from 2. We only need to prove
2. There are two parts to proving this. We first look at the local truncation error.
This is the error we would get at each step assuming we got the previous steps
right. More precisely, we write
y(tn+1 ) = y(tn ) + hf (tn , y(tn )) + Rn ,
and Rn is the local truncation error. For the Euler’s method, it is easy to get Rn ,
since f (tn , y(tn )) = y0 (tn ), by definition. So this is just the Taylor series expansion
of y. We can write Rn as the integral remainder of the Taylor series,
Z tn+1
Rn = (tn+1 − θ)y 00 (θ) dθ.
tn
ken+1 k∞ ≤ kyn − y(tn )k∞ + hkf (tn , yn ) − f (tn , y(tn ))k∞ + kRn k∞
≤ ken k∞ + hλken k∞ + ch2 = (1 + λh)ken k∞ + ch2 .
This is valid for all n ≥ 0. We also know ke0 k = 0. Doing some algebra, we get
n−1
X ch
ken k∞ ≤ ch2 (1 + hλ)j ≤ ((1 + hλ)n − 1) .
j=0
λ
Finally, we have 1 + hλ ≤ eλh since 1 + λh is the first two terms of the Taylor series,
and the other terms are positive. So (1 + hλ)n ≤ eλhn ≤ eλT . So we obtain the
bound
eλT − 1
ken k∞ ≤ ch .
λ
Then this tends to 0 as we take h → 0. So the method converges.
This works as long as λ 6= 0. However, λ = 0 is the easy case, since it is just
integration. We can either check this case directly, or use the fact that λ1 (eλT −
1) → T as λ → 0.
The same proof strategy works for most numerical methods, but the algebra will
be much messier.
This result tell us that the Euler method has order 1. This is one less than the
power of the local truncation error. When we look at the global error, we drop a
power, and only have en ∼ h as expected.
This formula is used to find the valuePof yn+s given the others.
P For the method
we define the two polynomial ρ(w) = s`=0 ρ` w` and σ(w) = s`=0 σ` w` .
• We say ρ(w) satisfies the root condition if all its zeros are bounded by 1 in size,
ie. all roots w satisfy |w| ≤ 1. Moreover any zero with |w| = 1 must be simple.
• The 2-step Adams-Bashforth (AB2) method has
1
yn+2 = yn+1 + h (3f (tn+1 , yn+1 ) − f (tn , yn )) .
2
E. 13-39
The idea behind multi-step method is that might be able to make our methods
more efficient/accurate by making use of previous values of yn instead of just the
most recent one.
13.4. ORDINARY DIFFERENTIAL EQUATIONS 589
Note that in the s-step numerical method we get the same method if we multiply
all the constants ρ` , σ` by a non-zero constant. By convention, we normalize this
by setting ρs = 1. Then we can alternatively write this as
s
X s−1
X
yn+s = h σ` f (tn+` , yn+` ) − ρ` yn+` .
`=0 `=0
Alternatively, this condition is equivalent to p being the largest number such that
ρ(ex ) − xσ(ex ) = O(xp+1 ) as x → 0.
0
We now expand the y and y about tn , and obtain
s
! ∞ s s
!
X X hk X X
ρ` y(tn ) + ρ` ` k − k σ` `k−1 y(k) (tn ).
k!
`=0 k=1 `=0 `=0
This is O(hp+1 ) under the given conditions. Hence the first result. To see the
equivalence of the two condition we expand ρ(ex ) − xσ(ex ),
s
X s
X
ρ(ex ) − xσ(ex ) = ρ` e`x − x σ` e`x .
`=0 `=0
We now expand the e`x in Taylor series about x = 0. This comes out as
∞
s s s
!
X X 1 X k
X k−1
ρ` + ρ` ` − k σ` ` xk .
k!
`=0 k=1 `=0 `=0
Note that s`=0 ρ` = 0, which is the condition required for the method to even
P
have an order at all, can be expressed as ρ(1) = 0.
E. 13-41
In the two-step Adams-Bashforth method, We see that the conditions hold for
p = 2 but not p = 3. So the order is 2. Alternatively
3 1
ρ(w) = w2 − w, σ(w) = w − ..
2 2
T. 13-42
<The Dahlquist equivalence theorem> A multi-step method is convergent
if and only if its order p is at least 1 and the root condition holds.
The proof is quite difficult, we will not prove it here. We can imagine this as
saying large roots in ρ are bad (for convergence) — they cannot get past 1, and
we cannot have too many with modulus 1.
Ps Ps
Intuitively,
Ps if a method
Ps `=0 ρ` yn+` = h `=0 σ` f (tp+1 n+` , yn+` ) has order p, then
`=0 Pρ` y(tn+` ) = h `=0P σ` f (tn+` , y(tn+` )) + O(h ), the difference of them
gives s`=0 ρ` en+` (h) = h s`=0 (σ` f (tn+` , yn+` ) − σ` f (tn+` , y(tn+` ))) + O(hp+1 )
wherePen (h) is the error at step n with step size h. In the limiting case h = 0 we
have s`=0 ρ` en+` (0) = 0, This is a difference equation with solution of P the form
(en )i (0) = rn where r is the root of the characteristic equation ρ(w) = s`=0 ρ` w` .
By consideration of continuity, for small h it would be similar to the case h = 0,
so we sort of see that roots ≥ 1 are “bad”.
E. 13-43
Consider the two-step Adams-Bashforth method. We have seen it has order p =
2 ≥ 1. So we need to check the root condition. ρ(w) = w2 − w = w(w − 1). So it
satisfies the root condition.
13.4. ORDINARY DIFFERENTIAL EQUATIONS 591
C. 13-44
<Constructing multi-step methods> Let’s now come up with a sensible
strategy for constructing convergent s-step methods:
1. Choose ρ(w) = s`=0 ρ` w` so that ρ(1) = 0 and the root condition holds.
P
D. 13-45
• An Adams method is a multi-step numerical method with ρ(w) = ws−1 (w − 1).
An Adams-Bashforth method is an explicit Adams method.
An Adams-Moulton method is an implicit Adams method.
• A backward differentiation method has σ(w) = σs ws for some σs 6= 0, ie.
s
X
ρ` yn+` = σs f (tn+s , yn+s ).
`=0
w = 1:
ρ(w) w(w − 1) 3 5
= = 1 + (w − 1) + (w − 1)2 + O(|w − 1|3 ).
log w log w 2 12
These aren’t our coefficients of σ, since what we need to do is to rearrange the
first three terms to be expressed in terms of w. So we have
ρ(w) 1 2 5 2
=− + w+ w + O(|w − 1|3 ).
log w 12 3 12
L. 13-47
An s-step backward differentiation method of order s is obtained by choosing
s
X 1 s−`
ρ(w) = σs w (w − 1)` ,
`
`=1
1 −1
Ps
with σs chosen such that ρs = 1, namely σs = `=1 ` .
s
Multiplying both sides by σs w gives the desired result.
For this method to be convergent, we need to make sure it does satisfy the root
condition. It turns out the root condition is satisfied only for s ≤ 6. This is not
obvious by first sight, but we can certainly verify this manually.
The Runge-Kutta methods are complicated, and are tedious to analyse. They have
been ignored for a long time, until more powerful computers came along and made
them much more practical to use. Nowadays they are used quite a lot since they have
many nice properties. The Runge-Kutta methods is a one-step method. So once we
get order p ≥ 1, we will have convergence (we don’t need to worry about things like
root condition!).
There are a lot of parameters we have to choose. We need to pick
In fact the optimal choice of parameters makes the method achieve order 2ν. The
conditions we need for a decent order is in general very complicated. However, we
can quickly obtain some necessary conditions. We can considerPthe case where f is a
constant. Then k` is P always that constant. So we must have ν`=1 b` = 1. It turns
out we also need c` = νj=1 a`j for ` = 1, · · · , ν. While these are necessary conditions,
they are not sufficient.
13.4. ORDINARY DIFFERENTIAL EQUATIONS 593
Note that in general, we have a implicit method since {k` }ν`=1 have to be solved for, as
they are defined in terms of one another. However, for certain choices of parameters, we
can make this an explicit method. This makes it easier to compute, but we would have
lost some accuracy and flexibility. Note also unlike all the other methods we’ve seen
so far, the parameters appear inside f . They appear non-linearly inside the functions.
This makes the method much more complicated and difficult to analyse using Taylor
series, the algebra rapidly become unmanageable.
This table allows for a general implicit method. Initially, explicit methods came out
first, since they are much easier to compute. In this case, the matrix A is strictly lower
triangular, ie. a`j = 0 whenever ` ≤ j.
E. 13-48
The most famous explicit Runge-Kutta method is the 4-stage 4th order one, often
called the classical Runge-Kutta method . The formula can be given explicitly by
h
yn+1 = yn + (k1 + 2k2 + 2k3 + k4 ),
6
1 1
where k1 = f (xn , yn ) k2 = f xn + h, yn + hk1
2 2
1 1
k3 = f xn + h, yn + hk2 k4 = f (xn + h, yn + hk3 ) .
2 2
We see that this is an explicit method. We don’t need to solve any equations.
E. 13-49
The Runge-Kutta methods can be motivated by Gaussian quadrature. Since ẏ = f ,
integrating over a single time step gives
Z tn+1
y(tn+1 ) = y(tn ) + f (t, y(t))dt.
tn
We can use the quadrature formula on the integral (transformed from [0, 1] to
[tn , tn+1 ]) to generate the “one-step method”
ν
X
yn+1 = yn + h b` f (tn + c` h, y(tn + c` h)).
`=1
But we certainly don’t know the exact values of {y(tn + c` h)}ν`=1 , so in some way
we have to approximate these values, which is where the k` comes in.
594 CHAPTER 13. NUMERICAL ANALYSIS
E. 13-50
Choosing the parameters for the Runge-Kutta method to maximize order is hard.
Consider the simplest case, the 2-stage explicit method. The general formula is
yn+1 = yn + h(b1 k1 + b2 k2 ),
k1 = f (xn , yn ) = y0 (tn )
k2 = f (tn + c2 h, y(tn ) + c2 hy0 (tn ))
∂f
= y0 (tn ) + c2 h (tn , y(tn )) + ∇f (tn , y(tn )) · y0 (tn ) +O(h2 )
∂t
| {z }
=y00 (tn )
0 00 2
= y (tn ) + c2 hy (tn ) + O(h ).
Now we see why Runge-Kutta methods are hard to analyse. The coefficients
appear non-linearly in this expression. It is still solvable in this case, in the
obvious way, but for higher stage methods, this becomes much more complicated.
In this case, we have a 1-parameter family of order 2 methods, satisfying
1
b1 + b2 = 1, b2 c2 = .
2
It is easy to check using a simple equation y 0 = λh that it is not possible to
get a higher order method (the optimal 2-stage order 4 Runge-Kutta method is
implicit). So as long as our choice of b1 and b2 satisfy this equation, we get a
decent order 2 method. For example for b1 = b2 = 21 and c2 = 1 we have the
h
<Heun’s method> yn+1 = yn + f (tn , yn ) + f (tn + h, yn + hf (tn , yn ))
2
This formula also make sense that it can be derive from the Taylor expansion of y
1 2 00
y(tn+1 ) = y(tn ) + hy0 (tn ) + h y (tn ) + O(h3 )
2
= y(tn ) + hf (tn , y(tn ))
1 f (tn + h, yn + hf (tn , yn )) − f (tn , yn )
+ h2 + O(h) + O(h3 )
2 h
h
= y(tn ) + f (tn , yn ) + f (tn + h, yn + hf (tn , yn )) + O(h3 ).
2
13.4. ORDINARY DIFFERENTIAL EQUATIONS 595
yn = (1 − λh)−n . Re
1
Then we get D = {z ∈ C : |1 − z| > 1}. We can visualize
this as the shaded region.
596 CHAPTER 13. NUMERICAL ANALYSIS
Again consider y 0 (t) = λy, with the trapezoidal rule. Then we can find
n
1 + hλ/2 2 + z
yn = =⇒ D= z∈C: < 1 = C− .
1 − hλ/2 2−z
2+z
Note that | 2−z | < 1 says that z has to be closer to −2 than to 2, therefore
−
D=C .
So we see this the backwards Euler method and trapezoidal rule is A-stable, but
the Euler method is not.
• Suppose a method is A-stable. If y(t) → 0 as t → ∞, then Re(λ) < 0, so
Re(hλ) < 0, so hλ ∈ D, hence yn → 0 as n → ∞. This says that everything that
is meant to converge to 0 will converge to 0 under the numerical method if it’s
A-stable. (Note that A-stability does not mean that any step size will do! We still
need to choose h small enough to ensure the required accuracy.)
Usually, when applying a numerical method to the problem y 0 = λy, we get
yn = (r(hλ))n where r is some rational function. So D = {z ∈ C : |r(z)| < 1}. In
particular whether yn → 0 or not should depend on just hλ, not them separately.
A-stability is a very strong requirement. It is hard to get A-stability. In particular,
Dahlquist proved that no multi-step method of order p ≥ 3 is A-stable (the p = 2
barrier is attained by the trapezoidal rule). Moreover, no explicit Runge-Kutta
method can be A-stable (although implicit ones can).
C. 13-53
<A-stability and the maximum principle> Suppose when applying a nu-
merical method to the problem y 0 = λy, we get yn = (r(hλ))n where r is some
function. We want to know whether C− ⊆ D = {z ∈ C : |r(z)| < 1}, i.e. whether
the method is A-stable. Suppose we know r is holomorphic/analytic in C− . The
Maximum principle from complex analysis states that if a function g is analytic
and non-constant in an open set Ω ⊆ C, then |g| has no maximum in Ω. Since
|r| needs to have a maximum in the closure of C− , the maximum must occur on
the boundary. So to show |r| ≤ 1 on the region C− , we only need to show the
inequality holds on the boundary which is the imaginary axis and “infinity”.
E. 13-54
Consider the following 2-stage 3rd-order implicit Runge-Kutta method applied on
y 0 = λy,
1 1
hk1 = hλ(yn + hk1 − hk2 )
4 4
1 5
hk2 = hλ(yn + hk1 + hk2 ).
4 12
This is a linear system for hk1 and hk2 , whose solution is
−1
1 − 41 hλ 1
1 − 23 hλ
hk1 4
hλ hλyn hλyn
= =
hk2 − 41 hλ 1 − 125
hλ hλyn 1 − 23 hλ + 16 (hλ)2 1
and therefore
1 3 1 + 13 hλ
yn+1 = yn + hk1 + hk2 = yn
4 4 1 − 3 hλ + 16 (hλ)2
2
13.4. ORDINARY DIFFERENTIAL EQUATIONS 597
6 + 2z
r(z) = .
6 − 4z + z 2
√
We first check if it is analytic. This certainly has some poles, but they are 2 ± 2i,
−
and are in the right-half plane. So this is analytic in C .
Next, what happens at the boundary of the left-half plane? Firstly, as |z| → ∞,
we find r(z) → 0, since we have a z 2 at the denominator. The next part is checking
when z is on the imaginary axis, say z = it with t ∈ R. Then we can check by some
messy algebra that |r(it)| ≤ 1 for t ∈ R. Therefore, by the maximum principle, we
must have |r(z)| ≤ 1 for all z ∈ C− .
h
yn+1 = yn + (3f (tn , yn ) − f (tn−1 , yn−1 )).
2
5 3 000
This is an order 2 error with ηn+1 = 12
h y (tn ) + O(h4 ). The trapezoidal rule is an
order 2 implicit method
h
yn+1 = yn + (f (tn , yn ) + f (tn+1 , yn+1 )).
2
2
TR is a far better method than AB: it is A-stable, hence its global behaviour is superior.
Employing AB to estimate the local error adds very little to the overall cost of TR, since AB is an
explicit method.
598 CHAPTER 13. NUMERICAL ANALYSIS
1 3 000
This has a local truncation error of ηn+1 = − 12 h y (tn ) + O(h4 ). The key to Milne’s
3 000
device is the coefficients of h y (yn ), namely
5 1
cAB = , cTR = − .
12 12
These are called the error constant of the methods (if exist). Since these are two
different methods, we get different yn+1 ’s. We distinguish these by superscripts, and
have
AB
y(tn+1 ) − yn+1 ' cAB h3 y000 (tn )
TR
y(tn+1 ) − yn+1 ' cTR h3 y000 (tn )
We can now eliminate y000 (tn ) to obtain
TR −cTR AB TR
y(tn+1 ) − yn+1 ' (yn+1 − yn+1 ).
cAB − cTR
In this case, the constant we have is 61 . So we can estimate the local truncation error
for the trapezoidal rule, without knowing the value of y000 . We can then use this to
adjust h accordingly.
In general, for error control for multistep methods we employ a predictor-corrector pair :
that is, we use two multistep methods of the same order, one explicit (the predictor)
and the other implicit (the corrector). The predictor is employed not just to estimate
the local error for the corrector using Milnes device, but also to provide a good initial
guess for the solution of the implicit corrector equations. Depending on whether an
error tolerance has been achieved, we (or the program) amend the step size h.
Since we are using a multi-step method one might wonder, if we change the step size
half way through a simulation, how how are the unknown previous approximations
(e.g. yn−1 , yn−2 etc.) obtained? We cannot use the originals since they correspond to
a different step size. The answer is that they are obtained by suitable polynomial inter-
polation from the other approximations calculated with different step lengths.
The strategy used for error control of multistep methods (i.e. Milnes device with
predictor-corrector pairs) cannot be applied to RK methods. This is because the
nonlinear nature of RK methods means that the leading term in the local truncation
error is no longer simply an error constant multiplying a derivative of the exact solution.
We replace the idea of predictor-corrector pairs by embedded RungeKutta methods ,
where a lower-order RK method is hidden inside a higher-order RK method. We will
not go into too much detail but in short, the embedded RK approach requires two
(typically explicit) RK methods: one having ν stages and order p, while the other
has ν + ` stages (with ` > 1) and order p + 1. The key restriction is that the first
ν stages of both methods must be identical. This restriction ensures that the cost
of implementing the higher-order method is marginal, once we have computed the
lower-order approximation.
The Zadunaisky device is a general technique for obtaining error estimates for nu-
merical approximations of initial-value ODEs. Suppose we have used an arbitrary
numerical method of order p and that we have stored the previously computed values
yn , yn−1 , ..., yn−p (the time steps between them need not be equal). We construct the p
degree interpolating polynomial (with vector coefficients) d, such that d(tn−i ) = yn−i
for i = 0, 1, · · · , p, and consider the initial-value ODE
z(t) = f (t, z(t)) + (d(t) − f (t, d)) for t ∈ [tn , tn+1 ] with z(tn ) = yn .
13.4. ORDINARY DIFFERENTIAL EQUATIONS 599
Note that
1. Since d(t) − y(t) = O(hp+1 ) and y(t) = f (t, y(t)), the term d(t) − f (t, d) is usually
small. Therefore, the new ODE for z is a small perturbation of the original ODE.
2. The exact solution of this new ODE is z(t) = d(t).
So, having applied our numerical method to the original ODE to produce yn+1 , we
apply exactly the same numerical method and implementation details to the new ODE
to produce zn+1 . We then evaluate the error in zn+1 , namely zn+1 − d(tn+1 ), and use
it as an estimate of the error in yn+1 .
There are various ways to solve this equation for yn+1 . We will give two commonest
types of iterative nonlinear solver:
1. Functional iteration : As the name suggests, this method is iterative. So we use
superscripts to denote the iterates. In this case, we use the formula
(k+1) (k)
yn+1 = yn + hf (tn+1 , yn+1 ).
(0)
Usually, we start with yn+1 = yn or we use some simpler explicit method to obtain
(0) (k)
our first guess of yn+1 . The question, of course, is whether yn+1 converges to
yn+1 as k → ∞. Fortunately, this converges to a locally unique solution if λh is
sufficiently small, where λ is the Lipschitz constant of f . For the backward Euler,
we will require λh < 1. This relies on the contraction mapping theorem.
Unlike Newtons method, functional iteration requires neither the solution of N ×N
linear systems nor the computation of Jacobian matrices. Hence it has a much
lower computational cost than Newtons method (which we will see next). However
for functional iteration wee need λh < 1, this restriction can (especially for stiff
equations) lead to very small time-steps and a large amount of computation time.
2. Newton’s method (Newton-Raphson method): yn+1 is the root of F(x) ≡ x −
(k+1) (k) (k) (k)
(yn + hf (tn+1 , x)), so we use the scheme yn+1 = yn+1 − (∇F(yn+1 ))−1 F(yn+1 ).
That is
(k+1) (k) (k) (k)
yn+1 = yn+1 − z(k) with (I − hJ (k) )z(k) = yn+1 − (yn + hf (tn+1 , yn+1 ))
(k)
where J (k) is the Jacobian matrix J (k) = ∇f (tn+1 , yn+1 ) ∈ RN ×N . This requires
us to first solve for z in the second equation, but this is a linear system, which we
have some efficient methods for solving.
There are several variants to Newton’s method. This is the full Newton’s method,
where we re-compute the Jacobian in every iteration. It is also possible to just
use the same Jacobian J (0) over and over again. There are some speed gains in
600 CHAPTER 13. NUMERICAL ANALYSIS
solving the equation, but then we will need more iterations before we can get our
yn+1 . The only role the Jacobian matrix plays is to ensure convergence: its precise
[k]
value makes no difference to limk→∞ yn . Therefore we might replace it with a
finite-difference approximation and/or evaluate it once every several steps.
3
The lecturer of this course requires L to be an unit lower triangular.
13.5. NUMERICAL LINEAR ALGEBRA 601
13.5.1 LU factorization
Since triangular matrices are easy to work with, it will be useful if we can factorise an
arbitrary matrix into products of triangular matrices. If we can find a LU factorization
A = LU of A where L and U are lower and upper triangular, then we can solve Ax = b
in two steps — we first find a y such that Ly = b. Then we find an x such that U x = y.
Then Ax = LU x = Ly = b.
Unfortunately, even if A is non-singular, it may not have an LU factorization. For
example ( 01 11 ) is non-singular with determinant −1, but we can manually check that
there is no LU factorization of this. On the other hand, while we don’t really like
singular matrices, singular matrices may still have LU factorizations. For example,
( 00 01 ) = ( 10 01 )( 00 01 ) is trivially an LU factorization of a singular matrix.
If our A is invertible, then both L and U must have non-zero determinant, hence non
of their diagonal entries are 0. So replacing li uTi with (li /Lii )(uTi Lii ) we can assume
L is an unit lower triangular matrix. This extra condition will ensure that (as we will
see) our LU factorisation of A is unique. In the case that A is not invertible, there
exist matrix (e.g. ( 01 01 )) such that LU factorisation exist, yet no LU factorisation exist
with L being unit. But we will not care too much about non-invertible matrix, so here
we will impose the extra condition of L being unit.
For each i, we know li and ui have the first i − 1 entries zero. So the first i − 1
rows and columns of li uTi are zero. In particular, the first row and columns only have
contributions from l1 uT1 , the second row/column only has contributions from l2 uT2 and
l1 uT1 etc. We can find the LU factorisation as follows:
1. Obtain l1 and u1 from the first row and column of A. Since the first entry of l1
is 1, uT1 is exactly the first row of A. We can then obtain l1 by taking the first
column of A and dividing by U11 = A11 .
2. Obtain l2 and u2 form the second row and column of A − l1 uT1 similarly.
602 CHAPTER 13. NUMERICAL ANALYSIS
3. · · ·
Pn−1
4. Obtain ln and un from the nth row and column of A − i=1 li uTi .
We can turn this into an algorithm. We define the intermediate matrices, starting with
A(0) = A. For k = 1, · · · , n, we let
(k−1)
Ukj = Akj j = k, · · · , n
(k−1)
Aik
Lik = (k−1)
i = k, · · · , n
Akk
(k) (k−1)
Aij = Aij − Lik Ukj i, j > k
(k)
we only sum over i, j > k in the last part because we know Aij would be 0 for any
other i, j. At the end of the algorithm when k = n, we end up with a zero matrix, and
then U and L are completely filled.
We can now see when this will break down. A sufficient condition for A = LU to exist
(k−1) (k−1)
is that Akk 6= 0 for all k. Since Akk = Ukk , this sufficient condition ensures U ,
and hence A is non-singular. Conversely, if A is non-singular and an LU factorization
(k−1)
exists, then this would always work, since we must have Akk = Ukk 6= 0. Moreover,
the LU factorization must be given by this algorithm. So the LU factorization of a
invertible matrix into LU with L being unit is unique. This is not necessarily true if A
is singular. For example, ( 00 01 ) = ( a1 10 )( 00 01 ) for any real number a. The problem with
this sufficient condition is that most of these coefficients do not appear in the matrix
A. They are constructed during the algorithm. We don’t know easily what they are
in terms of the coefficients of A. We will later come up with an equivalent condition
on our original A that is easier to check.
Note that as long as this method does not break down, we need O(n3 ) operations
to perform this factorization. Recall we only needed O(n2 ) operations to solve the
equation after factorization. So the bulk of the work in solving Ax = b is in doing the
LU factorization. In fact, LU factorisation is a good way to
1. find det A. We use the formula det A = det L det U = det U = n
Q
k=1 Ukk since L is
unit triangular.
2. find the inverse of A if it is non-singular. In particular, solving Axj = ej gives the
jth column of A−1 . Note that we are solving the system for the same A for each j.
So we only have to perform the LU factorization once, and then solve n different
equations. So in total we need O(n3 ) operations.
At the kth step of the LU-algorithm, the operation A(k) = A(k−1) − lk uTk (done for
entries i, j > k) has the property that the ith row of A(k) is the ith row of A(k−1)
minus Lik times uTk (the kth row of A(k−1) ), i.e.
[the ith row of A(k) ] = [the ith row of A(k−1) ] − Lik × [the kth row of A(k−1) ],
(k−1) (k−1)
where the multipliers Lik = Aik /Akk are chosen so that, at the outcome, the i, k
entry of A(k) is zero. This action is perform over i = k + 1, · · · , n, so that the last n − k
entries of the kth column is 0. This construction is analogous to Gaussian elimination
for solving Ax = b. Indeed, in the algorithm, we don’t need to define a new A(k) for
each k at every step, we can just overwrite the existing matrix A. In this case we also
13.5. NUMERICAL LINEAR ALGEBRA 603
managed to conveniently store U gradually in A (we stored them in the rows that are
meant to be becoming 0). The algorithm becomes: for k = 1, · · · , n,
Aik
Lik = i = k, · · · , n
Akk
Aij = Aij − Lik Akj i, j > k.
The resulting A is our U . We see that in this process, A gradually becomes U through
row operations just as in Gaussian elimination. We note in passing that if one wish
we can also make the program so that it also store L gradually in A by making use
of the columns that are becoming zero (and not storing the unit diagonal entries of L
since we know that are 1).
This different between the LU and the Gaussian approach is that in the LU approach
we also store L, this allow us to not care about b (for solving Ax = b) until the
factorization is complete. To solve Ax = b for many different b using the LU algorithm,
O(n3 ) operations are only required for the single initial factorisation (which we only
need to do once); but then the solution for each new b only requires O(n2 ) operations
(for the back- and forward substitutions). Whereas to solve Ax = b using Gaussian
elimination we would need to perform the row operations not just on the matrix
A but also at the same time on the particular b, so for each new b we need to
perform the whole elimination again which requires O(n3 ) computational operations
each time.
P A = LU,
where P is a permutation matrix such that when act on A would just permute the rows
of A. So we want to factor A up to a permutation of rows. Note that P is invertible,
so A = P −1 LU . If we managed to do this, we can like before easily solve Ax = b
since we know how to solve P Ax = P b. Now we will extend the previous algorithm
to allow permutations of rows, and we shall show that this factorization is possible for
all matrix.
(0)
Suppose our breakdown occurs at k = 1, ie. A11 = A11 = 0. We find a permutation
matrix P1 and let it act via P1 A(0) . The idea is to look down the first column of A, and
find a row starting with a non-zero element, say p. Then we use P1 to interchange rows
1 and p such that P1 A(0) has a non-zero top-most entry. For simplicity, we assume we
(0)
always need a P1 , and if A11 is non-zero in the first place, we just take P1 to be the
identity. After that, we can carry on. We construct l1 and u1 from P1 A(0) as before,
and set A(1) = P1 A(0) − l1 uT1 .
But what happens if the first column of A is completely zero? Then no interchange
will make the (1, 1) entry non-zero. However, in this case, we don’t actually have to
do anything. We can immediately find our l1 and u1 , namely set l1 = e1 (or anything)
and let uT1 be the first row of A(0) . Then this already works. Note however that this
corresponds to A (and hence U ) being singular, and we are not too interested with
these.
604 CHAPTER 13. NUMERICAL ANALYSIS
(k−1)
The later steps are exactly analogous. Suppose we have Akk = 0. Again we find
a Pk such that Pk A(k−1) has a non-zero (k, k) entry. We then construct lk and uk
from Pk A(k−1) and set A(k) = Pk A(k−1) − lk uTk . Again, if the kth column of A(k−1) is
completely zero, we set lk = ek and uTk to be the kth row of A(k−1) .
However, as we do this, the permutation matrices appear all over the place inside
the algorithm. It is not immediately clear that we do get a factorization of the form
P A = LU . Fortunately, keeping track of the interchanges, we do have an LU factor-
ization
P A = L̃U,
where U is what we got from the algorithm, P = Pn−1 · · · P2 P1 while L̃ is given
by L̃ = l̃1 · · · l̃n where l̃k = Pn−1 · · · Pk−1 lk . Note that in particular, we have
l̃n−1 = ln−1 and l̃n = ln .
One problem we have not considered is the problem of inexact arithmetic. While
these formula are correct mathematically, when we actually implement things, we do
them on computers with finite precision. As we go through the algorithm, errors will
accumulate, and the error might be amplified to a significant amount when we get
the reach the end. We want an algorithm that is insensitive to errors. In order to
work safely in inexact arithmetic, every time we permute the rows we will choose the
element of largest modulus in the kth column and put it in the (k, k)th position, not
just an arbitrary non-zero one, as this minimizes the error when dividing. We perform
this permuting every step even if the original element in the (k, k)th position is already
non-zero, i.e our aim is now to permute the largest element to the (k, k)th position
rather than getting rid of a zero element.
So far we allow the permutation of rows. We can in fact allow the permutation of
columns as well, so that we have the factorisation P AQ = LU where P and Q is
permutation matrices that reorders the rows and columns of A respectively. This
is call full pivoting. Now at every step, we don’t just move the largest modulus
element of the kth column to the (k, k)th position, we can move the largest modulus
element of the whole remaining matrix to the (k, k)th position. So this can minimise
inexact arithmetic. In practice however, the extra computational effort required for
total pivoting is not regarded as worthwhile and partial pivoting remains the standard
choice.
• A leading zero of a string of numbers or a vector is any 0 digit that comes before
the first non-zero digit.
E. 13-58
A band matrix of band width 0 is a diagonal matrix; a band matrix
of band width 1 is a tridiagonal matrix.
T. 13-59
A sufficient condition for both the existence and uniqueness of LU factorization
A = LU of a n × n matrix A with L unit is that det(Ak ) 6= 0 for k = 1, · · · , n − 1.
Even with symmetric matrices, some form of pivoting is generally necessary, both
to avoid breakdown and to maintain accuracy when using inexact arithmetic.
Clearly, permuting the rows of A will destroy symmetry unless we simultaneously
permute the corresponding columns: i.e. A → P AP T , where P is a permutation
matrix. One would like to prove that, for any symmetric A a symmetric factori-
sation of the form P AP T = LDLT where L is unit lower triangular and D is
diagonal, exists. This however is not true, even if A is restricted to be nonsingu-
lar. Fortunately, the next best result is true: for any symmetric A, a symmetric
factorisation of the form P AP T = LT LT where L is unit lower triangular and T
is both symmetric and tridiagonal, exists.
T. 13-64
Let A ∈ Rn×n be a positive-definite matrix. Then det(Ak ) 6= 0 for all k = 1, · · · , n.
(Forward) Since A is positive definite and symmetric, by the previous two theorem
A = LDLT where L is unit lower triangular and D is diagonal. Now we have to
show Dkk > 0. We define yk such that LT yk = ek , which exist, since L is
invertible. Then clearly yk 6= 0. Then we have
Dkk = eTk Dek = ykT LDLT yk = ykT Ayk > 0.
In fact if A is both symmetric and positive definite, pivoting is no longer required
either theoretically or practically.
This results gives a practical check for whether a symmetric A is positive definite.
We can perform this LU factorization, and then check whether the diagonal has
positive entries. This decomposition of a symmetric positive-definite matrix A
into A = LDLT with L unit lower triangular and D a positive-definite diagonal
matrix is called Cholesky factorization . There is another way of stating this
factorization. We let D1/2 be the “square root” of D, by taking the positive
square root of the diagonal entries of D. Then we have
A = LDLT = LD1/2 D1/2 LT = (LD1/2 )(LD1/2 )T = GGT ,
where G is lower triangular with Gkk > 0.
608 CHAPTER 13. NUMERICAL ANALYSIS
T. 13-66
Let A = LU be the LU factorization of the non-singular matrix A with L unit.
Then
1. all leading zeros in the rows of A to the left of the principal diagonal are
inherited by L,
2. all leading zeros in the columns of A above the principal diagonal are inherited
by U .
A matrix A is called a sparse matrix if nearly all of its elements are zero. It is
often required to solve very large systems Ax = b where A is sparse. The efficient
solution of such systems should exploit the sparsity. In particular, we wish the
matrices L and U to inherit as much as possible of the sparsity of A, so that
the cost of performing the forward and backward substitutions with L and U is
comparable with the cost of forming the product Ax: i.e. the cost of computation
should be determined by the number of nonzero entries, rather than by n. Hence
this and the next result are useful.
Also this result suggest that for the factorization of a sparse matrix A, one might
try to reorder its rows and columns beforehand so that many of the zero elements
become leading zeros in rows and columns. Thus we are using interchanges to re-
duce the fill-in for L and U , rather than to prevent breakdown of the factorisation.
P. 13-67
If a band matrix A has band width r and an LU factorization A = LU , then L
and U are both band matrices of width r.
T. 13-68
A vector x∗ ∈ Rn minimizes kAx − bk2 if and only if AT (Ax∗ − b) = 0.
(Forward) x∗ minimizes
f (x) = hAx − b, Ax − bi = xT AAx − 2xT AT b + bT b.
Then as a function of x, the partial derivatives evaluated at x∗ must vanish. We
have ∇f (x) = 2AT (Ax − b). So a necessary condition is AT (Ax∗ − b) = 0.
(Backward) We have AT (Ax∗ − b) = 0. For any x ∈ Rn write x = x∗ + y, then
kAx − bk2 = kA(x∗ + y) − bk = kAx∗ − bk2 + 2yT AT (Ax − b) + kAyk2
= kAx∗ − bk2 + kAyk2 ≥ kAx∗ − bk2 .
So x∗ must minimize the Euclidean norm.
This result makes sense geometrically: Let U be the space span by the columns of
A. Then Ax is a point in U , so kAx − bk is the distance between b and a point
in U . So if x∗ minimise kAx − bk, then Ax∗ − b must be orthogonal to the space
U , hence AT (Ax∗ − b) = 0.
P. 13-69
If A ∈ Rm×n is a full-rank matrix, then there is a unique solution to the least
squares problem.
We know all minimizers are solutions to (AT A)x = AT b. The matrix A being full
rank means y 6= 0 ∈ Rn implies Ay 6= 0 ∈ Rm . Now
xT AT Ax = (Ax)T (Ax) = kAxk2 > 0 for all x 6= 0.
T n×n
Hence A A ∈ R is positive definite (and in particular non-singular). So we
can invert AT A and find a unique solution x.
Now to find the x∗ minimizing kAx−bk2 , we just need to solve the normal equations
AT Ax = AT b. If A has full rank, then the Gram matrix AT A is non-singular,
and there is a unique solution. If not, then the general theory of linear equa-
tions tells us there are either infinitely many solution or no solutions. But for this
particular form of equations, it turns out there will always be a solution.
However, solving these equations is sometimes not the best way for finding x∗ for
accuracy or practical reasons. For example, A may have useful sparsity properties
which are lost when forming AT A. The “squaring” process is inherently danger-
ous and AT A can be a much more ill conditioned matrix than A. With inexact
arithmetic, A can have full rank but AT A may be singular to the computer (eg.
due to rounding errors when numbers get large (108 )2 = 1016 ). Instead, a better
approach to solve the problem make use of QR factorization.
D. 13-70
• A QR factorization of an m × n matrix A is a factorization of the form A = QR
where Q ∈ Rm×m is an orthogonal matrix, and R ∈ Rm×n is an upper triangular
matrix.4 Since the last m − n rows of R is 0, we can remove these useless rows in
R and corresponding useless column in Q, so that we obtain the “skinny”5 version
of QR factorization A = Q̃R̃ where Q̃ ∈ Rm×n and R̃ ∈ Rn×n .
4
Note that here R is not a square matrix, but we have the same definition of upper triangular,
that is Rij = 0 whenever i > j.
5
Sometimes called “thin” or “reduced” instead.
610 CHAPTER 13. NUMERICAL ANALYSIS
• We say the matrix R is in a standard form if it has the property that the number
of leading zeros in each row increases strictly monotonically: i.e. if Ri,ji is the first
non-zero entry in the ith row, then j1 , · · · , jp is a strictly monotonically increasing
sequence, where p is the last row with an non-zero element. Completely zero rows
of R are allowed, but they must all be at the bottom of R.
• In Rm , where m > 2, we define the Givens rotation on 3 parameters 1 ≤ p < q ≤
m and θ ∈ [−π, π] by
1
..
.
1
cos θ sin θ
1
[p,q] .. m×m
Ωθ = ∈R ,
.
1
− sin θ cos θ
1
..
.
1
where the sin and cos appear at the p, qth rows and columns.
• For u 6= 0 ∈ Rm , we define the Householder reflection by
uuT
Hu = I − 2 ∈ Rm×m .
uT u
E. 13-71
• Every A ∈ Rm×n has a QR factorization, as we will soon show, but this is not
unique (eg. we can multiply both Q and R by −1, or in fact just changing
sign in a column of Q and a corresponding row of R). In the “skinny” QR
factorization, with A = Q̃R̃, we will see that if A has full rank, then the skinny
QR is unique up to a sign, ie. unique if we require R̃kk > 0 for k = 1, · · · , n.
• Let A = QR be a QR factorisation. If we denote the columns Pjof A and Q
by {aj }n m
j=1 and {qj }j=1 respectively, then we see that aj = i=1 qi Rij for
j = 1, 2, · · · , n. In other words, the jth column of A is expressed as a linear
combination of the first j columns of Q (remember that the columns of Q form
an orthonormal set in Rm .)
• A key property of the orthogonal matrices is that we have kQxk = kxk for all
x ∈ Rn . Once we have the QR factorization of A, we can multiply kAx − bk by
QT = Q−1 , and get an equivalent problem of minimizing kRx − QT bk. We will
not go into details, but it should be clear that this is not too hard to solve. The
key of why QR factorization is useful here is that Q preserves distance while R
is easy to deal with.
• If R is in stand form, then the matrix equation Rx = b is easy to solve. For
example consider
x
R11 R12 R13 R14 1
b1
0 x 2
0 R23 R24 x3 = b2
0 0 0 0 0
x4
where R11 , R23 6= 0. This has infinitely many solutions. In particular we see
that x4 and x2 can be freely chosen and then x3 is determined by R23 x3 +
R24 x4 = b2 , and after that x1 is determined by R11 x1 + · · · + R14x4 = b1 . In
13.6. LINEAR LEAST SQUARES 611
general for R in standard form xji for i = 1, · · · , p can be freely chosen and the
others are then determined.
We shall demonstrate three standard algorithms for QR factorization:
1. Gram-Schmidt factorization: The Gram-Schmidt process is used to orthogo-
nalise the columns of A.
2. Givens rotations: Simple rotation (hence orthogonal) matrices are used to
gradually transform A element-by-element into “upper triangular” form.
3. Householder reflections: Simple reflection (hence orthogonal) matrices are
used to gradually transform A column-by-column into “upper triangular”
form.
Note that the only non-unique possibility here is the sign — we can let R11 = −ka1 k
and q1 = −a1 /ka1 k instead. But if we require R11 > 0, then this is fixed. In the
degenerate case a1 = 0 we can just set R11 = 0, and the pick any q1 ∈ Rn with
kq1 k = 1.
2. For columns 1 < k ≤ n, for i = 1, · · · , k − 1, we set Rik = hqi , ak i and set
k−1
X
dk = ak − qi hqi , ak i.
i=1
If dk 6= 0, then we set
dk
qk = and Rkk = kdk k.
kdk k
In the case where dk = 0, we again set Rkk = 0, and pick qk to be anything
orthonormal to q1 , · · · , qk−1 .
When using this algorithm we see that if A ∈ Rm×n (m ≥ n) has full rank (so no
degenerate cases), the only lack of uniqueness in the solution is the choice of sign for
each column of Q and corresponding row of R. Thus its skinny QR factorisation is
unique provided we impose the restrictions Rii > 0 for i = 1, · · · , n.
When A has full rank, we can also link its unique skinny QR factorisation with
the unique Cholesky factorisation of the Gram matrix AT A (recall AT A ∈ Rn×n
is symmetric positive definite and so has AT A = GGT , where G ∈ Rn×n is a lower
612 CHAPTER 13. NUMERICAL ANALYSIS
triangular matrix with positive diagonal elements). Thus AG−T ∈ Rm×n satisfies
(AG−T )T AG−T = G−1 AT AG−T = I and so the columns of AG−T form an orthonor-
mal set in Rm . Hence A = AG−T GT is our unique skinny QR factorisation, with
AG−T playing the role of Q and GT playing the role of R.
Second algorithm
When A doesn’t have full rank (say it has rank p < n), we’ll encounter cases like
a1 , dk = 0. Instead of defining a random vector qk orthonormal to q1 , · · · , qk−1 , we
might want to postpone the use of qk to the next step, so that pk would take on the
role of pk+1 . If we do this, we would find that at the end the p orthonormal vectors
q1 , · · · , qp is enough to create all the columns a1 , · · · , an of A. So our factorisation
A = QR looks like (a1 · · · an ) = (q1 · · · qp ∗ · · · ∗)R, where R would be in standard
form with its last n − p rows zero. We can remove the last n − p rows and columns
from R and Q respectively to get the skinny version.
More precisely we have the following algorithm (which also work for full rank) for this
procedure which produce the skinny version: Set j = 0, k = 0; where we use j to keep
track of the number of columns of A and R that have been already considered, and
k to keep track of the number of columns of Q that have been formed (k ≤ j). On
termination of the algorithm, we will have k = p, the rank of A.
1. Increase j by 1.
If k = 0, set dj = aj .
Pk
If k ≥ 1, set Rij = hqi , aj i for i = 1, 2, · · · , k and compute dj = aj − i=1 Rij qi .
2. If dj 6= 0, set qk+1 = dj /kdj k, Rk+1,j = kdj k and put Rij = 0 for all i with
j ≥ i ≥ k + 2 (if j ≥ k + 2), and then increase k by 1.
If dj = 0, put Rij = 0 for all i with j ≥ i ≥ k + 1.
3. Terminate if j = n, otherwise go to Step 1.
As constructed in the above algorithm, the first non-zero element of each row of R also
has the property that it is positive. This gives us conditions that make the skinny QR
factorisation unique. If A ∈ Rm×n has rank p ≤ n, then the skinny QR factorisation
is unique if R is in standard form and the first non-zero element in each row of R is
greater than zero.
In practice, a slightly different algorithm (modified Gram-Schmidt process) is used,
which is (much) superior with inexact arithmetic. The modified Gram-Schmidt process
is in fact the same algorithm, but performed in a different order in order to minimize
errors. However, this is often not an ideal algorithm for large matrices, since there are
many divisions and normalizations involved in computing the qi , and the accumulation
of errors will cause the resulting matrix Q to lose orthogonality.
Of course, by choosing
p a slightly different θ, we can make the result zero in the first
component and α2 + β 2 in the second.
[p,q]
Note that for y ∈ Rm , the effect of Ωθ only alters the p and q components. In
m×n [p,q]
general, for B ∈ R , Ωθ B only alters the p and q rows of B. Moreover, just like
the R2 case, given a particular z ∈ Rm , we can choose θ such that the qth component
[p,q]
(Ωθ z)q = 0.
Hence, A ∈ Rm×n can be transformed into an “upper triangular” form by applying
s = mn − 12 n(n + 1) Givens rotations, since we need to introduce s many zeros. Then
Qs · · · Q1 A = R. We’ll illustrate this with an example of a matrix A ∈ R4×3 . We will
apply the Givens rotations in the following order:
[3,4] [3,4] [2,3] [1,4] [1,3] [1,2]
Ωθ6 Ωθ5 Ωθ4 Ωθ3 Ωθ2 Ωθ1 A = R.
A = QT1 · · · QTs R.
However, we don’t really need to do this if we just want to use this to solve the least
squares problem, since to do so, we need to multiply by QT , not Q, which is exactly
Qs · · · Q1 . Note finally that when performing each of the rotation we can perform them
so that we leave the diagonal entries of A non-negative, so if A has full rank at the end
we can simply remove the redundant rows/columns to get the unique skinny version
(as the diagonal entries are positive).
Note that Hu is symmetric, and we can see Hu2 = I. So this is indeed orthogonal.
To show this is a reflection, suppose we resolve x in perpendicular and parallel parts
as x = αu + w ∈ Rm , where α = uT x/(uT u) and uT w = 0. Then we have Hu x =
614 CHAPTER 13. NUMERICAL ANALYSIS
uT z
Hu z = z − 2 u.
uT u
This only requires O(m) operations, which is nice.
L. 13-72
1. Let a, b ∈ Rm , with a 6= b, but kak = kbk. Then if we pick u = a − b, then
Hu a = b.
2. Let a, b ∈ Rm , with (ak , · · · , am ) 6= (bk , · · · , bm ) but m
P 2 Pm 2
j=k aj = j=k bj . If
T
u = (0, 0, · · · , 0, ak − bk , · · · , am − bm ) , then Hu a = (a1 , · · · , ak−1 bk , · · · , bm ).
2(kak − aT b)
1. Hu a = a − (a − b) = a − (a − b) = b.
kak2 − 2aT b + kbk2
2. This is a generalization of 1. The proof is just straightforward verification
noting the next lemma.
These result is obvious if we draw some pictures in low dimensions.
L. 13-73
Suppose the first k − 1 components of u are zero, then
1. For every x ∈ Rm , Hu x does not alter the first k − 1 components of x.
2. If the last m − k + 1 components of y ∈ Rm are zero, then Hu y = y.
These are all obvious from definition. All these say is that reflections don’t affect
components perpendicular to u, and in particular fixes all vectors perpendicular
to u.
So we begin our algorithm: we want to clear the first column of A. We let a be the
first column of A, and assume a ∈ Rm is not already in the correct form, ie. a is not
a multiple of e1 . Then we define u = a ∓ kake1 where either choice of the sign is
pure-mathematically valid. However, we will later see that there is one choice that is
better when we have to deal with inexact arithmetic. Then by 1 of [L.13-72] we end
up with
× × ×· · · ×
0 × ×· · · ×
H1 a = Hu a = ±kake1 hence H1 A = 0
.
×
.
×· · · ×
.. .
.
. . . . .
. . . . .
0 × ×· · · ×
To do the next step, we need to be careful, since we don’t want to destroy the previously
created zeroes. We let a0 be the second column of H1 A, and assume a03 , · · · , a0m are
not all zero, ie. (0, a02 , · · · , P
a0m )T is not a multiple of e2 . We choose u0 = (0, a02 ∓
γ, a03 , · · · , a0m )T where γ = ( m 0 2 1/2
j=2 (aj ) ) . Then by 2 of [L.13-72],
× × ×· · · ×
0 × ×· · · ×
0 0
H 2 a = H u0 a = (a01 , ±γ, 0, · · · ) and H2 H1 A = 0
.
0
.
×· · · ×
.. .
. . . . .
. . . . .
0 0 ×· · · ×
13.6. LINEAR LEAST SQUARES 615
The first column (and row) is unchanged by [L.13-73]. Suppose we have reached
Hk−1 · · · H1 A, where the first k − 1 rows are of the correct form. We consider a(k) , the
(k) (k)
kth column of Hk−1 · · · H1 A. We assume (0, · · · , 0, ak , · · · , am )T is not a multiple
of ek . Choosing
v
u m (k)
uX
(k) (k) (k) (k) (k) (k) T (k)
u = γ (0, · · · , 0, ak ∓ γ , ak+1 , · · · , am ) where γ = t (aj )2 ,
j=k
(k) (k)
we find that Hk a(k) = Hu(k) a(k) = (a1 , · · · , ak−1 , ±γ (k) , 0, · · · , 0)T . And we have
k k−1
Note that Hk did not alter the first k − 1 rows and columns of Hk−1 · · · H1 A.
There is one thing we have to decide on — which sign to pick. As mentioned, these do
not matter in pure mathematics, but with inexact arithmetic, we should pick the sign
in ak ∓ γ such that ak ∓ γ has maximum magnitude, ie. ak ∓ γ = ak + sgn(ak )γ. It
takes some analysis to justify why this is the right choice, but it is not too surprising
that there is some choice that is better. Alternatively we could choose the sign so
that the diagonal entries is non-negative, then if A has full rank, at the end we can
simply remove the redundant rows/columns o get the unique skinny version of the
factorisation.
So how do Householder and Givens compare? We notice that the Givens method
generates zeros at one entry at a time, while Householder does it column by column.
So in general, the Householder method is superior. However, for certain matrices with
special structures, we might need the extra delicacy of introducing a zero one at a
time. For example, if A already have a lot of zero entries in the lower triangular part
(eg. a band matrix), then it might be beneficial to just remove the few non-zero entries
one by one.
Note that A = Q̃R̃ is exactly the skinny QR factorisation of A, so the skinny factorisa-
tion is sufficient to solve our least squares problem (although the norm of the residual
would have to be calculated in a different way).
616 CHAPTER 13. NUMERICAL ANALYSIS
Now suppose that A ∈ Rm×n has rank p < n, and that we have a QR factorisation of
A with R in standard form. Again the least squares problem simplifies to minx kRx −
QT bk. Since the last m − p rows of R are zero, this simplified minimisation problem
again reduces to R̃x∗ = Q̃T b where R̃ ∈ Rp×n contains the first p rows of R and Q̃ ∈
Rm×p contains the first p columns of Q. This however is an under-determined system
with an infinite number of solutions depending on n − p free parameters. Nevertheless,
since R is in standard form, we can easily describe the solutions. If we denote any of
these solutions by x∗ ∈ Rn , then the residual for our least squares problem is exactly
the same: i.e. !1/2
Xm
∗ T T 2
kRx − Q bk = (Q b)i .
i=n+1
CHAPTER 14
Electromagnetism
Electromagnetism is one of the four fundamental forces of the universe. Apart from
gravity, most daily phenomena can be explained by electromagnetism. The very exis-
tence of atoms requires the electric force (plus weird quantum effects) to hold electrons
to the nucleus, and molecules are formed again due to electric forces between atoms.
More macroscopically, electricity powers all our electrical appliances. Finally, it is the
force that gives rise to light, and allows us to see things.
When we study charged sheets or lines, the charge density would be charge per unit
area or length instead, but this would be clear from context.
Electric current describes the coherent motions of electric charge across some surface
S, it counts the number of charge passing through the surface per unit time. The
motion of charge is described by the current density J(x, t), which is the “current
per unit area”, so that the current passing through a small surface dS located at x at
time t is J(x, t) · dS. For any surface S, the current through it is then
Z
I= J · dS
S
which counts the charge per unit time passing through S. Intuitively, if the charges in
the charge distribution ρ(x, t) move with velocity v(x, t), then (neglecting relativistic
effects) we have J = ρv.
It is well known that charge is conserved — we cannot create or destroy charge. How-
ever, the conservation of charge does not simply say that “the total charge in the
universe does not change”. We want to rule out scenarios where a charge on Earth
disappears, and instantaneously appears on the Moon. So what we really want to
say is that charge is conserved locally: if it disappears here, it must have moved to
1
The charge of quarks is actually −e/3 or 2e/3. This doesn’t change the spirit of the above
discussion since we could just change the basic unit. But, apart from in extreme circumstances,
quarks are confined inside protons and neutrons so we rarely have to worry about this.
617
618 CHAPTER 14. ELECTROMAGNETISM
somewhere nearby. Alternatively, charge density can only change due to continuous
currents. This is captured by the:
∂ρ
<Continuity equation> +∇·J=0
∂t
We can write this into a more intuitive integral form
R via the divergence theorem. The
charge Q in some region V is defined to be Q = V ρ dV . So
Z Z Z
dQ ∂ρ
= dV = − ∇ · J dV = − J · dS.
dt V ∂t V S
Hence the continuity equation states that the change in total charge in a volume is
given by the total current passing through its boundary. In particular, we can take
V = R3 , the whole of space. If there are no currents at infinity, then dQ
dt
= 0. So the
continuity equation implies the conservation of charge.
E. 14-1
A wire is a cylinder of cross-sectional area A. Suppose there are n electrons per
unit volume. Then ρ = nq = −ne, J = nqv and I = nqvA.
The presence of 4π in this formula isn’t telling us anything deep about Nature. It’s
more a reflection of the definition of the Coulomb as the unit of charge. The two
constant can be thought of as characterising the strength of the electric interactions
the strength of magnetic interactions respectively.
14.1 Electrostatics
Electrostatics is the study of stationary charges in the absence of magnetic fields. We
take ρ = ρ(x), J = 0 and B = 0. We then look for time-independent solutions. In this
case, the only relevant equations are ∇ · E = ρ/ε0 and ∇ × E = 0, and the other two
equations just give 0 = 0. In this section, our goal is to find E for any ρ.
Strictly speaking, this only holds when the charges are not moving. However,
for most practical purposes, we can still use this because the corrections required
when they are moving are tiny.
Consider ρ(r) = ρI[r < R], where I is the indicator function. Outside the sphere
of radius R, i.e for r > R, we know that E(r) = (Q/4πε0 r2 )r̂. Now suppose we
are inside the sphere, so r < R with S = ∂Br (0), then
Q r3
Z
Qr
= E · dS = E(r)4πr2 =⇒ E(r) = r̂,
ε0 R 3 S 4πε0 R3
So the field increases with radius inside the sphere.
E. 14-3
z
<Line charge> Consider an infinite line with uniform charge
density per unit length η. We p use cylindrical polar coordinates (so
our line is along z and r = x2 + y 2 ). By symmetry, the field E
is radial, ie. E(r) = E(r)r̂. Pick S to be a cylinder of length L
E
and radius r. We know that the end caps do not contribute to the
flux since the field lines are perpendicular to the normal. Also, the E
curved surface has area 2πrL. Then
Z
ηL η
= E · dS = E(r)2πrL. =⇒ E(r) = r̂.
ε0 S 2πε 0r
Note that the field varies as 1/r, not 1/r2 . Intuitively, this is because we have
one more dimension of “stuff” compared to the point charge, so the field does not
drop as fast.
E. 14-4
<Surface charge> Consider an infinite plane z = 0,
with uniform charge per unit area σ. By symmetry, the
field points vertically, and the field bottom is the oppo-
site of that on top. So E = E(z)ẑ with E(z) = −E(−z).
Consider a vertical cylinder of height 2z and cross-
sectional area A. Now only the end caps contribute.
So
Z
σA σ
= E · dS = E(z)A − E(−z)A. =⇒ E(z) =
ε0 S 2ε0
So E and is constant. Note here z is positive. Note that this electric field is
discontinuous across the surface. We have E(0+ ) − E(0− ) = σ/ε0 . Another
example is a spherical shell of radius R with surface charge σ, then it has electric
field (
σ R 2
( ) r>R
E = ε0 r
0 r<R
So again we find that E(0+ ) − E(0− ) = σ/ε0 . This is in fact a general result
that is true for any arbitrary surfaces and σ. We can prove this by considering
a cylinder across the surface and then shrink it indefinitely. Then we find that
n̂ · E+ − n̂ · E− = εσ . So the components of E normal to the surface are not
0
continuous across the surface.
However, the components of E tangential to the surface are continuous. To see
this let t be any tangent to a surface at a point. Consider a small straight line
14.1. ELECTROSTATICS 621
with length L in the direction of t at the point on the surface. We make two
copies of this line, one slight above the surface one slightly below with a distance
of a apart, and we join their ends with straights line of length a so that we have a
square loop C. We integrate E around the loop. Using Stoke’s theorem, we have
I Z
E · dr = ∇ × E · dS
C S
Point charge
Consider a point particle with charge Q at the origin, then ρ(r) = Qδ 3 (r), where δ 3 is
the generalization of the usual delta function for (3D) vectors. The equation we have
to solve is
Q
∇2 φ = − δ 3 (r).
ε0
Away from the origin r = 0, δ 3 (r) = 0, and we have the Laplace equation. From the
IA Vector Calculus course, the general solution is
α
φ= for some constant α.
r
The constant α is determined by the delta function. We integrate the equation over a
sphere of radius r centered at the origin. Then
Z Z Z
Q α
− = ∇2 φ dV = ∇φ · dS = − 2 r̂ · dS = −4πα
ε0 V S S r
Q Q
=⇒ α= and so E = −∇φ = r̂.
4πε0 4πε0 r2
This is just what we get from Coulomb’s law.
622 CHAPTER 14. ELECTROMAGNETISM
Dipole
A dipole consists of two point charges, +Q and −Q at r = 0 and r = −d respectively.
To find the potential of a dipole, we simply apply the principle of superposition and
obtain
1 Q Q
φ= − .
4πε0 r |r + d|
This is not a very helpful result, but we can consider the case when we are far, far
away, ie. r d. To do so, we Taylor expand the second term. For a general f (r), we
have
1
f (r + d) = f (r) + d · ∇f (r) + (d · ∇)2 f (r) + · · · .
2
Applying to the term we are interested in gives
1 1 1 1 2 1
= +d·∇ + (d · ∇) + ···
|r + d| r r 2 r
3(d · r)2
1 d·r 1 d·d
= − 3 − 3
− + ··· .
r r 2 r r5
ρ(r0 ) 3 0
Z Z
1 1
φ(r) = − ρ(r0 )G(r, r0 ) d3 r0 = d r
ε0 V 4πε0 V |r − r0 |
We can ask what φ and E look like very far from V , ie. |r| |r0 |. We again use the
Taylor expansion.
r · r0
1 1 0 1 1
= + r · ∇ + · · · = + + ··· .
|r − r0 | r r r r3
14.1. ELECTROSTATICS 623
Then we get
r · r0
Z
1 0 1 3 0 1 Q p · r̂
φ(r) = ρ(r ) + 3 + ··· d r = + 2 + ··· ,
4πε0 V r r 4πε0 r r
Z Z
r
where Q= ρ(r0 ) dV 0 p= r0 ρ(r0 ) dV 0 r̂ = .
V V krk
So if we have a huge lump of charge, we can consider it to be a point charge Q, plus
some dipole correction terms. Here p is again called the electric dipole moment , this
is the general form.
E. 14-5
<Field lines and equipotentials> Vectors are usually visualized using arrows,
where longer arrows represent larger vectors. However, this is not a practical
approach when it comes to visualizing fields, since a field assigns a vector to every
single point in space, and we don’t want to draw infinitely many arrows. Instead,
we use field lines. A field line is a continuous line tangent to the electric field E.
The density of lines is proportional to |E|. They begin and end only at charges
(and infinity), and never cross. We can also draw the equipotentials , which are
surfaces of constant φ. Because E = −∇φ, they are always perpendicular to field
lines. Below are the field lines (solid line) and equipotentials (dashed line) for a
positive (point) charge, negative charge, and dipole:
+ - + -
where we set φ(∞) = 0. Now consider N charges qi at positions ri . The total potential
energy stored is the work done to assemble these particles. Let’s put them in one by
one.
1. The first charge is free. The work done is W1 = 0.
2. To place the second charge at position r2 takes work. The work is
q1 q2 1
W2 = .
4πε0 |r1 − r2 |
4. etc.
The total work done is
N
X 1 X qi qj 1 1 X qi qj
U= Wi = = .
i=1
4πε0 i<j |ri − rj | 4πε0 2 |ri − rj |
i6=j
We can write this in an alternative form. The potential at point ri due to all other
particles is
N
1 X qj 1X
φ(ri ) = , so we can write U= qi φ(ri ).
4πε0 |ri − rj | 2 i=1
j6=i
1
ρ(r)φ(r) d3 r.
R
There is an obvious generalization to continuous charge distributions: U = 2
Hence we obtain
Z Z
ε0 ε0
U= (∇ · E)φ d3 r = [∇ · (Eφ) − E · ∇φ] d3 r.
2 2
The first term is a total derivative and vanishes. In the second term, we use the
definition E = −∇φ and obtain
Z
ε0
U= E · E d3 r.
2
This derivation of potential energy is not satisfactory. The final result shows that the
potential energy depends only on the field itself, and not the charges. However, the
result was derived using charges and electric potentials — there should be a way to
derive this result directly with the field, and indeed there is, however we will not do
this.
Also, we have waved our hands a lot when generalizing to continuous distributions,
which was not entirely correct. If we have a single point particle, the original discrete
formula implies that there is no potential energy. However, since the associated field
is non-zero, our continuous formula gives a non-zero potential. This does not mean
that the final result is wrong. It is correct, but it describes a more sophisticated (and
preferred) conception of “potential energy”. Again, we shall not go into the details
here.
14.1.4 Conductors
A conductor is a region of space which contains lots of charges that are free to
move. In electrostatic situations, we must have E = 0 inside a conductor (in the
interior where the charges can move in all directions freely). Otherwise, the charges
inside the conductor would move till equilibrium. This almost describes the whole
of the section: if you apply an electric field onto a conductor, the charges inside the
conductor move until the external field is cancelled out. Since E = 0 inside a conductor,
the potential φ must be a constant throughout the conductor (also on the surface by
continuity).
From this, we can derive a lot of results. Since 0 = ∇ · E = ρ/ε0 inside the conductor,
we must have ρ = 0. Hence any net charge must live on the surface. Note that
inside the conductor there can still be charge, just that in the interior the positive and
negative charges must balance out to give ρ = 0. Since φ is constant throughout the
14.1. ELECTROSTATICS 625
conductor, the surface (in fact the whole) of the conductor is an equipotential. Hence
the electric field is perpendicular to the surface. This makes sense since any electric
field with components parallel to the surface would cause the charges to move.
Recall also from before that across a surface, we have n̂ · Eoutside − n̂ · Einside = σ/ε0
where σ is the surface charge density. Since Einside = 0, we obtain
σ
Eoutside = n̂
ε0
This allows us to compute the surface charge given the field, and vice versa.
E. 14-6
<Faraday Cage> Consider some region of space that doesn’t contain any
charges, surrounded by a conductor. The conductor sits at constant φ = φ0 while,
since there are no charges inside, we must have ∇2 φ = 0. But this means that
φ = φ0 everywhere. This is because, if it didn’t then there would be a maximum
or minimum of φ somewhere inside, violating the maximum principle. Or alterna-
tively we know φ = φ0 is the unique solution since we have a Dirichlet boundary
condition φ = φ0 on the boundary. Therefore, inside a region surrounded by a
conductor, we must have E = 0. This is a very useful result if you want to shield
a region from electric fields. In this context, the surrounding conductor is called
a Faraday cage.
E. 14-7
Consider a spherical conductor with Q = 0. We put
a positive plate on the left and a negative plate on
the right. This creates a field from the left to the
right. With the conductor in place, since the electric
field lines must be perpendicular to the surface, they − +
have to bend towards the conductor. Since field lines
end and start at charges, there must be negative
charges at the left and positive charges at the right.
We get an induced surface charge.
To find the exact potential φ we’ll work in spherical polar coordinates and chose
the original, constant electric field (in absence of the conductor) to point in the
ẑ direction, E0 = e0 ẑ. This have potential φ0 = −E0 z = −E0 r cos θ. Take the
conducting sphere to have radius R and be centred on the the origin. Let’s add to
this an image dipole at the origin. This makes sense, as the − + in the diagram
suggested. The resulting potential is
R3
φ = −E0 r − 2 cos θ
r
Since we’ve added a dipole term, we can be sure that this still solves the Laplace
equation outside the conductor (as Laplace equation is linear). Moreover, by
construction, φ = 0 when r = R. This is all we wanted from our solution. An
alternative way to solve this is to use the general solution we obtain for axisymmet-
ric Laplace equation in [E.6-30] and equating boundary conditions. The induced
surface charge can be computed by evaluating the electric field just outside the
conductor. It is
2R3
σ ∂φ
=− = E0 1 + 3 cos θ = 3E0 cos θ
ε0 ∂r r r=R
626 CHAPTER 14. ELECTROMAGNETISM
We see that the surface charge is positive in one hemisphere and negative in the
other. The total induced charge averages to zero.
E. 14-8
Suppose we have a conductor that fills all space x < 0. We
ground it such that φ = 0 throughout the conductor. Then
we place a charge q at x = d > 0. We are looking for a
potential that corresponds to a source at x = d and satisfies φ=0 +
φ = 0 for x < 0. Since the solution to the Poisson equation is
unique, we can use the method of images to guess a solution
and see if it works — if it does, we are done.
To get the solution we want, we “steal” part of this potential and declare our
potential to be
1 q q
4πε0
√ −√ if x > 0
φ= (x−d)2 +y 2 +z 2 (x+d)2 +y 2 +z 2
0 if x ≤ 0
Using this solution, we can immediately see that it satisfies Poisson’s equations
both outside and inside the conductor. To complete our solution, we need to find
the surface charge required such that the equations are satisfied on the surface as
well.
To do so, we can calculate the electric field near the surface, and use the relation
σ = ε0 Eoutside · n̂. To find σ, we only need the component of E in the x direction:
∂φ q x−d x+d
Ex = − = − for x > 0.
∂x 4πε0 |r − d|3/2 |r + d|3/2
q d
σ = E x ε0 = − .
2π (d2 + y 2 + z 2 )3/2
R
The total surface charge is then given by σ dy dz = −q.
E. 14-9
<Capacitors> We’ll consider a parallel plate capacitor. Suppose there are two
identical flat plane conductor with surface area A, parallel to each other with a
distance d apart, one carrying charge Q, the other charge √ −Q. We assume that
the distance d between the surfaces is much smaller than A. This means that
we can neglect the effects that arise around the edge of plates and we’re justified
in assuming that the electric field between the two plates is the same as it would
be if the plates were infinite in extent.
14.2. MAGNETOSTATICS 627
Then between the plates we have electric field E =
(σ/ε0 )ẑ where σ = Q/A and we have assumed the plates
are separated in the z-direction. We define the capac- −Q
z=d
itance C to be C = Q/V where V is the voltage or
z=0
potential difference, the difference in the potential on Q
the two conductors. Since E = −dφ/dz is constant, we
must have
Qd
φ = −Ez + c =⇒ V = φ(0) − φ(d) = Ed =
Aε0
and the capacitance for parallel plates of area A, separated by distance d, is
C = Aε0 /d. Because V was proportional to Q, the charge has dropped out of our
expression for the capacitance. Instead, C depends only on the geometry of the
set-up. This is a general property.
Capacitors are usually employed as a method to store electrical energy. Using our
result in previous section, the energy stored in a parallel plate capacitor is
Z 2
Aε0 d σ Q2
Z
ε0
U= E · E dV = dz = .
2 2 0 ε0 2C
14.2 Magnetostatics
Charges give rise to electric fields, currents give rise to magnetic fields. In this section,
we study the magnetic fields induced by steady currents . This means that we are
again looking for time independent solutions to the Maxwell equations. We will also
restrict to situations in which the charge density vanishes, so ρ = 0. We can then set
E = 0 and focus our attention only on the magnetic field. The remaining Maxwell’s
equations are ∇ × B = µ0 J and ∇ · B = 0. The objective is, given a J, find the
resultant B.
Before we start, what does the condition ρ = 0 mean? It does not mean that there
are no charges around. We want charges to be moving around to create current.
What it means is that the positive and negative charges balance out exactly. More
importantly, it stays that way. We don’t have a net charge flow from one place to
another. At any specific point in space, the amount of charge entering is the same as
the amount of charge leaving. This is the case in many applications. For example, in a
wire, all electrons move together at the same rate, and we don’t have charge building
up at parts of the circuit. Mathematically, we can obtain the interpretation from the
continuity equation: ∂ρ
∂t
+ ∇ · J = 0. In the case of steady currents, we have ρ = 0. So
∇ · J = 0 which says that there is no net flow in or out of a point.
E. 14-10
<A long straight wire> A wire is a cylinder with current I flowing through it.
We use cylindrical polar coordinates (r, ϕ, z), where z is along the direction of the
current, and r points in the radial direction.
By symmetry, the magnetic field can only depend on the radius, and I
must lie in the x, y plane. Since we require that ∇ · B = 0, we cannot
have a radial component. So the general form is B(r) = B(r)ϕ̂. To
find B(r), using Ampere’s law and integrate over H a disc that cuts
through the wire horizontally, we have µ0 I = C B · dr = 2πrB(r). S
So
µ0 I
B(r) = ϕ̂.
2πr
E. 14-11
<Surface current> Consider the plane z = 0 with
surface current density k (ie. current per unit length).
Take the x-direction to be the direction of the current,
and the z direction to be the normal to the plane.
To compute B inside, use Ampere’s law with a curve C. Note that only the vertical
part (say of length L) inside the cylinder contributes to the integral. Then
I
BL = B · dr = µ0 IN L.
C
where N is the number of wires per unit length and I is the current in each wire
(so IN L is the total amount of current through the wires). Hence B = µ0 IN .
14.2. MAGNETOSTATICS 629
r − r0
I
µ0 I
B= dr0 × .
4π C |r − r0 |3
J(r0 ) dr0
Z I
µ0 µ0 I
A(r) = 0
dV = .
4π |r − r | 4π C |r − r0 |
Far from the loop, |r − r0 | is small and we can use the Taylor expansion 1/|r − r0 | =
1/r + r · r0 /r3 + · · · , then
r · r0
I
µ0 I 1
A(r) = + 3 + · · · dr0 .
4π C r r
Note that r is a constant of the integral, and we can take it out. The first r1 term
vanishes because it is a constant, and when we integrate along a closed loop, we get 0.
So we only consider the second term. We claim that for any constant vector g,
I Z
g · r0 dr0 = S × g where S= dS
C C
Now we have
µ0 m × r
A(r) ≈ where m = IS.
4π r3
m = IS is called the magnetic dipole moment . Now
µ0 3(m · r̂)r̂ − m
B=∇×A= .
4π r3
This is the same form as E for an electric dipole! Note that however the B field due
to a current loop and E field due to two charges don’t look the same close up, it just
that they have identical dipole long-range fall-offs.
14.2. MAGNETOSTATICS 631
∂j0 (Jj ri0 rk0 ) = (∂j0 Jj )ri0 rk0 + Ji rk0 + Jk ri0 = Jk ri0 + Ji rk0 .
E. 14-14
<Two parallel wires> Consider two straight wires (with currents)
pointing in the z direction, one passes through (0, 0, 0), the other
passes through (d, 0, 0), so there is a distance of d between them. We
know that the field produced by each current is
µ0 Ii
Bi = ϕ̂.
2πr d
The particles on the second wire will feel a force
µ0 I1
F = qv × B1 = qv × ŷ.
2πd
14.3 Electrodynamics
So far, we have only looked at fields that do not change with time. However, in real
life, fields do change with time. We will now look at time-dependent E and B fields.
We’ll explore the Maxwell equation ∇ × E + ∂B ∂t
= 0. In short, if the magnetic field
changes in time, ie. ∂B∂t
6
= 0, this creates an E that accelerates charges, which creates
a current in a wire. This process is called induction.
Consider a wire, which is a closed curve C, with a surface S. We integrate over the
surface S to obtain Z Z
∂B
(∇ × E) · dS = − · dS.
S S ∂t
∂Φ
<Faraday’s law of induction> E =−
∂t
true also for moving curves which we’ll later show. The Faraday’s law of induction
says that when we change the magnetic flux through S, then a current is induced. In
14.3. ELECTRODYNAMICS 633
practice, there are many ways we can change the magnetic flux, such as by moving
bar magnets or using an electromagnet and turning it on and off.
The minus sign has a significance. When we change a magnetic field, an emf is created.
This induces a current around the wire. However, we also know that currents produce
magnetic fields. The minus sign indicates that the induced magnetic field opposes the
initial change in magnetic field. If it didn’t and the induced magnetic field reinforces
the change, we will get runaway behaviour and the world will explode. This is known
as Lenz’s law .
E. 14-15
E. 14-16
There is another, related way to induce currents in the pres-
ence of a magnetic field: you can keep the field fixed, but
move the wire. Perhaps the simplest example is shown in
the figure: it’s a rectangular circuit, but where one of the d
wires is a metal bar that can slide backwards and forwards.
This whole set-up is then placed in a magnetic field, which
passes up, perpendicular through the circuit.
Slide the bar to the left with speed v. Each charge q will experience a Lorentz force
F = qvB in the counterclockwise direction. The emf, defined as the work done
per unit charge, is E = vBd because work is only done when particles (charges)
pass through the bar.
Meanwhile, the change of flux is dΦdt
= −vBd since the area decreases at a rate
of −vd. We again have E = − dΦ dt
. Note that we obtain the same formula but
different physics — we used Lorentz force law, not Maxwell’s equation.
Now we consider the general case: a moving loop C(t) (which need S(t + δt)
not be a circle) bounding a surface S(t). As the curve moves, the Sc
curve sweeps out a cylinder-like surface Sc . The change in flux S(t)
Z Z
Φ(t + δt) − Φ(t) = B(t + δt) · dS − B(t) · dS
S(t+δt) S(t)
Z Z
∂B
= B(t) + δt · dS − B(t) · dS + O(δt2 )
S(t+δt) ∂t S(t)
Z "Z Z #
∂B
= δt · dS + − B(t) · dS + O(δt2 )
S(t) ∂t S(t+δt) S(t)
We know that S(t + δt), S(t) and Sc together form a closedR surface. Since ∇ · B = 0,
the integral of B over a closed surface is 0. So we obtain S(t+δt)−S(t)+Sc B(t) · dS = 0.
634 CHAPTER 14. ELECTROMAGNETISM
Hence we have
Z Z
∂B
Φ(t + δt) − Φ(t) = δt · dS − B(t) · dS = 0.
S(t) ∂t Sc
We can simplify the integral over Sc by writing the surface element as dS = (dr×v) δt.
Then B · dS = δt(v × B) · dr. So
Z Z Z
dΦ δΦ ∂B
= lim = · dS − (v × B) · dr = − (E + v × B) dr.
dt δ→0 δt S(t) ∂t C(t) C
where the last equality comes from the Maxwell’s equation R∂B ∂t
= −∇ × E. Now the
emf (work done per charge moving around the curve) is E = C (E + v × B) dr, which
now includes the force tangential to the wire from both electric fields and also from the
motion of the wire in the presence of magnetic fields. We obtain the Faraday’s law of
induction E = − ∂Φ
∂t
for the most general case where the curve itself can change.
We can use the idea of inductance to compute the energy stored in magnetic fields.
The idea is to compute the work done in building up a current. As we build the
current, the change in current results in a change in magnetic field. This produces an
induced emf that we need work to oppose. The emf is given by
dΦ dI
E =− = −L .
dt dt
This opposes the change in current by Lenz’s law. In time δt, a charge Iδt flows around
C. The work done is
dI dW dI 1 dI 2
δW = EIδt = −LI δt =⇒ = −LI =− L .
dt dt dt 2 dt
So the work done to build up a current is W = 12 LI 2 = 12 IΦ. Note that we dropped
the minus sign because we switched from talking about the work done by the emf to
the work done to oppose the emf.
This work done is identified with the energy stored in the system. Recall that the
vector potential A is given by B = ∇ × A. So
Z Z I Z
1 1 1 1
U= I B · dS = I (∇ × A) · dS = I A · dr = J · A dV
2 S 2 S 2 C 2 R3
14.3. ELECTRODYNAMICS 635
Assuming that B × A vanishes sufficiently fast at infinity, the integral of the first term
vanishes. So we are left with
Z
1
= B · B dV.
2µ0
14.3.2 Resistance
The story so far is that we change the flux, an emf is produced, and charges are
accelerated. In principle, we should be able to compute the current. But accelerating
charges are complicated (they emit light). Instead, we invoke a new effect, friction. In
a wire, this is called resistance. In most materials, the effect of resistance is that E is
proportional to the speed of the charged particles, rather than the acceleration.
We can think of the particles as accelerating for a very short period of time, and then
reaching a terminal velocity.R So we have Ohm’s law E = IR, the constant R is called
resistance . Note that E = E · dr and E = −∇φ, so E = V , the potential difference.
So Ohm’s law can also be written as V = IR.
For the wire of length L and cross-sectional area A, we define the resistivity to be
ρ = AR/L, and the conductivity to be σ = 1/ρ. These are properties only of the
substance and not the actual shape of the wire. Now Ohm’s law reads J = σE. One
can formally derive Ohm’s law by considering the field and interactions between the
electron and the atoms, but we’ll not do it.
With resistance, we need to do work to keep a constant current. In time δt, the work
needed is δW = EIδt = I 2 Rδt using Ohm’s law. So the Joule heating , the energy
lost as heat in a circuit due to friction, is given by
dW
= I 2 R.
dt
E. 14-18
Take again the circuit with the sliding bar. Suppose that the
sliding bar has resistance R, and the remaining parts of the
circuit are superconductors with no resistance. There are `
two dynamical variables, the position of the bar x(t), and
the current I(t).
636 CHAPTER 14. ELECTROMAGNETISM
If a current I flows, the force on the bar is F = I`ŷ × Bẑ = IB`x̂ since I` = qv
where q is the total charge in the bar and v their speed. Suppose the bar can slide
without friction, so we have mẍ = IB`. We can compute the emf as
dΦ
E =− = −B`ẋ.
dt
B 2 `2 2 `2 t/mR
mẍ = − ẋ =⇒ ẋ(t) = ẋ(0)e−B .
R
So the speed of the bar decays exponentially from its initial speed. Here we
assumed that all the emf are induced, but we can easily modify this if we have
say an external supply of emf E0 (eg. a battery connected to the circuit) in which
case E = E0 − dΦ
dt
.
We have studied every Maxwell equation, except one thing. Recall the final Maxwell
equations is
∂E
∇ × B = µ0 J + ε 0
∂t
Its time-independent version is the Ampere’s Law ∇×B = µ0 J. But now that we allow
the field to change with time, we have the µ0 ε0 ∂E
∂t
term, which we haven’t previously
discussed. Historically this term is called the displacement current . The need of this
term was discovered by purely mathematically, since people discovered that Maxwell’s
equations would be inconsistent with charge conservation without the term.
Without the term, the last equation is ∇ × B = µ0 J. Take the divergence of the
equation to obtain µ0 ∇ · J = ∇ · (∇ × B) = 0, which says that any current that flows
into a given volume has to also flow out. But we know that’s not always the case. To
give a simple example, we can imagine putting lots of charge in a small region and
watching it disperse. Since the charge is leaving the central region, the current does
not obey ∇ · J = 0, seemingly in violation of Ampere’s Law. In fact we know charge
conservation says that ρ̇+∇·J = 0. With the new term however, taking the divergence
yields
∂E
µ0 ∇ · J + ε 0 ∇ · = 0.
∂t
∂E ∂
ε0 ∇ · = ε0 (∇ · E) = ρ̇
∂t ∂t
by the first Maxwell’s equation. So it gives ∇ · J + ρ̇ = 0. So with the new term, not
only is Maxwell’s equation consistent with charge conservation — it actually implies
charge conservation.
14.3. ELECTRODYNAMICS 637
1 ∂2E 1
− ∇2 E = 0 where c= √
c2 ∂t2 µ0 ε 0
√
So speed of the wave is c = 1/ µ0 ε0 = 3 × 108 m s−1 which is the speed of light!
We now look for plane wave solutions which propagate in the x direction, and are inde-
pendent of y and z. So we can write our electric field as E(x) = (Ex (x, t), Ey (x, t), Ez (x, t)).
Hence any derivatives wrt y and z are zero. Since we know that ∇ · E = 0, Ex must be
constant. We take Ex = 0 since any constant electric field can always be added as a
solution to the Maxwell equations so wlog we’ll choose this constant to vanish (in fact
we also can’t have Ex (x, t) = At). Also wlog, assume Ez = 0, ie the wave propagate
in the x direction and oscillates in the y direction. Then we look for solutions of the
form E = (0, E(x, t), 0) with
1 ∂2E ∂2E
2 2
− = 0.
c ∂t ∂x2
The general solution is E(x, t) = f (x − ct) + g(x + ct). The most important solutions
are the monochromatic waves
E = E0 sin(kx − ωt).
where E0 is the amplitude , k is the wave number , and ω is the (angular) frequency .
The wave number is related to the wavelength by λ = 2π/k. Since the wave has to
travel at speed c, we must have ω 2 = c2 k2 . So the value of k determines the value of
ω, vice versa.
To solve for B, we use ∇ × E = − ∂B∂t
and ∇ · B = 0, so B = (0, 0, B) for some B.
(Now if we use ∇ × B = µ0 ε0 ∂E
∂t
we’ll find we cannot have Ex (x, t) = At). Hence the
equation gives.
∂B ∂E E0
=− =⇒ B= sin(kx − ωt).
∂t ∂x c
Note that this is uniquely determined by E, and we do not get to choose our favorite
amplitude, frequency etc for the magnetic component.
We see that E and B oscillate in phase, orthogonal to each other, and orthogonal to
the direction of travel. These waves are what we usually consider to be “light”. Also
note that Maxwell’s equations are linear, so we can add up two solutions to get a
new one. This is particularly important, since it allows light waves to pass through
each other without interfering. It is useful to use complex notation. The most general
monochromatic takes the form
E = E0 exp(i(k · x − ωt))
with ω 2 = c2 |k|2
B = B0 exp(i(k · x − ωt))
638 CHAPTER 14. ELECTROMAGNETISM
k is called the wave vector , which is real. The “actual” solutions are just the real
part of these expressions. There are some restrictions to the values of E0 etc due to
the Maxwell’s equations:
∇·E=0 ⇒ k · E0 = 0
∇·B=0 ⇒ k · B0 = 0
∂B
∇×E=− ⇒ k × E0 = ωB0
∂t
If E0 and B0 are real, then k, E0 /c and B0 form a right-handed orthogonal triad of
vectors. A solution with real E0 , B0 , k is said to be linearly polarized . This says that
the waves oscillate up and down in a fixed plane. If E0 and B0 are complex, then the
polarization is not in a fixed direction. If we write E0 = α + iβ for α, β ∈ R3 , then
the “real solution” is
Re(E) = α cos(k · x − ωt) − β sin(k · x − ωt).
Note that ∇·E = 0 requires that k·α = k·β = 0. It is not difficult to see that this traces
out an ellipse. If E0 and B0 are complex, then it is said to be elliptically polarized .
In the special case where |α| = |β| and α · β = 0, this is circular polarization .
E. 14-19
We can have a simple application: why metals are shiny. A metal is
a conductor. Suppose the region x > 0 is filled with a conductor. A
light wave is incident on the conductor, ie. Einc = E0 ŷ exp(i(kx + Einc
ωt)) with ω = ck. We know that inside a conductor, E = 0, and
at the surface, Ek = 0, that is E0 · ŷ|x=0 = 0. Then clearly our
solution above does not satisfy the boundary conditions! To achieve
the boundary conditions, we add a reflected wave
Eref = −E0 ŷ exp(i(−kx − ωt)).
Then our total electric field is E = Einc +Eref . Then this is a solution to Maxwell’s
equations since it is a sum of two solutions, and satisfies E · ŷ|x=0 = 0 as required.
Maxwell’s equations says ∇ × E = − ∂B ∂t
. So
E0 E0
Binc = ẑ exp(i(kx − ωt)) Bref = ẑ exp(i(−kx − ωt))
c c
This obeys B · n̂ = 0, where n̂ is the normal to the surface. But we also have
2E0 −iωt
B · ẑ|x=0− = e ,
c
So there is a magnetic field at the surface. However, we know that inside the
conductor, we have B = 0. This means that there is a discontinuity across the
surface! We know that discontinuity happens when there is a surface current.
Using the formula we’ve previously obtained, we know that the surface current is
given by
2E0 −iωt
K= ŷe .
µ0 c
We see that the surface current oscillates with the frequency of the reflected wave.
So shining a light onto a metal will cause an oscillating current. We can imagine
the process as the incident light hits the conductor, causes an oscillating current,
which generates a reflected wave (since accelerating charges generate light). We
can do the same for light incident at an angle, and prove that the incident angle
is equal to the reflected angle.
14.3. ELECTRODYNAMICS 639
Electromagnetic waves carry energy — that’s how the Sun heats up the Earth! We
will compute how much. The energy stored in a field in a volume V is
Z
ε0 1
U= E·E+ B·B dV.
V 2 2µ0
Z
dU ∂E 1 ∂B
= ε0 E · + B· dV
dt V ∂t µ0 ∂t
Z
1 1
= E · (∇ × B) − E · J − B · (∇ × E) dV.
V µ0 µ0
Z Z
dU 1
=− J · E dV − (E × B) · dS.
dt V µ0 S
Recall that the work done on a particle q moving with velocity v is δW = qv · E δt.
So the J · E term is the rate of work done on a charged particles in V (note that no
work is done by the magnetic field). We can thus write
Z Z
dU 1
<Poynting theorem> + J · E dV =− (E × B) · dS
dt V µ0 S
| {z } | {z }
Total change of energy in V Energy that escapes
(fields + particles) through the surface S
Then the left-hand side is the combined change in energy of both fields and particles in
region V . Since energy is conserved, the right-hand side must describe the energy that
escapes through the surface S of region V . The Poynting vector is S = µ1 E × B.
0
This is a vector field, it tells us the magnitude and direction of the flow of energy
in any point in space. We can write the Poynting theorem in differential form as
∂u
∂t
+ J · E + ∇ · S = 0, where u is the energy density given by the integrand of U .
This is simikar to the continuity equation ∂ρ ∂t
+ ∇ · J = 0, but instead of describing
movements of charge it describes movements of energy.
Because the Poynting vector is quadratic in E and B, we’re not allowed to use the
complex form of the waves, otherwise things in the imaginary component would goes
into the real component when we times E and B together. For a linearly polarized
wave,
2
E0
The average over T = 2π/ω is thus hSi = 2cµ0
k̂.
640 CHAPTER 14. ELECTROMAGNETISM
Instead of the usual Euclidean metric, we use the Minkowski metric, defined by
+1 0 0 0
0 −1 0 0
ηµν =
0 0 −1 0
0 0 0 −1
Given the funny dot product, we have to be slightly more careful about our summation
convention. If we simply write X µ X µ for the dot product, then we will get (ct)2 +
x2 + y 2 + z 2 , which is nonsense. Instead, we define a quantity
ct
−x
Xµ =
−y
−z
Then the dot product will be given by Xµ X µ . This is the general rule of special
relativity — when contracting indices, we must have one index up and one index
down. Summing over indices on the same side is forbidden.
Since we know that the dot product is also given by X T ηX, we can view Xµ as a
shorthand Xµ = ηµν X ν . If we are given Xµ and want to obtain X µ , we have to
multiply by the inverse of ηµν , which fortunately is the same matrix (but with indices
up): η µν = diag(+1, −1, −1, −1). Hence X µ = η µν Xν . In general, the metric tensor
can be used to raise and lower indices. Note that since we have to distinguish between
indices-up and indices-down, we will in general not write things simply as X or η. We
always write the indices out as well. In special relativity (and electromagnetism), we
will have a lot more other 4-vectors, and the rules above all apply.
14.4. ELECTROMAGNETISM AND RELATIVITY 641
Lorentz transformations
The basic laws of relativity tell us how things look from the viewpoints of different
observers moving relative to each other. Suppose our first observer sits in an inertial
frame S with coordinates (ct, x, y, z), while the second sits in S 0 with coordinates
(ct0 , x0 , y 0 , z 0 ). Suppose S 0 is moving with speed v in the x direction relative to S.
Then the two coordinates are related by the Lorentz transform
ct0 = γ ct − vc x
1
x0 = γ x − vc ct γ = p
with 1 − (v/c)2
y0 =y
c = 299 792 458 m s−1 .
z0 =z
Note that we think in terms of ct instead of t, since ct has the same dimensions as x.
Also, we look at v/c instead of v itself since v/c is dimensionless. In general, changing
the frame of reference corresponds to applying a Lorentz transformation Λµν . Vectors
transform according to
X µ 7→ Λµν X ν ,
Of course, not all matrices Λµν represent Lorentz transforms. To represent a Lorentz
transform, Λµν must obey Λρµ ηρσ Λσν = ηµν . This definition is analogous to the defini-
tion of orthogonal matrices, which can be written (in a convoluted way) as Oij δik Ok` =
δj` . In particular, this definition requires that Λµν preserves the Minkowski metric.
Indeed, if we define the (pseudo) inner product of two tensors X, Y as
hX, Y i = X µ Yµ = X µ ηµν Y ν
and we write ΛX for Λµν X ν , then the equation above just says that we need hΛX, ΛY i =
hX, Y i for all tensors X, Y — the classic definition of orthogonal matrices! So a Lorentz
transform really is just an orthogonal transform under the Minskowski metric.
Excluding the strange time-reversal transformation, which we discard, there are two
classes of Lorentz transformations:
1 0 0 0
0
1. Rotations: Here RT R = 1, ie. an orthogonal matrix. Λµν = 0
R
0
γ −γv/c 0 0
−γv/c γ 0 0
2. Boosts: eg. a boost in the x direction is Λµν = 0
0 1 0
0 0 0 1
We know that a 4-vector transforms as X µ 7→ Λµν X ν . How does Xµ transform? We
have
Xµ 7→ Xµ0 = ηµν X 0ν = ηµν Λνσ X σ = ηµν Λνσ η σρ Xρ = Λµρ Xρ ,
where we used the rules for lowering and raising indices in the last line. What is this
mysterious Λµρ ? We can view it as the transpose of Λρµ . It turns out that this is also
the inverse of Λµρ . Recall that Λρµ ηρσ Λσν = ηµν . We multiply both sides by η ντ to
obtain Λρµ ηρσ Λσν η ντ = δµτ . So raising and lowering indices gives Λρµ Λρτ = δµτ . This
is analogous to the fact that the transpose of an orthogonal matrix is its inverse.
642 CHAPTER 14. ELECTROMAGNETISM
So the electric fields are inverted and the magnetic field is intact. Both Fµν and
F µν are tensors, since they are constructed out of Aµ , ∂µ and ηµν , which themselves
transform nicely under the Lorentz group. Under a Lorentz transformation, we have
F 0µν = Λµρ Λνσ F ρσ . For example under rotation Λ = ( 10 R
0
), we find that E0 = RE and
0
B = RB. Under a boost by v in the x-direction , we have
Ex0 = Ex Bx0 = Bx
v
Ey0 = γ(Ey − vBz ) By0 = γ By + 2 Ez
c
v
Ez0 = γ(Ez + vBy ) 0
Bz = γ Bz − 2 Ey
c
E. 14-20
<Boosted line charge> An infinite line along the x direction with uniform
charge per unit length, η, has electric field
0
η y .
E= 2 2
2πε0 (y + z )
z
The magnetic field is B = 0. Plugging this into the equation above, an observer
in frame S 0 boosted with v = (v, 0, 0), ie. parallel to the wire, sees
0 0
ηγ y ηγ y 0
E= 2 2
= 02 02
2πε0 (y + z ) 2πε0 (y + z )
z z0
0 0
ηγv z = ηγv z0 .
B=
2πε0 c2 (y 2 + z 2 ) 2πε0 c2 (y 02 + z 02 )
−y −y 0
where we have y = y 0 and z = z 0 because the boost is in the x-direction. In frame
S 0 , the charge density is Lorentz contracted to γη. Since the charge density is now
moving, the observer in frame S 0 sees a current I 0 = −γηv. The magnetic field
can be written as
0
µ0 I 0 0 0 1 −z 0
B= p ϕ̂ where ϕ̂ = p
2π y 02 + z 02 y 02 + z 02 y0
is the basis vector of cylindrical coordinates. This is just the magnetic field due
to a current in a wire. This is what we calculated from Ampere’s law previously.
But we didn’t use Ampere’s law here. We used Gauss’ law (to get the electric
field), and then applied a Lorentz boost.
We see that magnetic fields are relativistic effects of electric fields. They are what
we get when we apply a Lorentz boost to an electric field. So relativity is not only
about very fast objects. It is there when you stick a magnet onto your fridge!
E. 14-21
<Boosted point charge> A boosted point charge generates a current, but is
not the steady current we studied in magnetostatics. As the point charge moves,
the current density moves. A point charge Q, at rest in frame S has
x
Q Q y
E= 2
r̂ = 2 2 2 3/2
and B=0
4πε0 r 4πε0 (x + y + z )
z
14.4. ELECTROMAGNETISM AND RELATIVITY 645
v2 γ 2
γ 2 x02 + y 02 + z 02 = (γ 2 − 1)x02 + r02 = 2 x02 + r02
c
2 2
v2
v γ 02
= cos 2
θ + 1 r = γ 2
1 − sin 2
θ r02
c2 c2
Lorentz invariants
We can ask the question “are there any combinations of E and B that all observers
agree on?” With index notation, all we have to do is to contract all indices such
that there are no dangling indices. It turns out that there are two such possible
combinations. The first thing we might try would be
1 E2
Fµν F µν = − 2 + B2 ,
2 c
which works great. To describe the second invariant, we need to introduce a new
object in Minkowski space. Analogous to the εijk in R3 we define the anti-symmetric
tensor
+1 µνρσ is even permutation of 0123
εµνρσ = −1 µνρσ is odd permutation of 0123
0 otherwise
Under a Lorentz transformation, ε0µνρσ = Λµκ Λνλ Λρα Λσβ εκλαβ . Since εµνρσ is fully
anti-symmetric, so is ε0µνρσ . Similar to what we did in R3 , we can show that the only
646 CHAPTER 14. ELECTROMAGNETISM
∂µ F µν = µ0 J ν ∂µ F̃ µν = 0.
As we said before, if we find the right way of writing equations, they look really simple!
We don’t have to worry ourselves with where the c and µ0 , ε0 go! Note that each law is
actually 4 equations, one for each of ν = 0, 1, 2, 3. Under a Lorentz boost, the equations
are not invariant individually. Instead, they all transform nicely by left-multiplication
of Λνρ .
We now check that these agree with the Maxwell’s equations. First work with the first
equation: when ν = 0, we are left with ∂i F i0 = µ0 J 0 where i ranges over 1, 2, 3. This
is equivalent to saying
E ρ
∇· = µ0 ρc or ∇ · E = c 2 µ0 ρ =
c ε0
∂i F̃ i0 = 0 =⇒ ∇·B=0
∂B
∂µ F̃ µi = 0 =⇒ + ∇ × E.
∂t
So we recover Maxwell’s equations. Then we now see why the J ν term appears in the
first equation and not the second — it tells us that there is only electric charge, not
magnetic charge.
We can derive the continuity equation from Maxwell’s equation here. Since ∂ν ∂µ F µν =
0 due to anti-symmetry, we must have ∂ν J ν = 0. Recall that we once derived the
continuity equation from Maxwell’s equations without using relativity, which worked
but is not as clean as this.
Finally, we recall the long-forgotten potential Aµ . For Fµν defined in terms of it:
Fµν = ∂µ Aν − ∂ν Aµ , the equation ∂µ F̃ µν = 0 comes for free since we have
1 µνρσ 1
∂µ F̃ µν = ε ∂µ Fρσ = εµνρσ ∂µ (∂ρ Aσ − ∂σ Aρ ) = 0
2 2
where the last equality holds because of the symmetry of the two derivatives, combined
with the anti-symmetry of the ε-tensor. This means that we can also write the Maxwell
equations as
∂µ F µν = µ0 J ν where Fµν = ∂µ Aν − ∂ν Aµ .
dP 0 1 dE q dE
= = γE · v =⇒ = qE · v,
dτ c dτ c dt
which is the work done by an electric field.
648 CHAPTER 14. ELECTROMAGNETISM
E. 14-22
<Motion in a constant field> Consider a particle in a vanishing magnetic
field and constant electric field E = (E, 0, 0) (E here is not the energy) and
u = (u(t), 0, 0). Assuming that the particle starts from rest at t = 0, then Lorentz
force gives
d(γu) dx qEt
m = qE =⇒ mγu = qEt =⇒ u= = p .
dt dt m2 + q 2 E 2 t2 /c2
Note that u → c as t → ∞. We can solve to find
r !
mc2 q 2 E 2 t2
x= 1+ −1
qE mc2
For small t, x ≈ 21 qEt2 , which is the usual non-relativistic result for particles
undergoing constant acceleration in a straight line.
E. 14-23
<Motion in constant magnetic field> Now suppose we have no electric field
and B = (0, 0, B). In the non-relativistic world, we know that particles turn circles
with frequency ω = qB/m. Then
dP 0
= qF 0ν Uν = 0 =⇒ E = mγc2 = constant.
dτ
So |u| is constant. Now
∂(γu) du
m = qu × B =⇒ mγ = qu × B
∂t dt
since |u|, and hence γ, is constant. This is the same equation as the non-relativistic
case, except for the extra γ term. The particle goes in circles with frequency
ω = qB/mγ.
CHAPTER 15
Fluid Dynamics
In real life, we encounter a lot of fluids. For example, there is air and water. These
are known as Newtonian fluids, whose dynamics follow relatively simple equations.
This is fundamentally because they have simple composition — they are made up of
simple molecules. There are non-Newtonian fluid like toothpaste and shampoo, which
are more complex molecularly speaking and have complex properties. Sand, rice and
foams also have some fluid-like properties, although they are fundamentally made of
small granular solids. Here we will study only Newtonian fluids.
There are many applications of fluid dynamics. On a small scale, the dynamics of fluids
in cells is important for biology. On a larger scale, the fluid flow of the mantle affects
the movement of tectonic plates, while the dynamics of the atmosphere can be used
to explain climate and weather. On an even larger scale, we can use fluid dynamics to
analyse the flow of galactic systems in the universe.
15.0.6 Preliminaries
A fluid is a material that flows. A Newtonian fluid is a fluid with a linear relation-
ship between stress and rate of strain. The constant of proportionality is called the
(dynamic) viscosity . We will consider such fluids with viscosity, although sometime
we will make a simplifying assumption, the inviscid approximation , where we set the
viscosity to 0. Otherwise when we do not assume zero viscosity we say we have a
viscous flow .
Stress is force per unit area. For example, pressure is a stress. Strain is the extension
d
per unit length, and the rate of strain dt (strain) here is concerned with gradients
(w.r.t. space) of velocity. These quantities are in fact tensor fields, but we will not
treat them as such here. We will just consider “simplified” cases. Concepts we define
here, will become more clear when we start to write down our equations later.
Stresses are the forces that exist inside the fluid. If we have a boundary, we can classify
the stress according to the direction of the force — whether it is normal to or parallel
to the boundary. The boundary can either be an actual physical boundary, or an
imaginary surface we cook up in order to compute things.
Normal stress
Suppose we have a fluid with pressure p acting on a n pressure p
fluid
surface with unit normal n, pointing into the fluid.
The normal stress is τp = −pn. solid τp
The normal stress is present everywhere, as long as
we have a fluid (with pressure). However, pressure by itself does not do anything,
since pressure acts in all directions, and the net effect cancels out. However, if we have
a pressure gradient, then it gives an actual force and drives fluid flow. For example,
suppose we have a pipe, with the pump on the left:
649
650 CHAPTER 15. FLUID DYNAMICS
∇p
high pressure low pressure patm
force= −∇p
Then this gives a body force that drives the water from left to right.
Tangential stress
In short the tangential stress exerted by a fluid on a surface with normal n pointing
into the fluid is
∂u
τs = µ = µn · ∇u
∂n
where µ is the dynamic viscosity of the fluid and u is the flow velocity along the surface.
This tangential stress is caused by the fluid on the side of the surface which the normal
is pointing into.
651
Boundary conditions
How does a fluid behave at a physical boundary? In general, Newtonian fluids satisfy
(as experimentally shows) one of the following two boundary conditions:
1. No-slip condition : More precisely this is the no-slip no-penetration condition,
which requires that at the boundary, the fluid velocity equals the velocity of bound-
ary. In particular, if the boundary is stationary, the fluid velocity is zero at the
boundary. This no-slip condition is normally applied when we have a fluid-solid
boundary, where we think the fluid “sticks” to the solid boundary. This no-slip
condition of course only make sense when we have a viscous fluid. In the case
when we have a inviscid flow (i.e. 0 viscosity) we simply apply the no-penetration
condition and allow the fluid to slip.
2. Stress condition : At the boundary (with normal n pointing into the fluid), a
tangential stress τ is imposed on the fluid. In this case,
∂uT
−µ = τ.
∂n
This stress condition is common when we have a fluid-fluid boundary (like liquid
and gas), where we require the tangential stresses to match up.
t=0 t=1
dy
Indeed the streamlines at time t 6= 0 satisfies dx = 1t , so the streamlines are
y = t x + c for c a constant. For t = 0, the streamline (x(s), y(s)) satisfies dx
1
ds
=0
and dy
ds
= 1, so (x(s), y(s)) = (c, s), so the streamlines are x = c.
652 CHAPTER 15. FLUID DYNAMICS
What is the streaklines? At time t = T , suppose we wish to find the loci of points
that have passed through (x0 , y0 ) in the past. We know particle paths satisfies
x = 12 t2 +A and y = t+B, so at time T we have x = 12 T 2 +A and y = T +B. For it
to have pass through (x0 , y0 ) at some point in the past we need A and B to satisfy
x0 = 12 t02 + A and y0 = t0 + B for some t0 < T . Hence we have x = 21 T 2 + x0 − 12 t02
and y = T + y0 − t0 . Eliminating t0 we have x = 12 T 2 + x0 − 12 (y − T − y0 )2 . So
the streakline is x − x0 = (y − y0 )T − 12 (y − y0 )2 .
• A parallel flow is a flow where the fluid only flows in one dimension, and only
depends on the direction perpendicular to a plane.
For example u = (u(y), 0, 0) is a parallel flow, the fluid only flows in the x direction,
and it only depends on y which is perpendicular to the x − z plane. Note that our
velocity does not depend on the x direction. This can be justified by the assumption
that the fluid is incompressible. If we had a velocity gradient in the x-direction, then
we will have fluid “piling up” at certain places, violating incompressibility.
We will give a formal definition of incompressibility later. In general, though, fluids are
not incompressible. For example, sound waves are exactly waves of compression in air,
and cannot exist if air is incompressible. So we can alternatively state the assumption
of incompressibility as “sound travels at infinite speed”. Hence, compressibility matters
mostly when we are travelling near the speed of sound. If we are moving in low speeds,
we can just pretend the fluid is indeed incompressible. This is what we will do mostly,
since the speed of sound is in general very high.
We first consider the x direction. There are normal stresses on all the sides, and
tangential stresses at the top and bottom. The sum of forces in the x-direction (per
unit transverse width) gives
∂p ∂2u
− + µ 2 = 0.
∂x ∂y
∂p
− = 0.
∂y
In the second equation, we keep the negative sign for consistency, but obviously in this
case it is not necessary.
We can extend this result a bit by allowing non-steady flows and external forces on the
fluid. Then the velocity is of the form u = (u(y, t), 0, 0). Writing the external body
force (per unit volume) as (fx , fy , 0), we obtain the equations
∂u ∂p ∂2u ∂p
ρ =− + µ 2 + fx 0=− + fy .
∂t ∂x ∂y ∂y
The derivation of these equations is straightforward. Here ρ is the density, ie. the
mass per unit volume. The following table gives the approximate values of ρ and µ for
water and air
∂2u
= 0.
∂y 2
E. 15-4
<Poiseuille flow> Consider a flow driven by a
pressure gradient between stationary boundaries.
We have a high pressure P1 on the left, and a low
P P0
pressure P0 < P1 on the right. We solve this prob- 1
lem again, but we will also include gravity. So the
equations of motion become
∂p ∂2u ∂p
− +µ 2 =0 − − ρg = 0
∂x ∂y ∂y
The boundary conditions are u = 0 at y = 0, h. The second equation implies
p = −ρgy + f (x) for some function f . So ∂x p = f 0 (x). Substituting into the first
gives
∂2u
µ 2 = f 0 (x).
∂y
The left is a function of y only, while the right depends only on x. So both must be
constant, say
R LG. Write LR as the length of the tube. Using the boundary conditions,
L
P0 − P1 = 0 ∂x p dx = 0 f 0 (x) dx = LG, so we get
∂2u P1 − P0 G −G
µ = f 0 (x) = G = − =⇒ u= y(y − h) = y(h − y).
∂y 2 L 2µ 2µ
Here the velocity is the greatest at the middle, where y = h/2. Since the equations
of motion are linear, if we have both a moving boundary and a pressure gradient,
we can just add the two solutions up.
3. We can also calculate the force exerted on the boundary by the fluid. Recall that
the tangential stress τs is the tangential force per unit area exerted by the fluid on
the surface, given by
∂u
τs = µ where n points into the fluid.
∂n
We first solve the y momentum equation. The force (per area) in the y direction is
−ρgδy cos α. Hence the equation is
∂p
= −ρg cos α =⇒ p = p0 − ρg(y − h) cos α
∂y
∂2u ρg sin α
µ = −ρg sin α =⇒ u= y(2h − y).
∂y 2 2µ
using the no slip condition u = 0 when y = 0 and also condition that there is no
stress at y = h to get ∂u
∂y
= 0 when y = h. This is a bit like the Poiseuille flow, with
gp sin(α)/2µ as the pressure gradient. But instead of going to zero at y = h, we get
to zero at y = 2h instead. So this is half a Poiseuille flow.
656 CHAPTER 15. FLUID DYNAMICS
∂u ∂2u µ
=ν 2 where ν=
∂t ∂y ρ
This is clearly the diffusion equation, with the diffusivity ν, which is the kinematic
viscosity. We can view this as the diffusion coefficient for motion/momentum/vorticity.
Diffusion equation can be solved by ways done in Methods (including using Fourier
series or Fourier/Laplace transform in time etc.). The boundary conditions are u = 0
for t = 0 and u → 0 as y → ∞ for all t. The other boundary condition is obviously
u = U when y = 0, for t > 0.
Before we start, we try to do some dimensional analysis to gain some intuition about
the problem. We will try to figure out the approximate scales of things in our system, in
particular we will try to figure out how far we have to go away from the boundary before
we don’t feel any significant motion and how fast the movement of fluid propagates
up the y axis. We are already provided with a velocity U , which we use as our
characteristic speed. We let T be the time scale which we care about. We note that
in this case, we don’t really have an extrinsic length scale — in the case where we
have two boundaries, the distance between them is a natural length scale to compare
with, but here the fluid is infinitely thick. So what is a characteristic length scale
δ corresponding to the chosen time scale? Well, putting them into our differential
equation we obtain
U U √
∼ν 2 =⇒ δ ∼ νT .
T δ
√
So we expect the decay length of u up the y axis √ to be O( νt), and we expect the
movement of fluid propagates up the y axis like νt.
We now solve the problem properly. By symmetry of the system we expect the solution
u to only depend on variable y and t. In an infinite domain with no extrinsic length
scale, the diffusion equation admits a similarity solution. Note that if u(y, t) is a
solution to our problem, then so is ũ(y, t) = u(λy, λ2 t) for any λ > 0. If we is to have a
unique solution to the problem, the two solution must be the same, that is our solution
must be self-similar in the sense that u(y, t) = ũ(y, t) =√u(λy, λ2 t). This suggest that
our solution should only depend on the one variable y/ t.
Indeed if we write u(y, t) = U f (η)√where f (η) is a dimensionless function of the
dimensionless variable η = y/δ = y/ νt, and substitute this form of the solution into
the differential equation, we get
1
− ηf 0 (η) = f 00 (η) with boundary condition f = 1 on η = 0
2
The solution is by definition f = erfc(η/2), where
R∞ 2
erfc(z) = 1 − erf(z) = √2π z e−s ds, so √
δ∼ νt
y U
u = U erfc √ .
2 νt
15.2. KINEMATICS 657
Using our table of values for kinematic viscosities, we find that νair ≈ 20νwater , so
motion induce further/faster into air than water. We can also compute the tangential
stress of the fluid in the above case to be
∂u U 2 2 µU
τs = µ = µ√ √ e−y = √ .
∂y νt π y=0 πνt
√
Using values for viscosities, we find that µ/ ν is about 1 for water and about 5 × 10−3
for air. So water asserts a much greater sheer stress for the same motion. This is
significant in, say, the motion of ocean current. When the wind blows, it causes the
water in the ocean to move along with it. This is done in a way such that the surface
tension between the two fluids match at the boundary. Hence we see that even if the
air blows really quickly, the resultant ocean current is much smaller, say a hundredth
to a thousandth of it.
15.2 Kinematics
15.2.1 Material time derivative
We first want to consider the problem of how we can measure the change in a quantity,
say f . This might be pressure, velocity, temperature, or anything else you can imagine.
The obvious thing to do would be the consider the time derivative ∂f ∂t
. In physical
terms, this would be equivalent to fixing our measurement instrument at a point and
measure the quantity over time. This is known as the Eulerian picture . However,
often we want to consider something else. We pretend we are a fluid particle, and move
along with the flow. We then measure the change in f along our trajectory. This is
known as the Lagrangian picture .
Let’s look at these two pictures, and see how they relate to each other. Consider a
time-dependent field f (x, t). For example, it might be the pressure of the system, or
the temperature of the fluid. Consider a path x(t) through the field, and we want to
know how the field varies as we move along the path. Along the path x(t), the chain
rule gives
df ∂f dx ∂f dy ∂f dz ∂f ∂f
(x(t), t) = + + + = ∇f · ẋ + .
dt ∂x dt ∂y dt ∂z dt ∂t ∂t
If x(t) is the (Lagrangian) path followed by a fluid particle, then necessarily ẋ(t) = u
by definition. In this case, we write this as
Df ∂f
= u · ∇f + .
Dt ∂t
This is called the material derivative or Lagrangian derivative , which is the change
in f as we follow the path. On the right hand side of this equation, the first term
u · ∇f is the advective derivative , which is the change due to change in position, the
second term ∂f∂t
is the Eulerian time derivative , the change at a fixed point.
For example, consider a river that gets wider as we go downstream. We know (from,
say, experience) that the flow is faster upstream than downstream. If the motion is
steady, then the Eulerian time derivative vanishes, but the Lagrangian derivative does
not, since as the fluid goes down the stream, the fluid slows down, and there is a spacial
variation.
658 CHAPTER 15. FLUID DYNAMICS
Often, it is the Lagrangian derivative that is relevant, but the Eulerian time derivative
is what we can usually put into differential equations. So we will need both of them,
and relate them by the formula above. A final remark: later we will have material
Df ∂f
P say f , so Dt = u · ∇f + dt , while u · ∇f might seems
time derivative of a vector field,
confusing, it is just u · ∇f = i (u · ∇fi )ei = (u · ∇)f .
We have the negative sign since we picked the outward normal, and hence the integral
measures the outward flow of fluid. This makes sense since if we have flow flowing
out than the mass should decrease. Since the domain is fixed, we can interchange the
derivative and the integral on the left; on the right, we can use the divergence theorem.
Then we get
Z
∂ρ
+ ∇ · (ρu) dV = 0.
D ∂t
Since D was arbitrary, everywhere in space we must have
∂ρ Dρ
<Conservation law> + ∇ · (ρu) = 0 or equivalently + ρ∇ · u = 0.
∂t Dt
This is the general form of a conservation law — the rate of change of “stuff” density
plus the divergence of the “stuff flux” is constantly zero. Similar conservation laws
appear everywhere in physics. In the (first) conservation equation, we can expand
∇ · (ρu) to get
∂ρ Dρ
+ u · ∇ρ + ρ∇ · u = 0 =⇒ + ρ∇ · u = 0.
∂t Dt
since the first term is just the material derivative of ρ. With the conservation of mass,
we can now properly say what incompressibility is. What exactly happens when we
compress a fluid? When we compress mass, in order to conserve mass, we must increase
the density. Indeed we say a fluid is incompressible if the density of a fluid particle
does not (cannot) change. If we don’t allow changes in density, then the material
derivative DρDt
must vanish. Hence a incompressible flow satisfies ∇ · u = 0, this is
known as the continuity equation .
Of course, incompressibility is just an approximation. Since we can hear things, ev-
erything must be compressible. So what really matters is whether the speed is small
compared to the speed of sound. If it is relatively small, then incompressibility is a
good approximation. In air, the speed of sound is approximately 340 m s−1 . In water,
the speed of sound is approximately 1500 m s−1 .
15.2. KINEMATICS 659
E. 15-5
For a flow u = (u, 0, 0) which flows in one dimension, if the flow is incompressible,
we must have ∂u
∂x
= ∇ · u = 0. So we considered u of the form u = u(y, z, t).
The ψ such that A = (0, 0, ψ) is called the stream function . Alternatively we can
derive this by noting that the incompressibility condition ∂u ∂x
∂v
+ ∂y = 0 means that
−v dx + u dy = 0 is an exact differential, hence we can find ψ such that ∇ψ = (−v, u).
This stream function is both physically significant, and mathematically convenient, as
we will soon see. We look at some properties of the stream function. The first thing
we can do is to look at the contours ψ = c. These have normal n = ∇ψ = (ψx , ψy , 0).
We immediately see that
∂ψ ∂ψ ∂ψ ∂ψ
u·n= − = 0.
∂x ∂y ∂y ∂x
So the flow is perpendicular to the normal, ie. tangent to the contours of ψ. So
the contours of the stream function ψ are in fact the streamlines, that describes an
instantaneous picture of flow. It is also worth noting that in this case we have u =
∇×(0, 0, ψ) = (∇ψ)×(0, 0, 1). Also, note that ψ must be constant on a stationary rigid
boundary, ie. the boundary is a streamline, since the flow is tangential at the boundary.
This is a consequence of u · n = 0. We often choose ψ = 0 as our boundary.
To draw streamlines we just need to draw curves given
by ψ =constant. Typically, we draw streamlines that are
“evenly spaced”, ie. we pick the streamlines ψ = c0 , ψ = c1 , slow fast
2. Similarly this also works for spherical polar axisymmetric flow. In spherical polar
coordinates (r, θ, ϕ) we consider the flow u = (ur , uθ , 0) where there are no ϕ
dependence. The incompressibility condition reads
We can also manually check that for these flows u · ∇ψ = 0, so the flow is tangential
to the surface ψ =constant. However ψ =constant now defines a surface, which can’t
be a streamline. But fortunately the intersection of the ψ =constant surface with any
plane of the form y = Ax (or x = Ay) is a streamline. So we can still use it to draw
the stream lines.
15.3 Dynamics
The equations for the parallel viscous flow was a good start, and it turns out the
general equation of motion is of a rather similar form. Unfortunately, the equation
is rather complicated and difficult to derive, and we will not derive it here. How-
ever, we will be able to derive some special cases of it later under certain simplifying
assumptions.
Du
<Navier-Stokes equation> ρ = −∇p + µ∇2 u + f
Dt
This is the general equation for fluid motion. The left hand side is mass times ac-
celeration, and the right is the individual forces — the pressure gradient, viscosity,
and the body forces (per unit volume) respectively. In general, these are very difficult
equations to solve because of non-linearity. For example, in the material derivative,
we have the term u · ∇u. There are a few things to note:
662 CHAPTER 15. FLUID DYNAMICS
15.3.1 Pressure
In the Navier-Stokes equation, we have a pressure term. In general, we classify the
pressure into two categories. If there is gravity, then we will get pressure in the fluid due
to the weight of fluid above it. This is what we call hydrostatic pressure . Technically,
this is the pressure in a fluid at rest, ie. when u = 0. We denote the hydrostatic
pressure as pH .
To find this, we put in u = 0 into the Navier-Stokes equation to get ∇pH = f = ρg.
We can integrate this to obtain pH = pg · x + p0 where p0 is some arbitrary constant.
Usually, we have g = (0, 0, −g). Then pH = p0 − ρgz. This exactly says that the
hydrostatic pressure is the weight of the fluid above you.
What can we infer from this? Suppose we have a body D with boundary ∂D and
outward normal n. Then the force due to the pressure is
Z Z Z Z
F=− pH ndS = − ∇pH dV = − ρg dV = −g ρ dV = −M g,
∂D D D D
R
where M is the mass of fluid displaced. The second equality is since ∂D pH ni dS =
R R R
p e · dS = D ∇ · (pH ei ) dV = D (∇pH )i dV . This the Archimedes’ principle .
∂D H i
In particular, if the body is less dense than the fluid, it will float; if the body is denser
than the fluid, it will sink; if the density is the same, then it does not move, and we
say it is neutrally buoyant. This is valid only when nothing is moving, since that was
our assumption. Things can be very different when things are moving, which is why
planes can fly.
In general, when there is motion, we might expect some other pressure gradient. It
can either be some external pressure gradient driving the motion (eg. in the case of
Poiseuille flow), or a pressure gradient caused by the flow itself. In either case, we can
write p = pH + p0 where pH is the hydrostatic pressure, and p0 is what caused/results
from motion. We substitute this into the Navier-Stokes equation to obtain
Du
ρ = −∇p0 + µ∇2 u.
Dt
So the hydrostatic pressure term cancels with the gravitational term. What we usually
do is drop the “prime”, and just look at the deviation from hydrostatic pressure. What
15.3. DYNAMICS 663
this means is that gravity no longer plays a role, and we can ignore gravity in any flow
in which the density is constant. Then all fluid particles are neutrally buoyant. This is
the case in most of what we will do, except when we consider motion of water waves,
since there is a difference in air density and water density.
1. When Re 1, the inertia terms are negligible, and we now have P ∼ ρνU/L =
µU/L. So the pressure balances the sheer stress. We can approximate the Navier-
Stokes equation by dropping the term on the left hand side, then we have
y
g 0 (η)
Ex
Cδ = O(1)
Cδ
η u
At the scale δ, we get a Reynolds number of Reδ = U δ/ν ∼ O(1). This is the boundary
layer. For a larger extrinsic scale L δ, we get ReL = UνL 1. When interested
in flow on scales much larger than δ, we ignore the region y < δ (since it is small),
and we imagine a rigid boundary at y = δ at which the no-slip condition does not
apply.
When ReL 1, we solve the Euler equations, namely
Du
ρ = −∇p + f ∇ · u = 0.
Dt
We also have a boundary condition u · n = 0 at the stationary rigid boundary— we
don’t allow fluid to flow through the boundary. The no-slip condition is no longer
satisfied. One can show that u = (Ex, −Ey, 0) satisfies the Euler equations in y > 0
with a rigid boundary of y = 0, with
1 2 2
p = p0 − ρE (x + y 2 ).
2
We can plot the curves of constant pressure, as well
as the streamlines As a flow enters from the top,
the pressure keeps increases, and this slows down y
the flow. We say the y-pressure gradient is adverse.
As it moves down and flows sideways, the pressure
pushes the flow. So the x-pressure gradient is favor-
able. At the origin, the velocity is zero, and this is
a stagnation point. This is also the point of highest
pressure. In general, velocity is high at low pres- x
sures and low at high pressures. Note that the pres- p0
sure acts as an internal reaction force imposing the
constrinat of incompressibility.
We will ignore the last one. We can then write the rate of change of the total momen-
tum as Z Z Z Z
d
ρu dV = − ρu(u · n) dS − pn dS + f dV. (∗)
dt D ∂D ∂D D
It is helpful to write this in suffix notation. In this case, the equation becomes
Z Z Z Z
d
ρui dV = − ρui uj nj dS − −pni dS + fi dV.
dt D ∂D ∂D D
Just as in the case of mass conservation, we can use the divergence theorem to
write Z Z
∂ui ∂ ∂p
ρ +ρ (ui uj ) dV = − + fi dV.
D ∂t ∂xj D ∂xi
Since D is arbitrary, we must have
The last term on the left is the divergence of u, which vanishes by incompressibility,
and the remaining terms is just the material derivative of u. So we get
Du
<Euler momentum equation> ρ = −∇p + f .
Dt
This is just the equation we get from the Navier-Stokes equation by ignoring the
viscous terms. However, we were able to derive this directly from momentum conser-
vation.
Suppose we have conservative force f = −∇χ, where χ is a scalar potential. For
example, gravity can be given by f = ρg = ∇(ρg · x) (for ρ constant), so χ = −ρg · x =
ρgz if g = (0, 0, −g). Further suppose that we have a steady flow, so ∂u ∂t
vanishes.
Then the momentum equation (∗) becomes
Z Z Z
0=− ρu(u · n) dS − pn dS − ∇χ dV.
∂D ∂D D
This can be sometimes be useful. Of course in the case that we don’t have steady flow
and conservative force we can always revert back to using (∗).
E. 15-8
<Force on curved pipe> Suppose in space (no gravity)
we have a curved circular pipe with constant cross section U
area A (as shown on the diagram) and a steady inviscid S
imcompressible flow of speed U flowing though it. What
n1 S1 S2 n2
is the force the fluid excert on the pipe?
We use the momentum integral for steady flow, ∂D in this case made up of three
parts S1 (with normal n1 ), S2 (with normal n2 ) and the curved surface S. The
668 CHAPTER 15. FLUID DYNAMICS
= −A(p + ρU 2 )(n1 + n2 )
Note that the force depend on the background pressure p determined by the pump-
ing station.
2(p − P ) p Aa
= −2gh. =⇒ q= 2gh √ .
ρ A2 − a2
Therefore we can measure h in order to find out the flow rate. This allows us to
measure fluid velocity just by creating a constriction and then putting in some
pipes.
E. 15-10
<Force on a fire hose nozzle> Suppose we have a fire hose nozzle like this:
(2)
(3)
U (4)
A (1) (5) u, a, p = 0
P
We consider the steady-flow equation and integrate along the surface indicated as
dashed lines. We integrate each section separately to find the total rate of change
of momentum in the x direction. The end (1) contributes ρU (−U )A−P A. On (2),
everything vanishes. On (3), the first term vanishes Rsince the velocity is parallel
to the surface. Then we get a contribution of 0 + nozzle pn · x̂ dS. Similarly,
everything in (4) vanishes. Finally, on (5), noting that p = 0, we get ρu2 a. By
the steady flow equation, we know these all sum to zero. Hence, the force on the
nozzle is just
Z
F = pn · x̂ dS = ρAU 2 − ρau2 + P A.
nozzle
We can again apply Bernoulli along a streamline in the middle, which says 21 ρU 2 +
P = 21 ρu2 . So we can get
1 1 A a 2
F = ρAU 2 − ρau2 + ρA(u2 − U 2 ) = ρ 2 q 2 1 − .
2 2 a A
Let’s now put some numbers in. Suppose A = (0.1)2 πm2 and a = (0.05)2 πm2 . So
we get A/a = 4. A typical fire hose has a flow rate of q = 0.01 m3 s−1 . So we get
2
1 4 3
F = · 1000 · · 10−4 · ≈ 14 N.
2 π/40 4
670 CHAPTER 15. FLUID DYNAMICS
E. 15-11
Consider a two dimensional flow where a jet of water
with width a travelling with speed U incident on
a
a inclined plane. Assume we have no gravity, or
that the jet is fast enough that gravitational effect is
negligible. Wlog assume the atmospheric pressure is U
0. Suppose also that we have reach the point where a1 β a2
the flow is steady. We want to find a1 , a2 and the
force the fluid exert on the plane.
Far away upstream in the incident jet, the water are flowing in a straight line
with constant speed U , so 0 = ρ Du Dt
= −∇p, hence the pressure p within the jet
is constant, and so equals 0 the atmospheric pressure since the boundary have
that pressure. The same goes with the the two streams created after the incident.
In fact the speed of both of these streams fat away from the incident point must
have be U as well, this is since along the free surface stream line, the pressure is 0
always, so by Bernoulli ( 21 ρ|u|2 + p =constant along stream line) the speed of the
streams must be U .
By mass conservation we must have U a = U a1 + U a2 . Let D be the volume of
the
R flow as shown in the diagram. The momentum integral for steady flow says
∂D
(ρu(u · n) + pn) dS = 0. Along the x-direction we have
Z Z
0=− pn1 dS = ρu1 (u · n) dS = ρU cos β(−U )a + ρU (U )a2 + ρ(−U )(U )a1
∂D ∂D
Hence −ρa cos β + ρa2 − ρa1 = 0. Together with mass conservation, we find that
a2 = 12 a(1 + cos β) and a1 = 12 a(1 − cos β). The force the fluid exert on the plane
is given by
Z Z Z
F = p(−e2 ) dS = pn2 dS = − ρu2 (u · n) dS = ρU sin β(−U )a
y=0 ∂D ∂D
= −ρaU 2 sin β.
We can write the second part in terms of the vorticity ω = ∇×u. Then we have
∂uk ∂uk
ω × r = (∇ × u) × r = εpiq εijk rq ep = (δqi δpk − δqk δpj ) rq ep
∂xj ∂xj
∂uk ∂uk ∂ui ∂uj
= rj ek − rk ej = rj − ei = 2Ωij rj ei .
∂xj ∂xj ∂xj ∂xi
Since ωi ωj is symmetric, while Ωij is antisymmetric, the second term vanishes. In the
principal axes, E is diagonalizable. So we get
D 1 2
|ω| = E1 ω12 + E2 ω22 + E3 ω32 .
Dt 2
wlog, we assume E1 > 0 (since the Ei ’s sum to 0), and imagine E2 , E3 < 0. So the flow
is stretched in the e1 direction and compressed radially. We consider what happens to
a vortex in the direction of the stretching, ω = (ω1 , 0, 0). We then get
D 1 2
ω1 = E1 ω12 .
Dt 2
Dx2 (t)
= u(x2 ) Dδ`(t)
Dt =⇒ = u(x2 ) − u(x1 ) = δ` · ∇u,
Dx1 (t) Dt
= u(x1 )
Dt
by taking a first-order Taylor expansion. This is exactly the same equation as that
for ω in an inviscid fluid. So vorticity increases as the length of a material line in-
creases.
We have so far assumed the density is constant. If the fluid has a non-uniform density
ρ(x), then it turns out
Dω 1
= ω · ∇u + 2 ∇ρ × ∇p.
Dt ρ
E. 15-12
Below are some example of vortex amplification by stretching
low vorticity convection
stretching
cool warm cool
high vorticity Ground
Bathtub Hurricane Vorticity around build-ups
What is the physical significance of the factor A? Consider the volume flux q across
the surface of the sphere r = a. Then
Z Z Z Z
∂φ A
q= u · n dS = ur dS = dS = 2
dS = 4πA.
S S S ∂r S a
So we can write φ = −q/4πr. When q > 0, this corresponds to a point source of fluid.
When q < 0, this is a point sink of fluid. We can also derive this solution directly, using
incompressibility. Since flow is incompressible, the flux through any sphere containing
the source/sink 0 should be constant. Since the surface area increases as 4πr2 , the
velocity must drop as r12 , in agreement with what we obtained above. Notice that we
have ∇2 φ = qδ(x). So φ is actually Green’s function for the Laplacian.
That was not too interesting. We can consider a more general solution, where φ
depends on r and θ but not ϕ. Then Laplace’s equation becomes
1 ∂ ∂φ 1 ∂ ∂φ
∇2 φ = 2 r2 + 2 sin θ = 0.
r ∂r ∂r r sin θ ∂θ ∂θ
As we know from IB Methods, We can use Legendre polynomials to write the solution
as
∞
X ∂φ 1 ∂φ
φ= (An rn + Bn r−n−1 )Pn (cos θ), and then u= , ,0 .
n=0
∂r r ∂θ
E. 15-14
We can immediately look at some possible flows.
1. Source/sink: In the case that only B0 6= 0 we have φ = B0 /r and u = ∇φ =
−B0 er /r2 . This is a source if B0 < 0 and a sink if B0 > 0. Indeed if we look
at the outward mass flux over a sphere of radius R
Z
ρu · dS = ρ(−B0 /R2 )4πR2 = −4πB0 ρ
SR
E. 15-15
<Uniform flow pass sphere> We can look at uniform flow around a sphere of
radius a. We suppose the upstream flow is u = U x̂. So φ = U x = U r cos θ. So we
need to solve
r
∇2 φ = 0 r>a
φ → U r cos θ r→∞ U θ x
∂φ
=0 r = a.
∂r
The last condition is there to ensure no fluid flows into the sphere, ie. u · n = 0,
for n the outward normal.
Since P1 (cos θ) = cos θ, and the Pn are orthogonal, our boundary conditions at
infinity require φ to be of the form
a3
B
φ= Ar + cos θ and in fact φ=U r+ cos θ
r2 2r2
where we determined the two constant using two boundary conditions. The
condition that φ → U r cos θ tells us A = U , and the other condition tells us
a3
A − 2B/a3 = 0. We can interpret U r cos θ as the uniform flow, and U 2r 2 cos θ as
the dipole response due to the sphere. We can compute the velocity to be
a3
∂φ
ur = 1 − 3 cos θ
=U
∂r r
a3
1 ∂φ
uθ = = −U 1 + 3 sin θ.
r ∂θ 2r
is the added mass of the bubble (and MD is the mass of the fluid displaced by
the bubble). Now suppose we raise the bubble by a distance h. The change in
potential energy of the system is ∆PE = −MD gh. So
1
MA U 2 − MD gh = Energy
2
is constant, since we assume there is no dissipation of energy due to viscosity. We
differentiate this to know MA U U̇ = MD g ḣ = MD gU . We can cancel out the U ’s
and get U̇ = 2g. So in an inviscid fluid, the bubble rises at twice the acceleration
of gravity.
1 ∂2φ
2 1 ∂ ∂φ ∂φ 1 ∂φ
∇ φ= r + 2 2. u = ∇φ = , .
r ∂r ∂r r ∂θ ∂r r ∂θ
The general solution to Laplace’s equation is given by
∞
(
X n −n cos nθ
φ = A0 log r + B0 θ + (An r + Bn r )
n=1
sin nθ
∞
X
An rn cos(nθ + αn ) + Bn r−n cos(nθ + βn ) .
= A0 log r + B0 θ +
n=1
Note that even though we say the flow is two dimensional, it could still be flow in
three dimension, e.g. a flow that’s is confined in the x-y plane and is the same for all
z.
15.4. INVISCID IRROTATIONAL FLOW 677
E. 15-17
We can immediately look at some possible flow
1. Source/sink: In the case only A0 6= 0, we have φ = A0 log rR and u =
∇φ = A0 er /r. The mass flux through a circle about the origin is ρu · dS =
2πRρA0 /R = 2πρA0 . So A0 > 0 correspond to having a point source at the
origin, and A0 < 0 correspond to having a sink at the origin. And we call
m = |2πA0 | the strength of the source/sink.
Alternatively to get a point source of strength q, we can either solve ∇2 φ =
qδ(r) or use the conservation of mass to obtain 2πrur = q. then
q q
ur = =⇒ φ= log r.
2πr 2π
r
∇2 φ = 0 r>a
φ → U r cos θ r→∞ U θ x
∂φ
=0 r = a.
∂r
678 CHAPTER 15. FLUID DYNAMICS
We already have the general solution above. So we just write it down. We find
a2
u r = U 1 − cos θ
a2
K r 2
2
φ=U r+ cos θ + θ and so
r 2π a K
uθ = −U 1 + 2 sin θ + .
r 2πr
The last term in φ allows for a net circulation K around the cylinder, to account
for vorticity in the viscous boundary layer on the surface of the cylinder. We can
find the streamfunction for this as
a2
K
ψ = U r sin θ 1 − 2 − log r.
r 2π
If there is no circulation, ie. K = 0, then we get a
flow similar to the flow around a sphere. Again,
A0 A
there is no net force on the cylinder, since flow
is symmetric before and after, above and below.
Again, we get two stagnation points at A, A0 .
What happens when K 6= 0 is more interesting. We
first look at the stagnation points. We get ur = 0
if and only if r = a or cos θ = 0. For uθ = 0, when
r = a, we require K = 4πaU sin θ. So provided
|K| ≤ 4πaU , there is a solution to this problem, and
we get (two) stagnation points on the boundary.
For |K| > 4πaU , we do not get a stagnation point on
the boundary. However, we still have the stagnation
point at cos θ = 0 (ie. θ = ± π2 ) for some r. Looking
at the equation for uθ = 0, only θ = π2 works.
Let’s now look at the effect on the sphere. For
steady potential flow, Bernoulli works (ie. H is con-
stant) everywhere, not just along each streamline
(see later). So we can calculate the pressure on the
surface. Let p be the pressure on the surface. Then
we get
2
1 2 1 K
p∞ + ρU = p + ρ − 2U sin θ
2 2 2πa
1 2 ρK 2 ρKU sin θ
=⇒ p = p∞ + ρU − 2 2 + − 2ρU 2 sin2 θ.
2 8π a πa
We see the pressure is symmetrical before and after. So there is no force in the x
direction. However, we get a transverse force (per unit length) in the y-direction.
We have
Z 2π Z 2π
ρKU
Fy = − p sin θ(a dθ) = − sin2 θa dθ = −ρU K.
0 0 πa
where we have dropped all the odd terms. So there is a sideways force in the
direction perpendicular to the flow, and is directly proportional to the circulation
of the system. In general, the magnus force (lift force) resulting from interaction
between the flow U and the vortex K is F = ρU × K.
15.4. INVISCID IRROTATIONAL FLOW 679
where p0 (t) is positive for t > 0 and zero otherwise. We neglect gravity. What is
the velocity of the flow in the tube?
Firstly starting from rest implies that the flow is irrotational. So we can write
the potential as φ = u(t)x. Note that the flow speed u cannot depend on x since
the flow is incompressible. Applying the time dependent Bernoulli equation (∗)
on both ends of the tube we find
1 2 patm f (t) 1 patm + p0
u̇L + u + = = u2 +
2 ρ ρ 2 ρ
Therefore we have u̇ = p0 (t)/(ρL) with initial condition u(0) = 0. In the case that
p0 is a constant, we have u = (p0 /ρL)t.
................................................................................
Consider a slight variation of the problem where instead of specifying the pressure
at x = 0 by p|x=0 = patm +p0 (t), the pressure is control at x = −ξ where x = −ξ is
the back of the container. We can think of this as there being a piston at x = −ξ
pushing toward the water in the container with force per area patm + p0 (t). We
assume the container is large enough so that the flow velocity in the container is
negligible.
Applying (∗) on x = L and x = −ξ gives u̇L + 12 u2 + patm /ρ = (p0 (t) + patm )/ρ.
Therefore we have the non-linear ODE
1 2 p0 (t)
u̇ + u = with u(0) = 0
2L ρL
680 CHAPTER 15. FLUID DYNAMICS
p
In the case where p0 is constant, we define the velocity scale u0 = 2p0 /ρ and
time scale t0 = 2L/u0 . Then in terms of the dimensionless parameters η = u/u0
dη
and τ = t/t0 we have dτ = 1 − η 2 with η(0) = 0. The solution is therefore
η = tanh τ , that is u = u0 tanh(u0 t/2L).
E. 15-20
<Oscillations in a manometer> A manometer is a U -
shaped tube. We use some magic to set it up such that the
water level in the left tube is h above the equilibrium position
H. Then when we release the system, the water levels on h
both side would oscillate. We can get quite far just by doing H
dimensional analysis. There are only two parameters g, H.
p y=0
Hence the frequency must be proportional to g/H. To
get the constant of proportionality, we have to do proper
calculations.
We are going to assume the reservoir at the bottom is large, so velocities are
negligible. So φ is constant in the reservoir, say φ = 0. We want to figure out
the velocity on the left. This only moves vertically. So we have φ = uy = ḣy.
So we have ∂φ∂t
= ḧy. On the right hand side, we just have φ = −uy = −ḣy and
∂φ
∂t
= − ḧy. We now apply the equation from one tube to the other — we get
1 2
ρḧ(H + h) + ρḣ + patm + ρg(H + h) = f (t)
2
1 2
= −ρḧ(H − h) + ρḣ + patm + ρg(H − h).
2
Quite a lot of these terms cancel, and we are left with 2ρH ḧ+2ρgh = 0. Simplifying
terms, we get ḧ + gh/H = 0. So this is simple harmonic motion with the frequency
p
g/H.
E. 15-21
<Oscillations of a bubble> Suppose we have a spherical bubble of radius a(t)
in some fluid. Spherically symmetric oscillations induce a flow in the fluid. This
satisfies
∂φ
∇2 φ = 0 for r > a φ → 0 as r → ∞ = ȧ for r = a.
∂r
In spherical polars, we write Laplace’s equation as
1 ∂ 2 ∂φ A(t) ∂φ A(t)
r = 0. =⇒ φ= , ur = =− 2 .
r2 ∂r ∂r r ∂r r
2aȧ2 a2 ä
∂φ
= − − = −(aä + 2ȧ2 ).
∂t r=a r r r=a
We now consider the pressure on the surface of the bubble. We will ignore gravity,
and apply Euler’s equation at the bubble surface and at infinity. Then we get
1 3
−ρ(aä + 2ȧ2 ) + ρȧ2 + p(a, t) = p∞ =⇒ ρ aä + ȧ2 = p(a, t) − p∞ . (∗)
2 2
15.4. INVISCID IRROTATIONAL FLOW 681
We can take logs to obtain log p+γ log V = constant. Then taking small variations
of p and V about p∞ and V0 = 43 πa30 gives
δp δV
δ(log p) + γδ(log V ) = 0, that is = −γ .
p∞ V0
Thus we find
δV η
p(a, t) − p∞ = δp = −p∞ γ = −3p∞ γ .
V0 a0
Thus we get ρa0 η̈ = −3(γp∞ /a0 )η. This is again simple harmonic motion with
frequency ω = (3γp∞ /ρa20 )1/2 . We know all these numbers, so we can put them
in. For a 1 cm bubble, we get ω ≈ 2 × 103 s−1 . For reference, the human audible
range is 20 − 20 000 s−1 . This is why we can hear, say, waves breaking.
Finally we will point out that the pressure is not at its extreme at the boundary
of the bubble. Applying Euler’s equation (time-dependent Bernoulli) at any point
r and at infinity, we get
2
∂ a ȧ 1 a4 ȧ2
ρ − + ρ 4 + p(r, t) = p∞ .
∂t r 2 r
a4
a 1 a
p(r, t) = p∞ + (p(a, t) − p∞ ) + ρȧ − 4 .
r 2 r r
In fact each vortex is moved by the velocity field due to all the other vortices, so
X Ki ez × (xi − xj )
ẋi (t) = .
2π kxi − xj k2
j6=i
E. 15-22
• Consider a vortex pair N = 2 with K = K1 = −K2 > 0. Then
K ez × (x1 − x2 )
ẋ1 (t) = −
2π kx1 − x2 k2
x1 x2
K ez × (x2 − x1 )
ẋ2 (t) =
2π kx1 − x2 k2
So ẋ1 = ẋ2 = U. In particular the distance between the two vortices does not
d ∂φ
change, dt (x1 x2 ) = 0. Also we have |U| = |K|/(2πkx2 − x2 k). Note that ∂n =0
on the bisection of x1 − x2 (dashed line) since φ is symmetric.
• Consider a simple vertex of strength K a distance d from a rigid straight boundary.
To find its evolution, we use the method of images. We place an image vortex of
strength −K at a distance d on the other side of the wall and remove the boundary.
So we have reduce it to the first case. An example of this happening is when planes
are taking off or landing, vortices form at the wing tip, the ground act as the rigid
boundary, and the vortices vortices would migrate away from the plane/runway
towards the sides.
h(x, y, t)
z
z=0
y H
x z = −H
∂φ Dh ∂h ∂h ∂h
uz = = = +u +v when z = h.
∂z Dt ∂t ∂x ∂y
We then have the dynamic boundary condition that the pressure at the surface is the
atmospheric pressure, that is at z = h we have p = p0 = constant. We need to relate
this to the flow. So we apply the time-dependent Bernoulli equation
∂φ 1
ρ + ρ|∇φ|2 + ρgh + p0 = f (t) on z = h.
∂t 2
The equation is not hard, but the boundary conditions are. Apart from them being
non-linear, there is this surface h that we know nothing about.
whole of |∇φ|2 in Bernoulli’s equations since it is small. Next, we use Taylor series to
write
∂ 2 φ
∂φ ∂φ ∂φ
= + h + · · · ≈ .
∂z z=h ∂z z=0 ∂z 2 z=0 ∂z z=0
Note that the last equation is just Bernoulli equations, after removing the small terms
and throwing our constants and factors in to the function f . We now have a nice,
straightforward problem. We have a linear equation with linear boundary conditions,
which we can solve. General strategies of solving these include looking for separable
solution, or taking Fourier transform wrt x.
For this not to depend on x, we must have −iωφ0 cosh kH + gh0 = 0. A trivial
solution is h0 = φ0 = 0. Otherwise, we can solve to get ω 2 = gk tanh kH. This is the
dispersion relation , and relates the frequency to the wavelengths of the wave. We can
use the dispersion relation to find the speed of the wave, it is
r
ω g
c= = tanh kH.
k k
Analysis of result
We can look at the limits for large and small H.
• In deep water (or short waves), we
p have kH 1. We know that as kH → ∞, we
get tanh kH → 1. So we get c = g/k.
• In shallow water (or long waves), √
we have kH 1. In the limit kH → 0, we get
tanh kH → kH. Then we get c = gH.
These are exactly as predicted using dimensional analysis, with all the dimensionless
constants being 1.
15.5. WATER WAVES 685
We can plot how the wave speed varies with k. We see that √c
gH
wave speed decreases monotonically with k, and long waves 1
travel faster than short waves. This means if we start with,
say, a square wave, the long components of the wave trav-
els faster than the short components. So the square wave
disintegrates as it travels. kH
∇2 φ = 0 for − H ≤ z ≤ 0, 0 ≤ x ≤ a, 0 ≤ y ≤ b;
∂φ ∂φ ∂φ
= 0 on z = −H, = 0 on x = a, 0, = 0 on y = 0, b;
∂z ∂x ∂y
∂h ∂φ ∂φ
− =0 and + gh = f (t) on z = 0.
∂t ∂z ∂t
We seek solutions of the form φ = Re(φ̂(x, y, z)e−iωt ). In order for ∂t φ + gh to be
independent of x, y on z = 0 we need h = Re((iω/g)φ̂(x, y, z)e−iωt ). Now the condition
∂t h − ∂z φ = 0 on z = 0 says that ∂z φ|z=0 = (ω 2 /g)φ̂|z=0 .
Performing separation of variables on φ̂(x, y, z) = α(x)β(y)γ(z) using the boundary
condition and the fact that φ is harmonic, we find that α(x) = cos(mπx/a) and
β(y) = cos(nπy/b) for n, m ∈ Z. So we can write φ̂ = φ̂mn (z) cos(mπx/a) cos(nπy/b).
This solve ∇2 φ̂ = 0 provided
d2 φ̂mn 2 2
mπ 2 nπ 2
− kmn φ̂mn = 0 where kmn = + . (∗)
dz 2 a b
The boundary condition in z tell us that ∂z φ̂mn (−H) = 0 and ∂z φ̂mn (0) = (ω 2 /g)φ̂mn (0).
Using the first of these boundary condition we find that the solution to (∗) is φ̂mn (z) =
686 CHAPTER 15. FLUID DYNAMICS
C cosh(kmn (z + H)) for some constant C. Using the second boundary condition we
2
find that ω satisfies ωmn = gkmn tanh(kmn h). Each of these solution
iωmn mπx nπy
h = Re C cosh(kmn (z + H)) cos cos e−iωmn t
g a b
So the amplitudes of the waves would fluctuate. We say the wave travels in groups.
It turns out the “packets” don’t travel at the same velocity as the waves themselves.
The group velocity is given by
∂ω
cg = .
∂k
√ p
In particular, for deep water waves, where ω ∼ gk, we get cg = 12 g/k = 12 c. This
is also the velocity at which energy propagates.
The scales of the terms are W/H, U/L and V /L respectively. Since H L, we know
W U, V , ie. most of the movement is horizontal, which makes sense, since there
isn’t much vertical space to move around. We consider only horizontal velocities, and
write u = (u, v, 0) and f = (0, 0, f ). Then from Euler’s equations, we get
∂u 1 ∂p
− fv = − ,
∂t ρ ∂x
∂v 1 ∂p
+ fu = − ,
∂t ρ ∂y
1 ∂p
0=− − g.
ρ ∂z
From the last equation, plus the boundary conditions, we know p = p0 + ρg(h − z).
This is just the hydrostatic balance. We now put this expression into the horizontal
components to get
∂u ∂h ∂v ∂h
− f v = −g and + f u = −g .
∂t ∂x ∂t ∂y
Note that the right hand sides of both equations are independent of z. So the ac-
celerations are independent of z. The initial conditions are usually that u and v are
independent of z. So we assume that the velocities always do not depend on z.
This has streamfunction ψ = −gh/f , called the shallow water streamfunction . The
streamlines are places where h is constant, ie. the surface is of constant height, ie. the
pressure is constant.
In general, near a low pressure zone, there is a pressure gradi-
ent pushing the flow towards to the low pressure area. As soon
as the air starts to move, however, the Coriolis force deflects L
it. As the air moves from the high-pressure area, its speed
increases, and so does its Coriolis deflection. The deflection
increases until the Coriolis and pressure gradient forces are in geostrophic balance :
at this point, the air flow is no longer moving from high to low pressure, but instead
moves along an isobar, like in a cyclone. The diagram shows how a cyclone looks like
in the Northern hemisphere. If we are on the other side of the Earth, cyclones go the
other way round.
We now look at the continuity equation, ie. the conservation of mass. We consider a
horizontal surface D in the water. Then we can compute
Z Z Z
d
ρh dA = − hρuH · n d` = − ∇H · (ρhuH ) dA.
dt D ∂D D
15.6. FLUID DYNAMICS ON A ROTATING FRAME 689
where uH is the horizontal velocity and n the normal of the line. In the last equality we
∂ ∂
applied the divergence theorem where ∇H = ( ∂x , ∂y , 0). Since this was an arbitrary
surface, we can take the integral away, and we have the continuity equation
∂h
+ ∇H · (uH h) = 0.
∂t
So if there is water flowing into a point (ie. a vertical line), then the height of the
surface at the point increase, and vice versa. We can write this out in full, in Cartesian
coordinates:
∂h ∂ ∂
+ (uh) + (uh) = 0.
∂t ∂x ∂y
1 ∂2η
− − f · ζ = −g∇2 η.
h0 ∂t2
We now use the conservation of potential vorticity ζ = Q0 + hη f to rewrite this
0
as
2
∂ η
− gh0 ∇2 η + f · f η = −h0 f · Q0 .
∂t2
Note that the right hand side is just a constant (in time). So we have a nice differential
equation we can solve.
1
These assume small oscillations and small Rossby number. The more general non-linearised
D ( ζ+f ) = 0.
version is Dt h
690 CHAPTER 15. FLUID DYNAMICS
E. 15-24
Suppose we have fluid with mean depth h0 , and we
z
start with the following scenario. Due to the differ-
ences in height, we have higher pressure on the right η0
and lower pressure on the left. η0 h0
If there is no rotation, then the final state is a flat
surface with no flow. However, this cannot be the
case if there is rotation, since this violates the con-
servation of Q. So what happens if there is rotation?
η0
Q0 = − f sign(x).
h0
∂2η f2 f f2
2
− η = Q0 = − η0 sign(x).
∂x gh0 g gh0
√
It is convenient to define a new variable R = gh0 /f which is a length scale called
the Rossby radius of deformation . This is the fundamental length
√ scale to use in
rotating systems when gravity is involved as well. We know gh0 is the fastest
possible wave speed, and thus R is how far a wave can travel in one rotation period.
We rewrite our equation as
d2 η 1 1
− 2 η = − 2 η0 sign(x).
dx2 R R
So there is still flow in this system, and is a flow in the y direction into the paper.
This flow gives Coriolis force to the right, and hence balances the pressure gradient
of the system. The final state is not one of rest, but one with motion in which the
Coriolis force balances the pressure gradient. This is geostrophic flow.
15.6. FLUID DYNAMICS ON A ROTATING FRAME 691
E. 15-25
Going back to our pressure maps, if we have high and low
pressure systems, we can have flows that look like the
L H
right. Then the Coriolis force will balance the pressure
gradients.
Weather maps describe balanced flows. We can compute the scales here. In the
atmosphere, we have approximately
√
10 · 103
R≈ = 106 ≈ 1000 km.
10−4
So the scales of cyclones are approximately 1000 km. On the other hand, in the
ocean, we have √
10 · 10
R≈ = 105 = 100 km.
10−4
So ocean scales are much smaller than atmospheric scales.
692 CHAPTER 15. FLUID DYNAMICS
APPENDIX A
I
II APPENDIX A. SOME USEFUL RESULTS
sin−1 x √ 1
√ 1
sin−1 xa
1−x2 a2 −x2
cos−1 x √ −1 √ 1
cosh−1 xa
1−x2 x2 −a2
tan−1 x 1 1 1
tan−1 xa
1+x2 a2 +x2 a
sinh−1 x √ 1 1 1
tanh−1 xa = 2a 1 a+x
a2 −x2 a
ln a−x
1+x2
1 1 x−a
cosh−1 x √ −1 x2 −a2 2a
ln x+a
x2 −1
1 √x
tanh−1 x 1
1−x2 (a2 ±x2 )3/2 a2 a2 ±x2
1 √x
(x2 −a2 )3/2
−
a2 x2 −a2
Z Z
1 x 1
sin2 (ax) dx = (1 − cos(2ax))dx = − sin(2ax)
2 2 4a
Z Z
1 x 1
cos2 (ax) dx = (1 + cos(2ax))dx = + sin(2ax)
2 2 4a
Z ax
e
eax (cos(bx) + i sin(bx))dx = 2
a cos(bx) + b sin(bx) + i(a sin(bx) − b cos(bx))
a + b2
n
(−1)r nr n−r mx
Z X
xn emx dx = x e
r=0
mr+1
dn
2e b2c n
Z
n
X (−1)r+1 n2r−1 n−2r+1 X (−1)r n2r n−2r
x cos(mx)dx = x cos(mx) + x sin(mx)
r=1
m2r r=0
m2r+1
dn
2e b2c n
Z
n
X (−1)r+1 n2r−1 n−2r+1 X (−1)r n2r n−2r
x sin(mx)dx = 2r
x sin(mx) − x cos(mx)
r=1
m r=0
m2r+1
Z a
kπx nπx a
sin sin dx = δk,n .
0 a a 2
p a
a cos(mx) + b sin(mx) = sgn(b) a2 + b2 sin mx + tan−1
b
−1 b
p
= sgn(a) a + b cos mx − tan
2 2
a
A.2. COORDINATE SYSTEMS AND OPERATORS III
1 ∂ 1 ∂ 1 ∂
∇=eu + ev + ew
hu ∂u hv ∂v hw ∂w
hu eu hv ev hw ew
1 ∂ ∂ ∂
∇×F=
hu hv hw ∂u ∂v ∂w
hu Fu hv Fv hw Fw
1 ∂ ∂ ∂
∇·F= (hv hw Fu ) + (hu hw Fv ) + (hu hv Fw )
hu hv hw ∂u ∂v ∂w
.............................................................................
In cylindrical polar coordinates (ρ, ϕ, z) we have
∂f 1 ∂f ∂f
∇f = ρ̂ + ϕ̂ + ẑ
∂ρ ρ ∂ϕ ∂z
1 ∂ (ρFρ ) 1 ∂Fϕ ∂Fz
∇·F= + +
ρ ∂ρ ρ ∂ϕ ∂z
2
∂2f
1 ∂ ∂f 1 ∂ f
∇2 f = ρ + 2 2
+
ρ ∂ρ ∂ρ ρ ∂ϕ ∂z 2
c
c 0 1 2πδ(k)
p
cn!
ctn 0 δ(x − c) e−ick
pn+1
b 1
sin(bt) 0 − 12 + H(x)
(p2 + b2 ) ik
p 1
cos(bt) 0 H(x) πδ(k) +
(p2 + b2 ) ik
1 1
eat a e−αx H(x) (Re α > 0)
(p − a) ik + α
n! 2α
tn eat a e−α|x| (Re α > 0)
(p − a)n+1 α2 + k 2
a
sinh(at) |a| cos(ax + b) π(e−ib δ(k + a) + eib δ(k − a))
(p2 − a2 )
p
cosh(at) |a| sin(ax + b) iπ(e−ib δ(k + a) − eib δ(k − a))
(p2 − a2 )
b x tk
eat sin(bt) a rect τ sinc
(p − a)2 + b2 τ 2π
p−a τx k
eat cos(bt) a τ sinc 2π rect
(p − a)2 + b2 2π τ
√
r
1 π 2|x| x τ τk
t 0 1− rect sinc2
2 p3 τ τ 2 4π
r
1 π τ τx 2|k| k
√ 0 sinc2 1− rect
t p 2 4π τ τ
δ(t − t0 ) e−pt0 0
e−pt0
H(t − t0 ) 0
p
def
The above Laplace transforms are valid for p > p0 . Here sinc(x) = sin(x)/x, rect(x/τ )
is the rectangular pulse function of width τ , H(x) is the Heaviside step function and
δ(x) is the Dirac delta function.
A.4. DISTRIBUTIONS V
A.4 Distributions
Discrete distributions: (here q = 1 − p)
Distribution PMF Mean Variance PGF
n
1 1 1 2 1X i
Discrete uniform U {1, · · · , n} , k ∈ {1, 2, · · · , n} (n + 1) (n − 1) z
k 2 12 n i=1
Bernoulli Bin(1, p) p (1 − p)1−k , k ∈ {0, 1}
k
p p(1 − p) q + pz
!
n k
Binomial Bin(n, p) p (1 − p)n−k , k ∈ {0, 1, · · · , n} np np(1 − p) (q + pz)n
k
k q q p
Geometric v.1 (1 − p) p, k ∈ N0
p p2 1 − qz
1 q pz
Geometric v.2 (1 − p)k−1 p, k∈N
! p p2 1 − qz
k−1 n n q (pz)n
Negative binomial NegBin(n, p) p (1 − p)k−n , k ∈ {n, n + 1, · · · } n 2
n−1 p p (1 − qz)n
λk −λ
Poisson Poisson(λ) e , k ∈ N0 λ λ eλ(z−1)
k!
Continuous distributions:
Distribution PDF CDF Mean Variance MGF
1 x−a a+b 1 eθb − eθa
Uniform U [a, b] I(a ≤ x ≤ b) (b − a)2
b−a b−a 2 12 θ(b − a)
(x−µ) 2
1 − 1 2 σ2
Normal N (µ, σ 2 ) √ e 2σ2 / µ σ2 eθµ+ 2 θ
2πσ
1 1 λ
Exponential Exponential(λ) λe−λx I(x ≥ 0) 1 − e−λx
λ λ2 λ−θ
1
Cauchy / undefined undefined undefined
π(1 + x2 )
α α−1 −λx
n
λ x e α α λ
Gamma Gamma(α, λ) I(x ≥ 0) / for θ < λ
Γ(α) λ λ2 λ−θ
Γ(a + b) a−1 a ab
Beta Beta(a, b) x (1 − x)b−1 I(0 ≤ x ≤ 1) / /
Γ(a)Γ(b) a+b (a + b)2 (a + b + 1)
−1 T −1 (z−µ)
e 2 (z−µ) Σ T µ+ 1 tT P t
Multivariate normal Nn (µ, Σ) n √ / µ Σ et 2
(2π) 2 det Σ
R∞
• Here Γ(z) = 0 xz−1 e−x dx is the gamma function. It is such that Γ(n) = (n − 1)!
if n is a positive integer.
• For n ∈ N the χ2n distribution is the same as the Gamma( n2 , 21 ) distribution. If Y ∼
Gamma(n, λ), then 2λY ∼ χ22n . The χ2n distribution is also the same as the sum of
n iid standard normal N (0, 1).
VI APPENDIX A. SOME USEFUL RESULTS
Percentage points of tn
n 60.0% 66.7% 75.0% 80.0% 87.5% 90.0% 95.0% 97.5% 99.0% 99.5% 99.9%
1 0.325 0.577 1.000 1.376 2.414 3.078 6.314 12.706 31.821 63.657 318.31
2 0.289 0.500 0.816 1.061 1.604 1.886 2.920 4.303 6.965 9.925 22.327
3 0.277 0.476 0.765 0.978 1.423 1.638 2.353 3.182 4.541 5.841 10.215
4 0.271 0.464 0.741 0.941 1.344 1.533 2.132 2.776 3.747 4.604 7.173
5 0.267 0.457 0.727 0.920 1.301 1.476 2.015 2.571 3.365 4.032 5.893
6 0.265 0.453 0.718 0.906 1.273 1.440 1.943 2.447 3.143 3.707 5.208
7 0.263 0.449 0.711 0.896 1.254 1.415 1.895 2.365 2.998 3.499 4.785
8 0.262 0.447 0.706 0.889 1.240 1.397 1.860 2.306 2.896 3.355 4.501
9 0.261 0.445 0.703 0.883 1.230 1.383 1.833 2.262 2.821 3.250 4.297
10 0.260 0.444 0.700 0.879 1.221 1.372 1.812 2.228 2.764 3.169 4.144
11 0.260 0.443 0.697 0.876 1.214 1.363 1.796 2.201 2.718 3.106 4.025
12 0.259 0.442 0.695 0.873 1.209 1.356 1.782 2.179 2.681 3.055 3.930
13 0.259 0.441 0.694 0.870 1.204 1.350 1.771 2.160 2.650 3.012 3.852
14 0.258 0.440 0.692 0.868 1.200 1.345 1.761 2.145 2.624 2.977 3.787
15 0.258 0.439 0.691 0.866 1.197 1.341 1.753 2.131 2.602 2.947 3.733
16 0.258 0.439 0.690 0.865 1.194 1.337 1.746 2.120 2.583 2.921 3.686
17 0.257 0.438 0.689 0.863 1.191 1.333 1.740 2.110 2.567 2.898 3.646
18 0.257 0.438 0.688 0.862 1.189 1.330 1.734 2.101 2.552 2.878 3.610
19 0.257 0.438 0.688 0.861 1.187 1.328 1.729 2.093 2.539 2.861 3.579
20 0.257 0.437 0.687 0.860 1.185 1.325 1.725 2.086 2.528 2.845 3.552
21 0.257 0.437 0.686 0.859 1.183 1.323 1.721 2.080 2.518 2.831 3.527
22 0.256 0.437 0.686 0.858 1.182 1.321 1.717 2.074 2.508 2.819 3.505
23 0.256 0.436 0.685 0.858 1.180 1.319 1.714 2.069 2.500 2.807 3.485
24 0.256 0.436 0.685 0.857 1.179 1.318 1.711 2.064 2.492 2.797 3.467
25 0.256 0.436 0.684 0.856 1.178 1.316 1.708 2.060 2.485 2.787 3.450
26 0.256 0.436 0.684 0.856 1.177 1.315 1.706 2.056 2.479 2.779 3.435
27 0.256 0.435 0.684 0.855 1.176 1.314 1.703 2.052 2.473 2.771 3.421
28 0.256 0.435 0.683 0.855 1.175 1.313 1.701 2.048 2.467 2.763 3.408
29 0.256 0.435 0.683 0.854 1.174 1.311 1.699 2.045 2.462 2.756 3.396
30 0.256 0.435 0.683 0.854 1.173 1.310 1.697 2.042 2.457 2.750 3.385
35 0.255 0.434 0.682 0.852 1.170 1.306 1.690 2.030 2.438 2.724 3.340
40 0.255 0.434 0.681 0.851 1.167 1.303 1.684 2.021 2.423 2.704 3.307
45 0.255 0.434 0.680 0.850 1.165 1.301 1.679 2.014 2.412 2.690 3.281
50 0.255 0.433 0.679 0.849 1.164 1.299 1.676 2.009 2.403 2.678 3.261
55 0.255 0.433 0.679 0.848 1.163 1.297 1.673 2.004 2.396 2.668 3.245
60 0.254 0.433 0.679 0.848 1.162 1.296 1.671 2.000 2.390 2.660 3.232
120 0.254 0.677 1.289 1.658 1.980 2.358 2.617 3.160
∞ 0.253 0.431 0.674 0.842 1.150 1.282 1.645 1.960 2.326 2.576 3.090
A.5. STATISTICS TABLES VII
IX
X APPENDIX B. LIST OF SYMBOLS
cosec 1/sin
tan sin/cos
cot 1/tan
cosh hyperbolic function cosh
sinh hyperbolic function sinh
sech 1/cosh
cosech 1/sinh
tanh sinh/cosh
coth 1/tanh
limx→a limit as x → a
like likelihood
iid independent identically dis-
tributed
i.i.d. independent identically dis-
tributed
stab stabilizers
corr correlation coefficient
span span of
diag(?) Diagonal matrix with diago-
nal entries ?
XII APPENDIX B. LIST OF SYMBOLS
Below are probable meanings for notations with relatively less certain/definite
meaning. Notations which are more easily used for something else.
?00
second derivative of ? Cn differentiability class
XIV
INDEX XV
value, 92
vector of decision variables, 73
vector of fitted values, 549
vector of residuals, 549
vector potential, 629
vector space, 103
Vectors, 642
velocity potential, 673
Venturi meter, 668
Vibrations of a circular membrane, 239
viscosity, 649
viscous flow, 649
volume flux, 654
volume form, 125
Vortex, 677
vortex lines, 668
vorticity, 654, 662
Vorticity equation, 671
z-test, 534
Zadunaisky device, 598
zero divisor, 364
zero-sum game, 91
Zorn’s lemma, 389