Beruflich Dokumente
Kultur Dokumente
1
2002-2003,
c Aster, Borchers, and Thurber
ii
Preface for Draft Versions
This textbook grew out of a course in geophysical inverse methods that has
been taught for over 10 years at New Mexico Tech, first by Rick Aster and for
the last four years by Rick Aster and Brian Borchers. The lecture notes were
first assembled into book format and used as the official text for the course in
the fall of 2001. The draft of the textbook was then used in Cliff Thurber’s
course at the University of Wisconsin during the spring semester of 2002. In
Fall, 2002, Aster, Borchers, and Thurber signed a contract with Academic Press
to produce a text by the end of 2003 for publication in 2004.
We expect that readers of this book will have some familiarity with calculus,
differential equations, linear algebra, probability and statistics at the undergrad-
uate level. Appendices A and B review the required linear algebra, probability,
and statistics. In our experience teaching this course, many students have had
weaknesses in their preparation in linear algebra or in probability and statistics.
For that reason, we typically spend the first two to three weeks of the course
reviewing this material.
Chapters one through five form the heart of the book and should be read
in sequence. Chapters six, seven, and eight are independent of each other, but
depend strongly on the material in Chapters one through five. They can be
covered in any order. Chapters nine and ten are independent of Chapters six,
seven, and eight, but should be covered in sequence. Chapter 11 is independent
of Chapters six through ten.
If significant time for review of linear algebra, probability, and statistics is
taken, then there will not be time to cover the entire book in one semester.
However, it should be possible cover the majority of the material by skipping
one or two of the chapters after Chapter five.
The text is a work in progress, but already contains much usable material
that should be of interest to a large audience. We and Academic Press feel that
the final content, presentation, examples, index terms, references, and other
features will be substantially benefited by having a large number of instructors,
students, and researchers use, and provide feedback on, draft versions. Such
comments and corrections are greatly appreciated, and significant contributions
will be acknowledged in the book.
iii
iv PREFACE FOR DRAFT VERSIONS
Contents
1 Introduction 1-1
1.1 Classification of Inverse Problems . . . . . . . . . . . . . . . . . . 1-1
1.2 Examples of Parameter Estimation Problems . . . . . . . . . . . 1-3
1.3 Examples of Inverse Problems . . . . . . . . . . . . . . . . . . . . 1-7
1.4 Why Inverse Problems Are Hard . . . . . . . . . . . . . . . . . . 1-11
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14
1.6 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 1-15
v
vi CONTENTS
Introduction
G(m) = d . (1.1)
1-1
1-2 CHAPTER 1. INTRODUCTION
discrete inverse problem may be coefficients or other constants in a functional relationship that describes
parameter estimation problem a physical process. We can express the parameters as a vector m. Similarly,
continuous inverse problem if there are only a finite number of data collected, then we can express the
linear systems
data as a vector d. Such problems are called discrete inverse problems or
kernel
parameter estimation problems. A parameter estimation problem can be
written as a nonlinear system of equations
G(m) = d . (1.2)
In other cases, where m and d are functions of time and space, we cannot
simply express them as vectors. Such problems are called continuous in-
verse problems. For convenience, we will refer to discrete inverse problems as
parameter estimation problems, and reserve the phrase “inverse problem” for
continuous inverse problems.
A central theme of this book is that continuous inverse problems can of-
ten be approximated by parameter estimation problems with a large number of
parameters. This process of discretizing a continuous inverse problem is dis-
cussed in Chapter 3. By solving the discretized parameter estimation problem,
we can avoid some of the mathematical difficulties associated with continuous
inverse problems. However, it happens that the parameter estimation prob-
lems that arise from discretization of continuous inverse problems still require
sophisticated solution techniques.
A class of mathematical models for which many useful results exist are linear
systems, which obey superposition
and scaling
G(αm) = αG(m) . (1.4)
In the case of a discrete linear inverse problem, (1.2) can always be written
in the form of a linear system of algebraic equations.
G(m) = Gm = d . (1.5)
and Z b Z b
g(s, x) αm(x) dx = α g(s, x) m(x) dx .
a a
1.2. EXAMPLES OF PARAMETER ESTIMATION PROBLEMS 1-3
Equations of the form of (1.6) where m(x) is the unknown are called Fred- Fredholm integral of the first
kind
holm integral equations of the first kind (IFK). IFK’s arise in a sur-
IFK
prisingly large number of inverse problems. Unfortunately, they can also have convolution equation
mathematical properties that make it difficult to obtain useful solutions. convolution
In some cases, g(s, x) depends only on the difference s − x, and we have a deconvolution
convolution equation Fourier Transform
Z ∞ linear regression
d(s) = g(s − x) m(x) dx . (1.7)
−∞
to obtain φ(x). Although there are many tables and analytic methods of ob-
taining Fourier transforms and their inverses, it is sometimes desirable to obtain
numerical estimates of φ(x), such as where there is no analytic inverse, or where
we wish to find φ(x) from spectral data at a discrete collection of frequencies.
It is an intriguing question why linearity appears in many interesting geo-
physical problems. Many physical systems encountered in practice are linear to
a very high degree because only small departures from equilibrium occur. An
important example from geophysics is that of seismic wave propagation, where
the stresses caused by the propagation of seismic disturbances are usually very
small relative to the magnitudes of the elastic moduli. This situation leads to
small strains and a very nearly linear stress-strain relationship. Because of this,
many seismic wave field problems obey superposition and scaling. Other physi-
cal systems such as gravity and magnetic fields at the strengths encountered in
geophysics have inherently linear physics.
Because many important inverse problems are linear, Chapters 2 through
8 of this book deal with methods for the solution of linear inverse problems.
We discuss methods for nonlinear parameter estimation and inverse problems
in Chapters 9 and 10.
Example 1.1
An archetypical example of a linear parameter estimation problem
is the fitting of a function to a data set via linear regression. In
an ancient example, the observed data, y, are altitude observations
of a ballistic body observed at a set of times t (Figure 1.1), and we
wish to solve for a model, m, giving the initial altitude (m1 ), initial
1-4 CHAPTER 1. INTRODUCTION
− 12 t21
1 t1 y1
1 t2 − 12 t22
y2
1 t3 − 12 t23 m1
y3
. . . m2 = . . (1.9)
. . . m3 .
. . . .
1 tm − 12 t2m ym .
Even though the functional relationship (1.8) is quadratic, the equa-
tions for the three parameters are linear, so this is a linear parameter
estimation problem.
Because there are more (known) data than (unknown) model parameters in
(1.9), and because our forward modeling of the physics may be approximate the
1.2. EXAMPLES OF PARAMETER ESTIMATION PROBLEMS 1-5
However, (1.10) is not always the best misfit measure to use in solving para-
metric systems. Another misfit measure that we may wish to optimize instead,
and which is decidedly better in many situations, is the 1-norm
m
X
ky − Gm)k1 = |yi − (Gm)i | . (1.11)
i=1
Methods for linear parameter estimation using both of these misfit measures
are discussed in Chapter 2.
Example 1.2
m = [x, τ ]T , (1.12)
d = [t1 , t2 , . . . , tm ]T (1.13)
t1
t2
t3
. = G((x, τ )0 ; W) ,
.
.
tm
where W is a matrix where the ith row, Wi,· gives the spatial co-
ordinates of the ith station. G returns arrival–times for the network
of seismic stations for a given source location and time by forward
modeling arrival times through some seismic velocity structure, v(x).
This problem is nonlinear even if the seismic velocity is a constant, v.
In this case, the ray paths in Figure 1.2 are straight. Furthermore,
the phase arrival time at station i located at position Wi is
T
kWi,· − xk2
ti = +τ . (1.14)
v
Because the forward model (1.14) is nonlinear there is no general way
to write the relationship between the data, the hypocenter model m,
and G as a linear system of equations.
Nonlinear parameter estimation problems are frequently solved by
guessing a starting solution (which is hopefully close to the true
1.3. EXAMPLES OF INVERSE PROBLEMS 1-7
Example 1.3
In vertical seismic profiling we wish to know the vertical seis-
mic velocity of the Earth surrounding a borehole. A downward-
propagating seismic disturbance is generated at the surface by a
source and seismic waves are sensed by a string of seismometers in
the borehole (Figure 1.3).
The arrival time of the wavefront at each instrument is measured
from the recorded seismograms. As in the tomography Example 1.3,
1-8 CHAPTER 1. INTRODUCTION
Example 1.4
where Γ is the gravitational constant. Note that the kernel has the
form g(x−s). Because the kernel depends only on x−s, this forward
problem is also an example of convolution. Because g(x) is a smooth
function, d(s) will be a smoothed version of m(x), and inverse solu-
tions for m(x) will consequently be some roughened transformation
of d(x).
Example 1.5
Figure 1.4: A simple linear inverse problem; find ∆ρ(x) from gravity anomaly
observations, d(x).
Figure 1.5: A simple nonlinear inverse problem; find h(x) from gravity anomaly
observations, d(x).
1.5). In this case, the data are still, of course, just given by the su-
perposition of the contributions to the gravitational anomaly field,
but the forward problem is now
Z ∞
m(x)
d(s) = Γ 3/2
∆ρ dx .
−∞ ((x − s) + m2 (x))
2
Example 1.6
1-10 CHAPTER 1. INTRODUCTION
The physical model for tomography in its most basic form assumes
that geometric ray theory (essentially a high frequency limit to the
wave equation) is valid for the problem under consideration, so that
we can consider electromagnetic or elastic energy to be propagating
along ray paths. The density of ray path coverage in a tomographic
problem may vary significantly throughout the model and thus pro-
vide much better information on the physical properties of interest
in regions of denser ray coverage while providing little or no infor-
mation in other regions.
Z
t= s(x) dl .
`
In general, the ray paths l can bend due to refraction effects. How-
ever, if refraction effects are negligible, then it may be feasible to
approximate the ray paths with straight lines.
rank deficient
solution existence models that adequately fit the data. It is essential to characterize just what
null space solution has been obtained, how “good” it is in terms of physical plausibility and
null space model fit to the data, and perhaps how consistent it is with other constraints. Some
solution nonuniqueness
important issues that must be considered include solution existence, solution
null space models
solution uniqueness
uniqueness, and instability of the solution process.
model resolution
ill-posed problem 1. Existence. There may be no model that exactly fits a given data set if our
discrete-ill-posed problem mathematical model of the system’s physics is approximate and/or the
ill-conditioned system data contains noise.
solution instability
regularization 2. Uniqueness. If exact solutions do exist, they may not be unique, even for
an infinite number of exact data. This can easily occur in potential field
problems. A trivial example is the case of the external gravitational field
from a spherically-symmetric mass distribution of unknown density, where
we can only determine the total mass, not the dimensions and density of
the body. Non-uniqueness is also a characteristic of rank deficient linear
problems because the matrix G can have a nontrivial null space.
One manifestation of this is that a model may fit a finite number of data
points acceptably, but predict wild and physically unrealistic values where
we have no observations. In linear parameter estimation problems, models
that lie in the null space of G, m0 , are solutions to Gm0 = 0, and hence fit
d = 0 in the forward problem exactly. These null space models can be
added to any particular model that satisfies (1.5), and not change the fit
to the data. The result is an infinite number of mathematically acceptable
models.
In practical terms, suppose that there exists a nonzero model m which
results in an instrument reading of zero. One cannot discriminate this
situation from the situation where m is really zero.
An important and thorny issue with problems that have nonunique solu-
tions is that an estimated model may be significantly smoothed relative to
the true situation. Characterizing this smoothing is essential to interpret-
ing models in terms of their likely correspondence to reality. This analysis
falls under the general topic of model resolution analysis.
g(s, x) = 1
Clearly, since the left hand side of (1.17) is independent of s, this system has
no solution unless y(s) is a constant. Thus, there are an infinite number of
mathematically conceivable data sets y(s), which are not constant and where
no exact solution exists. This is an existence issue; there are data sets for which
there are no exact solutions.
Conversely, where a solution to (1.17) does exist, the solution is nonunique,
since there are an infinite number of functions which integrate over the unit
interval to produce the same constant. This is a uniqueness issue; there are an
infinite number of models that satisfy the IFK exactly.
A more subtle example of nonuniqueness is shown by the equation
Z 1 Z 1
g(s, x) m(x) dx = s · sin(πx) m(x) dx = y(s) . (1.18)
0 0
square-integrable function to any model solution, m(x), and obtain a new model that fits the data just as
Riemann–Lebesgue lemma well because
Z 1 Z 1 Z 1
s·sin(πx)(m(x)+m0 (x)) dx = s·sin(πx) m(x) dx+ s·sin(πx) m0 (x) dx
0 0 0
Z 1
= s · sin(πx) m(x) dx + 0 .
0
Equation (1.18) thus allows an infinite range of very different solutions which
fit the data equally well.
Even if we do not encounter existence or uniqueness issues, it can be shown
that instability is a fundamental feature of IFK’s, because
Z ∞
lim g(s, t) sin nπt dt = 0 (1.20)
n−>∞ −∞
1.5 Exercises
1. Consider a mathematical model (1.1) of the form G(m) = d, where m is
a vector of size n, and d is a vector of size m. Suppose that the model
obeys the superposition and scaling laws and is thus linear. Show that
G(m) can be written in the form
G(m) = Γm
1.6. NOTES AND FURTHER READING 1-15
Chapter 2
Linear Regression
One commonly used measure of the misfit is the 2–norm of the residual. This
model that minimizes the 2–norm of the residual is called the least squares
solution. The least squares or 2–norm solution is of special interest both
because it is very amenable to analysis and geometric intuition, and because it
turns out to be statistically the most likely solution if data errors are normally
distributed.
A typical example of a linear parameter estimation problem is linear regres-
sion to a line, where we seek a model of the form y = m1 + m2 x which which
2-1
2-2 CHAPTER 2. LINEAR REGRESSION
maximum likelihood best fits a set of m > 2 data points. We seek a minimum 2–norm solution for
estimation
1 x1 d1
1 x2
d2
. . m1
.
Gm = . = =d (2.3)
. m2
.
. . .
1 xm dm
The 2–norm solution for m is, from the normal equations (A.7),
It can be shown that if G is of full rank then (GT G)−1 always exists.
Equation (2.4) is a general 2–norm minimizing solution for any full-rank
linear parameter estimation problem. Returning to our example of fitting a
straight line to m data points, we can now write a closed–form solution for this
particular problem as
model, as parameterized by the n elements in the vector, m for a set of m obser- likelihood function
vations in the vector d. We will assume that the observations are statistically maximum likelihood principle
independent so that we can use a product form joint density function.
Given a model m, we have a probability density function fi (di , m) for the
ith observation. In general, this probability distribution will vary, depending
on m. The joint probability density for a vector of observations d will then be
Note that the values of f are probability densities not probabilities. We can
only compute the probability of getting data in some range given a model m
by computing an integral of f (d, m) over that range. In fact, the probability of
getting any particular data set is precisely 0!
We can get around this problem by considering the probability of getting
a data set in a small box around a particular data set d. This probability
is approximately proportional to the probability density f (d, m). For many
possible models m, the probability density will be quite small. Such models
would be relatively unlikely to result in the data d. For other models, the
probability density might be much larger. These models would be relatively
likely to result in the data d.
For this reason, (2.6) is called the likelihood function. According to the
maximum likelihood principle we should select the model m that maximizes
the likelihood function. Estimates of the model obtained by following the max-
imum likelihood principle have many desirable properties [CB02, DS98]. It is
particularly interesting that when we have a discrete linear inverse problem and
the data errors are independent and normally distributed, then the maximum
likelihood principle leads us to the least squares solution.
Suppose that our data have independent random errors which are normally
distributed with expected value zero. Let the standard deviation of the ith
observation be σi . The likelihood for the ith observation is
1 2 2
fi (di , m) = 1/2
e−(di −(Gm)i ) /2σi . (2.7)
(2π) σi
Considering the entire data set, The likelihood function is the product of the
individual likelihoods
1 −(di −(Gm)i ) 2
/2σi2
f (d, m) = Πm
i=1 e . (2.8)
(2π)m/2 σ 1 · · · σm
The constant factor does not effect the maximization of f , so we can solve the
maximization problem
2
−(di −(Gm)i ) /2σi2
max Πm
i=1 e . (2.9)
Aside from the constant factors of 1/σi2 , this is the least squares problem for
Gm = d.
To incorporate the standard deviations, we scale the system of equations.
Let
W = diag(1/σ1 , 1/σ2, . . . , 1/σm ). (2.13)
Then let
Gw = WG (2.14)
and
dw = Wd . (2.15)
The weighted system of equations is
Gw m = dw . (2.16)
Now,
m
X
kdw − Gw mw k22 = (di − (Gmw )i )2 /σi2 . (2.18)
i=1
Thus the least squares solution to Gw m = dw is the maximum likelihood solu-
tion.
The sum of squares also provides important statistical information about
the quality of our estimate of the model. Let
m
X
χ2obs = (di − (Gmw )i )2 /σi2 . (2.19)
i=1
0.2
0.15
0.1
0.05
0
0 5 10 15 20 25 30
x
1. The p–value is not too small and not too large. Our least squares solution
produces an acceptable data fit and our assumptions are consistent with
the data In general use, p does not actually have to be very large to
be deemed marginally “acceptable” in many cases (e.g., p ≈ 10−2 ), as
truly “wrong” models (see below) will typically produce extremely small
p–values (e.g., 10−12 ) because of the short-tailed nature of the normal
distribution.
2. The p–value is very small. We are faced with three nonexclusive possibil-
ities, but something is clearly wrong.
3. The p–value is very large (very close to one). The fit of the forward model
to the data is almost exact. We should investigate the possibility that we
have overestimated the data errors. A more sinister possibility is that a
very high p–value is indicative of data fraud, such as might happen if data
were cooked-up ahead of time to fit a particular model.
A rule of thumb for problems with a large number of degrees of freedom is
that the expected value of χ2 approaches ν. This arises because, as the central
limit theorem predicts, the χ2 random variable will itself becomes normally
distributed with mean ν and standard deviation (2ν)1/2 for large ν.
In addition to examining the χ2obs statistic, it is always a good idea to
examine the residuals rw = Gw mw − dw . The residuals should be roughly
normally distributed with standard deviation one and no obvious patterns. In
some cases where an incorrect model has been fitted to the data, the residuals
will reveal the nature of the modeling error. For example, in a linear regression
to a straight line, it might be that all of the residuals are negative for small
values of the independent variable t and then positive for larger values of t.
This would indicate that perhaps an additional quadratic term is needed in the
regression model.
If we have two alternative models that we have to the data it can be impor-
tant to compare the models to determine whether or not one model fits the data
significantly better than the other model. The F distribution (Appendix B) can
be used to compare the fit of two models to assess whether or not a statistically
significant residual reduction has occurred [DS98].
The respective χ2obs values of the two models are have χ2 distributions with
their respective degrees of freedom, ν1 and ν2 . It can be shown that under our
statistical assumptions, the ratio
χ21 /ν1 χ2 ν2
R= 2 = 12 (2.22)
χ2 /ν2 χ2 ν1
will have an F distribution with parameters ν1 and ν2 [DS98].
If the observed value of the ratio, Robs is extremely large or small, this
indicates that one of the models fits the data significantly better than the other
2.2. STATISTICAL ASPECTS OF LEAST SQUARES 2-7
level. If Robs is greater than the the 95th percentile of the F distribution unbiased
with parameters ν1 and ν2 , then we can conclude that the second model has a
statistically significant improvement in data fit over the first model.
Parameter estimates in linear regression are constructed of linear combina-
tions of independent data with normal errors. Because a linear combination of
normally distributed random variables is normally distributed (Appendix B),
the model parameters will also be normally distributed. Note that, even though
the data errors are uncorrelated, the covariance matrix of the model parameters
will generally have nonzero off-diagonal terms.
To derive the mapping between data and model covariances, consider the
covariance of an m-length data vector, d, of normally distributed, independent
random variables operated on by a general linear transformation specified by an
m by n matrix, A. From Appendix B, we know that
The least squares solution (2.17) has A = (GTw Gw )−1 GTw . Since the weighted
data have an identity covariance matrix, the covariance for the model parameters
is
C = (GTw Gw )−1 GTw Im Gw (GTw Gw )−1 = (GTw Gw )−1 . (2.24)
In the case of equal and uncorrelated normal data errors, so that the data
covariance matrix Cov(d) is simply the variance σ 2 times the m by m identity
matrix, Im , (2.24) simplifies to
GT Gmtrue = GT dtrue .
Thus
E[mL2 ] = mtrue .
In statistical terms, the least squares solution is said to unbiased.
We can compute 95% confidence intervals for the individual model param-
eters using the fact that each model parameter mi has a normal distribution
with mean mtrue and variance Cov(mL2 )i,i . The 95% confidence intervals are
given by
mL2 ± 1.96 · diag(Cov(mL2 ))1/2 , (2.26)
2-8 CHAPTER 2. LINEAR REGRESSION
Example 2.1
Gi = [1, ti , − (1/2)t2i ]
so that
1 1 −0.5
1 2 −2.0
1 3 −4.5
1 4 −8.0
1 5 −12.5
G= .
1 6 −18.0
1 7 −24.5
1 8 −32.0
1 9 −40.5
1 10 −50.0
2.2. STATISTICAL ASPECTS OF LEAST SQUARES 2-9
600
500
400
Elevation (m)
300
200
100
0
0 2 4 6 8 10 12
Time (s)
Figure 2.2: Data and model predictions for the ballistics example.
Figure 2.2 shows the observed data and the fitted curve.
mL2 has a corresponding model covariance matrix, from (2.25), of
88.53 −33.60 −5.33
C = −33.60 15.44 2.67 . (2.30)
−5.33 2.67 0.48
confidence regions If we consider combinations of model parameters, the situation becomes more
principal axes, error ellipsoid complex. To characterize model uncertainty more effectively, we can examine
95% confidence regions for pairs or larger collections of parameters. When
confidence regions for parameters considered jointly are projected onto model
coordinate axes, we obtain intervals for individual parameters which may be
significantly larger than parameter confidence intervals considered individually.
For a vector of estimated model parameters that have an n-dimensional
MVN distribution with mean mtrue and covariance matrix C, the quantity
C−1 = PT ΛP , (2.33)
Cov(mi , mj )
ρmi , mj =p
Var(mi ) · Var(mj )
which give
ρm1 ,m2 = −0.91
ρm1 ,m3 = −0.81
ρm2 ,m3 = 0.97 .
This shows that the three model parameters are highly statistically
dependent, and that the error ellipsoid is thus very elliptical.
Application of (2.33) for this example shows that the directions of
the error ellipsoid principal axes are
−0.03 −0.93 0.36
P = [P·,1 , P·,2 , P·,3 ] ≈ −0.23
0.36 0.90 ,
0.97 0.06 0.23
where Fχ−1
2 (0.95) ≈ 2.80 is the 95th percentile of the χ
2
distribution
with three degrees of freedom.
Projecting the 95% confidence ellipsoid axes back into the (m1 , m2 , m3 )
coordinate system (Figure 2.3) we obtain 95% confidence intervals
for the parameters considered jointly
which are about 40% broader than the single parameter confidence
estimates obtained using only the diagonal covariance matrix terms
(2.31). Note that there is actually a greater than 95% probability
that the box in (2.34) will include the true values of the param-
eters. The reason is that these intervals considered together as a
rectangular region includes many points which lie outside of the
95% confidence ellipsoid.
2-12 CHAPTER 2. LINEAR REGRESSION
Figure 2.3: Projections of the 95% error ellipsoid onto model axes.
As you might expect, there is a statistical cost associated with our not
knowing the true standard deviation. If the data standard deviations are known
2.2. STATISTICAL ASPECTS OF LEAST SQUARES 2-13
Example 2.3
Gm = y . (2.38)
y = −1.03 + 10.09x .
Figure 2.4 shows the data and the linear regression line. Our es-
timate of the standard deviation of the measurement errors is s =
30.74. Thus the estimated covariance matrix for the fit parameters
is
338.24 −4.93
C = s2 (GT G)−1 =
−4.93 0.08
2-14 CHAPTER 2. LINEAR REGRESSION
and √
m2 = 10.09 ± tn−2,0.975 0.08 = 10.09 ± 0.59 .
Gw mw = yw .
y = −12.24 + 10.25x .
m1 = −12.24 ± 22.39
and
m2 = 10.25 ± 0.47 .
Figure 2.6 shows the data and least squares fit. Figure 2.7 shows
the scaled residuals. Notice that this time there is no trend in the
magnitude of the residuals as x and y increase. The estimated stan-
dard deviation is 0.045, or 4.5%. In fact, these data were generated
according to y = 10x + 0, with standard deviations for the measure-
ment errors at 5% of the y value.
2.2. STATISTICAL ASPECTS OF LEAST SQUARES 2-15
1100
1000
900
800
700
y
600
500
400
300
200
20 40 60 80 100
x
80
60
40
20
0
r
−20
−40
−60
−80
20 40 60 80 100
x
1100
1000
900
800
700
y
600
500
400
300
200
20 40 60 80 100
x
0.1
0.05
0
r
−0.05
−0.1
20 40 60 80 100
x
rather than minimizing the 2–norm. The 1–norm solution, mL1 , will be more
outlier resistant, or robust, than the least squares solution, mL2 , because
(2.40) does not square each of the terms in the misfit measure, as (2.12) does.
The 1–norm solution mL1 also has a maximum likelihood interpretation; it
is the maximum likelihood estimator for data with errors distributed according
to a 2-sided exponential distribution
1 −|x−µ|/σ
f (x) = e . (2.41)
2σ
Data sets distributed as (2.41) are unusual. Nevertheless, it is still often worth-
while to find a solution where (2.40) is minimized rather than (2.12), even if
measurement errors are normally distributed, because of the all-too-frequent
occurrence of incorrect mesurements that do not follow the normal distribution.
Example 2.4
It is easy to demonstrate the advantages of 1–norm minimization
using the quadratic regression example discussed earlier. Figure 2.8
shows the original sequence of independent data points with unit
standard deviations, where one of the points (number 4) is clearly
an outlier if a model of the form (2.27) is appropriate (it is the
original data with 200 m subtracted from it). The data prediction
using the 2–norm minimizing solution for this data set,
mL2 = [26.4 m, 75.6 m/s, 4.9 m/s2 ]T , (2.42)
2-18 CHAPTER 2. LINEAR REGRESSION
600
500
400
Elevation (m)
300
200
100
0
0 2 4 6 8 10 12
Time (s)
(the lower of the two curves) is clearly skewed away from the major-
ity of the data points due to its efforts to accommodate the outlier
data point and thus minimize χ2 , and it is a poor estimate of the
true model (2.28). Even without a graphical view of the data fit, we
can note immediately that (2.42) fails to fit the data acceptably be-
cause of the huge χ2 value of ≈ 1109. This is clearly astronomically
out of bounds for a problem with 7 degrees of freedom, where the
value of χ2 should be not be far from 7. The corresponding p–value
for this huge χ2 value is effectively zero.
The upper data prediction is obtained using the 1–norm solution,
The derivative is 0 when exactly half of the data are less than m and half of
the data are greater than m. Of course, this can only happen when there are
an even number of observations. In this case, any value of m between the two
middle observations is a 1–norm solution. When there are an odd number of
data, the median data point is the unique 1–norm solution. Even an extreme
outlier will not have a large effect on the median of the data. This is the reason
for the robustness of the 1–norm solution.
The general problem of minimizing kd − Gmk1 is somewhat more compli-
cated. One practical way to find 1–norm solutions is iteratively reweighted
least squares, or IRLS [SGT88]. The IRLS algorithm solves a sequence of
weighted least squares problems whose solutions converge to a 1–norm minimiz-
ing solution.
Let
r = Gm − d (2.48)
We want to minimize
m
X
f (m) = krk1 = |ri |. (2.49)
i=1
2-20 CHAPTER 2. LINEAR REGRESSION
m
∂f (m) X
= Gi,k sgn(ri ) . (2.51)
∂mk i=1
Thus
∇f (m) = GT Rr = GT R(d − Gm) (2.53)
where R is a diagonal weighting matrix with Ri,i = 1/|ri |. To find the 1–norm
minimizing solution, we solve ∇f (m) = 0.
GT RGm = GT Rd . (2.55)
Since R depends on m, this is a nonlinear system of equations which we
cannot solve directly. Fortunately, we can use a simple iterative scheme to find
a the appropriate weights. The IRLS algorithm begins with the least squares
solution m0 = mL2 . We calculate the initial residual vector r0 = d − Gm0 .
We then solve (2.55) to obtain a new model m1 and residual vector r1 . The
process is repeated until the model and residual vector stabilize. A typical rule
is to stop the iteration when
kmk+1 − mk2
<τ (2.56)
1 + kmk+1 k2
for some tolerance τ .
The procedure will fail if any element of the residual vector becomes zero.
A simple modification to the algorithm deals with this problem. We select
a tolerance below which we consider the residuals to be effectively zero. If
|ri | < , then we set Ri,i = 1/. With this modification it can be shown
that this procedure will always converge to an approximate 1–norm minimizing
solution.
As with the χ2 misfit measure, there is a corresponding p-value that can be
used under the assumption of normal data errors for 1–norm solutions [PM80].
For a 1–norm misfit measure, y, for ν degrees of freedom given by (2.40), the
(1)
probability that a worse misfit than that observed, µobs could have occurred
2.4. MONTE CARLO ERROR PROPAGATION 2-21
given independent normally-distributed data and ν degrees of freedom is ap- Monte Carlo error propagation
proximately given by
σ1 = (1 − 2/π)ν ,
2 − π/2 1
γ= ν2 ,
(π/2 − 1)3/2
x2 − 1 − x2
Z (2) (x) = √ e 2 ,
2π
µ(1) − µ̄
x= ,
σ1
and p
µ̄ = 2/π ν .
GmL1 = db .
We next resolve the IRLS problem many (q) times for 1–norm models corre-
sponding to independent data realizations to obtain a suite of 1–norm solutions
to
GmL1 ,i = db + ni ,
where ni is the ith noise vector. If A is the q by m matrix where the ith row
contains the difference between the ith model estimate and the average model
AT A
Cov(mL1 ) = .
q
2-22 CHAPTER 2. LINEAR REGRESSION
Example 2.5
Recall Example 2.3. An estimate of Cov(mL1 ) for this example using
10,000 iterations of the Monte Carlo procedure is
123.92 −46.96 −7.43
Cov(mL1 ) = −46.96 15.44 3.72 ≈ 1.4 · Cov(mL2 ) .
−7.43 3.72 0.48
Although we have no reason to believe that the model parameters
will be normally distributed, we can still compute approximate 95%
confidence intervals for the parameters by simply acting as though
they were normally distributed. We obtain
mL1 = [17.6 ± 21.8 m, 96.4 ± 7.70 m/s, 9.3 ± 1.4 m/s2 ]T . (2.58)
2.5 Exercises
1. A seismic profiling experiment is performed where the first arrival times
of seismic energy from a mid-crustal refractor are made at distances (km)
of
6.0000
10.1333
14.2667
x=
18.4000
22.5333
26.6667
from the source, and are found to be (in seconds after the source origin
time)
3.4935
4.2853
5.1374
t= 5.8181
.
6.8632
8.1841
The model for this arrival time pattern is a simple two-layer flat lying
structure which predicts that
ti = t0 + s2 xi
where the intercept time, t0 depends on the thickness and slowness of the
upper layer, and s2 is the slowness of the lower layer. The estimated noise
is believed to be uncorrelated, and normally distributed with expected 0
and standard deviation 0.1 s. i.e. σ = 0.1s.
2.5. EXERCISES 2-23
(a) Find the least squares solution for the two model parameters t0 and chi2pdf
s2 . Plot the data predictions from your model relative to the true
data.
(b) Calculate and comment on the parameter correlation matrix. How
will the correlation entries be reflected in the appearance of the error
ellipsoid?
(c) Plot the error ellipsoid in the (t0 , s2 ) plane and calculate conservative
95% confidence intervals for t0 and s2 .
Hint: The following MATLAB code will plot a 2-dimensional co-
variance ellipse, where covm is the covariance matrix and m is the
2-vector of model parameters.
%diagonalize the covariance matrix
[u,lam]=eig(inv(covm));
%generate a vector of angles from 0 to 2*pi
theta=(0:.01:2*pi)’;
%calculate the x component of the ellipsoid for all angles
r(:,1)=(delta/sqrt(lam(1,1)))*u(1,1)*cos(theta)+...
(delta/sqrt(lam(2,2)))*u(1,2)*sin(theta);
%calculate the y component of the ellipsoid for all angles
r(:,2)=(delta/sqrt(lam(1,1)))*u(2,1)*cos(theta))+...
(delta/sqrt(lam(2,2)))*u(2,2)*sin(theta);
%plot(x,y), adding in the model parameters
plot(m(1)+r(:,1),m(2)+r(:,2))
(d) Evaluate the p–value for this model (you may find the MATLAB
Statistics Toolbox function chi2cdf to be useful here).
(e) Evaluate the value of χ2 for 1000 Monte Carlo simulations using the
data prediction from your model perturbed by noise that is consistent
with the data assumptions. Compare a histogram of these χ2 values
with the theoretical χ2 distribution for the correct number of degrees
of freedom (you may find the MATLAB statistical toolbox function
chi2pdf to be useful here).
(f) Are your p–value and Monte Carlo χ2 distribution consistent with
the theoretical modeling and the data set? If not, explain what is
wrong.
(g) Use IRLS to evaluate 1–norm estimates for t0 and s2 . Plot the data
predictions from your model relative to the true data and compare
with (a).
(h) Use Monte Carlo error propagation and IRLS to estimate symmetric
95% confidence intervals on the 1–norm solution for t0 and s2 .
(i) Examining the contributions from each of the data points to the 1–
norm misfit measure, can you make a case that any of the data points
are statistical outliers?
2-24 CHAPTER 2. LINEAR REGRESSION
and
b − b̄
b0 = p ,
σ · C2,2
and demonstrate using a Q − Q plot that your estimates for a0 and
b0 are distributed as N (0, 1).
(d) Show using a Q − Q plot that the squared residual lengths
krk22 = kd − Gmk22
and
b − b̄
b0 = 1/2
s · C2,2
where each solution is normalized by its respective standard deviation
estimate.
(f) Demonstrate using a Q − Q plot that your estimates for a0 and b0 are
distributed as the t PDF with ν = 3 degrees of freedom.
2.6. NOTES AND FURTHER READING 2-25
Chapter 3
Discretizing Continuous
Inverse Problems
In this chapter, we will discuss inverse problems involving functions rather than
vectors of parameters. There is a large body of mathematical theory that can
be applied to such problems. Some of these problems can be solved analytically.
However, in practice such problems are often approximated by linear systems
of equations. Thus we will focus on techniques for discretizing continuous in-
verse problems here then consider techniques for solving discretized problems in
subsequent chapters.
Here d(s) is a known function, typically representing the observed data. The
function g(s, t) is also known. It encodes the physics that relates the unknown
model m(t) to the observed d(s). The interval [a, b] may be finite, in which case
the analysis is somewhat simpler, or the interval may be infinite. The function
d(s) might in theory be known over an entire interval [c, d], but in practice we
will only have measurements of d(s) at a finite set of points.
We wish to solve for the unknown function m(t). This type of linear equation,
which we previously saw in Chapter 1, is called a Fredholm integral equation
of the first kind or IFK. A surprisingly large number of inverse problems
can be written as Fredholm integral equations of the first kind. Unfortunately,
some IFK’s have properties that can make them very difficult to solve.
3-1
3-2 CHAPTER 3. DISCRETIZING CONTINUOUS INVERSE PROBLEMS
or as Z b
di = gi (t)m(t) dt (i = 1, 2, . . . , m) , (3.2)
a
where gi (t) = g(si , t). The functions gi (t) are referred to as representers or
data kernels.
In the quadrature approach to discretizing an IFK, we use a quadrature
rule (an approximate numerical integration scheme) to approximate
Z b
gi (t)m(t) dt .
a
3.2. QUADRATURE METHODS 3-3
midpoint rule
collocation, simple
a t1 t2 tn b
The simplest quadrature rule is the midpoint rule. We divide the interval
[a, b] into n subintervals, and pick points t1 , t2 , . . ., tn in the middle of each
interval. Thus
∆t
ti = a + + (i − 1)∆t
2
where
b−a
∆t = .
n
The integral is then approximated by
Z b n
X
gi (t)m(t) dt ≈ gi (tj )m(tj ) ∆t (3.3)
a j=1
If we let
i = 1, 2, . . . , m
Gi,j = gi (tj )∆t (3.5)
j = 1, 2, . . . , n
and
mj = m(tj ) (j = 1, 2, . . . , n) , (3.6)
then we obtain a linear system of equations Gm = d.
The approach of using the midpoint rule to approximate the integral is known
as simple collocation. Of course, there are also more sophisticated quadrature
rules such as the trapezoidal rule and Simpson’s rule. In each case, we end up
with a similar linear system of equations.
Example 3.2
Consider the vertical seismic profiling example (Example 1.3) of
Chapter 1, where we wish to estimate vertical seismic slowness using
arrival time measurements of downward propagating seismic waves
(Figure 3.2). As noted in Chapter 1, the data in this case are simply
3-4 CHAPTER 3. DISCRETIZING CONTINUOUS INVERSE PROBLEMS
Figure 3.2: Discretization of the vertical seismic profiling problem (n/m = 2).
The rows of the matrix Gi,· each consist of i · n/m entries ∆z on the
left and n − (i · n/m) zeros on the right. For n = m, G is simply a
lower triangular matrix with each nonzero entry equal to ∆z.
Example 3.3
Example 3.4
3-6 CHAPTER 3. DISCRETIZING CONTINUOUS INVERSE PROBLEMS
Flow
0 x
∂C ∂2C ∂C
= D 2 −v (3.14)
∂t ∂x ∂x
C(0, t) = Cin (t)
C(x, t) → 0 as x → ∞
C(x, 0) = C0 (x)
d = Gm
3.3. THE GRAM MATRIX TECHNIQUE 3-7
(3.19)
di = d(si ) (i = 1, 2, . . . , m) . (3.21)
Γα = d .
Once this linear system has been solved, the corresponding model is
m
X
m(t) = αj gj (t) . (3.22)
j=1
If the representers and Gram matrix are analytically expressible, then the
Gram matrix formulation facilitates finding continuous solutions for m(t) [Par94].
3-8 CHAPTER 3. DISCRETIZING CONTINUOUS INVERSE PROBLEMS
inner product Where only numerical representations for the representers exist, the method can
still be used, although the elements of Γ must be obtained by numerical inte-
gration.
It can be shown (see the Exercises) that if the representers gi (t) are linearly
independent, then the matrix Γ will be nonsingular.
However, as we will see in the next chapter, the Γ matrix tends to become
very badly conditioned as m increases. On the other hand, we want to use as
large as possible a value of m so as to increase the accuracy of the discretization.
There is thus a tradeoff between the discretization error due to using too small
a value of m and the ill-conditioning of Γ due to using too large a value of m.
(3.26)
This leads to the m by n linear system
Gα = d
where Z b
Gi,j = gi (t)hj (t) dt .
a
If we define the dot product or inner product of two functions to be
Z b
f ·g = f (x)g(x) dx , (3.27)
a
This norm can be very useful in approximating functions. The corresponding Laguerre polynomial basis
measure of the difference between functions f and g is
s
Z b
kf − gk2 = (f (x) − g(x))2 dx . (3.29)
a
If it happens that our basis functions hj (x) are orthonormal with respect to
this inner product, then the projection of gi (t) onto the space H spanned by
the basis is
projH gi (t) = (gi · h1 )h1 (t) + (gi · h2 )h2 (t) + . . . + (gi · hn )hn (t)
The elements in the G matrix are given by the same dot products
Gi,j = gi · hj (3.30)
Thus we have effectively projected the original representers onto our function
space H.
A variety of basis functions have been used to discretize integral equations
including sines and cosines, spherical harmonics, B-splines, and wavelets. In
selecting the basis functions, it is important to select a basis that can reasonably
represent likely models. The basis functions should be linearly independent, so
that a function can be written in terms of the basis functions in exactly one
way, and (3.23) is thus unique. It is also desirable to use an orthonormal basis.
The selection of an appropriate basis for a particular problem is a fine art
that requires detailed knowledge of the problem as well as of the behavior of
the basis functions. Beware that a poorly selected basis may not adequately
approximate the solution, resulting in an estimated model m(t) which is very
wrong.
Example 3.5
Consider the discretization of a Fredholm integral equation of the
first kind in which the interval of integration is [0, ∞].
Z ∞
d(s) = g(s, t)m(t) dt .
0
L0 (t) = 1
and
2n + 1 − t n
Ln+1 (t) = Ln (t) − Ln−1 (t) (n = 0, 1, . . .) .
n+1 n+1
The first few polynomials include L1 (t) = 1−t, L2 (t) = 1−2t+t2 /2,
and L3 (t) = 1 − 3t + 3t2 /2 − t3 /6.
We will use a slightly modified inner product on this interval
Z ∞
f ·g = f (x)g(x)e−x dx .
0
−x
The weighting factor of e helps to ensure that the integral will
converge, but has the effect of down weighting the importance of
values of the function for large x.
The Laguerre polynomial basis functions are orthonormal in the
sense that
Z ∞
0 m 6= n
Ln (t)Lm (t)e−t dt = .
0 1 m=n
Gα = d ,
where
Z ∞
i = 1, 2, . . . , m
Gi,j = gi (t)Lj−1 (t)e−t dt .
0 j = 1, 2, . . . , n
Gi,j = gi · Lj−1 ,
3.4. EXPANSION IN TERMS OF BASIS FUNCTIONS 3-11
which are exactly the coefficients of the Lj−1 in the orthogonal ex-
pansion of gi (t).
In this example, we selected an inner product and orthogonal basis
that gives less weight to values of m(t) for large t. We are effectively
measuring how close to functions f and g are by
sZ
∞
kf − gk = (f (x) − g(x))2 e−x dx .
0
3.5 Exercises
1. Consider the following data set:
y d(y)
0.025 0.2388
0.075 0.2319
0.125 0.2252
0.175 0.2188
0.225 0.2126
0.275 0.2066
0.325 0.2008
0.375 0.1952
0.425 0.1898
0.475 0.1846
0.525 0.1795
0.575 0.1746
0.625 0.1699
0.675 0.1654
0.725 0.1610
0.775 0.1567
0.825 0.1526
0.875 0.1486
0.925 0.1447
0.975 0.1410
This data can be found in the file ifk.mat.
The function d(y), 0 ≤ y ≤ 1 is related to an unknown function m(x),
0 ≤ x ≤ 1, through the IFK
Z 1
d(y) = xe−xy m(x) dx . (3.31)
0
(a) Using the data provided, discretize the integral equation using simple
collocation and solve the resulting system of equations.
(b) What is the condition number for this system of equations? Given
that the data d(y) are only accurate to about 4 digits, what does this
tell you about the accuracy of your solution?
2. Use the Gram matrix technique to discretize the integral equation from
problem 3.1.
(a) Solve the resulting linear system of equations, and plot the resulting
model.
(b) What was the condition number of Γ? What does this tell you about
the accuracy of your solution?
3.6. NOTES AND FURTHER READING 3-13
3. Show that if the representers gi (t) are linearly independent, then the Gram
matrix Γ is nonsingular.
Chapter 4
Gm = d (4.1)
is factored into
G = USVT (4.2)
where
• S is an m by n diagonal matrix.
The SVD matrices can be computed in MATLAB with the svd command.
The singular values along the diagonal of S are customarily arranged in
decreasing size, s1 ≥ s2 ≥ . . . ≥ smin(m,n) ≥ 0. Some of the singular values may
be zero. If only the first p singular values are nonzero, we can partition S as
Sp 0
S= (4.3)
0 0
4-1
4-2 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING
svd, compact form where Sp is a p by p diagonal matrix composed of the positive singular values.
Now, expand the SVD representation of G
G = USVT (4.4)
Sp 0 T
= [U·,1 , U·,2 , . . . , U·,m ] [V·,1 , V·,2 , . . . , V·,n ] (4.5)
0 0
Sp 0 T
= [Up , U0 ] [Vp , V0 ] (4.6)
0 0
(4.7)
where Up denotes the first p columns of U, U0 denotes the last m−p columns of
U, Vp denotes the first p columns of V, and V0 denotes the last n − p columns
of V. Because the last m − p columns of U and the last n − p columns of V
in (4.6) are multiplied by zeros in S, we can simplify the SVD of G into its
compact form
G = Up Sp VpT . (4.8)
Take any vector y in R(G). Using (4.8), we can write y as
y = Gx = Up Sp VpT x .
(4.9)
Since the columns of Up span R(G), are linearly independent, and are orthonor-
mal, they form an orthonormal basis for R(G). Because this basis has p vectors,
rank(G) = p.
Since U is an orthogonal matrix, the columns of U form an orthonormal basis
for Rm . We have already seen that the p columns of Up form an orthonormal
basis for R(G). By theorem (A.5), N (GT ) + R(G) = Rm , so the remaining
m − p columns of U0 form an orthonormal basis for N (GT ). Similarly, because
GT = Vp Sp Up , the columns of Vp form an orthogonal basis for R(GT ) and
the columns of V0 form an orthogonal basis for N (G).
Two other important properties of the SVD are similar to properties of
eigenvalues and eigenvectors. Since the columns of V are orthogonal,
VT V·,i = ei .
Thus
GV·,i = USVT V·,i = USei = si U·,i (4.11)
and
GT U·,i = VST UT U·,i = VST ei = si V·,i . (4.12)
4.1. THE SVD AND THE GENERALIZED INVERSE 4-3
There is an important connection between the singular values of G and the generalized inverse
eigenvalues of the matrices GGT and GT G . Moore–Penrose pseudoinverse
pseudoinverse
T
GG U·,i = Gsi V·,i (4.13)
= si GV·,i (4.14)
= s2i U·,i (4.15)
Similarly,
GT GV·,i = s2i V·,i . (4.16)
These relations show that we could, in theory, compute the SVD by finding
the eigenvalues and eigenvectors of GT G and GGT . In practice, more efficient
specialized algorithms are used [Dem97, GL96, TB97].
The SVD can be used to compute a generalized inverse of G, which is
called the Moore-Penrose Pseudoinverse because it has desirable properties
originally identified by Moore and Penrose [Moo20, Pen55]. The generalized
inverse is
G† = Vp S−1 T
p Up . (4.17)
Using (4.17), let
m† = Vp S−1 T †
p Up d = G d . (4.18)
We will show that m† is a least squares solution. Among the desirable properties
of (4.18) is that G† , and hence m† , always exist, unlike, for example, the inverse
of GT G in the normal equations (2.4) which does not exist when G is not of
full rank.
To encapsulate what the SVD tells us about our linear system, G, and the
corresponding generalized inverse system G† , consider four cases:
1. Both the model and data null spaces, N (G) and N (GT ) are trivial. Up
and Vp are both square orthogonal matrices, so that UT = U−1 , and
VT = V−1 . (4.18) gives
G† = Vp S−1 T T −1
p Up = (Up Sp Vp ) = G−1 (4.19)
minimum length solution so the data are fit. However, the solution is nonunique, because of the
rank deficient system existence of the nontrivial model null space N (G). The general solution
is the sum of m† and an arbitrary model component in N (G)
n
X
m = m † + m 0 = m† + αi V·,i . (4.24)
i=p+1
where we have equality only if all of the model null space coefficients αi
are zero. The generalized inverse solution, which lies in R(GT ) is thus
a minimum 2–norm or minimum length solution for a rank deficient
system with m = rank(G) < n.
We can also write this solution in terms of G and GT .
m† = Vp S−1 T
p Up d (4.26)
= Vp Sp UTp Up S−2 T
p Up d (4.27)
= GT (Up S−2 T
p Up )d (4.28)
T T −1
= G (GG ) d. (4.29)
The product Up UTp d gives the projection of d onto R(G). That is, Gm†
is equal to the projection of d onto the range of G. This shows that the
generalized inverse gives a least squares solution to Gm = d.
This solution is exactly the same as that obtained from the normal equa-
tions, because
and
m† = G† d (4.36)
= Vp S−1 T
p Up d (4.37)
= Vp S−2 T T
p Vp Vp Sp Up d (4.38)
T −1 T
= (G G) G d. (4.39)
4. Both N (GT ) and N (G) are nontrivial and rank(G) is less than both
m and n. In this case, the generalized inverse solution encapsulates the
behavior from both of the two previous cases and minimizes both kGm −
dk2 and kmk2 .
As in case 3,
Again,
n
X
kmk22 = km† k22 + αi2 ≥ km† k22 , (4.44)
i=p+1
The generalized inverse thus produces an inverse solution that always exists,
and that is both least-squares and minimum length, regardless of the relative
sizes of rank(G), m and n. Relationships between the subspaces R(G), N (GT ),
R(GT ), N (G), and the operators G and G† , are shown schematically in Figure
4.1. Table 4.1 summarizes the SVD and its properties.
To examine the significance of the N (G) subspace, which is spanned by the
columns of V0 , Consider an arbitrary model m0 which lies in N (G)
n
X
m0 = αi V·,i . (4.45)
i=p+1
4-6 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING
Figure 4.1: SVD model and data space mappings, where G is the forward
operator and G† is the generalized inverse. N (GT ) and N (G) are the data and
model null spaces, respectively.
For this reason, the null space of G is commonly called the model null space.
The existence of a nontrivial model null space (one that includes more than
just the zero vector) is at the heart of solution nonuniqueness. General models
in Rn include an infinite number of solutions that will fit the data equally
well, because model components in N (G) have no affect on data fit. To select a
particular preferred solution from this infinite set thus requires more constraints
(such as minimum length or smoothing constraints) than are encoded in the
matrix G.
To see the significance of the N (GT ) subspace, consider an arbitrary data
vector, d0 , which lies in N (GT )
m
X
d0 = βi U·,i . (4.48)
i=p+1
model space resolution matrix In Chapter 2, we found that under the assumption of independent normally
distributed measurement errors, the least squares solution was an unbiased es-
timator of the true model, and that the estimated model parameters had a
multivariate normal distribution with covariance
We could attempt the same analysis for the generalized inverse solution m† .
The covariance matrix would be given by
Cov(m† ) = G† Cov(d)(G† )T .
However, unless the least squares problem is of full rank, the generalized inverse
solution is not an unbiased estimator of the true solution. This is because
the true model may have nonzero projections in the model null space which
are not represented in the generalized inverse solution. In practice, the bias
introduced by this can be far larger than the uncertainty due to measurement
error. Estimating this bias is a hard problem, which we will address in Chapter
5.
The concept of resolution is another way to way to characterize the quality
of the generalized inverse solution. In this approach we see how closely the
generalized inverse solution matches a given model, assuming that there are no
errors in the data. We begin with any model m. By multiplying G times m,
we can find a corresponding data vector d. If we then multiply G† times d, we
get back a generalized inverse solution m†
m† = G† Gm (4.50)
Rm = G† G. (4.51)
Rm = Vp S−1 T T
p Up Up Sp Vp (4.52)
Rm = Vp VpT . (4.53)
(4.54)
In practice, the model space resolution matrix is used in two different ways. If data space resolution matrix
the diagonal entries of Rm are close to one, then we should have good resolution. singular value spectrum
If any of the diagonal entries are close to zero, then the corresponding model
parameters will be poorly resolved. We can also multiply Rm times a particular
model m to see how that model would be resolved by the inverse solution.
We can perform the operations of multiplication by G† and G in the opposite
order to obtain a data space resolution matrix, Rd .
d† = Gm† (4.55)
= GG† d (4.56)
= Rd d. (4.57)
(4.58)
Rd = Up UTp . (4.59)
condition number, of svd The elements of the vector UTp d are simply the dot products of the first p
columns of U with d. This vector can be written as
(U·,1 )T d
(U·,2 )T d
T
.
Up d = . (4.61)
.
.
T
(U·,p ) d
T
(U·,1 ) d
s1
(U·,2 )T d
s2
−1 T
Sp Up d = .
. (4.62)
.
.
(U·,p )T d
sp
m† − m0† = G† (d − d0 ) .
Thus
km† − m0† k2 ≤ kG† k2 kd − d0 k2 .
From (4.63), it is be clear that the largest difference in the inverse models will
occur when the d − d0 is in the direction U·,p . If
d − d0 = αU·,p ,
4.3. INSTABILITY OF THE GENERALIZED INVERSE SOLUTION 4-11
with
α
km† − m0† k2 = .
sp
Thus we have a bound on the instability of the generalized inverse solution
km† − m0† k2 1
≤
kd − d0 k2 sp
or
1
km† − m0† k2 ≤ kd − d0 k2 .
sp
Similarly, we can see that the generalized inverse model is smallest in norm
when d points in a direction parallel to V·,1 . Thus
1
km† k2 ≥ kdk2 .
s1
Combining these inequalities, we obtain
km† − m0† k2 s1 kd − d0 k2
≤ . (4.64)
km† k2 sp kdk2
discrete Picard condition A condition that insures solution stability is the discrete Picard condition
truncated SVD [Han98]. The discrete Picard condition is satisfied when the dot products of
TSVD the columns of Up and the data vector decay to zero more quickly than the
discrepancy principle
singular values, si . Under this condition, we should not see instability due to
over fitting of data
regularization
contributions from the smallest singular values.
rank deficient If the discrete Picard condition is not satisfied, we may still be able to recover
a usable solution by truncating (4.63) at some highest term p0 < p, to produce
a truncated SVD, or TSVD solution. One way to decide when to truncate
(4.63) is to apply the discrepancy principle, where we consider all solutions
which minimize kmk, subject to fitting the data to some tolerance
kGw m − dw k2 ≤ δ , (4.65)
where Gw and dw are the weighted system matrix and data vector (2.16). A
simple way to chose δ for independent normal data errors is to note that, in this
case, the square of the weighting-normalized 2–norm misfit measure (4.65) will
be distributed as χ2 with ν = m − n degrees of freedom. In this case we can
choose a δ corresponding to some confidence interval for that χ2 distribution
(e.g., 95%). The truncated SVD solution is then straight forward; we sum up
terms in (4.63) until the discrepancy principal is just met. This produces a
stable solution that can be statistically justified as being consistent with both
modeling assumptions and data errors.
This solution will not fit the data as well as solutions that include the small
singular value model space basis vectors. Perhaps surprisingly, this is what we
should generally do when we solve ill-posed problems with noise. If we fit the
data vector exactly or nearly exactly, we are in fact over fitting the data and
perhaps letting the noise control major features of the model. This corresponds
to using the available freedom of a large model space to produce a χ2 value that
is too good (Chapter 2). Truncating (4.63) also decreases the used dimension
of the model space by p − p0 . we should thus be careful to acknowledge that
true model projections in the directions of the newly omitted columns of V
cannot appear in the truncated SVD solution, even if they are present in the
true model.
The truncated SVD solution is but one example of regularization, whereby
solutions are selected to sacrifice fit to the data for solution stability. Much of
the utility of inverse solutions in many situations, as we shall subsequently see,
hinges on the application of good regularization strategies.
truly non–zero singular value, then it can be easy to distinguish between the
non–zero and zero singular values. Rank deficient problems can be solved in a
straight forward manner by applying the generalized inverse solution. Instability
of the resulting solution due to small singular values is seldom an issue.
Example 4.1
With the tools of the SVD at our disposal, let us reconsider the
straight-line tomography example (example 1.3; Figure 4.2), where
we had a rank deficient system in which we were constraining a
9-parameter model with 8 observations.
s11
1 0 0 1 0 0 1 0 0 s12 t1
0 1 0 0 1 0 0 1 0 t2
0 0 1 0 0 1 0 0 1 s13 t3
s21
1 1 1 0 0 0 0 0 0 t4
Gm = s
22 =
.
0 0 0 1 1 1 0 0 0 s23 t5
0 0 0 0 0 0 1 1 1 t6
√ √ √ s31
2 0 0 0 2 0 0 0 √2 t7
s32
0 0 0 0 0 0 0 0 2 t8
s33
(4.66)
4-14 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING
>> svd(G)
ans =
3.180e+00
2.000e+00
1.732e+00
1.732e+00
1.732e+00
1.607e+00
5.535e-01
4.230e-16
Recall that we can add any multiple of m0 to any solution and not
change our fit to the data. Three common features of the two model
null space basis vectors stand out
1. The sums along all rows and columns of m0 are zero
2. The upper left to lower right diagonal sum is zero
3. There is no projection of m0 in the m9 = s33 model space
direction.
The zero sum conditions (1) and (2) arise because the average slow-
nesses along any of the rows or columns in those directions are un-
constrained by the ray path geometry. Paths passing through three
blocks can only constrain the average block value. The zero value
for s33 (3) occurs because s33 is uniquely constrained because of the
observation t8 .
The lone data space null basis vector, specified by the 8th column of
U, is
−0.408
−0.408
−0.408
0.408
U0 = .
0.408
0.408
0.000
0.000
This is a basis vector in the data space that we can never fit with
any model because it is inconsistent. Measurements of the form cU0 ,
where c is a constant, require the average values of the si,· along
rows to decrease or increase, while the average value of the s·,i along
columns simultaneously increases or decreases, and, the upper left
to lower right diagonal average and the value of s33 cannot change
at all. This is impossible for any model to satisfy.
The n by n model space resolution matrix, Rm tells us about the
uniqueness of the generalized inverse solution parameters. The di-
agonal elements are indicative of how recoverable each parameter
value will be. The reshaped diagonal of Rm for this system is
0.833 0.833 0.667
reshape(diag(Rm ), 3, 3) = 0.833 0.833 0.667
0.667 0.667 1.000
which tells us that m9 = s33 is perfectly resolved, but that we can
expect loss of resolution (and hence smearing of the true model into
other blocks) for all of the other solution parameters. The full Rm
gives the specifics of how this smearing occurs
Rm =
4-16 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING
resolution test
spike resolution test
impulse resolution test 0.833 0 0.167 0 0.167 −0.167 0.167 −0.167 0
0 0.833 0.167 0.167 0 −0.167 −0.167 0.167 0
0.167
0.167 0.667 −0.167 −0.167 0.333 0 0 0
0 0.167 −0.167 0.833 0 0.167 0.167 −0.167 0
0.167
0 −0.167 0 0.833 0.167 −0.167 0.167 0 .
−0.167 −0.167 0.333 0.167 0.167 0.667 0 0 0
0.167 −0.167 0 0.167 −0.167 0 0.667 0.333 0
−0.167 0.167 0 −0.167 0.167 0 0.333 0.667 0
0 0 0 0 0 0 0 0 1.000
(4.67)
As you can probably surmise, examining the entire model resolution
matrix becomes cumbersome in large problems. In such cases it is
instead common to perform a resolution test by generating syn-
thetic data for some known model of interest and then assessing the
recovery of that model in the inverse solution. One such synthetic
model commonly used in resolution tests is uniform or zero except
for a single perturbed model element. Examining the inverse recov-
ery of this model is commonly referred to as a spike or impulse
resolution test. For this example, consider the model perturbation
0
0
0
0
mtest = 1 .
0
0
0
0
The predicted data set for this mtest is
0
1
0
0
dtest = Gmtest =
1 ,
0
√0
2
0
and the reshaped generalized inverse model is
0.167 0 −0.167
reshape(m† , 3, 3) = 0 0.833 0.167
−0.167 0.167 0.000
4.5. DISCRETE ILL-POSED PROBLEMS 4-17
which is just the 5th column of Rm showing that information about checkerboard resolution test
the central block slowness will bleed into some, but not all, of the
adjacent blocks, even for noise-free data. A related resolution test
that is commonly performed in tomography work is a checkerboard
test, which, as you would surmise, consists of generating synthetic
data from a model of alternating positive and negative perturbations
and then inverting to attempt to recover it. For example, the true
checkerboard slownesses model
2 1 2
reshape(mtrue , 3, 3) = 1 2 1
2 1 2
and is recovered as
2.333 1.000 1.667
reshape(m† , 3, 3) = 1.000 1.667 1.333 .
1.667 1.333 2.000
discrete ill–posed problem This happens frequently when we discretize Fredholm integral equations of the
moderately ill–posed first kind as we did in Chapter 3. In particular, as we increase the number of
severely ill–posed points n in the discretization, we typically find that the matrix G becomes more
weakly ill–posed
and more badly conditioned. Discrete inverse problems such as these cannot for-
instrument response
mally be called ill–posed, because the condition number remains finite although
very large. However, we will refer to these problems as discrete ill–posed
problems.
The rate at which the singular values decay can be used to characterize
a discrete ill–posed problem as weakly, moderately, or strongly ill–posed. If
si = O(i−α ) for α ≤ 1, then we call the problem weakly ill–posed. If si = O(i−α )
for α > 1, then the problem is moderately ill–posed. If si = O(e−α ) then the
problem is severely ill-posed.
In addition to the general pattern of singular values which decay to 0, dis-
crete ill–posed problems are typically characterized by differences in the singular
vectors V·,j [Han98]. For large singular values, the singular vectors are typically
smooth, while the singular vectors corresponding to the smaller singular values
typically have lots of oscillations. These oscillations become apparent in the
generalized inverse solution.
When we attempt to solve such a problem with the SVD, it becomes difficult
to decide where to truncate the sum. If we truncate the sum too early, then we
effectively lose details in our model solution corresponding to the corresponding
model vectors Vk that are left out of the sum. If we include all of the singular
values, then the generalized inverse solution becomes extremely unstable in the
presence of noise. In particular, we can expect that high frequencies in the
generalized inverse solution will be strongly affected by the noise. Regularization
techniques are required to address this problem.
Example 4.2
g0 te−t/T0
(t ≥ 0)
g(t) = . (4.68)
0 (t < 0)
1 deconvolution
0.8
0.6
V
0.4
0.2
0
0 20 40 60 80
Time (s)
Note that the rows of G are time reversed, and the columns of G
are non-time-reversed, sampled versions of the impulse response g(t),
lagged by i and j, respectively. Using a time interval of [−5, 100] s,
outside of which (4.68) and (we assume) any model, m, of interest
will be very small or zero, and a discretization interval of ∆t = 0.5 s
we obtain a discretized m by n system matrix G with m = n = 210.
(Figure (4.4).
4-20 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING
20
40
60
80
i 100
120
140
160
180
200
50 100 150 200
j
Figure 4.4: Grayscale image of the system matrix, G. Values range from 0
(black) to 0.5 (white). Discretization interval is ∆t = 0.5 s and the time interval
is [−5, 100] s.
The singular values of G are all nonzero and range from about 25.3
to 0.017 (Figure 4.5), giving a condition number of approximately
1480, which is not an especially large range relative to what can arise
in extremely ill-posed problems. However, adding noise at the level
of 1 part in 1000 will be sufficient to make the generalized inverse
solution unstable. The reason for the moderately large condition
number can be seen by examining successive rows of G, which are
nearly (but not quite) identical.
Gi,· GTi+1,·
≈ 0.999 .
kGi,· kkGi+1,· k
1
10
0
10
i
s
−1
10
1
0.9
0.8
0.7
Acceleration (m/s2)
0.6
0.5
0.4
0.3
0.2
0.1
0 20 40 60 80
Time (s)
3
V
0
−20 0 20 40 60 80 100
Time (s)
The recovered 2–norm model from the full generalized inverse solu-
tion
m = VS−1 UT dsyn (4.73)
is shown in Figure 4.8. (4.73) fits its noiseless data vector, dsyn ,
perfectly, and is essentially identical to the true model.
Next consider what happens when we add a modest amount of noise
to (4.72). Figure 4.9 shows the data after normally distributed noise
with standard deviation 0.05 has been added.
The 2–norm solution for a noisy data vector, dsyn + n, where the
elements of n are uncorrelated N (0, (0.05)2 ) random variables
1.2
0.8
Acceleration (m/s2)
0.6
0.4
0.2
−0.2
−20 0 20 40 60 80 100
Time (s)
Figure 4.8: Generalized inverse solution using all 210 singular values the noise-
free data of Figure 4.7.
3
V
−1
−20 0 20 40 60 80 100
Time (s)
Figure 4.9: Predicted data from the true model plus uncorrelated N (0, (0.05)2 )
noise.
4-24 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING
Acceleration (m/s2)
2
−1
−2
−3
−4
0 20 40 60 80
Time (s)
Figure 4.10: Generalized inverse solution using all 210 singular values, and the
noisy data of Figure 4.9.
0.8
0.7
0.6
Acceleration (m/s2)
0.5
0.4
0.3
0.2
0.1
0
−0.1
0 20 40 60 80
Time (s)
Figure 4.11: Solution Using the 26 Largest Singular Values; N (0, (0.05)2 ) data
noise.
20
40
60
80
100
i
120
140
160
180
200
50 100 150 200
j
Figure 4.12: Grayscale image of the model resolution matrix, Rm for the trun-
cated SVD solution including the 26 largest singular values. Values range from
0.35 (white) to -0.07 (black)
.
4-26 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING
0.12
0.08
0.06
0.04
0.02
−0.02
0 20 40 60 80
Time (s)
Figure 4.13: A column from the model resolution matrix, Rm for the truncated
SVD solution including the 26 largest singular values.
Example 4.3
Recall the Shaw example problem from Example 3.2. The MATLAB
Regularization Toolbox contains a routine shaw that computes the
G matrix for this problem. Using this routine, we computed the
G matrix for n = 20. We then computed the singular values of
this matrix. Figure 4.14 shows the singular values of the G matrix,
which decay very rapidly to zero. Because the singular values decay
to 0 in an exponential fashion, this is a strongly ill–posed problem
There is no obvious break point above which the singular values are
nonzero and below which the singular values are 0. The MATLAB
rank command gives a value of p = 18, suggesting that the last
two singular values are effectively 0. The condition number of this
problem is enormous (> 1014 ).
4.5. DISCRETE ILL-POSED PROBLEMS 4-27
5
10
0
10
−5
10
−10
10
−15
10
−20
10
0 5 10 15 20
We can run the model forward and then use the generalized inverse
with various values of p to obtain inverse solutions. The spike in-
put is shown in Figure 4.17. The corresponding data (with no noise
added) is shown in Figure 4.18. If we compute the generalized in-
verse with p = 18, and apply it to the data in Figure 4.18, we get
essentially perfect recovery of the original spike. See Figure 4.19.
However, if we add a very small amount of noise to the data in Fig-
ure 4.18, things change dramatically. Adding normally distributed
random noise with mean 0 and standard deviation 10−6 to the data
in Figure 4.18 and computing the inverse solution using p = 18,
produces the wild solution shown in Figure 4.20, which bears little
resemblance to the input (notice that the vertical scale is multiplied
by 105 !). This inverse operator is even more unstable than that of
4-28 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING
0.4
0.3
0.2
0.1
−0.1
−0.2
−0.3
−0.4
−2 −1 0 1 2
−0.05
−0.1
−0.15
−0.2
−0.25
−0.3
−0.35
−0.4
−0.45
−2 −1 0 1 2
1.5
0.5
−0.5
−2 −1 0 1 2
1.5
0.5
−0.5
−2 −1 0 1 2
1.5
0.5
−0.5
−2 −1 0 1 2
5
x 10
4
−1
−2
−3
−4
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0.5
0.4
0.3
0.2
0.1
−0.1
−0.2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0.5
0.4
0.3
0.2
0.1
−0.1
−0.2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
5
10
0
10
−5
10
−10
10
−15
10
−20
10
0 20 40 60 80 100
13
x 10
2
1.5
0.5
−0.5
−1
−1.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Figure 4.24: Recovery of the spike model with noise, n = 100, p = 20.
1
10
0
10
−1
10
−2
10
1 2 3 4 5 6
4.6 Exercises
1. A large NS-EW oriented, nearly square plan view, sandstone quarry block
(16 m by 16 m) with a bulk P-wave seismic velocity of approximately
3000 m/s is suspected of harboring higher-velocity dinosaur remains. An
ultrasonic P-wave travel-time tomography scan is performed in a hori-
zontal plane bisecting the boulder, producing a data set consisting of 16
E→W, 16 N→S, 31 SW→NE, and 31 NW→SE travel times. Each travel-
time measurement has statistically independent errors and with estimated
standard deviations of 15 µs.
The data files that you will need to load from your working directory into
your MATLAB program are: rowscan.m, colscan.m, diag1scan.m,
diag2scan.m (these contain the travel-time data), and std.m (which
contains the standard deviation of the data measurements). The travel
time contribution from a uniform background model (velocity of 3000 m/s)
has been subtracted from each travel-time measurement for you, so you
will be solving for perturbations from a uniform slowness model of 3000
m/s. The row format of each data file is (x1, y1, x2, y2, t) where the
starting point coordinate of each shot is (x1, y1), the end point coordinate
is (x2, y2), and the travel time along a ray path between the start and
end points is a path integral (in seconds)
Z
t = s(x)dl
l
where ∆lblock is 1 m for the row and column scans and sqrt{2} m for the
diagonal scans.
Utilize the SVD to find a minimum-length/least-squares solution, m† ,
for the 256 block slowness perturbations which fit the data as exactly as
possible. Perform two inversions:
(A) Using the row and column scans only, and
(B) Using the complete data set.
For each inversion:
(a) State and discuss the significance of the elements and dimensions of
the data and model null spaces.
(b) Note if there any model parameters that have perfect resolution.
(c) Note the condition number of your raw system matrix relating the
data and model.
(d) Note the condition number of your generalized inverse matrix.
(e) Produce a 16 by 16 element contour or other plot of your slowness
perturbation model, displaying the maximum and minimum slowness
perturbations in the title of each plot. Anything in there? If so, how
fast or slow is it (in m/s)?
(f) Show the model resolution by contouring or otherwise displaying the
256 diagonal elements of the model resolution matrix, reshaped into
an appropriate 16 by 16 grid.
(g) Construct, and contour or otherwise display a nonzero model which
fits the trivial data set d = 0 exactly.
(h) Describe how one could use solutions of the type discussed in (g)
to demonstrate that very rough models exist which will fit any data
set just as well as a generalized inverse model. Show one such wild
model.
2. Find the singular value decomposition of the G matrix from problem 3.1.
Taking into account the fact that the measured data are only accurate to
about four digits, use the truncated SVD to compute a solution to this
problem.
4-36 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING
discrepancy principle
Chapter 5
Tikhonov Regularization
We saw in Chapter 4 that, given the SVD of G (4.2), we can express a generalized
inverse solution by (4.63)
p
X UT·,i d
m† = Vp S−1 T
p Up d = V·,i ,
i=1
si
and that this expression can become extremely unstable when one or more of the
singular values, si become small. We considered the option of dropping terms
in the sum that correspond to the smaller singular values. This regularized or
stabilized the solution in the sense that it made the solution less sensitive to
data noise. However, we paid a price for this stability in that the regularized
solution had reduced resolution.
In this chapter we will discuss Tikhonov regularization, which is perhaps
the most widely used technique for regularizing discrete ill–posed problems. It
turns out that the Tikhonov solution can be expressed quite easily in terms of
the SVD of G. We will derive a formula for the Tikhonov solution and see how
it is a variant on the generalized inverse solution that effectively gives larger
weight to larger singular values in the SVD solution and gives lower weight to
small singular values.
5-1
5-2 CHAPTER 5. TIKHONOV REGULARIZATION
|| m ||
delta
|| Gm-d ||
Figure 5.1: Minimum values of the model norm, kmk2 , for varying values of δ.
misfit measure (1/σ 2 )kGmtrue −dk22 will have a χ2 distribution with m degrees
of freedom. If the standard deviations are not all equal, but the errors are
still normal and independent, we can divide each equation by the standard
deviation associated with its right hand side (2.14) and obtain a weighted system
of equations (2.17) in which kGw mtrue − dw k22 will follow a χ2 distribution.
We learned in Chapter 2 that when we estimate the solution to a full rank
least squares problem, kGw mL2 − dw has a χ2 distribution with m − n degrees
of freedom. Unfortunately, when the number of model parameters n is greater
than or equal to the number of data m, this simply does not work, because there
is no χ2 distribution with fewer than one degrees of freedom.
In√practice, a common heuristic is to require kGw m − dw k2 to be smaller
than m, which is the approxmate median of a χ2 distribution with m degrees
of freedom.
Under the discrepancy principle, we consider all solutions with kGm−dk2 ≤
δ, and select from these solutions the one that minimizes the norm of m
min kmk2
(5.1)
kGm − dk2 ≤ δ.
Why select the minimum norm solution from among those solutions that
adequately fit the data? One interpretation is that any significant nonzero
feature that appears in the regularized solution is there because it was necessary
to fit the data. Model features that are unnecessary to fit the data will be
removed by the regularization.
Notice that as δ increases, the set of feasible models expands, and the min-
imum value of kmk2 decreases. We can thus trace out a curve of minimum
values of kmk2 versus δ (Figure 5.1). It is also possible to trace out this curve
by considering problems of the form
L–curve
epsilon
|| m ||
|| Gm-d ||
Figure 5.2: Optimal values of the model norm, kmk2 , and the misfit norm,
kGm − dk2 , as varies.
As decreases, the set of feasible solutions becomes smaller, and the minimum
value of kGm − dk2 increases. Again, as we adjust we trace out the curve of
optimal values of kmk2 and kGm − dk2 . See Figure 5.2.
A third option is to consider the damped least squares problem
of normal equations.
T
G T
d
G αI m= G αI . (5.5)
αI 0
which simplifies to
In this formula, we pick k = min(m, n) so that all singular values are included,
no matter how small they might be. To show that this solution is correct, we
substitute (5.9) into the left hand side of (5.8)
k
X s2i (U·,i )T d
(VST SVT + α2 I) V·,i
s2
i=1 i
+α 2 si
k
X s2i (U·,i )T d
= (VST SVT + α2 I)V·,i . (5.10)
i=1
s2i + α2 si
(VST SVT +α2 I)V·,i can be simplified by noting that that VT V·,i is a standard
basis vector, ei . When we multiply ST S times a standard basis vector, we get
a vector with s2i in position i and zeros elsewhere. When we multiply V times
this vector, we get s2i V·,i . Thus
k k
X s2i (U·,i )T d X s2i (U·,i )T d 2
(VST SVT + α2 I) V ·,i (si + α2 )V·,i
s2
i=1 i
+ α2 si i=1
s2
i + α 2 s i
k
X
= si (U·,i )T dV·,i = Vp STp UTp d = VST UT d . (5.11)
i=1
The objects
s2i
fi = (5.12)
s2i + α2
5.2. SVD IMPLEMENTATION OF TIKHONOV REGULARIZATION 5-5
Example 5.1
We will now use the regularization toolbox to analyze the Shaw
problem (with noise), as introduced in Example 3.2. We begin by
computing the L–curve and finding its corner.
>> [reg_corner,rho,eta,reg_param]=l_curve(U,diag(S),dspiken,’Tikh’);
>> reg_corner
reg_corner =
3.120998999813652e-07
>> [mtik,rho,eta]=tikhonov(U,diag(S),V,dspiken,3.12e-7)
mtik =
-0.01756815988942
0.08736794531934
-0.11352225772109
-0.02687189186254
0.13097279143005
5-6 CHAPTER 5. TIKHONOV REGULARIZATION
5
10 3.0125e−13
4
10
8.5336e−12
2
2.4173e−10
10
1
10 6.8476e−09
0
10
1.9397e−07
5.4948e−06 0.00015565 0.0044092 0.1249
−1
10
−8 −7 −6 −5 −4 −3 −2 −1 0
10 10 10 10 10 10 10 10 10
residual norm || A x − b ||2
0.02511897984851
-0.17334073966220
-0.02730844460233
0.38747989040776
0.52263562656346
0.24617936521424
-0.01194773792291
-0.03194830349568
0.00037512886287
-0.01136710435884
0.00811188289513
0.02757132302673
-0.01861173224197
-0.01291369695969
0.01038008213527
rho =
2.921973109767267e-07
eta =
5.2. SVD IMPLEMENTATION OF TIKHONOV REGULARIZATION 5-7
0.6
0.4
0.2
−0.2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Figure 5.4: Tikhonov solution for the Shaw problem, α = 3.12 × 10−7 .
0.74616447426107
Here mtik (Figure 5.4) is the solution, rho is the misfit norm, and
eta is the solution norm. The solution is shown in Figure 5.4. Notice
that this solution is much better than the wild solution obtained by
the truncated SVD with p = 18 (Figure 4.20).
We can also use the discrep command to find the appropriate α
to obtain a Tikhonov regularized solution. Because normally dis-
tributed random noise with standard deviation 1×10−6 was added to
this data, we search for a solution where the square of the norm of the
noise vector (and of the residuals) is roughly√n = 20 times 10−12 , or a
corresponding norm kGm−dk2 of roughly 20×10−6 ≈ 4.47×10−6 .
>> [mdisc,alpha]=discrep(U,diag(S),V,dspiken,1.0e-6*sqrt(20));
>> alpha
alpha =
5.3561e-05
0.5
0.4
0.3
0.2
0.1
−0.1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Figure 5.5: Tikhonov solution for the Shaw problem, α = 5.36 × 10−5 .
Picard plot
10
10
σi
|uTi b|
|uTi b|/σi
5
10
0
10
−5
10
−10
10
−15
10
−20
10
0 2 4 6 8 10 12 14 16 18 20
i
Example 5.2
5-10 CHAPTER 5. TIKHONOV REGULARIZATION
>> diag(R)
ans =
0.8981
0.4720
0.4461
0.3837
0.4145
0.4087
0.4205
0.4375
0.4390
0.4475
0.4475
0.4390
0.4375
0.4205
0.4087
0.4145
0.3837
0.4461
0.4720
0.8981
indicating that the most entries in the solution vector are rather
poorly resolved. Figure 5.7 displays this poor resolution by apply-
ing R to the (true) spike model (4.54). Recall that this is the model
recovered when the true model is a spike and there is no noise added
to the data vector; additional features of a recovered model in prac-
tice (5.1) will also be influenced by noise.
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Example 5.3
Recall our earlier example of the Shaw problem with the spike model
input. Figure 5.8 shows the true model, the solution obtained using
α = 5.36 × 10−5 , and 95% confidence intervals for the estimated
parameters. Notice that very few of the actual model parameters
are included within the confidence intervals.
These estimates of the error are very rough and order–of–magnitude. They
should not be counted on to give a very accurate estimate of the error in the
5-12 CHAPTER 5. TIKHONOV REGULARIZATION
1
Tikhonov
True
0.5
intensity
−0.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
angle
Figure 5.8: Confidence intervals for the Tikhonov solution to the Shaw problem
and a spike model.
Example 5.4
Figure 5.9 shows a plot of the Hanke and Raus estimates of the
error in the Shaw problem with noisy data from the spike model.
The graph has a number of local minimum points, and it is not
obvious which value of α should be used. One minimum lies near α =
5.36 × 10−5 , which was the value of α determined by the discrepancy
principle. At this value of α, the estimated error is about 0.058,
which is somewhat smaller than the actual error of about 0.74.
2.5
2
|| mtrue − mlambda ||
1.5
0.5
0
−7 −6 −5 −4 −3 −2 −1 0
10 10 10 10 10 10 10 10
lambda
Figure 5.9: Hanke and Raus estimates of the solution error for the Shaw problem
with spike model.
second order Tikhonov second derivative sense. In second order Tikhonov regularzation, we use
regularization
the L from (5.18).
zeroth order Tikhonov
regularization We have already seen (5.9) how to solve (5.17) when L = I. This is called
generalized singular value zeroth order Tikhonov regularization. It is possible to extend this idea to
decomposition higher orders, and to problems in which the models are higher (e.g. 2- or 3-)
GSVD dimensional.
generalized singular values To solve higher order Tikhonov regularization problems, we employ the gen-
eralized singular value decomposition, or GSVD. Using the GSVD, the
solution to (5.17) can be expressed as a sum of filter factors times generalized
singular vectors.
Unfortunately, the definition of the GSVD and associated notation are not
standardized. In the following, we will generally follow the conventions used by
the MATLAB regularization toolbox and its cgsvd command. One important
difference is that we will use λi for the generalized singular value, while the
toolbox uses σi for the generalized singular value. Note that MATLAB also has
a built in command, gsvd, which uses a different definition of the GSVD.
We will assume that G is an m by n matrix, and that L is a p by n matrix,
with m ≥ n ≥ p, that rank(L) = p, and that the null spaces of G and L intersect
only at the zero vector. It is possible to define the GSVD in a more general
context, but these assumptions are necessary for this development of the GSVD
and are reasonable for most problems.
Under the above assumptions there exist matrices U, V, Λ, M and X with
the following properties and relationships:
• U is m by n and orthogonal.
• V is p by p and orthogonal.
• X is n by n and nonsingular.
• Λ is a p by p diagonal matrix, with
0 ≤ λ 1 ≤ λ2 ≤ . . . ≤ λ p ≤ 1 (5.19)
1 ≥ µ1 ≥ µ2 ≥ . . . ≥ µp > 0 (5.20)
•
Λ 0
G=U X−1 (5.24)
0 In−p
•
L = V [M 0] X−1 (5.25)
•
Λ2
T T 0
X G GX = (5.26)
0 I
•
M2
0
XT LT LX = (5.27)
0 0
• When p < n, the matrix L will have a nontrivial null space. It can be
shown that the vectors X·,p+1 , X·,p+2 , . . ., X·,n form a basis for the null
space of L.
When G comes from an IFK, the GSVD typically has two properties that
were also characteristic of the SVD. First, the generalized singular values γi
(5.22) tend to zero without any obvious break in the sequence. Second, the vec-
tors U·,i , V·,i , and X·,i tend to become rougher as i increases and γi decreases.
In applying the GSVD to solve (5.17), we note that we can reformulate the
problem as a a least-squares system
2
G d
min
m−
. (5.28)
αL 0
2
Assume that (5.28) is of full rank, the corresponding normal equations are
T T
G T T
d
G αL m= G αL . (5.29)
αL 0
We could solve these normal equations directly, but the GSVD provides a short
cut. Rather than deriving the GSVD solution from scratch, we will simply show
that the solution is
p n
X γi2 UT·,i d X
mα,L = X ·,i + (UT·,i d)X·,i (5.30)
i=1
γi2 + α2 λi i=p+1
(GT G + α2 LT L)mα,L =
ΛT
0 Λ 0 M
X−T UT U X−1 + α2 X−T VT V MT X−1
0
0 I 0 I 0
p 2 T n
X γi (U·,i ) d X
2 + α2 X·,i + (U·,i )T dX·,i . (5.31)
i=1
γi λ i i=p+1
5-16 CHAPTER 5. TIKHONOV REGULARIZATION
Λ2 + α2 M2
−T 0 −1
= X X
0 I
p n
X γi2 T
(U·,i ) d X
X·,i + (U·,i )T dX·,i
i=1
γi2 + α2 λi i=p+1
p
γi2 (U·,i )T d
2
Λ + α2 M2 0
X
−T −1
= X X X·,i
γ 2 + α2
i=1 i
λi 0 I
n 2
Λ + α2 M2 0
X
+ (U·,i )T d X−T X−1 X·,i .
0 I
i=p+1
p n
X γi2 (U·,i )T d −T 2 X
= X (λ i + α 2 2
µi )e i + (U·,i )T dX−T ei
i=1
γi2 + α2 λi i=p+1
p n
γi2 T
λ2i
X (U·,i ) d 2 X
= X−T µi + α 2 ei + (U·,i )T dei
i=1
γi2 + α2 λi µ2i i=p+1
p T n
X (U·,i ) d X
= X−T γi2 µ2i ei + (U·,i )T dei
i=1
λ i i=p+1
Xp n
X
= X−T λi (U·,i )T dei + (U·,i )T dei .
i=1 i=p+1
We can also simplify the right hand side of the original system of equations.
T
Λ 0
GT d = X−T UT d
0 I
Xp n
X
= X−T λi (U·,i )T dei + (U·,i )T dei . (5.32)
i=1 i=p+1
5.4. HIGHER ORDER TIKHONOV REGULARIZATION 5-17
0.34
0.33
0.32
True Slowness (s/km)
0.31
0.3
0.29
0.28
0.27
0.26
0.25
0 200 400 600 800 1000
Depth (m)
Thus
(GT G + α2 LT L)mα,L = GT d .
Example 5.5
0.35
0.3
0.25
Arrival Time (s)
0.2
0.15
0.1
0.05
0
0 200 400 600 800 1000
Depth (m)
Figure 5.11: Data for the smooth model Shaw problem, noise added.
th
L−curve for 0 order regularization
−2
10
2
solution norm || m ||
−4
10 −3 −2 −1 0
10 10 10 10
residual norm || G m − d ||2
0.32
0.3
Slowness (s/km)
0.28
0.26
0.24
0.22
0.2
0.18
0 200 400 600 800 1000
Depth (m)
Figure 5.14 shows the first–order L–curve for this problem obtained
using the regularization toolbox commands get l, cgsvd, l curve,
and tikhonov. The L–curve now has a well-defined corner near
α ≈ 137 for this particular data realization, and Figure 5.15 shows
the corresponding solution. The first–order regularized solution is
much smoother than the zeroth order regularized one, and is much
closer to the true solution.
Figure 5.16 shows the L–curve for second—order Tikhonov regu-
larization, which has a corner at α ≈ 2325 and Figure 5.17 shows
the corresponding solution. This solution is smoother still compared
to the first–order regularized solution. Both the first– and second–
order solutions depart most from the true solution at shallow depths
where the true slowness has the greatest slope and curvature.
9.6125
2
solution semi−norm || L m ||
18.4619
35.4583
68.1018
−5
10 130.7975
251.2119
482.4818
926.6625
1779.7633
−6
10 −4 −3 −2 −1
10 10 10 10
residual norm || G m − d ||2
0.32
Slowness (s/km)
0.3
0.28
0.26
0.24
0 200 400 600 800 1000
Depth (m)
2.5038
solution semi−norm || L m ||
−4
10 5.9393
14.0888
−5 33.4206
10
79.2782
188.0587
−6
10
446.1006
1058.2109
2510.2193
5954.5794
−7
10 −4 −3 −2
10 10 10
residual norm || G m − d ||2
2nd order
0.34
0.32
Slowness (s/km)
0.3
0.28
0.26
0.24
0 200 400 600 800 1000
Depth (m)
where
G] = (GT G + α2 LT L)−1 GT . (5.34)
Using properties of the GSVD we can simplify this expression to
−1
Λ2 + α2 M2 ΛT
] 0 T −T 0
G =X X X UT .
0 I 0 I
−1
Λ2 + α2 M2 0 ΛT
0
=X UT .
0 I 0 I
FΛ−1
0
=X UT (5.35)
0 I
where F is a diagonal matrix with diagonal elements
γi2
Fi,i = . (5.36)
γi2 + α2
Example 5.6
To examine the resolution of the Tikhonov-regularized inversions
of Example 5.4, we perform a spike test using (5.37). Figure 5.18
shows the effect of multiplying Rα,L with a unit amplitude spike
model (spike depth 500 m) under zeroth–, first–, and second–order
Tikhonov regularization using the corresponding α values of 10.0,
137, and 2325, respectively. The shapes of these curves (Figure
5.18) are indicative of the ability of each inversion to recover abrupt
changes in slowness halfway through the model. Progressively in-
creasing widths with increasing regularization order demonstrate de-
creasing resolution as more misfit is allowed in the interest of obtain-
ing a progressively smoother solution. Under first– or second–order
regularization, the resolution of various model features will depend
critically on how smooth or rough these features are in the true
model. In this example, the higher–order solutions recover the true
model better because the true model is smooth.
5.5. RESOLUTION IN HIGHER ORDER REGULARIZATION 5-23
0.5
0.4
0.3
0.2
0.1
−0.1
−0.2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Figure 5.18: Rα,L times the spike model for each of the regularized solutions of
Example 5.4
.
5-24 CHAPTER 5. TIKHONOV REGULARIZATION
value decomposition
TGSVD
8
10
16
4 14
10
10
2 12
0
10 10
8 6 4
2
−2
10
−7 −6 −5 −4 −3 −2 −1 0
10 10 10 10 10 10 10 10
residual norm || A x − b ||2
Figure 5.19: L–curve for the TGSVD solution of the Shaw problem, smooth
model.
When we first started working with the singular value decomposition we exam-
ined the truncated singular value (TSVD) method of regularization. We can
conceptualize the TSVD as simply skipping tossing out small singular values
and their associated model space basis vectors, or as a damped SVD solution in
which filter factors of one are used for larger singular values and filter factors of
zero are used for smaller singular values. This approach can also be extended to
work with the generalized singular value decomposition to obtain a truncated
generalized singular value decomposition or TGSVD solution. In this
case we include the k largest generalized singular values with filter factors of
one to obtain the solution
p n
X (U·,i )T d X
mk,L = X·,i + ((U·,i )T d)X·,i . (5.38)
λi i=p+1
i=p−k+1
Figure 5.19 shows the L–curve for the TGSVD solution of the Shaw problem
with the smooth true model. The corner occurs at k = 10. Figure 5.20 shows
the corresponding solution. The model recovery is reasonably good, but it is not
as close to the original model as the solution obtained by second order Tikhonov
regularization.
5.7. GENERALIZED CROSS VALIDATION 5-25
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Here V (α) measures the misfit. As α increases, V (α) will also increase.
Below the corner of the L–curve, V (α) should increase very slowly. At the corner
of the L–curve, V (α) should begin to increase fairly rapidly. The denominator,
T (α) measures the closeness of the data resolution matrix to the identity matrix.
T (α) is a slowly increasing function of α. Intuitively, we would expect G(α) to
have a minimum near the corner of the L–curve. For values of α below the
corner of the curve, V (α) will be roughly constant and T (α) will be increasing,
so G(α) should be decreasing. For values of α above the corner of the L–curve,
V (α) should be increasing rapidly, while T (α) should be increasing slowly. Thus
G(α) should be increasing. In the GCV method, we pick the value α∗ which
minimizes G(α).
Wahba was able to show that under reasonable assumptions about the
smoothness of mtrue and under the assumption that the noise is white that the
value of α that minimizes G(α) approaches the value that minimizes E[Gmα,L −
dtrue ] as the number of data points m goes to infinity [Wah90]. Wahba was
also able to show that under the same assumptions, E[kmtrue − mα,L k2 ] goes
to 0 as m goes to infinity [Wah90]. These results are too complicated to prove
5-26 CHAPTER 5. TIKHONOV REGULARIZATION
−4
10
−6
10
G(λ)
−8
10
−10
10
−12
10
−14
10
−15 −10 −5 0 5 10 15 20
10 10 10 10 10 10 10 10
λ
Figure 5.21: GCV curve for the Shaw problem, second order regularization.
here, but they do provide an important theoretical justification for the GCV
method of selecting the regularization parameter.
Example 5.7
Figure 5.21 shows a graph of G(α) for our Shaw test problem, us-
ing second order Tikhonov regularization. The minimum occurs at
α = 7.04 × 10−6 . Figure 5.22 shows the corresponding solution.
This solution is actually the best of all the solutions that we have
obtained for this test problem. The norm of the difference between
this solution and the true solution is 0.028.
5.7. GENERALIZED CROSS VALIDATION 5-27
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Figure 5.22: GCV solution for the Shaw problem, second order, α = 7.04×10−6 .
5-28 CHAPTER 5. TIKHONOV REGULARIZATION
Theorem 5.1
and
min kḠm − d̄k22 + α2 kmk22 (5.41)
are solved to obtain mα and m̄α . Then
kmα − m̄α k2 κ̄α kek2 krα k2
≤ 2 + + κ̄α (5.42)
kmα k2 1 − κ̄α kdα k2 kdα k2
where
kGk2
κ̄α = (5.43)
α
E = G − Ḡ (5.44)
e = d − d̄ (5.45)
kEk2
= (5.46)
kGk2
dα = Gmα (5.47)
rα = d − dα . (5.48)
5.8. ERROR BOUNDS 5-29
1.5
0.5
−0.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
In the particular case when G = Ḡ, and the only difference between the two
problems is e = d − d̄, the inequality becomes even simpler
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Figure 5.24: GT times the spike model for the Shaw problem.
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Figure 5.25: GT G times the spike model for the Shaw problem.
5.8. ERROR BOUNDS 5-31
Theorem 5.2
Suppose that we use zeroth order Tikhonov regularization to solve
Gm = d and that mtrue can be be expressed as
T
G w p=1
mtrue = (5.50)
GT Gw p = 2
and that
kGmtrue − dk2 ≤ ∆kwk2 (5.51)
for some ∆ > 0. Then
∆
kmtrue − G] dk2 ≤ + γαp kwk2 (5.52)
2α
where
1/2 p = 1
γ= . (5.53)
1 p=2
Furthermore, if we begin with the bound
we can let
δ
∆= . (5.55)
kwk2
Under this condition, it can be shown that the optimal value of α is
1
p+1
∆ 1
α̂ = = O(∆ p+1 ) . (5.56)
2γp
5-32 CHAPTER 5. TIKHONOV REGULARIZATION
∆ = 2γpα̂p+1 (5.57)
This theorem tells us that the error in the Tikhonov regularization solution
depends on both the noise level ∆ and on the regularization parameter α. For
very large values of α, the error due to regularization will be dominant. For very
small values of α, the error due to noise in the data will be dominant. There
is an optimal value of α which balances these effects. Using the optimal α, we
can obtain an error bound of O(∆2/3 ) if p = 2, and an error bound of O(∆1/2 )
if p = 1.
Of course, the above result can only be applied when our true model lives
in the very restricted subspace of models in R(GT ). In practice this seldom
happens. The result also depends on ∆, which we can at best approximate by
∆ ≈ δ. As a rule of thumb, if the model mtrue is reasonably smooth, then
we can hope for an error in the Tikhonov regularized solution which is O(δ 1/2 ).
Another way of saying this is that we can hope for an answer with about half
as many accurate digits as the data.
5.9 Exercises
1. Use the method of Lagrange Multipliers to derive the damped least squares
problem (5.2) from the discrepancy principal problem (5.1), and demon-
strate that (5.2) can be written as (5.4).
2. Consider the integral equation and data set from Problem 3.1. You can
find a copy of this data set in ifk.mat.
(e) Use second order Tikhonov regularization to solve the problem. Use
GCV, the discrepancy principle and the L–curve criterion to pick the
regularization parameter.
(f) Analyze the resolution of your solutions. Are the features you see
in your inverse solutions unambiguously real? Interpret your results.
Describe the size and location of any significant features in the solu-
tion.
(a) Use the truncated SVD to solve this inverse problem. Plot the result
using imagesc and colorbar.
(b) Use zeroth order Tikhonov regularization to solve this problem and
plot your solution. Explain why it is hard to use the discrepancy
principle to select the regularization parameter. Use the L–curve
criterion to select your regularization parameter. Plot the L–curve
as well as your solution.
(c) Use second order Tikhonov regularization to solve this problem and
plot your solution. Because this is a two dimensional problem, you
will need to implement a finite difference approximation to the Lapla-
cian (second derivative in the horizontal direction plus the second
derivative in the vertical direction). You can generate the appropri-
ate L roughening matrix using the following MATLAB code:
L=zeros(14*14,256);
k=1;
for i=2:15,
for j=2:15,
M=zeros(16,16);
M(i,j)=-4;
M(i,j+1)=1;
M(i,j-1)=1;
M(i+1,j)=1;
M(i-1,j)=1;
L(k,:)=(reshape(M,256,1))’;
5-34 CHAPTER 5. TIKHONOV REGULARIZATION
k=k+1;
end
end
What if any problems did you have in using the L–curve criterion on
this problem? Plot the L–curve as well as your solution.
(d) Discuss your results. Can you explain the characteristic vertical
bands that appeared in some of your solutions?
Chapter 6
Iterative Methods
SVD based pseudo inverse and Tikhonov regularization solutions become im-
practical when we consider larger problems in which G has tens of thousands
of rows and columns. The problem is that while G may be a sparse matrix, the
U and V matrices in the SVD will typically be dense matrices. For example,
consider a tomography problem in which the model is of size 256 by 256 (65536
model elements), and there are 100,000 ray paths. Most of the ray paths miss
most of the model cells, so the majority of the entries in G are zero. Thus the
G matrix might have a density of less than 1%. If we stored G as a regular
dense matrix, it would require about 50 gigabytes of storage. Furthermore, the
U matrix would require 80 gigabytes of storage, and the V matrix would require
about 35 gigabytes of storage. However, the sparse matrix G can be stored in
less than one gigabyte of storage.
Because iterative methods can take advantage of the sparsity commonly
found in the G matrix, iterative methods are often the only applicable method
for large problems.
6-1
6-2 CHAPTER 6. ITERATIVE METHODS
y=x-1
y=1
y=1
y=(1/3)x
Figure 6.2: Slow convergence occurs when hyperplanes are nearly parallel.
Example 6.1
ART algorithm
2
10
12
14
16
2 4 6 8 10 12 14 16
10
12
14
16
2 4 6 8 10 12 14 16
10
12
14
16
2 4 6 8 10 12 14 16
SIRT technique is an approximation to the sum of the slownesses along ray path i + 1. The
simultaneous interactive difference between qi+1 and di+1 is roughly the error in our predicted travel
reconstruction technique
time for ray i + 1. The denominator of the fraction becomes Ni+1 , the number
of cells along ray path i + 1 or equivalently the number of nonzero entries in
row i + 1 of G.
Thus we are taking the total error in the travel time for ray i + 1 and
dividing it by the number of cells in ray path i + 1. This correction factor is
then multiplied by a vector which has ones in cells along the ray path i + 1.
This has the effect of smearing the needed correction in travel time over all of
the cells in ray path i + 1.
The new approximate update formula can be written as
( (i) q −d
(i+1) mj − i+1 Ni+1
i+1
cell j in ray path i + 1
mj = (i) (6.6)
mj cell j not in ray path i + 1 .
The approximation can be improved by taking into account that the ray
paths through some cells are longer than the ray paths through other cells. Let
Li+1 be the length of ray path i + 1. We can improve the approximation by
using the following update formula.
( (i) d
(i+1) mj + Li+1 − Nqi+1 cell j in ray path i + 1
mj = (i)
i+1 i+1
(6.7)
mj cell j not in ray path i + 1 .
The main advantage of ART is that it saves storage- we need only store
information about which rays pass through which cells, and we do not need to
record the length of each ray in each cell. A second advantage of the method is
that it reduces the number of floating point multiplications required by Kacz-
marz’s algorithm. Although in current computers floating point multiplications
and additions require roughly the same amount of time, during the 1970’s when
ART was first developed, multiplication was slower than addition.
One problem with ART is that the resulting tomographic images tend to
be somewhat more noisy than the images resulting from Kaczmarz’s algorithm.
The Simultaneous Iterative Reconstruction Technique (SIRT) is a variation on
ART which gives slightly better images in practice, at the expense of a slightly
slower algorithm. In the SIRT algorithm, we compute updates to cell j of the
model for each ray i that passes through cell j. The updates for cell j are added
together and then divided by Nj before being added to mj .
Example 6.2
Returning to our earlier tomography example, Figure 6.6 shows the
ART solution obtained after 200 iterations. Again, the solution is
very similar to the truncated SVD solution.
Figure 6.7 shows the SIRT solution for our example tomography
problem. This solution is similar to the Kaczmarz’s and ART solu-
tions.
6.1. ITERATIVE METHODS FOR TOMOGRAPHY PROBLEMS 6-7
10
12
14
16
2 4 6 8 10 12 14 16
10
12
14
16
2 4 6 8 10 12 14 16
∇φ(x) = Ax − b = 0 (6.9)
or
Ax = b . (6.10)
Thus solving the system of equations Ax = b is equivalent to minimizing φ(x).
The method of conjugate gradients approaches the problem of minimizing
φ(x) by constructing a basis for Rn in which the minimization problem is ex-
tremely simple. The basis vectors p0 , p1 , . . ., pn−1 are selected so that
Then
n−1
!T n−1
! n−1
!
1 X X
T
X
φ(α) = αi pi A αi pi −b αi pi . (6.13)
2 i=0 i=0 i=0
Since the vectors are mutually conjugate with respect to A, this simplifies to
n−1 n−1
!
1X 2 T T
X
φ(α) = α p Api − b αi pi (6.15)
2 i=0 i i i=0
6.2. THE CONJUGATE GRADIENT METHOD 6-9
or
n−1
1X 2 T
φ(α) = α p Api − 2αi bT pi . (6.16)
2 i=0 i i
Equation (6.16) shows that φ(α) consists of n terms, each of which is inde-
pendent of the other terms. Thus we can minimize φ(α) by selecting each αi to
minimize the ith term,
αi2 pTi Api − 2αi bT pi .
Differentiating with respect to αi and setting the derivative equal to zero, we
find that the optimal value for αi is
bT pi
αi = (6.17)
pTi Api
We’ve seen that if we have a basis of vectors which are mutually conjugate
with respect to A, then minimizing φ(x) is very easy. We have not yet shown
how to construct the mutually conjugate basis vectors.
Our algorithm will actually construct a sequence of solution vectors xi , resid-
ual vectors ri = b − Axi , and basis vectors pi . The algorithm begins with
x0 = 0, r0 = b, p0 = r0 , and α0 = (rT0 r0 )/(pT0 Ap0 ).
Suppose that at the start of iteration k of the algorithm we have constructed
x0 , x1 , . . ., xk , r0 , r1 , . . ., rk , p0 , p1 , . . ., pk and α0 , α1 , . . ., αk . We assume
that the first k + 1 basis vectors pi are mutually conjugate with respect to A,
the first k + 1 residual vectors ri are mutually orthogonal, and that rTi pj = 0
when i 6= j.
We let
xk+1 = xk + αk pk . (6.18)
This effectively adds one more term of (6.12) into the solution. Next, we let
We let
rTk+1 rk+1
βk+1 = . (6.24)
rTk rk
Finally, we let
pk+1 = rk+1 + βk+1 pk (6.25)
6-10 CHAPTER 6. ITERATIVE METHODS
We will now show that rk+1 is orthogonal to ri for i ≤ k. For every i < k,
rTk+1 ri = 0. (6.36)
3. Let xk+1 = xk + αk pk .
4. Let rk+1 = rk + αk Apk .
krk+1 k22
5. Let βk = krk k22
.
6. Let k = k + 1.
6-12 CHAPTER 6. ITERATIVE METHODS
CGLS A major advantage of the CG method is that it requires storage only for the
conjugate gradient least vectors xk , pk , rk and the matrix A. If A is large and sparse, then sparse matrix
squares method
techniques can be used to store A. Unlike factorization methods (QR, SVD,
Cholesky), there will be no fill in of the zero entries in A. Thus it is possible to
solve extremely large systems (n in the hundreds of thousands) using CG, while
direct factorization would require far too much storage.
GT Gm = GT d . (6.61)
3. Let mk+1 = mk + αk pk .
6.3. THE CGLS METHOD 6-13
Example 6.3
Recall the image deblurring problem that we discussed in Chapter
1 and Chapter 3. Figure 6.8 shows an image that has been blurred
and also has a small amount of added noise.
This image is of size 200 pixels by 200 pixels, so the G matrix for the
blurring operator is of size 40,000 by 40,000. Fortunately, the blur-
ring matrix G is quite sparse, with less than 0.1% nonzero elements.
The matrix requires about 12 megabytes of storage. A dense matrix
of this size would require about 13 gigabytes of storage. Using the
SVD approach to Tikhonov regularization would require far more
storage than most current computers have. However, CGLS works
quite well on this problem.
Figure 6.9 shows the L–curve for the CGLS solution of this prob-
lem. For the first 15 or so iterations, kGm − dk2 decreases quickly.
After that point, the improvement in misfit slows down, while kmk2
increases rapidly. Figure 6.10 shows the CGLS solution after 30 iter-
ations. The blurring has been greatly improved. Note that the 30 is
far less than the size of the matrix, n = 40, 000. Unfortunately, fur-
ther CGLS iterations do not significantly improve the image. In fact,
noise builds up rapidly. Figure 6.11 shows the CGLS solution after
100 iterations. In this image the noise has been greatly amplified,
with little or no improvement in the clarity of the image.
6-14 CHAPTER 6. ITERATIVE METHODS
4.45
10
|| m ||
4.44
10
0 1 2
10 10 10
|| Gm−d ||
6.4 Exercises
1. Consider the cross well tomography problem of exercise 5.2.
(a) Apply Kaczmarz’s algorithm to this problem.
(b) Apply ART to this problem.
(c) Apply SIRT to this problem.
(d) Comment on the solutions that you obtained.
2. The regularization toolbox command blur computes the system matrix for
the problem of deblurring an image that has been blurred by a Gaussian
point spread function. The file blur.mat contains a particular G matrix
and a data vector d.
(a) How large is the G matrix? How many nonzero entries does it have?
How much storage would be required for the G matrix if all of its en-
tries were nonzero? How much storage would the SVD of G require?
(b) Plot the raw image.
(c) Using CGLS, obtain an estimate of the true image. Plot your solu-
tion.
3. Show that if p0 , p1 , . . ., pn−1 are mutually conjugate with respect to an n
by n symmetric and positive definite matrix A, then the vectors are also
linearly independent. Hint: Use the definition of linear independence.
Chapter 7
Other Regularization
Techniques
7-1
7-2 CHAPTER 7. OTHER REGULARIZATION TECHNIQUES
bound variables least squares the bounded variables least squares (BVLS) problem
BVLS
tikhcstr min kGm − dk2
m ≥ l
(7.2)
m ≤ u
.
Stark and Parker developed an algorithm for solving the BVLS problem and
implemented their algorithm in Fortran [SP95]. A similar algorithm is given in
the 1995 edition of Lawson and Hanson’s book [LH95].
Given the BVLS algorithm for (7.2), we can perform Tikhonov regulariza-
tion with bounds. The MATLAB regularization toolbox includes a command
tikhcstr for Tikhonov regularization with lower and upper bounds on the model
parameters.
A related optimization problem involves minimizing or maximizing a linear
functional of the model subject to bounds constraints and a constraint on the
misfit. This problem can be formulated as
min cT m
kGm − dk2 ≤ δ
m ≥ l (7.3)
m ≤ u
.
This problem can be solved by an algorithm given in Stark and Parker [SP95].
Solutions to this problem can be used to obtain bounds on the maximum and
minimum possible values of model parameters.
Example 7.1
Recall the source history reconstruction problem of Example 3.3.2.
Figure 7.1 shows a hypothetical source history. Figure 7.2 shows the
corresponding samples at T = 300, with noise added.
Figure 7.3 shows the least squares solution. Clearly, some regular-
ization is required. Figure 7.4 shows the nonnegative least squares
solution. This solution is somewhat better, in that concentrations
are all nonnegative. However, the true source history is not ac-
curately reconstructed. Furthermore, the NNLS solution includes
concentrations that are larger than one. Suppose for a moment that
the solubility limit of the contaminant in water is known to be 1.1.
This provides a natural upper bound on the elements of the model
vector. Figure 7.5 shows the corresponding BVLS solution. Further
regularization is required.
Figure 7.6 shows the L–curve for a second order Tikhonov regular-
ization solution with bounds on the model parameters. Figure 7.7
shows the regularized solution for λ = 0.0616. This solution cor-
rectly shows the two major peaks in the input concentration, but
7.1. USING BOUNDS CONSTRAINTS 7-3
1.4
1.2
0.8
Cin(t)
0.6
0.4
0.2
0
0 50 100 150 200 250
time
both peaks are slightly broader than they should be. The solution
misses one smaller peak at around t = 150.
We can use (7.3) to establish bounds on the values of the model
parameters. For example, we might want to establish bounds on the
average concentration from t = 125 to t = 150. These concentrations
appear in positions 51 through 60 of the model vector m. We let
ci be zero in positions 1 through 50 and 61 through 100, and let ci
be 0.1 in positions 51 through 60. The solution to (7.3) is then a
bound on the average concentration from t = 125 to t = 150. After
solving the optimization problem, we obtained a lower bound 0.36
and an upper bound 0.73 for the average concentration during this
time period.
7-4 CHAPTER 7. OTHER REGULARIZATION TECHNIQUES
0.4
0.35
0.3
0.25
0.2
C(x,300)
0.15
0.1
0.05
−0.05
0 50 100 150 200 250 300
distance
9
x 10
1.5
0.5
Concentration
−0.5
−1
−1.5
0 50 100 150 200 250
Time
3.5
2.5
Concentration
1.5
0.5
0
0 50 100 150 200 250
Time
1.4
1.2
0.8
Concentration
0.6
0.4
0.2
0
0 50 100 150 200 250
Time
1
10
0
|| Lm ||
10
−1
10
−2
10 −3 −2 −1 0
10 10 10 10
|| Gm−d ||
0.8
0.7
0.6
0.5
Concentration
0.4
0.3
0.2
0.1
−0.1
0 50 100 150 200 250
Time
ln(wx) + 1 = 0. (7.7)
7-8 CHAPTER 7. OTHER REGULARIZATION TECHNIQUES
16
x2
x ln x
14 x ln 0.2 x
x ln 5 x
12
10
−2
0 0.5 1 1.5 2 2.5 3 3.5 4
1
x= . (7.8)
ew
Thus maximum entropy regularization favors solutions with x values near 1/(ew),
and penalizes solutions with larger or smaller x values. For large values of x,
the function x2 grows faster than x ln wx. Maximum entropy regularization pe-
nalizes solutions with large x values, but not as much as 0th order Tikhonov
regularization. It should be clear that the choice of w, along with the magnitude
of the model parameters can have a significant effect on the maximum entropy
regularization solution.
Maximum entropy regularization is particularly popular in astronomy and
image processing applications [CE85, NN86, SB84]. Here the goal is to recover
a solution which consists of bright spots (stars, galaxies or other astronomical
objects) on a dark background. The non-negativity constraints in maximum
entropy furthermore ensure that the resulting image will not include features
with negative intensities. While conventional Tikhonov regularization tends
to broaden peaks in the solution, maximum entropy regularization may not
penalize sharp peaks as much.
Example 7.2
1000
900
800
700
600
500
400
300
200
100
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
We will assume that the data errors are normally distributed with
mean 0 and standard deviation 1. Following the discrepancy princi-
ple, we will look for a solution with kGm − dk2 around 4.4.
For this problem, default weights of w = 1 are appropriate, since the
background noise level of about 0.5 is close to the minimum of the
regularization term. We next solved (7.5) for several values of the
regularization parameter α. We found that at α = 0.2, the misfit was
kGm − dk2 = 4.4. The corresponding solution is shown in Figure
7.10. The spike near θ = −0.5 is visible, but the magnitude of the
peak is not correctly estimated. The second spike near θ = 0.7 is
not very well resolved.
For comparison, we also used 0th order Tikhonov regularization with
a nonnegativity constraint to solve the same problem. Figure 7.11
shows the Tikhonov solution. This solution is similar to the solution
produced by maximum entropy regularization.
For this sample problem, maximum entropy regularization had no
advantage over 0th order Tikhonov regularization with non-negativity
constraints. The best maximum entropy solution was compara-
ble to the solution obtained by Tikhonov regularization with non-
negativity constraints. This result is consistent with the results from
a number of sample problems in [AH91]. Amato and Huges found
that maximum entropy regularization was at best comparable to,
and often worse than Tikhonov regularization with non-negativity
constraints.
7-10 CHAPTER 7. OTHER REGULARIZATION TECHNIQUES
1000
900
800
700
600
500
400
300
200
100
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
1000
900
800
700
600
500
400
300
200
100
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
modified truncated SVD Per Hansen has suggested another approach which retains the two–norm
MTSVD of the data misfit while incorporating the TV regularization term [Han98]. In
PP-TSVD Hansen’s PP-TSVD method, we begin by using the SVD and the k largest
singular values of G to obtain a rank k approximation to G. This approximation
is
Xk
T
Gk = si U·,i V·,i . (7.15)
i=1
Note that the matrix Gk will be rank deficient. The point of the approximation
is to obtain a matrix with a well defined null space. The vectors V·,k+1 , . . .,
V·,n form a basis for the null space of Gk . We will need this basis later, so let
Using the model basis set [V·,1 . . . V·,k ], the minimum length least squares
solution is, from the SVD,
k
X UT·,i d
mk = V·,i . (7.17)
i=1
si
We can modify this solution by adding in any vector in the null space of Gk .
This will increase kmk2 , but have no effect on kGk m − dk.
We can use this to find solutions which minimize some regularization func-
tional, but have minimum misfit. For example, in the modified truncated SVD
(MTSVD) method, we find a model mL,k which minimizes kLmk among those
models which minimize kGk m−dk. Since all models which minimize kGk m−dk
can be written as m = mk − Bk z for some vector z, the MTSVD problem can
be written as
min kL(mk − Bk z)k2 (7.18)
or
min kLBk z − Lmk k2 . (7.19)
This is a least squares problem that can be solved with the SVD, QR factoriza-
tion, or by the normal equations,
The PP-TSVD algorithm uses a similar approach. First, we minimize kGk m−
dk2 . Let β be the minimum value of kGk m − dk2 . Instead of minimizing the
two–norm of Lm, we minimize the 1-norm of Lm, subject to the constraint that
m must be a least squares solution.
min kLmk1
(7.20)
kGk m − dk2 = β.
2.5
1.5
0.5
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
or
min kLBk z − Lmk k1 . (7.22)
This is an 1-norm minimization problem which can be solved in many ways,
including IRLS.
Note that in (7.22) the matrix LBk has n − 1 rows and n − k columns.
In general, when we solve this 1-norm minimization problem, we can find a
solution in which n − k of the equations are satisfied exactly. Thus at most
k − 1 entries in L(mk − Bk z) will be nonzero. Since each nonzero entry in
L(mk − Bk z) corresponds to a discontinuity in the model, there will be at most
k −1 discontinuities in the model. Furthermore, the zero entries in L(mk −Bk z)
correspond to points at which the model is constant. For example, if we use
k = 1, we will get a flat model with no discontinuities. For k = 2, we can obtain
a model with two flat sections and one discontinuity, and so on.
The PP-TSVD method can also be extended to piecewise linear functions
and to piecewise higher order polynomials by using a matrix L which approxi-
mates the second or higher order derivatives. The MATLAB function pptsvd,
available from Per Hansen’s web page, implements the PP-TSVD algorithm.
Example 7.3
In this example we consider the Shaw problem with a target model
which consists of a single step function. Our target model is shown
in Figure 7.12. The corresponding data, with noise (normal, 0.001
standard deviation) added is shown in Figure 7.13.
First, we solved the problem with conventional Tikhonov regular-
ization. Figure 7.14 shows the 0th order Tikhonov regularization
7-14 CHAPTER 7. OTHER REGULARIZATION TECHNIQUES
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
2.6
2.4
2.2
1.8
1.6
1.4
1.2
0.8
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
2.2
1.8
1.6
1.4
1.2
0.8
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
2.5
1.5
0.5
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
2.5
1.5
0.5
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
min cT m
(7.23)
kGm − dk22 ≤ δ2 .
for some constant C. Show how this constraint can be incorporated into
(7.5) using a second Lagrange Multiplier.
Chapter 8
Fourier Techniques
Inverse problems involving such systems can thus be usefully solved for m(t)
via an inverse or deconvolution operation. Here, the independent variable t is
time and the data d, model, m, and system kernel g are all time functions.
However, the results here are just as applicable to spatial problems and are also
generalizable to higher dimensions. Here, we will briefly overview the essentials
of Fourier theory in the context of performing convolutions and deconvolutions.
Readers are urged to consult one of the many references for a more complete
treatment.
Consider a linear time-invariant operator, G, that converts an unknown
model, m(t) into an observable data function d(t)
and scaling
G[αm(t)] = αG[m(t)] , (8.4)
where α is a scalar. (8.4) also implies that the output of the system is zero when
m(t) = 0.
G[0] = 0 . (8.5)
8-1
8-2 CHAPTER 8. FOURIER TECHNIQUES
delta function To show that the operation of any system satisfying (8.3) and (8.4) can be
sifting property, of delta cast in the form of (8.1), we utilize the sifting property of the impulse or
function
impulse function delta function, δ(t). The delta function can be conceptualized as the limiting
sifting property, delta function case of a pulse as its width goes to zero, its height goes to infinity, and its area
impulse response stays constant and equal to one, e.g.,
Green’s function
δ(t) = lim τ −1 Π(t/τ ) (8.6)
τ →0
= f (t0 ) a ≤ t0 ≤ b (8.8)
= 0 elsewhere (8.9)
for any f (t) continuous at finite t = t0 . The impulse response of a system,
where the model and data are related by the operation G, is defined as the
output produced when the input is a delta function
g(t) = G[δ(t)] . (8.10)
An impulse response is also widely referred to as a system Green’s function
in many applications.
We will now show that all linear, time-invariant system outputs for a given
input are characterizable via the convolution integral (8.1). First, note that any
input signal, m(t), can clearly be written as a summation of impulse functions
through the sifting property
Z ∞
m(t) = m(τ )δ(t − τ )dτ (8.11)
−∞
Thus, a general linear system response d(t) to an arbitrary input m(t) can be
written as Z ∞
d(t) = G m(τ )δ(t − τ )dτ (8.12)
−∞
or, from the fundamental definition of the integral operation
∞
" #
X
d(t) = G lim m(τn )δ(t − τn )∆τ . (8.13)
∆τ →0
n=−∞
where F denotes the Fourier transform operator, and F −1 denotes the inverse
Fourier transform. The impulse response g(t) is called the time domain
response, and its Fourier transform, G(f ), is called the spectrum of g(t). For
a system with impulse response g(t), G(f ) is called the frequency domain
response or transfer function of the system. The Fourier transform (8.16)
gives a formula for evaluating the spectrum, and the inverse Fourier transform
(8.17) says that the time domain function g(t) can be exactly reconstructed by a
complex weighted integration of functions of the form eı2πf t , where the weight-
ing is provided by G(f ). The essence of Fourier analysis is thus representing
functions using this particular infinite set of Fourier basis functions, eı2πf t .
It is important to note that, for a real-valued function g(t) the spectrum G(f )
will be complex. |G(f )| is called the ingpectral amplitude, and the angle
that G(f ) makes in the complex plane, tan−1 (imag(G(f ))/real(G(f )) is called
the spectral phase.
Readers should be note that in physics and geophysics applications the sign
convention chosen for the complex exponentials in the Fourier transform and its
inverse may be reversed, so that the forward transform (8.16) has a plus sign in
the exponent and the inverse transform (8.17) has a minus sign in the exponent.
This convention change simply causes a complex conjugation in the spectrum
that is reversed when the corresponding inverse transform is applied to return to
8-4 CHAPTER 8. FOURIER TECHNIQUES
convolution theorem the time domain, and is thus simply an alternative spectral phase convention.
An additional convention issues arise as to whether to express frequency in
Hertz (f ) or radians (ω = 2πf ). Alternative Fourier transform formulations
using omega instead of f have scaling factors of 2π in the forward, reverse, or
both transforms.
Consider the Fourier transform of the convolution of two functions
Z ∞ Z ∞
F [m(t) ∗ g(t)] = m(τ )g(t − τ )dτ e−ı2πf t dt (8.18)
−∞ −∞
which is the convolution theorem. The convolution theorem shows that con-
volution of two functions in the time domain has the simple effect of multiplying
their Fourier transforms the frequency domain. The Fourier transform of the
impulse response, G(f ) = F [g(t)], thus characterizes how M(f ) is altered in
spectral amplitude and phase by the convolution.
To see the implication of the convolution theorem more explicitly, consider
the response of a linear system, characterized by the impulse response g(t) in
the time domain and the transfer function G(f ) in the frequency domain, to
a single model Fourier basis function of frequency f0 , m(t) = eı2πf0 t . The
spectrum of eı2πf t can be seen to be δ(f − f0 ), by examining the corresponding
inverse Fourier transform (8.17)
Z ∞
ı2πf0
e = δ(f − f0 )eı2πf t df . (8.20)
−∞
The frequency domain response of a linear system to this basis function model
is therefore (8.19)
F [eı2πf0 ]G(f ) = δ(f − f0 )G(f0 ) . (8.21)
The corresponding time domain response is thus (8.17)
Z ∞
G(f0 )δ(f − f0 )eı2πf t df = G(f0 )eı2πf0 . (8.22)
−∞
Linear time-invariant systems thus only map model Fourier basis functions to
data functions at the same frequency, and only alter them in spectral ampli-
tude and phase, with the alteration being characterized by the spectrum of the
system impulse response evaluated at the appropriate frequency. Of particular
interest here is the result that model basis functions at frequencies that are
8.2. DECONVOLUTION FROM A FOURIER PERSPECTIVE 8-5
weakly mapped to the data (frequencies where G(f ) is small) may be difficult zeros
or impossible to recover in an inverse problem. poles
The transfer function can be expressed in a particularly useful analytical
form if we can express a linear time-invariant system as a linear differential
equation
dn y dn−1 y dy
an n + an−1 n−1 + · · · + a1 + a0 y =
dt dt dt
dm x dm−1 x dx
bm m
+ b m−1 m−1
+ · · · + b1 + b0 x , (8.23)
dt dt dt
where the coefficients ai and bi are constants. Because each of the terms in
(8.23) are linear and constant (there are no powers or other nonlinear functions
of x, y, or their derivatives), (8.23) expresses a linear time-invariant system
obeying superposition (8.3) and scaling (8.4), because differentiation is itself a
linear operation.
If a system of the form of (8.23) operates on a model of the form m(t) =
eı2πf t , (8.22) indicates that the corresponding predicted data will be d(t) =
G(f )eı2πf t . A time derivative for such a function merely generates a multiplier
of 2πıf . Dividing the resulting equation on both sides by eı2πf t and solving for
G(f ) gives the transfer function
Pm j
D(f ) j=0 bj (2πıf )
G(f ) = = Pn k
. (8.24)
M(f ) k=0 ak (2πıf )
downward continuation
Example 8.1
A illustrative physical example of an ill-posed inverse system is down-
ward continuation (Figure 8.1). Consider a point mass, M , located
at the origin, which has a total gravitational field
−M γ r̂ −M γ(xx̂ + y ŷ + z ẑ)
gt (r) = (gx , gy , gz ) = 2
= (8.25)
krk kxk3
where γ is Newton’s gravitational constant. At a general position
(r = xx̂ + y ŷ + z ẑ), the vertical (ẑ) component of the gravitational
field will be
−M γz
gz = ẑ · gt = (8.26)
krk3
We note that the integral of gz over the xy plane is a constant with
respect to the observation height, z,
Z ∞ Z ∞ Z ∞ Z ∞
dx dy
gz dx dy = −M γz 3
. (8.27)
−∞ −∞ −∞ −∞ krk
= −2πM γ , (8.28)
which is independent of the plane height z.
If we consider the vertical field at z = 0+ from a point mass, we
thus have a delta function singularity at the origin with a magnitude
8.2. DECONVOLUTION FROM A FOURIER PERSPECTIVE 8-7
given by (8.26), as the field has no vertical component except exactly downward continuation
at the origin. We can thus write the vertical field an infinitesimal upward continuation filter
distance above the xy plane as low–pass filter
filter, low–pass
gz |z=0+ = −2πM γδ(x, y) . (8.29)
h
ds1 ds2 . (8.31)
2π(s21 + s22 + h2 )3/2
We can examine the effects of the convolution process, and address
the inverse problem of downward continuation of the vertical field
at z = h back to z = 0, by examining the transfer function cor-
responding to (8.30), which will reveal how sensitive observations
made at z = h are to the field values at z = 0 as a function of
spatial frequency, k. The Fourier transform of the impulse response
is the transfer function
Z ∞ Z ∞
h · e−i2πkx x e−i2πky y dx dy
G(kx , ky ) =
−∞ −∞ 2π(x2 + y 2 + h2 )3/2
2 2 1/2
= e−2πh(kx +ky ) , (8.32)
which is the upward continuation filter. In this case the . model
and data basis functions are of the form eı2πkx x , and eı2πky y , where
kx,y are spatial frequencies (analogous to f in temporal problems).
As z increases, (8.32) indicates that higher frequency (larger k) spa-
tial basis functions of the true seafloor gravity signal are exponen-
tially attenuated to greater and greater degrees. Such an operation
is thus referred to as a low–pass filter. Predicted observations
thus become smoother with increasing h.
8-8 CHAPTER 8. FOURIER TECHNIQUES
(8.34) and (8.35) use the common conventions that the indexing ranges from 0
to n − 1, and also use the same complex exponential sign convention as (8.16)
and (8.17). (8.35) shows that a model time series can be expressed as a linear
combination of the n basis functions e2πıjk/n , where the complex coefficients are
the discrete spectral values Mk . The DFT is also frequently referred to as the
FFT because a particularly efficient algorithm, the fast Fourier transform,
8.3. LINEAR SYSTEMS IN DISCRETE TIME 8-9
Figure 8.2: Frequency and index mapping describing the DFT of a real-valued,
series (n = 16) sampled at a sampling rate fs . For a real-valued series, the
n complex elements of the DFT exhibits even spectral amplitude symmetry
and odd spectral phase symmetry about the index n/2. Indices less than n/2
correspond to the zero and positive frequencies 0, fs /n, 2fs /n, . . . , fs /n, fs /2−
fs /n. Indices greater than n/2 correspond to the negative frequencies −fs /2 +
fs /n, − fs /2 + 2fs /n, . . . − fs /n. Index n/2 corresponds to frequencies at half
the sampling rate. For the DFT to adequately represent the spectrum of an
assumed periodic time series, fs must be greater than or equal to the Nyquist
frequency (8.36)).
8-10 CHAPTER 8. FOURIER TECHNIQUES
Hermitian symmetry, spectral is widely exploited to evaluate (8.34) and (8.35. Figure 8.2 shows the frequency
Nyquist frequency and index mapping for an n-length DFT.
convolution, serial
The DFT spectra, Mk (8.34) are complex and discrete, with positive real
convolution, circular
frequencies assigned to indices 1 through n/2 − 1, and negative real frequencies
assigned to indices n/2 through n. The k = 0 term is the sum of the mj , or
n times the average series value. There is an implicit assumption in the DFT
that the underlying time series, mj , is periodic over n terms. If m(t) is real
valued, substitution of n − k for n 8.34 shows that the DFT has Hermitian
symmetry, where Mk = M∗n−k . Because of this complex conjugate symmetry
about k, the spectral amplitude, |M| is symmetric and the spectral phase is
antisymmetric with respect to k. For this reason it is customary to only plot
the positive frequency spectral amplitude and phase for the spectrum of a real
signal.
For a sampled time series to accurately convey information about a contin-
uous function it can be shown that the continuous real-world function must be
sampled at a rate fs that is at least twice the highest frequency, fmax , at which
there is appreciable energy, or at or above the Nyquist frequency,
fs ≥ fN = 2fmax . (8.36)
Should the condition (8.36) not be met, a usually irreversible nonlinear distor-
tion called aliasing will occur, where energy at frequencies higher than fs will
appear in the spectrum to lie at frequencies below fs .
The discrete convolution of two sampled time series with equal sampling
intervals ∆t = 1/f s can be performed in two ways. The first of these is serial
fashion,
n−1
X
dj = mi gj−i ∆t (8.37)
i=0
where we assume that the shorter of the two time series is zero for all indices
greater than m. In serial convolution we get a result with at most n + m − 1
nonzero terms.
The second type of discrete convolution is circular, can be applied only
to two time series of equal length (n = m). If the time series are of different
lengths, the lengths can be equalized by padding the shorter of the two with
zeros. The result of a circular convolution is as if we had joined each time series
to its tail and convolved them in circular fashion. Applying the convolution
theorem
strategy used in the FFT algorithm, it is also desirable to pad m and g up to spectral division
lengths that are powers of two, or at least highly composite numbers.
Consider the case where we have a theoretically known, or accurately esti-
mated, system impulse response, g(t), convolved with an unknown model, m(t).
We note in passing that, although we will examine the deconvolution problem
in 1 dimension for simplicity, the results are generalizable to higher dimensions.
The forward problem is
Z b
d(t) = g(t − τ )m(τ )dτ . (8.39)
a
This time domain representation of the forward problem was previously exam-
ined in Example 4.5.
Taking the Fourier perspective, an inverse solution can be obtained by first
padding d and g appropriately with zeros so that they are of equal and suffi-
cient length to avoid wrap-around artifacts associated with circular convolution.
Evaluating the discrete Fourier transforms of both vectors, (8.19) allows us to
cast the problem as a complex-valued linear system
D = G · M∆t , (8.42)
Gi,i = Gi , (8.43)
water level regularization However, (8.44) does not avoid instability associated with deconvolution sim-
ply by transforming the problem into the frequency domain because reciprocals
of any very small elements of G will dominate the diagonal of G−1 . (8.44) will
thus frequently require regularization to be useful.
mw = DFT−1 G−1
w D . (8.45)
The colorful name for this technique arises from the analogy of pouring water
into the holes of the Fourier transform of g until the spectral amplitude level
there reaches w. The effect of the water level is to prevent unregularized spec-
tral division at frequencies where the spectral amplitude of the system transfer
function is small, and thus prevent undesirable noise amplification.
An optimal water level value w will reduce the sensitivity to noise in the
inverse solution while still recovering important model features. As is typical
of the regularization process, it is possible to chose at a “best” solution by
assessing the tradeoff between the forward model residual and the model norm.
as the regularization parameter w is varied. A useful property in evaluating
data misfit and model length for calculations in the frequency domain that the
2–norm of the a Fourier transform vector (defined for complex vectors as the
square root of the sum of the squared complex element amplitudes) vector will
is proportional to the 2–norm of the time-domain vector (the specific constant
of proportionality depends on the DFT conventions used). One can thus easily
evaluate 2–norm tradoff metrics in the frequency domain without calculating
inverse Fourier transforms. The 2–norm of the water level-regularized solution,
mw , will thus decrease monitonically as w increases because |Gw,i,i | ≥ |Gi,i |.
Example 8.2
Revisiting the time-domain seismometer deconvolution example, which
was regularized in Chapter 4 using SVD truncation (Example 4.5),
we investigate a frequency-domain solution regularized via the water
level technique. The impulse response, true model, and noisy data
for this example are plotted in Figures 4.3, 4.6, and 4.9, respec-
tively. We first pad the n = 210 point data and impulse response
8.4. WATER LEVEL REGULARIZATION 8-13
5
10
0
10
−5
10
−10
10
−15
10 −3 −2 −1 0
10 10 10 10
Figure 8.3: Spectral amplitudes of the impulse response, noise-free, and noise
data vectors for the seismometer deconvolution problem. Spectra range in fre-
quency from zero (not shown) to half of the sampling frequency (1 Hz). Because
spectral amplitude for real-valued time series are symmetric with frequency,
spectra are shown only for f = kfs /n > 0.
Spectral Amplitude
1
10
0
10
−2 −1 0
10 10 10
f (Hz)
Figure 8.4: Spectral amplitudes of the Fourier transform of the noisy data
divided by the transfer function (the Fourier transform of the impulse response).
This spectrum is dominated by amplified noise at frequencies above about 0.1
Hz and is the Fourier transform amplitude of Figure 4.10.
ure 8.3 also shows that the spectral amplitudes of the noisy data
dominate the signal at frequencies higher than ≈ 0.1 Hz. Because
of the small values of Gk at these frequencies the spectral division
solution will result in a model that is dominated by noise, exactly
as in the time domain-case solution (Figure 4.10). Figure 8.4 shows
the amplitude spectrum resulting from this spectral division.
To regularize the spectral division solution, an optimal water level is
needed. Examining Figure 8.3 it is readily observed that the value
of w that will deconvolve the portion of the data spectrum that is
unobscured by noise while suppressing the amplification of higher
frequency noise must be close to unity. Such a spectral observation
might be more difficult for real data with more complex spectra,
however. A more adaptable way to select w is to construct a tradeoff
curve between model smoothness and data misfit by trying a range
of water level values. Figure 8.5 shows the curve for this synthetic
example, showing that the optimum w is close to 3. Figure 8.6 shows
a corresponding range of solutions, and Figure 8.7 shows the solution
for w = 3.16.
The solution of Figure 8.7, chosen from the corner of the tradeoff
curve of Figure 8.5, shows the resolution reduction characteristic of
regularized solutions, manifested in reduced amplitude and increased
8.4. WATER LEVEL REGULARIZATION 8-15
10
w=0.1
8
Model 2−Norm
w=0.31623
4
w=1
w=3.1623
w=10
2 w=31.6228
w=100
0
0 200 400 600 800 1000
Misfit 2−Norm
Figure 8.5: Tradeoff curve between model 2–norm and data 2–norm misfit for
a range of water level values.
20
15
10
10 log10 (w)
−5
−10
0 20 40 60 80
Time (s)
Figure 8.6: Models corresponding to a range of water level values and used to
construct Figure 8.5. Dashed curves show the true model (Figure 4.6).
8-16 CHAPTER 8. FOURIER TECHNIQUES
high–pass filter 1
filter, high–pass
0.8
Acceleration (m/s2)
0.6
0.4
0.2
0 20 40 60 80
Time (s)
Figure 8.7: Model corresponding to w = 3.16 (Figure 8.5; Figure 8.6). Dashed
curves show the true model (Figure 4.6).
8.5 Exercises
1. Consider regularized deconvolution as the solution to a pth order Tikhonov
regularized system
d G
= m
O αLp
which is, for example for p = 0,
T
h 0 0 ... 0
0
hT 0 ... 0
0
0 hT ... 0
. . . . .
. . . . .
. . . . .
d
= 0 0 0 ... hT m (8.46)
O α
0 0 ... 0
0 α 0 ... 0
. . . . .
. . . . .
. . . . .
0 0 0 ... α
Nonlinear Regression
9-1
9-2 CHAPTER 9. NONLINEAR REGRESSION
Using this approximation, we can obtain an approximate equation for the dif-
ference between x0 and the unknown x∗ .
or
x∗ − x0 ≈ ∇F(x0 )−1 (d − F(x0 )). (9.7)
This leads to Newton’s Method.
Theorem 9.1
A simple modification to the basic Newton’s method algorithm often helps damped Newton method
with convergence problems. In the damped Newton method, we use the New- Newton’s method for
minimizing f(x)
ton’s method equations at each iteration to compute a direction in which to
move. However, instead of simply taking the full step s, we search along the
line from xi to xi + s for a point which minimizes kF(xi + αs) − dk2 , and take
the step which minimize the norm.
Now suppose that we have a scalar valued function f (x), and want to min-
imize f . if we assume that f is twice continuously differentiable, then we have
the Taylor series approximation
1
f (x0 + s) ≈ f (x0 ) + ∇f (x0 )T s + sT ∇2 f (x0 )s (9.8)
2
where ∇f (x0 ) is the gradient of f
∂f (x0 )
∂x1
∇f (x0 ) = ... (9.9)
∂f (x0 )
∂xm
The theoretical properties of Newton’s method for minimizing f (x) are sum-
marized in the following theorem. Since Newton’s method for minimizing f (x)
is just Newton’s method for solving a nonlinear system of equations applied to
∇f (x) = 0, the proof follows immediately from the proof of theorem 9.1.
Theorem 9.2
If f is twice continuously differentiable in a neighborhood of a local
minimizer x∗ , there is a constant λ such that k∇2 f (x)−∇2 f (y)k2 ≤
λkx − yk2 in the neighborhood, ∇2 f (x∗ ) is positive definite, and x0
is close enough to x∗ , then Newton’s method will converge quadrat-
ically to x∗ .
In this equation, the product of ∇fi (m) and F(m) is the element wise product Gauss–Newton method
of the vectors. This formula can be simplified by using matrix notation to
N
X
∇2 f (m) = Hi (m). (9.22)
i=1
Here Hi (m) is the Hessian of fi (m)2 . The j, k element of this Hessian matrix
is given by
i ∂ 2 (fi (m)2 )
Hj,k (m) = . (9.23)
∂mj ∂mk
i ∂ ∂fi (m)
Hj,k (m) = 2fi (m) . (9.24)
∂mj ∂mk
∂ 2 fi (m)
i ∂fi (m) ∂fi (m)
Hj,k (m) = 2 + fi (m) . (9.25)
∂mj ∂mk ∂mj ∂mk
Thus
∇2 f (m) = 2J(m)T J(m) + Q(m) (9.26)
where
N
X
Q(p) = 2 fi (m)∇2 fi (m). (9.27)
i=1
In the context of nonlinear regression, we typically expect that the fi (m) terms
will be reasonably small as we approach the optimal parameters m∗ , so that this
is a reasonable approximation. Specialized methods are available for problems
with large residuals in which this approximation is not justified.
Using our approximation, the equations for Newton’s method are
Here the parameter λ is adjusted during the course of the algorithm to insure
convergence. One important reason for using a nonzero value of λ is that the
λI term ensures that the matrix is nonsingular. For very large values of λ, we
get
−1
mk+1 − mk = ∇f (m). (9.32)
λ
This is a steepest descent step. The algorithm simply moves downhill. The
steepest descent step provides very slow, but certain convergence. For very
small values of λ, we get the Gauss–Newton direction, which gives fast but
uncertain convergence.
The hard part of the Levenberg–Marquardt method is determining the right
value of λ. The general idea is to use small values of λ in situations where the
Gauss–Newton method is working well, but to switch to larger values of λ when
the Gauss–Newton method is not making progress. A very simple approach is
start with a small value of λ, and then adjust it in every iteration. If the L–M
step leads to a reduction in f (m), then decrease λ by a constant factor (say 2).
If the L–M step does not lead to a reduction in f (m), then do not take the step.
Instead, increase λ by a constant factor (say 2), and try again. Repeat this
process until a step is found which actually does decrease the value of f (m).
Robust implementations of the L–M method use sophisticated strategies for
adjusting the parameter λ. In practice, a careful implementation of the L–M
method typically has the good performance of the Gauss–Newton method as
well as very good convergence properties. In general, the L–M method is the
method of choice for small to medium sized nonlinear least squares problems.
Note that the λI term in the L–M method looks a lot like Tikhonov reg-
ularization. It is important to understand that this is not actually a case of
Tikhonov regularization, since the λI is only used as a way to improve the con-
vergence of the algorithm, and does not enter into the objective function being
minimized. We will discuss regularization for nonlinear problems in chapter 10.
9.3. STATISTICAL ASPECTS 9-7
∆d = J(m∗ )∆m
Using this linear approximation, our nonlinear least squares problem becomes a
linear least squares problem. Thus the matrix J(m∗ ) takes the place of G in an
approximate estimate of the covariance of the model parameters. Since we have
incorporated the σi into the formula for f , the Cov(d) matrix is the identity
matrix, and we obtain the formula
black box function deviations are unknown but assumed to be equal. We set the σi in (9.1) to 1,
automatic differentiation and minimize the sum of squared errors. Define a vector of residuals by
finite differences
ri = G(m∗ , xi ) − di i = 1, 2, . . . , N
and let r̄ be the mean of the residuals. Our estimate of the measurement
standard deviation is then given by
s
PN 2
i=1 (ri − r̄)
s= . (9.37)
N −n
Once we have m∗ and Cov(m∗ ), we can establish confidence intervals for the
model parameters exactly as we did in Chapter 2. Just as with linear regression,
it is also important to examine the residuals for systematic patterns or deviations
from normality. If we have not estimated the measurement standard deviation
s, then it is also important to test the χ2 value for goodness of fit.
0.9
0.8
0.7
0.6
f(p)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
0.9
0.8
0.7
0.6
f(p)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
3 global optimization
multistart
2.8
2.6
2.4
2.2
f(p)
1.8
1.6
1.4
1.2
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
G–N and L–M methods are designed to converge to a local minimum, but de-
pending on the point at which we begin the search, there is no way to be certain
that we will converge to a global minimum.
Global Optimization methods have been developed to deal with this prob-
lem. One simple global optimization procedure is called multistart. . In this
procedure, we randomly generate a large number of initial solutions, and per-
form the L–M method starting with each of the random solutions. We then
examine the local minimum solutions found by the procedure, and select the
minimum with the smallest value of f (m).
9.5 An Example
Example 9.1
>> mtrue
mtrue =
9-12 CHAPTER 9. NONLINEAR REGRESSION
Number m1 m2 m3 m4 χ2 p-value
1 0.8380 -0.4616 2.2461 -1.0214 9.5957 0.73
2 3.3135 -0.6132 -7.5628 -2.8005 10.7563 0.63
3 1.8864 0.2017 -0.9294 0.0151 37.4439 < 1 × 10−3
4 2.3725 -0.5059 -2.7754 -66.0067 271.5124 < 1 × 10−3
1.0000
-0.5000
2.0000
-1.0000
>> [x y]
ans =
1.0000 1.3308
1.2500 1.2634
1.5000 1.1536
1.7500 1.0247
2.0000 0.9125
2.2500 0.8007
2.5000 0.6951
2.7500 0.6117
3.0000 0.5160
3.2500 0.4708
3.5000 0.3838
3.7500 0.3309
4.0000 0.2925
4.2500 0.2413
4.5000 0.2044
4.7500 0.1669
5.0000 0.1524
We next used the L–M method to solve the problem a total of 100
times, using random starting points with each parameter uniformly
distributed between -2 and 2. This produced a number of different
locally optimal solutions, which are shown in table 1. Since solution
number 1 has the best χ2 value, we will analyze it.
9.5. AN EXAMPLE 9-13
1.5
0.5
−0.5
−1
−1.5
−2
0 2 4 6 8 10 12 14 16 18
>> J=jac(mstar);
>> covm=inv(J’*J)
covm =
>> %
>> % Now, compute the confidence intervals.
>> %
9-14 CHAPTER 9. NONLINEAR REGRESSION
ans =
0.4073 1.2687
-0.6077 -0.3156
1.5209 2.9713
-1.1206 -0.9221
Notice that the true parameters (1, -.5, 2, and -1) are all covered
by these confidence intervals. However, there is quite a bit of un-
certainty in the model parameters. This is an example of a poorly
conditioned nonlinear regression problem in which the data do not
constrain the possible parameter values very well.
The correlation matrix provides some insight into the nature of the
ill conditioning. For our solution, the correlation matrix is
>> corm
corm =
9.6 Exercises
1. A recording instrument sampling at 50 Hz is run with a high precision
sine wave generator connected to its input. The signal generator produces
a 39.98 s signal of the form
x(t) = A sin(2πfa t + φb )
and the recorded signal consists of instrumental noise, n(t), plus a mean
offset
y(t) = B sin(2πfb t + φb ) + c + n(t) .
Using the data in instdata.mat, and a starting model: fb = 0.5, φb = 0,
c = x̄, and B = (max(x) − min(x))/2, apply Newton’s method to solve for
the unknown parameters (B, fb , φb , c).
x y z
The arrival times of P-waves from the earthquakes are carefully measured
at the stations, with an estimated error of approximately 1 ms. The arrival
time estimates for each earthquake at each station (in seconds, with the
nearest clock second subtracted) are
(a) Apply the L–M method to this data set to estimate minimum 2-norm
misfit locations of the earthquakes.
(b) Estimate the errors in x, y, z (in meters) and origin time (in s) for each
earthquake using the diagonal elements of the appropriate covariance
matrix. Do the earthquakes form any sort of discernible trend?
(a) Use the arrival time at station 1, 2, 4, 6, 7, 8, 10, and 13 to find the
time and location of the radio frequency source in the lightning flash.
Assume the radio waves travel in a straight line at the speed of light,
2.99731435d8 m/s
9.7. NOTES AND FURTHER READING 9-17
10-1
10-2 CHAPTER 10. NONLINEAR INVERSE PROBLEMS
Here n is the number of data points and m is the number of model parameters.
Under this linear approximation, the damped least squares problem (10.3)
becomes
In (10.8), J(m) and d̂(m) are constant. This problem is a damped linear least
squares problem which we learned how to solve when we studied Tikhonov
regularization. The solution is given by
−1
m + ∆m = J(m)T J(m) + α2 LT L J(m)T d̂(m). (10.10)
Example 10.1
In this example we will consider the problem of estimating subsurface
electrical conductivities from above ground EM induction measure-
ments. The instrument used in this example is the Geonics EM–38
ground conductivity meter. A description of the instrument and the
mathematical model of the response of the instrument can be found
in [HBR+ 02]. The forward model is quite complicated, but we will
treat it as a black box model, and instead concentrate on the inverse
problem.
Measurements are taken at heights of 0, 10, 20, 30, 40, 50, 75, 100,
and 150 centimeters above the ground, with the coils oriented in
10-4 CHAPTER 10. NONLINEAR INVERSE PROBLEMS
5000
4000
3000
2000
Conductivity (mS/m)
1000
−1000
−2000
−3000
−4000
−5000
0 0.5 1 1.5 2 2.5
depth (m)
400
350
300
Conductivity (mS/m)
250
200
150
100
50
0 0.5 1 1.5 2 2.5
depth (m)
400
350
300
Conductivity (mS/m)
250
200
150
100
50
0 0.5 1 1.5 2 2.5
depth (m)
10.3 Exercises
1. Recall example 1.3, in which we had gravity anomaly observations above
a density perturbation of variable depth m(x), and fixed density ∆ρ. In
this exercise, you will use Occam’s inversion to solve an instance of this
inverse problem.
Consider a gravity perturbation along a one kilometer section, with ob-
servations take every 50 meters, and density perturbation of 200 Kg/m3
(0.2 g/cm3 .) The perturbation is expected to be at a depth of roughly
200 meters.
The data file gravprob.mat contains a vector x of observation locations.
Use the same coordinates for your discretization of the model. The vector
obs contains the actual observations. Assume that the observations are
accurate to about 1.0 × 10−12 .
Bayesian Methods
We then used Cov(m) to compute confidence intervals for the estimated param-
eters.
11-1
11-2 CHAPTER 11. BAYESIAN METHODS
This approach worked quite well for linear regression problems in which the
least squares problem is well conditioned. However, we found that in many
cases the least squares problem is not well conditioned. The set of solutions
that adequately fits the data is huge, and contains many solutions which are
completely unreasonable.
We discussed a number of approaches to regularizing the least squares prob-
lem. All of these approaches pick one “best” solution out of the set of solutions
that adequately fit the data. The different regularization methods differ in what
constitutes the best solution. For example, Tikhonov regularization picks the
model that minimizes kmk2 subject to the constraint kGm − dk2 < δ, while
higher order Tikhonov regularization selects the model that minimizes kLmk2
subject to kGm − dk2 < δ.
The regularization approach can be applied to both linear and nonlinear
problems. For relatively small linear problems the computation of the regu-
larized solution is generally done with the help of the SVD. This process is
straight forward and extremely robust. For large sparse linear problems iter-
ative methods such as CGLS can be used. For nonlinear problems things are
more complicated. We saw that the Gauss–Newton and Levenberg-Marquardt
methods could be used to find a local minimum of the least squares problem.
However, these nonlinear least squares problems often have a large number of
local minimum solutions. Finding the global minimum is challenging. We dis-
cussed the multi start approach in which we start the L–M method from a
number of random starting points. Other global optimization techniques for
solving the nonlinear least squares problem include simulated annealing and
genetic algorithms.
How can we justify selecting one solution from the set of models which
adequately fit the data? One justification is Occam’s razor. Occam’s razor is
the principle that when we have several different theories to consider, we should
select the simplest theory. The solutions selected by regularization are in some
sense the simplest models which fit the data. Any feature seen in the regularized
solution must be required by the data. If fitting the data did not require a feature
seen in the regularized solution, then that feature would have been smoothed
out by the regularization term. This answer is not entirely satisfactory.
It is also worth recalling that once we have regularized a least squares prob-
lem, we lose the ability to obtain statistically valid confidence intervals for the
parameters. The problem is that by regularizing the problem we bias the so-
lution. In particular, this means that the expected value of the regularized
solution is not the true solution.
is a specific (but unknown) model mtrue that we would like to discover. In the prior distribution
Bayesian approach, the model is a random variable. Our goal is to compute a subjective prior
probability distribution for the model. Once we have a probability distribution principle of indifference
uninformative prior
for the model, we can use the distribution to answers questions about the model distribution
such as “What is the probability that m5 is less than 0?” An important advan- posterior distribution
tage of the Bayesian approach is that it explicitly addresses the uncertainty in Bayes Theorem
the model, while the classical approach provides only a single solution.
A second very important difference between the classical and Bayesian ap-
proaches is that in the Bayesian approach we can incorporate additional in-
formation about the solution which comes from other data sets or our own
intuition. This prior information is expressed in the form of a probability distri-
bution for m. This prior distribution may incorporate the user’s subjective
judgment, or it may incorporate information from other experiments. If no
other information is available, then under the principle of indifference, we
may pick a prior distribution in which each possible model parameter has equal
likelihood. Such a prior distribution is said to be uninformative.
One of the main objections to the Bayesian approach is that the method is
“unscientific” because it allows the analyst to incorporate subjective judgments
into the model which are not based on the data alone. Bayesians reply that
the analyst is free to pick an uninformative prior distribution. Furthermore, it
is possible to complete the Bayesian analysis with a variety of prior distribu-
tions and examine the effects of different prior distributions on the posterior
distribution.
It should be pointed out that if the parameters m are contained in the range
(−∞, ∞), then the uninformative prior is not a proper probability distribution.
The problem is that there does not exist a probability distribution p(m) such
that Z ∞
p(m)dm = 1 (11.2)
−∞
and p(m) is constant. In practice, the use of this improper prior distribution can
be justified, because the posterior distribution for m is a proper distribution.
We will use the notation p(m) for the prior distribution. We also assume
that using the forward model we can compute the probability that given a
particular model, a particular data value will be observed. This is a conditional
probability distribution, We will use the notation f (d|m) for this conditional
probability distribution. Of course we know the data and are attempting to
estimate the model. Thus we are interested in the conditional distribution of
the model parameter(s) given the data. We will use the notation q(m|d) for
this posterior probability distribution.
Bayes’ theorem relates these distributions in a way that makes it possible
for us to compute what we want. Recall Bayes’ theorem.
Theorem 11.1
f (d|m)p(m)
q(m|d) = R . (11.3)
allmodels
f (d|m)p(m)dm
11-4 CHAPTER 11. BAYESIAN METHODS
MAP model Notice that the denominator in this formula is simply a normalizing constant
Markov Chain Monte Carlo which is used to insure that the integral of conditional distribution q is one.
Method
maximum likelihood
Since the normalization constant is not always needed, this formula is sometimes
written as
q(m|d) ∝ f (d|m)p(m). (11.4)
The posterior distribution q(m|d) does not provide a single model that we
can consider the “answer.” In cases where we want to single out one model as the
answer, it is appropriate to use the model with the largest value of q(m|d). This
is the so called maximum a posteriori (MAP) model. An alternative would be
to use the mean of the posterior distribution. In situations where the posterior
distribution is symmetric, the MAP model and the posterior mean model are
the same.
In general, the computation of a posterior distribution can be a complicated
process. The difficulty is in evaluating the integrals in (11.3) These are often
integrals in very high dimensions, for which numerical integration techniques are
computationally expensive. One important approach to such problems is the
Markov Chain Monte Carlo Method (MCMC) [GRS96]. Fortunately, there are
a number of special cases in which the computation of the posterior distribution
is greatly simplified.
One simplification occurs when the prior distribution p(m) is constant. In
this case, we have
q(m|d) ∝ f (d|m). (11.5)
In this context, the conditional distribution of d subject to m is known as the
likelihood of m (written L(m).) Under the maximum likelihood principle we
select the model mM L which maximizes L(m). This is exactly the MAP model.
A further simplification occurs when the noise in the measured data is in-
dependent and normally distributed with standard deviation σ. Because the
measurement errors are independent, we can write the likelihood function as
the product of the likelihoods of each individual data point.
Since the individual data points di are normally distributed with mean (G(m))i
and standard deviation σ, we can write f (di |m) as
(di −(G(m))i )2
f (di |m) = e− 2σ 2 . (11.7)
Except for the constant factor of 1/2σ 2 , this is precisely the least squares prob-
lem min kG(m)−dk22 . Thus we have shown that the Bayesian approach leads to
the least squares solution when we have independent and normally distributed
measurement errors and we use a flat prior distribution.
Example 11.1
Consider the following very simple inverse problem. We have an
object of unknown mass m. We have a scale that can be used to
weigh the object. However, the measurement errors are normally
distributed with mean 0 and standard deviation σ = 1 kg. With
this error model, we have
1 2
f (d|m) = √ e−(d−m) /2 . (11.10)
2π
1 2
q(m|d = 10.3 kg) ∝ √ e−(10.3−m) /2 . (11.12)
2π
In fact, if we integrate the right hand side of this last equation from
m = −∞ to m = +∞, we find the integral is one, so the constant
of proportionality is one, and
1 2
q(m|d = 10.3 kg) = √ e−(10.3−m) /2 . (11.13)
2π
This posterior distribution is shown in Figure 11.1.
Next, suppose that we obtain a second measurement of 10.1 kg.
Now, we use the distribution (11.13) as our prior distribution and
compute a new posterior distribution.
q(m|d1 = 10.3 kg, d2 = 10.1 kg) ∝ f (d2 = 10.1 kg|m)q(m|d1 = 10.3 kg)
(11.14)
1 2 1 2
q(m|d1 = 10.3 kg, d2 = 10.1 kg) ∝ √ e−(10.1−m) /2 √ e−(10.3−m) /2 .
2π 2π
(11.15)
We can multiply the exponentials
√ by adding exponents. We can also
absorb the factors of 1/ 2π into the constant of proportionality.
2
+(10.1−m)2 )/2
q(m|d1 = 10.3 kg, d2 = 10.1 kg) ∝ e−((10.3−m)
(11.16)
11-6 CHAPTER 11. BAYESIAN METHODS
conjugacy 0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
5 6 7 8 9 10 11 12 13 14 15
2
q(m|d1 = 10.3kg, d2 = 10.1kg) ∝ e−(2(10.2−m) +0.02)/2
. (11.18)
The constant factor e−0.02/2 can also be absorbed into the constant
of proportionality. We are left with
2
q(m|d1 = 10.3 kg, d2 = 10.1 kg) ∝ e−(10.2−m) . (11.19)
After normalization, this can be written as
(10.2−m)2
1 − √
q(m|d1 = 10.3 kg, d2 = 10.1 kg) = √ e 2(1/ 2)2 . (11.20)
√1 2π
2
0.5
0.4
0.3
0.2
0.1
0
5 6 7 8 9 10 11 12 13 14 15
Figure 11.2: Posterior distribution q(m|d1 = 10.3 kg, d2 = 10.1 kg), flat prior.
Thus
1 T
C−1 T −1
q(m|d) ∝ e− 2 ((Gm−dobs ) D (Gm−dobs )+(m−mprior ) CM (m−mprior )) . (11.23)
CM 0 = (GT C−1 −1 −1
D G + CM ) . (11.25)
−1/2 −1/2
min kCD (Gm − dobs )k2 + kCM (m − mprior )k2 . (11.28)
" # " #
2
−1/2 −1/2
CD G CD dobs
min
m−
. (11.29)
−1/2 −1/2
CM CM mprior
2
For convenience, let us assume that the data are independent with variance one.
Then
CM 0 ≈ (GT I G)−1 . (11.31)
CM 0 = RT R. (11.34)
This can be done easily using the chol command in MATLAB. We then generate
a vector s of normally distributed random numbers with mean zero and standard
deviation one. This can be done using the randn command in MATLAB. The
covariance matrix of s is the identity matrix. Finally, we generate our random
solution with
m = RT s + mM AP . (11.35)
The expected value of m is
Since mM AP is a constant,
From appendix A, we know how to find the covariance of matrix times a random
vector.
Cov(m) = RT Cov(s)R. (11.41)
Cov(m) = RT R. (11.43)
Cov(m) = CM 0 . (11.44)
11-10 CHAPTER 11. BAYESIAN METHODS
2
MAP solution
Target Model
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Example 11.2
3.5
2.5
1.5
0.5
−0.5
−1
−1.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
entropy
15
10
−5
−10
−15
−20
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Definition 11.1
P (X = xi ) = pi
is given by
X
H(X) = − pi ln pi .
11.4. MAXIMUM ENTROPY METHODS 11-13
Example 11.3
Suppose that we know that X takes on only nonnegative values, and
that the mean value of X is µ. It can be shown using the calculus of
variations that the maximum entropy distribution is an exponential
distribution [KK92]. The maximum entropy distribution is
1 −x/µ
fX (x) = e x ≥ 0.
µ
Definition 11.2
Given a discrete probability distribution
P (X = xi ) = pi
and an alternative distribution
P (X = xi ) = qi ,
the Kullback–Leibler cross entropy is given by
X pi
D(p, q) = pi ln .
qi
Given continuous distributions f (x) and g(x), the cross entropy of
the distributions is
Z ∞
f (x)
D(f, g) = f (x) ln
−∞ g(x)
11-14 CHAPTER 11. BAYESIAN METHODS
minimum cross entropy The cross entropy is a measure of how close two distributions are. Notice that if
principle
the two distributions are identical, then the cross entropy is 0. It can be shown
that in all other cases, the cross entropy is greater than zero.
Under the minimum cross entropy principle if we are give a prior distri-
bution q and some additional constraints on the distribution, we should select
a posterior distribution p which minimizes the cross entropy of p and q subject
to the constraints.
For example, in the Minimum Relative Entropy (MRE) method of
Woodbury and Ulrych [UBL90, WU96], the minimum cross entropy principle
is applied to linear inverse problems of the form Gm = d subject to lower
and upper bounds on the model elements. First, a maximum entropy prior
distribution is computed using the lower and upper bounds and a given prior
mean. Then, a posterior distribution is selected to minimize the cross entropy
subject to the constraint that the posterior mean distribution must satisfy the
equations Gm = d.
11.5 Exercises
1. Consider the following coin tossing experiment. We repeatedly toss a coin,
and each time record whether it comes up heads (0), or tails (1). The bias
b of the coin is the probability that it comes up heads. We do not have
reason to believe that this is a fair coin, so we will not assume that b = 1/2.
Instead, we will begin with a flat prior distribution p(b) = 1, for 0 ≤ b ≤ 1.
(a) What is f (d|b)? Note that the only possible data are 0 and 1, so this
distribution will involve delta functions at d = 0, and d = 1.
(b) Suppose that on our first flip, the coin comes up heads. Compute
the posterior distribution q(b|d1 = 0).
(c) The second, third, fourth, and fifth flips are 1, 1, and 1, and 1. Find
the posterior distribution q(b|d1 = 1, d2 = 1, d3 = 0, d4 = 1, d5 = 1).
Plot the posterior distribution.
(d) What is your MAP estimate of the bias?
(e) Now, suppose that you initially felt that the coin was at least close
to fair, with
2
p(b) ∝ e−10(b−0.5) 0 ≤ b ≤ 1
Repeat the analysis of the five coin flips.
Appendix A
The topics discussed in this appendix have been selected because they are log-
ically necessary to the development of the rest of the textbook. This is by no
means a complete review of linear algebra. It would take an entire textbook
to completely cover the material in this appendix. Instead, the purpose of this
appendix is to summarize some important concepts, definitions, and theorems
for linear algebra that will be used throughout the book.
Example A.1
Consider the system of equations:
x +2y +3z = 14
x +2y +2z = 11 .
x +3y +4z = 19
A-1
A-2 APPENDIX A. REVIEW OF LINEAR ALGEBRA
elementary row operations Next, we can eliminate y from the first equation by subtracting two
times the second equation from the first equation.
x +z = 4
y +z = 5 .
−z = −3
x = 1
y = 2 .
z = 3
process used in the previous example. In the example, the final version of the reduced row echelon form
augmented matrix is RREF
1 0 0 1
0 1 0 2 .
0 0 1 3
Definition A.1
Example A.2
x +y = 1
x +y = 2
Example A.3
In this example, we will consider a system of two equations in three
variables which has many solutions. Our system of equations is:
x1 +x2 +x3 = 0
. (A.1)
x1 +2x2 +2x3 = 0
We put this system of equations into augmented matrix form and
then find the RREF
1 0 0 0
.
0 1 1 0
We can translate this back into equation form as
x = 0
.
y +z = 0
Clearly, x must be 0 in any solution to the system of equations.
However, y and z are not fixed. We can treat z as a free variable
and allow it to take on any value. However, whatever value z takes
on, y must be equal to −z. Geometrically, this system of equations
describes the intersection of two planes. The intersection of the two
planes consists of the points on the line y = −z in the x = 0 plane.
Example A.4
Recall the system of equations (A.1)
x1 +x2 +x3 = 0
x1 +2x2 +2x3 = 0
from example A.1. We can write this in vector form as
1 1 1 0
x1 + x2 + x3 = .
1 2 2 0
The expression on the left hand side of the equation in which scalars are
multiplied by vectors and then added together is called a linear combination.
Although this form is not convenient for solving the system of equations, it can
be useful in setting up a system of equations.
A-6 APPENDIX A. REVIEW OF LINEAR ALGEBRA
Example A.5
Let
1 2 3
A=
4 5 6
and
1
x= 0
2
Then
1 2 3 7
Ax = 1 +0 +2 = .
4 5 6 16
Notice that the formula for Ax involves a linear combination much like the
one that occurred in the vector form of a system of equations. It is possible
to take any linear system of equations and rewrite the system of equations as
Ax = b, where A is a matrix containing the coefficients of the variables in the
equations, b is a vector containing the coefficients on the right hand sides of the
equations, and x is a vector containing the variables.
Example A.6
The system of equations
x1 +x2 +x3 = 0
x1 +2x2 +2x3 = 0
Note that the product is only possible if the two matrices are of compatible
sizes. In general, if A has m rows and n columns, and B has n rows and r
columns, then the product AB exists and is of size m by r. In some cases, it is
possible to multiply AB but not BA. Also, it turns out that even when both
AB and BA exist, AB is not always equal to BA!
There is an alternate way to compute the matrix–matrix product. In the
row–column expansion method, we obtain the entry in row i and column j
of C by computing the matrix product of row i of A and column j of B.
Example A.7
Let
1 2
A= 3 4
5 6
and
5 2
B=
3 7
and let C = AB. The matrix C will be of size 3 by 2. We compute
the product using both of the methods. First, using the matrix–
vector approach:
C = [AB1 AB2 ]
1 2 1 2
C = 5 3 + 3 4 2 3 + 7 4
5 6 5 6
11 16
C = 27 34 .
43 52
Next, we use the row–column approach
1×5+2×3 1×2+2×7
C= 3×5+4×3 3×2+4×7
5×5+6×3 5×2+6×7
11 16
C = 27 34 .
43 52
A-8 APPENDIX A. REVIEW OF LINEAR ALGEBRA
identity matrix The n by n identity matrix In consists of 1’s on the diagonal and 0’s on the
off diagonal. For example, the 3 by 3 identity matrix is
1 0 0
I3 = 0 1 0 .
0 0 1
We often write I without specifying the size of the matrix in situations where
the size of matrix is obvious from context. You can easily show that if A is an
m by n matrix, then
AIn = A
and
Im A = A.
Thus multiplying by I in matrix algebra is similar to multiplying by 1 in con-
ventional scalar algebra.
We have not defined division of matrices, but it is possible to define the
matrix algebra equivalent of the reciprocal.
Definition A.3
AB = BA = I, (A.3)
Since the columns of the identity matrix are known, and A is known, we can
solve
1
0
.
AB1 = .
.
0
to obtain B1 . In the same way, we can find the remaining columns of the inverse.
If any of these systems of equations are inconsistent, then A−1 does not exist.
Example A.8
Let
2 1
A= .
5 2
A.2. MATRIX AND VECTOR ALGEBRA A-9
The RREF is
1 0 −2
.
0 1 5
Thus the first column of A−1 is
−2
.
5
Thus
−1 −2 1
A = .
5 −2
The inverse matrix can be used to solve a system of linear equations with n
equations and n variables. Given the system of equations Ax = b, and A−1 ,
we can calculate
Ax = b
−1
A Ax = A−1 b
Ix = A−1 b
x = A−1 b.
This argument shows that if A−1 exists, then for any right hand side b, the
system of equations Ax = b has a unique solution. If A−1 does not exist, then
the system Ax = b may either have many solutions or no solution.
Example A.9
Consider the system of equations Ax = b, where
2 1
A=
5 2
and
3
b= .
7
A-10 APPENDIX A. REVIEW OF LINEAR ALGEBRA
Definition A.4
Definition A.5
Example A.10
Let
2 1
A= .
5 2
Then
2 5
AT = .
1 2
Definition A.6
A matrix is symmetric if A = AT .
Definition A.7
We will also work with non square matrices which are upper triangular. upper triangular matrix
lower triangular matrix
Definition A.8
An m by n matrix R is upper triangular if Ri,j = 0 whenever
i > j. A matrix L is lower triangular if LT is upper triangular.
Example A.11
The matrix
1 0 0 0 0
S= 0 2 0 0 0
0 0 3 0 0
is diagonal, and the matrix
1 2 3
0 2 4
R=
0
0 5
0 0 0
is upper triangular.
Theorem A.1
The following statements are true for any scalars s and t and any
matrices A, B, and C. It is assumed that the matrices are of the
appropriate size for the operations involved and that whenever an
inverse occurs, the matrix is invertible.
1. A + 0 = 0 + A = A.
2. A + B = B + A.
3. (A + B) + C = A + (B + C).
4. A(BC) = (AB)C.
5. A(B + C) = AB + AC.
6. (A + B)C = AC + BC.
7. (st)A = s(tA).
8. s(AB) = (sA)B = A(sB).
9. (s + t)A = sA + tA.
10. s(A + B) = sA + sB.
11. (AT )T = A.
12. (sA)T = s(AT ).
A-12 APPENDIX A. REVIEW OF LINEAR ALGEBRA
The first ten rules in this list are identical to rules of conventional alge-
bra, and you should have little trouble in applying them. The rules involving
transposes and inverses are new, but they can be mastered without too much
trouble.
Many students have difficulty with the following statements, which would
appear to be true on the surface, but which are in fact false for at least some
matrices.
1. AB = BA.
2. If AB = 0, then A = 0 or B = 0.
3. If AB = AC and A 6= 0, then B = C.
Definition A.9
Example A.12
A.4. SUBSPACES OF RN A-13
Let subspace
1 2 3
A= 4 5 6 .
7 8 9
Are the columns of A linearly independent vectors? To determine
this we setup the system of equations Ax = 0 in an augmented
matrix, and then find the RREF
1 0 −1 0
0 1 2 0 .
0 0 0 0
A.4 Subspaces of Rn
So far, we have worked with vectors of real numbers in the n dimensional space
Rn . There are a number of properties of Rn that make it convenient to work
with vectors. First, the operation of vector addition always works— we can take
any two vectors in Rn and add them together and get another vector in Rn .
Second, we can multiply any vector in Rn by a scalar and obtain another vector
in Rn . Finally, we have the 0 vector, with the property that for any vector x,
x + 0 = 0 + x = x.
Definition A.10
Example A.13
x1 + x2 + x3 = 0
Definition A.11
3. A0 = 0, so 0 is in N (A).
Example A.14
Let
3 1 9 4
A= 2 1 7 3 .
5 2 16 7
A.4. SUBSPACES OF RN A-15
2
However, we can take any vector in the null space of A and add it to
this solution to obtain another solution. Suppose that x is in N (A).
Then
A(x + p) = Ax + Ap
A(x + p) = 0 + b
A(x + p) = b.
For example,
1 −2 −1
2 −3 −1
x=
1 + 2 1 + 3 0
2 0 1
A-16 APPENDIX A. REVIEW OF LINEAR ALGEBRA
Definition A.12
A basis for a subspace W is a set of vectors v1 , . . ., vp such that
1. Any vector in W can be written as a linear combination of the
basis vectors.
2. The basis vectors are linearly independent.
Definition A.13
The standard basis for Rn is the set of vectors e1 , . . ., ep such
that the elements of ei are all zero, except for the ith element, which
is one.
Any nontrivial subspace W of Rn will have many different bases. For exam-
ple, we can take any basis and multiply one of the basis vectors by 2 to obtain a
new basis. However, it is possible to show that all bases for a subspace W have
the same number of basis vectors.
Theorem A.2
Let W be a subspace of Rn with basis v1 , v2 , . . ., vp . Then all bases
for W have p basis vectors, and p is the dimension of W .
Example A.15
In the previous example, the vectors
−2 −1
−3 −1
1 v2 = 0
v1 =
0 1
form a basis for the null space of A because any vector in the null
space can be written as a linear combination of v1 and v2 , and
because the vectors v1 and v2 are linearly independent. Since the
basis has two vectors, the dimension of the null space of A is two.
A.4. SUBSPACES OF RN A-17
It can be shown that the procedure used in the above example always pro- column space
duces a basis for N (A). column space
range
Definition A.14
Let A be an m by n matrix. The column space or range of A
(written R(A)) is the set of all vectors b such that Ax = b has at
least one solution. In other words, the column space is the set of all
vectors b that can be written as a linear combination of the columns
of A.
Example A.16
As in the previous example, let
3 1 9 4
A= 2 1 7 3 .
5 2 16 7
Theorem A.3
In addition to the null space and range of a matrix A, we will often work
with the null space and range of the transpose of A. Since the columns of AT
are rows of A, the column space of AT is also called the row space of A. Since
each row of A can be written as a linear combination of the nonzero rows of the
RREF of A, the nonzero rows of the RREF form a basis for the row space of
A. There are exactly as many nonzero rows in in the RREF of A as there are
pivot columns. Thus we have the following theorem.
Theorem A.4
Definition A.15
Definition A.16
x · y = xT y = x1 y1 + x2 y2 + . . . + xn yn
Definition A.17
A.5. ORTHOGONALITY AND THE DOT PRODUCT A-19
@
I Euclidean length
@ x-y 2–norm
@ law of cosines
@
@ orthogonal vectors
x @ perpendicular vectors
1
θ
y
Figure A.1: Relationship between the dot product and the angle between two
vectors.
Later we will introduce two other ways of measuring the “length” of a vector.
The subscript 2 is used to distinguish this 2–norm from the other norms.
You may be familiar with an alternative definition of the dot product in
which x · y = kxkkyk cos(θ) where θ is the angle between the two vectors. The
two definitions are equivalent. To see this, consider a triangle with sides x, y,
and x − y. See Figure 1. The angle between sides x and y is θ. By the law of
cosines,
We can also use this formula to compute the angle between two vectors.
xT y
θ = cos−1 .
kxk2 kyk2
Definition A.18
Definition A.20
Two subspaces V and W of Rn are orthogonal or perpendicular
if every vector in V is perpendicular to every vector in W .
Theorem A.5
Let A be an m by n matrix. Then
N (A) ⊥ R(AT ) .
Furthermore,
N (A) + R(AT ) = Rn .
That is, any vector x in Rn can be written uniquely as x = p + q
where p is in N (A) and q is in R(AT ).
Definition A.21
A basis in which the basis vectors are orthogonal is an orthogonal
basis.
Definition A.22
An n by n matrix Q is orthogonal if the columns of Q are orthog-
onal and each column of Q has length one.
Theorem A.6
If Q is an orthogonal matrix, then:
1. QT Q = I. In other words, Q−1 = QT .
A.5. ORTHOGONALITY AND THE DOT PRODUCT A-21
orthogonal projection
x y
xT w1 xT w2 xT wp
p = projW x = w1 + w2 + . . . + wp .
w1T w1 w2T w2 wpT wp
Gram–Schmidt basis for a subspace of Rn into an orthogonal basis. We begin with a basis v1 ,
orthogonalization process
v2 , . . ., vp . The process recursively constructs an orthogonal basis by taking
orth
each vector in the original basis and then subtracting off its projection on the
space spanned by the previous vectors. The formulas are
w1 = v1
v1T v2
w2 = v2 − v1
v1T v1
...
w1T vp wpT vp
wp = vp − T w1 − . . . − T wp .
w1 w1 wp wp
Unfortunately, the Gram–Schmidt process is numerically unstable when applied
to large bases. In MATLAB the command orth provides a numerically stable
way to produce an orthogonal basis from a nonorthogonal basis.
An important property of the orthogonal projection is that the projection
of x onto W is the point in W which is closest to x. In the special case that
x is in W , the projection of x onto W is x. This provides a convenient way to
write a vector in W as a linear combination of the orthogonal basis vectors.
Example A.17
In this example, we will find the point on the plane
x1 + x2 + x3 = 0
Using the RREF, we find that the null space has the basis
−1 −1
u1 = 1 u2 = 0 .
0 1
Unfortunately, this basis is not orthogonal. Using the MATLAB
orth command, we obtain the orthogonal basis
−0.7071 0.4082
w1 = 0 w2 = −0.8165 .
0.7071 0.4082
Using this orthogonal basis, we compute the projection of x onto
the plane
xT w1 xT w2
p= T w1 + T w2 .
w1 w1 w2 w2
A.5. ORTHOGONALITY AND THE DOT PRODUCT A-23
−1 least squares solution
p = 0 . normal equations
1
AT (Ax − b) = 0 .
AT Ax = AT b . (A.7)
This last system of equations is referred to as the normal equations for the
least squares problem. It can be shown that if the columns of A are linearly
independent, then the normal equations have exactly one solution. This solution
minimizes the sum of squared errors.
Example A.18
Let
1 2
1 3
A=
1
1
2 2
and
1
2
3 .
b=
4
It is easy to see that the system of equations Ax = b is inconsis-
tent. We will find the least squares solution by solving the normal
equations.
T 7 10
A A= .
10 18
14
AT b = .
19
A-24 APPENDIX A. REVIEW OF LINEAR ALGEBRA
Definition A.23
Ax = λx .
x(k + 1) = Ax(k)
x0 (t) = Ax(t) .
Ax = λx .
Suppose that we can find a set of n linearly independent eigenvectors orthogonal diagonalization
quadratic form
v1 , v2 , . . . , vn
λ1 , λ2 , . . . , λn .
These eigenvectors form a basis for Rn . We can use the eigenvectors to diago-
nalize the matrix as
A = PΛPT
where
P= v1 v2 . . . vn
and
λ1 0 ... 0
0 λ2 0 ...
Λ=
... ... ... ... .
0 ... 0 λn
To see that this works, simply compute AP
AP = A v1 v2 . . . vn
AP = λ1 v1 λ2 v2 . . . λn vn
AP = PΛ .
Thus
A = PΛPT .
Not all matrices are diagonalizable, because not all matrices have n linearly
independent eigenvectors. However, there is an important special case in which
matrices can always be diagonalized.
Theorem A.7
If A is a real symmetric matrix, then A can be written as
A = QΛQT ,
Definition A.24
A-26 APPENDIX A. REVIEW OF LINEAR ALGEBRA
Theorem A.8
A symmetric matrix A is positive semidefinite if and only if its eigen-
values are greater than or equal to 0.
Theorem A.9
Let A be an an n by n positive definite and symmetric matrix. Then
A can be written uniquely as
A = RT R ,
where R is an upper triangular matrix. Furthermore, A can be
factored in this way only if it is positive definite.
A.7. LAGRANGE MULTIPLIERS A-27
The MATLAB command chol can be used to compute the Cholesky factor- chol
ization of a symmetric and positive definite matrix. Lagrange multipliers
We will need to compute derivatives of quadratic forms. stationary point
Theorem A.10
Let f (x) = xT Ax where A is an n by n symmetric matrix. Then
∇f (x) = 2Ax
and
∇2 f (x) = 2A .
min f (x)
(A.10)
g(x) = 0.
Theorem A.11
A minimum point of (A.10) can occur only at a point x where
for some λ.
min f (x)
(A.12)
g(x) ≤ 0 .
Theorem A.12
A minimum point of (A.12) can occur only at a point x where
Example A.19
min x1 + x2
x21 + x22 − 1 ≤ 0 .
Notice that (A.13) is (except for the condition λ > 0) the necessary condition
for a minimum point of the unconstrained minimization problem
Theorem A.13
Suppose that f (x) and its first and second partial derivatives are
continuous. Then given some point x0 , we can write f (x) as
1
f (x) = f (x0 )+∇f (x)T (x−x0 )+ (x−x0 )T ∇2 f (c)(x−x0 ) (A.15)
2
for some point c between x and x0 .
1
f (x) ≈ f (x0 ) + ∇f (x)T (x − x0 ) + (x − x0 )T ∇2 f (x0 )(x − x0 ) . (A.16)
2
A.9. VECTOR AND MATRIX NORMS A-29
We will frequently make use of this approximation. Taylor’s theorem can easily convex function
be specialized to functions of a single variable. norm!definition
1
f (x) ≈ f (x0 ) + f 0 (x0 )(x − x0 ) + f 00 (x0 )(x − x0 )2 .
2
The theorem can also be extended to include terms associated with higher order
derivatives, but we will not need this.
The gradient ∇f (x0 ) has an important geometric interpretation. This vector
points in the direction in which f (x) increases most rapidly as we move away
from x0 . The Hessian, ∇2 f (x0 ), plays a role similar to the second derivative of
a function of a single variable.
Theorem A.14
Definition A.25
Definition A.26
norm!infinity It can be shown that for any p ≥ 1, the p–norm satisfies the conditions of
norm Definition A.9. The conventional Euclidean length is just the 2–norm. Two
other p–norms are commonly used. The 1–norm is the sum of the absolute
values of the entries in x. The infinity–norm is obtained by taking the limit
as p goes to infinity. The infinity–norm is the maximum of the absolute values
of the entries in x. The MATLAB command norm can be used to compute the
norm of a vector. It has options for the 1, 2, and infinity norms.
Example A.20
Let
−1
x = 2 .
−3
Then
kxk1 = 6
√
kxk2 = 14
and
kxk∞ = 3
Definition A.27
Any measure of the size or length of an m by n matrix that satisfies
the following five properties can be used as a matrix norm.
1. For any matrix A, kAk ≥ 0.
2. For any matrix A and any scalar s, ksAk = |s|kAk.
3. For any matrices A and B, kA + Bk ≤ kAk + kBk.
4. kAk = 0 if and only if A = 0.
5. For any two matrices A and B of compatible sizes, kABk ≤
kAkkBk.
A.9. VECTOR AND MATRIX NORMS A-31
Definition A.29
The Frobenius norm of an m by n matrix is given by
v
um X n
uX
kAkF = t A2ij .
i=1 j=1
Definition A.30
A matrix norm and a vector norm are compatible if
kAxk ≤ kAkkxk .
The matrix p–norm is (by its definition) compatible with the vector p–norm
from which it was derived. It can also be shown that the Frobenius norm of a
matrix is compatible with the vector 2–norm. Thus the Frobenius norm is often
used with the vector 2–norm.
In practice, the Frobenius norm, 1–norm, and infinity–norm of a matrix are
easy to compute, while the 2–norm of a matrix can be difficult to compute for
large matrices. The MATLAB command norm has options for computing the
1, 2, infinity, and Frobenius norms of a matrix.
A-32 APPENDIX A. REVIEW OF LINEAR ALGEBRA
Ax = b .
Ax̂ = b̂ .
Ax = b
Ax̂ = b̂
A(x − x̂) = b − b̂
(x − x̂) = A−1 (b − b̂)
kx − x̂k = kA−1 (b − b̂)k
kx − x̂k ≤ kA−1 kkb − b̂k .
This formula provides an absolute bound on the error in the solution. It is also
worthwhile to compute a relative error bound.
kb − b̂k
.
kbk
The relative error in x is measured by
kx − x̂k
.
kxk
A.11. THE QR FACTORIZATION A-33
Theorem A.15
A = QR
where R1 is n by n, and
Q = [Q1 Q2 ] ,
where Q1 is m by n and Q2 is m by m − n. In this case the QR factorization
has some important properties.
Theorem A.16
Since multiplying a vector by an orthogonal matrix does not change its length,
this is equivalent to
min kQT (Ax − b)k2 .
But
QT A = QT QR = R.
So, we have
min kRx − QT bk2
or
R1 x − QT1 b
min
0x − QT2 b
.
2
Whatever value of x we pick, we will probably end up with nonzero error because
of the 0x − QT2 b part of the least squares problem. We cannot minimize the
norm of this part of the vector. However, we can find an x that exactly solves
R1 x = QT1 b. Thus we can minimize the least squares problem by solving the
square system of equations
R1 x = QT1 b. (A.20)
The advantage of solving this system of equations instead of the normal equa-
tions (A.7) is that the normal equations are typically much more badly condi-
tioned than (A.20).
A.12. LINEAR ALGEBRA IN MORE GENERAL SPACES A-35
is c1 = c2 = 0.
We can define the dot product of two functions f and g to be
Z b
f ·g = f (x)g(x)dx.
a
Another commonly used notation for this dot product or inner product of f
and g is
f · g = hf, gi.
It is easy to show that this inner product has all of the algebraic properties of
the dot product of two vectors in Rn . A more important motivation for defining
the dot product in this way is that it leads to a useful definition
√ of the 2–norm
of a function. Following our earlier formula that kxk2 = xT x, we have
s
Z b
kf k2 = f (x)2 dx.
a
This measure is obviously zero when f (x) = g(x) everywhere, and is only zero
when f (x) = g(x) except possibly at some isolated points.
Using this inner product and norm, we can reconstruct the theory of lin-
ear algebra from Rn in our space of functions. This includes the concepts of
orthogonality, projections, norms, and least squares solutions.
A-36 APPENDIX A. REVIEW OF LINEAR ALGEBRA
A.13 Exercises
1. Construct an example of an over determined system of equations that has
infinitely many solutions.
2. Construct an example of an over determined system of equations that has
exactly one solution.
3. Construct an example of an under determined system of equations that
has no solutions.
4. Is it possible for an under determined system of equations to have exactly
one solution? If so, construct an example. If not, then explain why it is
not possible.
5. Let A be an m by n matrix with n pivot columns in its RREF. Can A
have infinitely many solutions?
6. Write a MATLAB routine to compute the RREF of a matrix.
%
% R=myrref(A)
%
% A an m by n input matrix.
% R output: the RREF(A)
%
Check your routine by comparing its results with results from MATLAB’s
rref command. Try some large (100 by 200) random matrices. Do you
always get the same result? If not, explain why.
7. If C = AB is a 5 by 4 matrix, then how many rows does A have? How
many columns does B have? Can you say anything about the number of
columns in A?
8. Write a MATLAB function to compute the product of two matrices
%
% C=myprod(A,B)
%
% A a matrix of size m by n
% B a matrix of size n by r
% C output: C=A*B
%
P = A(AT A)−1 AT .
We will assume that (AT A)−1 exists, but not that A−1 exists.
(a) Show that P is symmetric.
(b) Show that P2 = P.
13. Suppose that v1 , v2 and v3 are three vectors in R3 and that v3 = −2v1 +
3v2 . Are the vectors linearly dependent or linearly independent?
14. Let A be an n by n matrix. Show that if A−1 exists, then the columns
of A are linearly independent. Also show that if the columns of A are
linearly independent, then A−1 exists.
15. Show that any collection of four vectors in R3 must be linearly indepen-
dent. Show that in general, any collection of more than n vectors in Rn
must be linearly dependent.
16. Let
1 2 3 4
A= 2 2 1 3 .
4 6 7 11
Find bases for N (A), R(A), N (AT ) and R(AT ). What are the dimensions
of the four subspaces?
17. Let A be an n by n matrix such that A−1 exists. What are N (A), R(A),
N (AT ), and R(AT )?
18. Let A be any 9 by 6 matrix. If the dimension of the null space of A is 5,
then what is the dimension of R(A)? What is the dimension of R(AT )?
What is the rank of A?
19. Suppose that a non homogeneous system of equations with four equations
and six unknowns has a solution with two free variables. Is it possible to
change the right hand side of the system of equations so that the modified
system of equations has no solutions?
20. Let W be the set of vectors x in R4 such that x1 x2 = 0. Is W a subspace
of R4 ?
21. Find vectors v1 and v2 such that
v1 ⊥ v2
v1 ⊥ x
A-38 APPENDIX A. REVIEW OF LINEAR ALGEBRA
v2 ⊥ x
where
1
x= 2
3
22. Let v1 , v2 , v3 be a set of three orthogonal vectors. Show that the vectors
are also linearly independent.
23. Show that if x ⊥ y, then
QT Q = I.
uuT
P=I− .
uT u
Show that P is a symmetric matrix.
26. Prove the parallelogram law
λ1 , λ 2 , . . . , λn .
Show that
det(A) = λ1 λ2 · · · λn .
A = PΛP−1 .
30. Suppose that A is diagonalizable and that all eigenvalues of A have ab-
solute value less than one. What is the limit as k goes to infinity of Ak ?
31. Let A be an m by n matrix, and let
P = A(AT A)−1 AT .
We will assume that (AT A)−1 exists, but not that A−1 exists.
(a) Show that for any vector x, Px is in R(A).
(b) Show that for any vector x, (x − Px) ⊥ R(A). Hint: Show that
x − Px is in N (AT ).
This shows that multiplying P times x computes the orthogonal projection
of x onto R(A). Can you find a similar formula for projecting onto N (A)?
32. In this exercise, we will derive the formula (A.17) for the one–norm of a
matrix. Begin with the optimization problem
33. Derive the formula (A.19) for the infinity norm of a matrix.
34. In this exercise we will derive the formula (A.18) for the two–norm of a
matrix. Begin with the maximization problem
max kAxk22 .
kxk2 =1
Note that we have squared kAxk2 . We will take the square root at the
end of the problem.
√
(a) Using the formula kxk2 = xT x, rewrite the above maximization
problem without norms.
A-40 APPENDIX A. REVIEW OF LINEAR ALGEBRA
40. Let A be a symmetric and positive definite matrix with Cholesky factor-
ization
A = RT R.
Show how the Cholesky factorization can be used to solve Ax = b by
solving two systems of equations, each of which has R or RT as its matrix.
41. Show that if A is an m by n matrix with QR factorization
A = QR
then
kAkF = kRkF .
42. Let P3 [0, 1] be the space of polynomials of degree less than or equal to 3
on the interval [0, 1]. The polynomials p1 (x) = 1, p2 (x) = x, p3 (x) = x2 ,
and p4 (x) = x3 form a basis for P3 [0, 1], but they are not orthogonal with
respect to the inner product
Z 1
f ·g = f (x)g(x) dx .
0
Use the Gram-Schmidt process to construct an orthogonal basis for P3 [0, 1].
Once you have your basis, use it to find the third degree polynomial that
best approximates f (x) = e−x on the interval [0, 1].
Appendix B
Definition B.1
1. P (S) = 1.
2. For every event A ⊆ S, P (A) ≥ 0.
3. if events A1 , A2 , . . . are pairwise mutually exclusive (that is, if
Ai ∩ Aj is empty for all pairs i, j), then
∞
X
P (∪∞
i=1 Ai ) = P (Ai )
i=1
B-1
B-2 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS
random variable The list of properties of probability given in this definition is extremely
realizations useful in developing the mathematics of probability theory. However, applying
this definition of probability to real world situations requires some ingenuity.
Example B.1
Consider the experiment of throwing a dart at a dart board. We will
assume that our dart thrower is an expert who always hits the dart
board. The sample space S consists of the points on the dart board.
We can define an event A that consists of the points in the bulls eye.
Then P (A) is the probability that the thrower hits the bulls eye.
Definition B.2
A random variable X is a function X(s) that assigns a value to
each outcome s in the sample space S.
Each time we perform an experiment, we obtain a specific value of
the random variable. These are called realizations of the random
variable.
We will typically use capital letters to denote random variables, with
lower case letters used to denote realizations of a random variable.
Example B.2
To continue our previous example, let X be the function that takes
a point on the dart board and returns the associated score. Suppose
that throwing the dart in the bulls eye scores 50 points. Then for
each point s in the bullseye, X(s) = 50.
In this book we will deal frequently with experimental results in the form of
measurements that can include some random measurement error.
Example B.3
Suppose we weigh an object five times and obtain masses of m1 =
10.1 kg, m2 = 10.0 kg, m3 = 10.0 kg, m4 = 9.9 kg, and m5 = 10.1
kg. We will assume that there is one true mass m, and that the mea-
surements we obtained varied because of some random measurement
error. Thus
Definition B.3
The uniform probability density on the interval [a, b] has
1
b−a a ≤ x ≤ b
fU (x) = 0 x<a (B.1)
0 x>b
Definition B.4
The Gaussian or normal probability density (with parameters µ
and σ) has
1 1 2 2
fN (x) = √ e− 2 (x−µ) /σ = N (µ, σ 2 ) (B.2)
σ 2π
The standard normal random variable, N (0, 1), has µ = 0 and
σ = 1.
Definition B.5
The exponential random variable has
λe−λx x ≥ 0
fexp (x) = (B.3)
0 x<0
B-4 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS
double-sided exponential
distribution
gamma function
Definition B.6
the double-sided exponential probability density has
1 √
fdexp (x) = 3/2 e− 2|x−µ|/σ . (B.4)
2 σ
Definition B.7
The χ2 probability density with parameter ν has
1 1
fχ2 (x) = x 2 ν−1 e−x/2 , (B.5)
2ν/2 Γ(ν/2)
where the gamma function is
Z ∞
Γ(x) = ξ x−1 e−ξ dξ
0
Definition B.8
The Student’s t distribution with n degrees of freedom has
−(n+1)/2
x2
Γ((n + 1)/2) 1
ft (x) = √ 1+ . (B.6)
Γ(n/2) nπ n
Definition B.9
The F distribution is a function of two degrees of freedom, ν1 and
ν2
ν1 /2 − ν1 +ν
2
2
(ν1 −2)
ν1 ν1
ν2 1 + x ν2 x 2
fF (x) = , (B.7)
β(ν1 /2, ν2 /2)
where the beta function is
Z 1
β(z, w) = tz−1 (1 − t)w−1 dt . (B.8)
0
B-8 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS
cumulative distribution
function
CDF
Note that FX (a) must lie in the interval [0, 1] for all a, and is a non-decreasing
function of a because of the unit area and non negativity of the PDF.
For the uniform PDF on the unit interval, for example, the CDF is a ramp
function Z a
FU (a) = fu (z) dz
−∞
0 a≤0
FU (a) = a 0≤a≤1
1 a>1
The PDF, fX (x), and CDF, FX (a), completely determine the probabilistic
properties of a random variable.
The probability that a particular realization of X will lie within a general
interval [a, b] is
P (a ≤ X ≤ b) = P (X ≤ b) − P (X ≤ a) = F (b) − F (a)
B.2. EXPECTED VALUE AND VARIANCE B-9
Z b Z a Z b expected value
= f (x) dx − f (x) dx = f (x) dx . mode of a PDF
−∞ −∞ a
median of a PDF
Definition B.10
Some authors use the term “mean” for the expected value of a random
variable. We will reserve the term mean for the average of a set of data. Note
that the expected value of a random variable is not necessarily identical to the
mode (the value with the largest value of f (x)) nor is it necessarily identical
to the median, the value of x for which the F (x) = 1/2.
Example B.4
The first integral term is µ because the integral of the entire PDF is 1,
and the second term is zero because it is an odd function integrated
over a symmetric interval. Thus
E[X] = µ
Definition B.11
B-10 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS
The variance and standard deviation serve as measures of the spread of the
random variable about its expected value. Since the units of σ are the same as
the units of µ, the standard deviation is generally more practical as a measure of
the spread of the random variable. However, the variance has many properties
that make it more useful for certain calculations.
Definition B.12
Definition B.13
Definition B.14
B.3. JOINT DISTRIBUTIONS B-11
Theorem B.1
The following properties of Var, Cov, and correlation hold for any
random variables X and Y and scalars s and a.
• Var(X) ≥ 0
• Var(X + a) = Var(X)
• Var(sX) = s2 Var(X)
• Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
• Cov(X, Y ) = Cov(Y, X)
• ρ(X, Y ) = ρ(Y, X)
• −1 ≤ ρXY ≤ 1 .
Example B.5
X = µ + σZ .
Then
E[X] = E[µ] + σE[Z]
so
E[X] = µ .
Also,
Var(X) = Var(µ) + σ 2 Var(Z) = σ 2 .
Thus if we have a program to generate random numbers with the
standard normal distribution, we can use it to generate normal ran-
dom numbers with any desired expected value and standard devia-
tion. The MATLAB command randn generates independent real-
izations of an N (0, 1) random variable.
B.3. JOINT DISTRIBUTIONS B-13
Example B.6
What is the CDF (or PDF) of the sum of two independent random
variables X + Y ? To see this, we write the desired CDF in terms of
an appropriate integral over the JDF, f (x, y) (Figure B.8)
FX+Y (z) = P (X + Y ≤ z)
ZZ
= f (x, y) dx dy
x+y≤z
ZZ
= fX (x)fY (y) dx dy
x+y≤z
Z ∞ Z z−y
= fX (x)fY (y) dx dy
−∞ −∞
Z ∞ Z z−y
= fX (x) dx fY (y) dy
−∞ −∞
Z ∞
= FX (z − y)fY (y) dy .
−∞
The JDF can be used to evaluate the CDF or PDF arising from a general
function of independent random variables. The process is identical to the pre-
vious example except that the specific form of the integral limits is determined
by the specific function.
Example B.7
Z = XY
B-14 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS
Figure B.8: Integration of a joint probability density for two independent ran-
dom variables, X, and Y , to evaluate the CDF of Z = X + Y .
F (z) = P (Z ≤ z) = P (XY ≤ z) .
For z ≤ 0, this is the integral of the JDF over the exterior of the
hyperbolas defined by xy ≤ z ≤ 0, while for z ≥ 0, we integrate
over the interior of the complementary hyperbolas xy ≤ z ≥ 0. At
z = 0, the integral covers exactly half of the (x, y) plane (the 2nd
and 4th quadrants) and, because of the symmetry of the JDF, has
accumulated half of the probability, or 1/2.
The integral is thus
Z 0 Z ∞
1 −(x2 +y2 )/2σ2
F (z) = 2 e dy dx (z ≤ 0)
−∞ z/x 2πσ 2
and
Z 0 Z z/x
1 −(x2 +y2 )/2σ2
F (z) = 1/2 + 2 e dy dx (z ≥ 0) .
−∞ 0 2πσ 2
B.4. CONDITIONAL PROBABILITY B-15
As in the previous example for the sum of two random variables, the conditional probability
PDF may be obtained from the CDF by differentiating with respect law of total probability
to z. Bayes’ theorem
Definition B.16
The conditional probability of A given that B has occurred is
given by
P (A ∩ B)
P (A|B) = . (B.11)
P (B)
Theorem B.2
Suppose that B1 , B2 , . . ., Bn are mutually disjoint and exhaustive
events. That is, Bi ∩ Bj = ∅ for i 6= j, and
∪ni=1 Bi = S
Then
n
X
P (A) = P (A|Bi )P (Bi ) (B.12)
i=1
Theorem B.3
P (A|B)P (B)
P (B|A) = (B.13)
P (A)
B-16 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS
P (A|B)P (B)
P (B|A) = .
P (A)
Thus
0.99 × 0.0001
P (B|A) = = 0.0098
0.010098
In other words, even after a positive screening test, it is still unlikely
that the individual will have the disease. The vast majority of those
individuals who test positive will in fact not have the disease.
Theorem B.4
Given two random variables X and Y , with the distribution of X
depending on Y , we can compute
Z ∞
P (X ≤ a) = P (X ≤ a|Y = y)fY (y)dy (B.14)
−∞
B.5. THE MULTIVARIATE NORMAL DISTRIBUTION B-17
Example B.9
Let U be a random variable uniformly distributed on (1, 2). Let X
be an exponential random variable with parameter λ = U . We will
find the expected value of X.
Z 2
E[X] = E[X|U = u]fU (u)du.
1
Definition B.17
If the random variables X1 , . . ., Xn have a multivariate normal
distribution (MVN), then the joint probability density function is
1 1 −1
f (x) = n/2
p e−(x−µ)C (x−µ)/2 (B.16)
(2π) det(C)
where µ = [µ1 , µ2 , . . . , µn ]T is a vector containing the expected
values along each of the coordinate directions of X1 , . . ., Xn , and C
contains the covariances between the random variables
Theorem B.5
Let X be a multivariate normal random vector with expected values
defined by the vector µ and covariance C, and let Y = AX. Then
Y is also multivariate normal, with
E[Y] = Aµ
and
Cov(Y) = ACAT .
Theorem B.6
If we have an n-dimensional MVN distribution with covariance ma-
trix C and expected value µ, and the covariance matrix is of full
rank, then the quantity
Z = (X − µ)T C−1 (X − µ)
has a χ2 distribution with n degrees of freedom.
Example B.10
We can generate vectors of random numbers according to an MVN
distribution by using the following process, which is very similar to
the process for generating random normal scalars.
1. Find the Cholesky factorization C = LLT .
2. Let Z be a vector of n independent N (0, 1) random numbers.
3. Let X = µ + LZ.
Because E[Z] = 0, E[X] = µ + L0 = µ. Also, since Cov(Z) = I and
Cov(µ) = 0, Cov(X) = Cov(µ + LZ) = LILT = C.
Theorem B.7
Let X1 , X2 , . . ., Xn be independent and identically distributed (IID)
random variables with a finite expected value µ and variance σ 2 . Let
X1 + X2 + . . . + Xn − nµ
Zn = √ .
nσ
In the limit as n approaches infinity, the distribution of Zn ap-
proaches the standard normal distribution.
B.7. TESTING FOR NORMALITY B-19
30 Q–Q plot
25
20
15
10
0
−4 −3 −2 −1 0 1 2 3 4 5
xi = F −1 ((i − 0.5)/n) (i = 1, 2, . . . , n) ,
B-20 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS
2
Quantiles of Input Sample
−1
−2
−3
−4
−3 −2 −1 0 1 2 3
Standard Normal Quantiles
60
50
40
Quantiles of Input Sample
30
20
10
−10
−20
−3 −2 −1 0 1 2 3
Standard Normal Quantiles
where F (x) is the CDF of the distribution against which we wish to compare
our observations.
If we are testing to see if the elements of d could have come from the normal
distribution, then F (x) is the CDF for the standard normal distribution
Z x
1 1 2
FN (x) = √ e− 2 z dz .
2π −∞
If the elements of d are normally distributed, the points (yi , xi ) will follow a
straight line with a slope and intercept determined by the standard deviation
and expected value, respectively of the normal distribution that produced the
data. The MATLAB statistics toolbox includes a command, qqplot, that can
be used to generate a Q − Q plot.
Figure B.10 shows the Q − Q plot for our sample data set. The point at the
upper right corner of the plot has the value 4.62, but under a normal distribution
with expected value -0.03 and standard deviation 1.26, the largest of the 100
points should be about 2.5. Similarly, some points in the lower left corner stray
from the line. It is apparent that the data set contains more extreme values
than the normal distribution would predict. In fact, these data were generated
according to a t distribution with 5 degrees of freedom, which has broader tails
than the normal distribution (Figure B.6).
Figure B.11 shows a more extreme example. This data is clearly not normally
distributed. The Q − Q plot shows that distribution is skewed to the right. It
would be unwise to treat these data as if they were normally distributed.
There are a number of statistical tests for normality. These tests, including
the Kolmogorov–Smirnov test, Anderson–Darling test, and Lilliefors test each
produce probabilistic measures called p-values. A small p-value indicates that
the observed data would be unlikely if the distribution were in fact normal,
while a large p-value is consistent with normality. In practice, these tests may
declare a data set to be non-normal even though the data are “normal enough”
for practical applications. For example, the Lilliefors test implemented by the
MATLAB statistics toolbox command lillietest rejects the data in Figure B.10
with a p-value of 0.04.
sample mean This sample mean m̄ will serve as our estimate of m. We will also compute
Student’s $t$ distribution an estimate s of the standard deviation
confidence interval sP
n 2
i=1 (mi − m̄)
s= .
n−1
Theorem B.8
(The Sampling Theorem) Under the assumption that measurements
are independent and normally distributed with expected value m
and standard deviation σ, the random quantity
m̄ − m
t= √
s/ n
[9.96, 10.08] g .
The above procedure for constructing a confidence interval for the mean us-
ing the t distribution was based on the assumption that the measurements were
normally distributed. In situations where the data are not normally distributed
this procedure can fail in a very dramatic fashion. However, it may be safe
to generate an approximate confidence interval using this procedure if (1) the
number n of data is large (50 or more) or (2) the distribution of the data is not
strongly skewed and n is at least 15.
and only a 5% probability that t would lie outside this interval. Equivalently,
there is only a 5% probability that |tobs | ≥ tn−1, 0.975 .
This leads to the t–test: If |tobs | ≥ tn−1, 0.975 , then we reject the hypothesis
that µ = µ0 . On the other hand, if |t| < tn−1, 0.975 , then we cannot reject the
hypothesis that µ = µ0 . Although the 95% confidence level is traditional, we
B-24 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS
$p$–value, $t$–test can also perform the t-test at a 99% or some other confidence level. In general,
type I error if we want a confidence level of 1 − α, then we compare |t| to tn−1, 1−α/2 .
type II error
In addition to reporting whether or not a set of data passes a t-test it is good
power, of a hypothesis test
practice to report the associated t–test p–value . The p-value associated with a
t-test is the largest value of α for which the data passes the t-test. Equivalently,
it is the probability that we could have gotten a greater t value than we have
observed, given that all of our assumptions are correct.
Example B.12
Because |tobs | is larger than t9, 0.975 = 2.262, we reject the hypoth-
esis that these data came from a normal distribution with expected
value 0 at the 95% confidence level.
The t-test (or any other statistical test) can fail in two ways. First, it could
be that the hypothesis that µ = µ0 is true, but our particular data set contained
some unlikely values and failed the t-test. Rejecting the hypothesis when it is
in fact true is called a type I error . We can control the probability of a type
I error by decreasing α.
The second way in which the t-test can fail is more difficult to control. It
could be that the hypothesis µ = µ0 was false, but the sample mean was close
enough to µ0 to pass the t-test. In this case, we have a type II error. The
probability of a type II error depends very much on how close the true mean is
to µ0 . If the true mean µ = µ1 is very close to µ0 , then a type II error is quite
likely. However, if the true mean µ = µ1 is very far from µ0 then a type II error
will be less likely. Given a particular alternative hypothesis, µ = µ1 , we call
the probability of a type II error β(µ1 ), and call the probability of not making
a type II error (1 − β(µ1 )) the power of the test. We can estimate β(µ1 ) by
repeatedly generating sets of n random numbers with µ = µ1 and performing
the hypothesis test on the sets of random numbers.
B.10. EXERCISES B-25
The results of a hypothesis test should always be reported with care. It is Markov’s inequality
important to discuss and justify any assumptions (such as the normality as- Chebychev’s inequality
sumption made in the t-test) underlying the test. The p-value should always be
reported along with whether or not the hypothesis was rejected. If the hypoth-
esis was not rejected and some particular alternative hypothesis is available, it
is good practice to estimate the power of the hypothesis test against this alter-
native hypothesis. Confidence intervals for the mean should be reported along
with the results of a hypothesis test.
It is important to distinguish between the statistical significance of a hy-
pothesis test and the actual magnitude of any difference between the observed
mean and the hypothesized mean. For example, with very large n it is nearly
always possible to achieve statistical significance at the 95% confidence level,
even though the observed mean may differ from the hypothesis by only 1% or
less.
B.10 Exercises
1. Compute the expected value and variance of a uniform random variable
in terms of the parameters a and b.
E[|X|]
P (|X| ≥ c) ≤
c
This result is known as Markov’s inequality.
1
P (|X − µ| ≥ kσ) ≤
k2
This result is known as Chebychev’s inequality.
7. Show that
Var(X) = E[X 2 ] − E[X]2
B-26 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS
8. Show that
Cov(aX, Y ) = aCov(X, Y )
and that
Cov(X + Y, Z) = Cov(X, Z) + Cov(Y, Z)
9. Show that the PDF for the sum of two independent uniform random vari-
ables on [a, b] = [0, 1] is
0 (x ≤ 0)
x (0 ≤ x ≤ 1)
f (x) =
2 − x (1 ≤ x ≤ 2)
0 (x ≥ 0) .
10. On a multiple choice question with four possible answers, there is a 75%
chance that a student will know the correct answer and 25% chance that
the student will randomly pick one of the four answers. What is the
probability that the student will get a correct answer? Given that the
answer was correct, what is the probability that the student actually knew
the correct answer?
11. Suppose that X and Y are independent random variables. Use condition-
ing to find a formula for the CDF of X + Y in terms of the PDF’s and
CDF’s of X and Y .
12. Suppose that X is a vector of 2 random variables with an MVN distribu-
tion (expected value µ, covariance C), and that A is a 2 by 2 matrix. Use
the properties of expected value and covariance to show that Y = AX has
expected value Aµ and covariance ACAT .
13. Consider a least squares problem Ax = b, where the A matrix is known
exactly, but the right hand side vector b includes some random measure-
ment noise. Thus
bnoise = btrue +
where is a vector of independent N (0, 1) random numbers. We would
like to find xtrue such that
Axtrue = btrue .
Find the sample mean and standard deviation. Use these to construct a
95% confidence interval for the mean. Test the hypothesis H0 : µ = 0
at the 95% confidence level. What do you conclude? What was the
corresponding p-value?
15. Using MATLAB, repeat the following experiment 1,000 times. Use the
statistics toolbox function exprnd() to generate 5 exponentially distributed
random numbers (B.3) with λ = 10. Use these 5 random numbers to gen-
erate a 95% confidence interval for the mean. How many times out of the
1,000 experiments did the 95% confidence interval cover the expected value
of 10? What happens if you instead generate 50 exponentially distributed
random numbers at a time? Discuss your results.
16. Using MATLAB, repeat the following experiment 1,000 times. Use the
randn function to generate a set of 10 normally distributed random num-
bers with expected value 10.5 and standard deviation 1. Perform a t-test
of the hypothesis µ = 10 at the 95% confidence level. How many type II
errors were committed? What is the approximate power of the t-test with
n = 10 against the alternative hypothesis µ = 10.5? Discuss your results.
17. Using MATLAB, repeat the following experiment 1,000 times. Using the
exprnd() function of the statistics toolbox, generate 5 exponentially dis-
tributed random numbers with expected value 10. Take the average of the
5 random numbers. Plot a histogram and a probability plot of the 1,000
averages that you computed. Are the averages approximately normally
distributed? Explain why or why not. What would you expect to happen
if you took averages of 50 exponentially distributed random numbers at a
time? Try it and discuss the results.
Glossary of Mathematical
Terminology
• A, B, C, ... : Matrices.
• Ai,· : ith row of matrix A.
• A·,i : ith column of matrix A.
• Ai,j : (i, j)th element of matrix A.
• E[X]: Expected value of the random variable X.
• G† : Generalized inverse of the matrix G calculated from the SVD.
• A, B, C, ... : Functions; Random Variables.
• N (µ, σ 2 ): Normal probability density function with expected value µ and
variance σ 2 .
• N (A): Null space of the matrix A.
• rank(A): Rank of the matrix A.
• R(A): Range of the matrix A.
• Rn : Vector space of dimension n.
• A, B, C, ... : Fourier transforms.
• a, b, c, ... : Column vectors.
• ai : ith element of vector a.
• ā: Mean value of the elements in vector a.
• Cov(x), Cov(X, Y ): Covariance between the elements of vector x or
between the random variables X and Y .
C-1
C-2 APPENDIX C. GLOSSARY OF MATHEMATICAL TERMINOLOGY
• α, β, γ, ... : Scalars.
• a, b, c, ... : Scalars or functions.
• χ2 : Chi-square random variable.
• λi : Eigenvalues.
• µ(1) : 1-norm misfit measure.
• σ: Standard deviation.
• σ 2 : Variance.
• ν: Degrees of freedom.
• CDF: Cumulative distribution function.
• JDF: Joint probability density function.
• MVN: Multivariate normal probability density function.
• PDF: Probability density function.
• SVD: Singular value decomposition.
• tν,p : p-percentile of the t distribution with ν degrees of freedom.
• Fν1 ,ν2 ,p : p-percentile of the F distribution with ν1 and ν2 degrees of free-
dom.
Bibliography
[Bjö96] Åke Björck. Numerical Methods for Least Squares Problems. SIAM,
Philadelphia, 1996.
BIB-1
BIB-2 BIBLIOGRAPHY
[DS96] J. E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods for Un-
constrained Optimization and Nonlinear Equations. SIAM, Philadel-
phia, 1996.
[GCSR03] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin.
Bayesian Data Analysis. Chapman & Hall/CRC, Boca Raton, FL,
second edition, 2003.
[HBR+ 02] J.M.H. Hendrickx, B. Borchers, J.D. Rhoades, D.L. Corwin, S.M.
Lesch, A.C. Hilgendorf, and J. Schlue. Inversion of soil conductivity
profiles from electromagnetic induction measurements; theory and
experimental verification. Soil Science Society of America Journal,
66(3):673–685, 2002.
[HH93] Martin Hanke and Per Christian Hansen. Regularization methods for
large–scale problems. Surveys on Mathematics for Industry, 3:253–
315, 1993.
[HR96] M. Hanke and T. Raus. A general heuristic for choosing the regular-
ization parameter in ill–posed problems. SIAM Journal on Scientific
Computing, 17(4):956–972, 1996.
[HV91] Sabine Van Huffel and Joos Vandewalle. The Total Least Squares
Problem: computational aspects and analysis. SIAM, Philadelphia,
1991.
[Rud87] Walter Rudin. Real and Complex Analysis. McGraw–Hill, New York,
third edition, 1987.
[Sea82] Shayle R. Searle. Matrix Algebra Useful for Statistics. Wiley, New
York, 1982.
[SGT88] John A. Scales, Adam Gersztenkorn, and Sven Treitel. Fast lp so-
lution of large, sparse, linear systems: Application to seismic travel
time tomography. Journal of Computational Physics, 75(2):314–333,
1988.
[SS97] John Scales and Martin Smith. DRAFT: Geophysical inverse theory.
http://landau.Mines.EDU/ samizdat/inverse theory/, 1997.
[Str88] Gilbert Strang. Linear Algebra and its Applications. Harcourt Brace
Jovanovich Inc., San Diego, third edition, 1988.
[Tar87] Albert Tarantola. Inverse Problem Theory: Methods for Data Fitting
and Model Parameter Estimation. Elsevier, New York, 1987.
[TB97] Loyd N. Trefethen and David Bau. Numerical Linear Algebra. SIAM,
Philadelphia, 1997.
[TG65] A.N. Tikhonov and V.B. Glasko. Use of the regularization method in
non–linear problems. USSR Computational Mathematics and Math-
ematical Physics, 5(3):93–107, 1965.
IDX-1
IDX-2 INDEX
tikhcstr, 7-2
time domain response, 8-3
time series, 8-8
tomography, 1-10
total least squares, 2-25
total variation regularization, 7-11
transfer function, 8-3
transpose, A-10
trivial solution, A-4
truncated generalized singular value
decomposition, 5-24
truncated SVD, 4-12
TSVD, 4-12
type I error, B-24
type II error, B-24
unbiased, 2-7
uncorrelated random variables, B-
11
uniform, B-3
uninformative prior distribution, 11-
3
upper triangular matrix, A-11
upward continuation filter, 8-7
variance, B-10
vector space, A-35
vertical seismic profiling, 1-7