Inverse Theory

Geophysical Inverse Theory
Notes by German A. Prieto

Universidad de los Andes
March 11, 2011
c
2009
ii
Contents
1 Introduction to inverse theory
1.1 Why is the inverse problem more difficult?
1.1.1 Example: Non-uniqueness . . . . .
1.2 So, what can we do? . . . . . . . . . . . .
1.2.1 Example: Instability . . . . . . . .
1.2.2 Example: Null space . . . . . . . .
1.3 Some terms . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Review of Linear Algebra

2.1 Matrix operations . . . . . . . . . . . . . . . . .
2.1.1 The condition Number . . . . . . . . . . .
2.1.2 Matrix Inverses . . . . . . . . . . . . . . .
2.2 Solving systems of equations . . . . . . . . . . . .
2.2.1 Some notes on Gaussian Elimination . . .
2.2.2 Some examples . . . . . . . . . . . . . . .
2.3 Linear Vector Spaces . . . . . . . . . . . . . . . .
2.4 Functionals . . . . . . . . . . . . . . . . . . . . .
2.4.1 Linear functionals . . . . . . . . . . . . .
2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Norms and the inverse problem . . . . . .
2.5.2 Matrix Norms and the Condition Number
3 Least Squares & Normal Equations
3.1 Linear Regression . . . . . . . . . . . . .
3.2 The simple least squares problem . . . .
3.2.1 General LS Solution . . . . . . .
3.2.2 Geometrical Interpretation of the
3.2.3 Maximum Likelihood . . . . . . .
3.3 Why LS and the effect of the norm . . .
3.4 The L2 problem from 3 Perspectives . .
3.5 Full Example: Line fit . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
2
3
3
5
6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
8
10
10
11
13
14
16
18
19
19
21
21
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
normal equations
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
24
25
27
29
32
33
34
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iv
4 Tikhonov Regularazation
4.1 Tikhonov Regularization . . . . . . . . . . . .
4.2 SVD Implementation . . . . . . . . . . . . . .
4.3 Resolution vs variance, the choice of , or p .
4.3.1 Example 1: Shaws problem . . . . . .
4.4 Smoothing Norms or Higher-Order Tikhonov
4.4.1 The discrete Case . . . . . . . . . . .
4.5 Fitting within tolerance . . . . . . . . . . . .
4.5.1 Example 2 . . . . . . . . . . . . . . . .
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
38
39
40
42
43
44
45
Chapter 1
Introduction to inverse
theory
In geophysics we are often faced with the following situation: We have measurements at the surface of the Earth of some quantity (magnetic field, seismic
waveforms) and we want to know some property of the ground under the place
where we made the measurements. Inverse theory is a method to infer the
unknown physical properties (model) from these measurements (data).
This class is called Geophysical Inverse Theory (GIT) because it is assumed
we understand the physics of the system. That is, if we knew the properties
accurately, we would be able to reconstruct the observations that we have taken.
First, we need to be able to solve the forward problem
di = Gi (m)
(1.1)
where from a known field m(x, t, . . . ) we can predict the observations di . We

assume there are a finite number N of observations, thus di is a N-dimensional
data vector.
G is the theory that predicts the data from the model m. This theory is based
on physics. Mathematically, G(m) is a functional, a rule that unambiguously
assigns a single real number to an element of a vector space.
As its name suggests, the inverse problem reverses the process of predicting
the values of the measurements. It tries to invert the operator G to get an
estimate of the model
m = F (d)
(1.2)
Some examples of properties inside the Earth (model) and its surface observations used to make inferences about the model are shown in table 1.1
The inverse problem is usually more difficult than the forward problem. To
start, we assume that the physics are completely under control, before even
thinking about the inverse problem. There are plenty of geophysical systems
where the forward problem is still incompletely understood, such as the geodynamo problem or earthquake fault dynamics.
Chapter 1. Introduction to inverse theory
Table 1.1: Example properties and measurements for inverse problems

Model
Topography
Magnetic field at CMB
Mass distribution
Fault slip
Seismic velocity
1.1
Data
Altitute/Bathymetry measurements
Magnetic field at the surface
Gravity measurements
Waveforms / Geodetic motion
Arrival times / Waveforms
Why is the inverse problem more difficult?
A simple reason is that we have a finite number of measurements (and of limited

precision). The unknown property we are after is a function of position or time
and requires in principle infinitely many parameters to describe it. This leads
to the problem that in many cases the inverse problem is non-unique. Nonuniqueness means that more than one solution can reproduce the data in hand.
A finite di , where i = 1, . . . , M
does not allow us to estimate a function that would take an infinite
number of coefficients to describe.
1.1.1
Example: Non-uniqueness
Imagine we want to describe the Earths velocity structure, the forward problem
could be described as follows:
(, , r) =
X
l
X
X
Ylm (, )Zn (r)almn
(1.3)
l=0 m=l n=0
where is the P -wave velocity as measured at position (, , r), Zn (r) are

the basis functions that control radial dependence, Ylm are the basis functions
that describe angular dependence (lat, lon) and almn are the unknown model
coefficients.
Note that even if we had 1000s of exact measurements of velocity i (, , r)
the discretized forward problem is
i (, , r) =
X
l
X
X
(i)
Ylm (, )Zn(i) (r)almn
(1.4)
l=0 m=l n=0
where i = 1, . . . , M . We have an infinite number of parameters almn to determine, leading to the non-uniqueness problem.
A commonly used strategy is to drastically oversimplify the model
i (, , r) =
l
6
6
X
X
X
l=0 m=l n=0
(i)
Ylm (, )Zn(i) (r)almn
(1.5)
1.2 So, what can we do?
or a 1D velocity assumption with radial dependence only

i (, , r) =
20
X
Zn(i) (r)an
(1.6)
n=0
In this cases the number of data points is larger than the model parameters
M > N , so the problem is overdetermined.
If the oversimplification (i. e., radial dependence only) is justified by observations this may be a fine approach, but when there is no evidence for this
arrangement even if the data is fit, we will be uncertain o the significance of the
result. Another problem is that this may unreasonably limit the solution.
1.2
So, what can we do?
Imagine we could interpolate between measurements to have a complete data. In

a few cases that would be enough, but in most cases geophysical inverse problems
are ill-posed. In this sense they are unstable, an infinitesimal perturbation in
the data can result in a finite change in the model. So, how you interpolate may
control the features of the predicted model. The forward problem on the other
hand is unique (remember the term functional), and it is stable too.
1.2.1
Example: Instability
Consider the anti-plane problem for an infinitely long strike-slip fault
x2
x1
x3
Figure 1.1: Anti-plane slip for infinitely long strike-slip fault
The displacement at the Earths surface u(x1 , x2 , x3 ) is in the x
1 direction,
due to slip S() as a function of depth, ,
1
u1 (x2 , x3 = 0) =

S()
x2
2
x2 + 2

d
(1.7)
where S() is the slip along x1 and varies only with depth x
3 . If we had only
discrete measurements
di =
(i)
u1 (x2 )
Z
S() gi ()d
(1.8)
where
1
gi () =

(i)
x2
2
(i)
x2
+
Now, lets assume that slip occurs only at some depth c, so that S() = (c)
d(x2 )

S()
x22

d
x2
+ c2
(1.9)
(1.10)
u1
x2
x22 + 2
x2
Figure 1.2: Observations at the surface due to concentrated slip at depth c.
The results (Figure 1.2) show
1. Effect of concentrated slip is spread widely
2. This will lead to trouble (instability) in the inverse problem
so that even if we did have data at every point on the surface of the Earth, the
inverse problem would be unstable.
The kernel of functional g() smooths the focused deformation. The
problem lies in the physical model, not really how you solve it
1.3 Some terms
1.2.2
Example: Null space
We consider data for a vertical gravity anomaly observed at some height h to

estimate the unknown buried line mass density distribution m(x) = (x). The
forward problem is described by
Z
d(s)
h

(x s) +
h2
3/2 m(x)dx
(1.11)
Z
g(x s) m(x)dx
(1.12)
Suppose now we can find a smooth function m+ (x), such that the integral
in (1.12) vanishes, such that d(s) = 0. Because of the symmetry of the kernel
g(x s), if we choose m+ (x) to be a line with a given slope, the observed
anomaly d(s) will be zero. The consequence of this is that we can add to the
true anomaly, an anomaly function m+ to it
m = mtrue + m+
and the new gravity anomaly profile will match the data just as well as mtrue
Z
d(s)
g(x s) [mtrue (x) + m+ ]dx
Z
g(x s) mtrue (x)dx +
g(x s) m+ (x)dx
(1.14)
g(x s) mtrue (x)dx + 0
(1.13)
(1.15)
From the field observations, even if error free and infinitely sampled there is no
way to distinguish between the real anomaly and any member of an infinitely
large family of alternatives.
Models m+ (x) that lie in the null space of g(x s) are solutions to
Z
g(x s)m(x)dx = 0
By superposition, any linear combination of these null space models
can be added to a particular model and not change the fit to the
data. This kind of problems do not have a unique answer even with
perfect data,
Table 1.2: Example of inverse problems

Model
Discrete
Discrete
Discrete
Continuous
Continuous
1.3
Theory
Linear
Linear
Nonlinear
Linear
Nonlinear
Determinancy
Overdetermined
Underdetermined
Overdetermined
Underdetermined
Underdetermined
Examples
Line fit
Interpolation
Earthquake Location
Fault Slip
Tomography
Some terms
The inverse problem is not just simple linear algebra.

1. For the continuous case, you dont invert a matrix of infinite rows
2. Even the discrete case
d = Gm
you could simply multiply by the inverse of the matrix
G1 d = G1 Gm = m = GG1 m
and this is only possible for square matrices, so for under/over determined
cases would not work.
Overdetermined
More observations than unknowns N > M
Due to errors, you are never able to fit all data points
Getting rid of data is not ideal (why?)
Find compromise in fitting all data simultaneously (least-squares
sense)
Underdetermined
More unknowns than equations N < M
Data could be fit exactly, but we could vary some components of the
model arbitrarily
Add additional constraints, such as smoothness or positivity
Chapter 2
Review of Linear Algebra

A matrix is a rectangular array of real (or complex) numbers arranged in sets
of m rows with n entries each. The set of such m by n matrices is called Rmn
(or Cmn f orcomplexones). A vector is simple a matrix consistent of a single
column. Notice we will use the notation Rm rather than Rm1 or R1m . Also,
be careful since Matlab does understand the difference between a raw vector
and a column vector.
Notation is important. We will use boldface capital letters (A, B, . . . ) for
matrices, lowercase bold letters (a, b, . . . ) for vectors and lowercase roman and
Greek letters (m, n, , beta, . . . ) to denote scalars.
When referring to specific entries of the array A Rmn I use the indices
aij , which means the entry on the ith row and the jth column. If we have a
vector x, xj refers to its jth entry.
x1
a11 a12 . . . a1n
x2
..
..
.
a
.
A=
x
=
..
21
.
..
.
amn
xm
We can also think of a matrix as an ordered collection of column vectors
a11 a12 . . . a1n
..

..
.
.
A=
a21
= a 1 a2 an
..
.
amn
There are a number of special matrices to keep in mind. These are useful
since some of them are used to get matrix inverses.
Square matrix
Diagonal Matrix
Tridiagonal matrix
m=n
aij = 0
whenever i 6= j
aij = 0
whenever |i j| > 1
Chapter 2. Review of Linear Algebra

Upper triangular matrix
aij = 0
whenever i > j
Lower triangular matrix
aij = 0
whenever i < j
Sparse matrix
Most entries zero
Note that the definition of upper and lower triangular matrices may apply to
non-square matrices as well as square ones.
A zero matrix is a matrix composed of all zero elements. It plays the role
on matrix algebra as the scalar 0.
A+0
= A
= 0+A
The unit matrix is the square, diagonal matrix with only unity in the diagonal
and zeros elsewhere and is usually denoted I. Assuming the right matrix sizes
apply
AI = A = IA
2.1
Matrix operations
Having a set of matrices Rmn , addition

A = B + C means
aij = bij + cij
and scalar multiplication

A = B
means
aij = bij
where is a scalar.
Another basic manipulation is transposition
B = AT
means
bij = aji
More important is matrix multiplication, where

Rmn Rnp Rmp
C = AB
means
cij =
n
X
aik bkj
k=1
Notice we can only multiply two matrices when the numbers of columns (n) in
the first one equals the number of rows in the second. The other dimensions are
not important, so non-square matrices can be multiplied.
Other standard aritmethic rules are valid, such as distribution A(B + C) =
AB + AC. Less obvious the association of multiplication holds A(BC) =
(AB)C as long as the matrix sizes permit. But multiplication is not commutative
AB 6= BA
2.1 Matrix operations
unless some special properties exist.

When one multiplies a matrix into a vector, there are a number of useful
ways of interpreting the operation
y = Ax
(2.1)
1. If x and y are in the same space Rm , A is providing a linear mapping or

linear transformation of one vector into another.
Example 1: m = 3, A represents the components of a tensor
x
A
y
angular velocity
inertia tensor
angular momentum
Example 2: Rigid body rotation, used in Plate tectonics reconstructions.

2. We can think of A as a collection of column vectors, then
y
= Ax = [a1 , a2 , . . . , an ] x
= x1 a1 + x2 a2 + xn an
so that the new vector y is simply a linear combination of te column vectors

of A, with expansion coefficients given by the elements of x. Note, this
is the way we think about matrix multiplication when fitting a model,
y contains data values, A are the predictions of the theory hat includes
some unknown weights (the model) given by the entries of x.
d = Gm
There are two ways of multiplying two vectors. For two vector in
the outer product is
x1
x1 y1 x1 yq
x2
..
..
T
..
xy = . y1 y2 yq = .
.
.
..
xp y1 xp yq
xp
Rp and Rq
and the inner product of two vectors of the same length is
y1

y2
xT y = x1 x2 xp . = x1 y1 + x2 y2 + + xp yp
..
yp
The inner product is just a vector dot product of vector analysis.
10
If A is a square matrix and if there is a matrix B such that

AB = I
the matrix B is called the inverse of A and is written A1 . Square matrices
that posses no inverse are called singular, when the inverse exists A is called
nonsingular.
The inverse of the transpose is the transpose of the inverse
1
(AT )
= (A1 )
The inverse is useful for solving linear systems of algebraic equations. Starting with equation 2.1
y
y
= Ax
= A1 Ax
= Ix = x
so if we know y and A and A is square and has an inverse we can recover

the unknown vector x. As you will see later, calculating the inverse and then
multiplying to y is a poor way to solve for x numerically.
A final rule about transposes and inverses
T
(AB)
(AB)
2.1.1
= BT A T
= B1 A1
The condition Number
The key to understanding the accuracy in the solution of

y = Ax
is to look at the condition number of the matrix A
A = kAkkA1 k
which estimates the factor by which small errors in y or A are magnified in
the solution x . This can sometimes be very large (> 1010 ). It can be shown
that the condition number in solving the normal equations (to be studied later)
is the square of the condition number using a QR decomposition, which can
sometimes lead to catastrophic error build up.
2.1.2
Matrix Inverses
Remember our definition that if we have a matrix A Rnn is invertible if

there exists A1 such that
A1 A = I
and AA1 = I
2.2 Solving systems of equations
11
Some examples of inverses
d1
0
d2
D=
0
D1 =
d3
1
d1
0
1
d2
1
d3
the inverse of a diagonal matrix, is a diagonal matrix with the diagonal elements
to the negative power.
1 0 0
1 0 0
P= 0 0 1
P1 = 0 0 1
0 1 0
0 1 0
If ones exchanges rows 2 and 3, you get a
simple inverse.
1 0
E= 2 1
0 0
diagonal matrix, so P will have a
0
0
1
In this case, the matrix is not diagonal, but we can use Gaussian Elimination
which we will go through next.
2.2
Solving systems of equations
Consider a system of equations

1
2
3
2x +
4x +
2x +
y
y
2y
+ z
+ z
=
1
= 2
=
7
and solve using Gaussian elimination.

The first step is to end up with zeros in first column for all rows (except first
one)
Subtract 2 (1) from (2).
the factor 2 is called pivot
the -1 is called pivot
1
2
3
2x +
y
y
3y
+ z
2z
+ 2z
=
1
= 4
=
8
The next step is

1
2
3
2x +
y
y
z
2z
4z
=
1
= 4
= 4
12
and now solve each equation from bottom to

substitution
3
4z =
1
2
y 2 = 4
1
2x + 2 + 1 =
1
top by the process called back
z=1
y=2
x = 2
In solving this system of equations we have used elementary row operations,

namely adding multiple of one equation to another, multiplying by a constant
or swaping two equations. This process can be extended to solve systems of
equations with an arbitrary number of equations.
Another way to think of Gaussian Elimination is as a matrix factorization
(triangular factorization). Rewrite the system of equations in matrix form
or
2
4
2
1
1
2
Ax
= b
Aij xj
= bi
1
x
1
0 y = 2
1
z
7
We are going to try and get A = LU, where L is a lower triangular matrix and
U is upper triangular with the same Gaussian Elimination steps.
1. Subtract two times the first equation from the second
1 0 0
2 1 1
x
1
2 1 0 4 1 0 y = 2
0 0 1
2 2 1
z
7
2
1
1
x
1
0 1 2 y = 4
2
2
1
z
7
or for short
E1 Ax
= E1 b
A1 x = b1
2. Subtract 1 times the first equation from
1 0 0
2
1
1
0 1 0 0 1 2
1 0 1
2
2
1
2
1
1
0 1 2
0
3
2
the third
x
1
y = 4
7
z
x
1
y = 4
8
z
or for short
E2 A1 x = E2 b1
A2 x = b2
13
3. Subtract 3 times the second equation from the third
1
0 0
2
1
1
x
0
1 0 0 1 2 y =
0 3 1
0
3
2
z
2
1
1
x
0 1 2 y =
0
0 4
z
1
4
8
1
4
4
or for short
E3 A2 x = E3 b2
A3 x = b3
This new matrix will be assigned a new name, so the system now looks
E3 E2 E1 Ax
= E3 E2 E1 b
Ux = c
and since
U = E3 E2 E1 A
A
= E1 1 E2 1 E3 1 U = LU
where our matrix L is
1
L= 2
1
0 0
1 0
3 1
which is a lower triangular matrix. Notice that the non-diagonal components of

L are the pivots.
From this result it is suggested that if we find A = LU we only need to
change
Ax
to
Ux = c
and back-substitute. Easy right?
2.2.1
Some notes on Gaussian Elimination
Basic steps
Uses multiples of first equation to eliminate first coefficient of subsequent
equations.
Repeat for coefficients n 1.
Back substitute in reverse order.
14
Problems
Zero in first column
Linearly dependent equations
Inconsistent equations
Efficiency
If we count division, multiplication, sum as one operation and assume we have
a matrix A Rnn
n operations to get zero in first coefficient
n 1 for the # of rows to do
n2 n operations so far
N = (12 + + n2 ) (1 + + n) =
n3 n
3
to do remaining coefficients.
For large n N n3 /3
Back-substitution part N = n2 /2
There are other more efficient ways to solve systems of equations.
2.2.2
Some examples
We want to solve systems of equations with m equations and n unknowns.

Square matrices
There are three possible outcomes for square matrices, with A Rmm
1. A 6= 0 x = A1 b This is a non-singular case where A is an invertible
metrix and the solution x is unique.
2. A = 0, b = 0 0x = 0 and x could be anything. This is the underdetermined case, the solution x in non-unique.
3. A = 0, b 6= 0 0x = b. There is no solution. this is an inconsistent case
for which there is no solution.
15
Non-square matrices
An example of a system with 3 equations and four unknowns (overdetermined,
underdetermined?) is as follows
x1

1
3 3 2
0
x
2
= 0
2
6 9 5
x3
1 3 3 0
0
x4
We can use Gaussian elimination by setting to zero first
2 and 3
x1
1 0 0
1
3 3 2
x2
2 1 0 2
6 9 5
x3 =
1 0 1
1 3 3 0
x4
x1
1 3 3 2
0 0 3 1 x2 =
x3
0 0 6 2
x4
and for the third coefficient for the last
1
0 0
1 3 3
0
1 0 0 0 3
0 2 1
0 0 6
row
2
1
1 3 3 2
0 0 3 1
0 0 0 0
x1
x2
=
x3
x4
x1
x2
=
x3
x4
coefficients in rows
0
0
0
0
0
0
0
0
0
0
0
0
The underlined values are the pivots. The pivots have a column of zeros below
and are to the right and below other pivots.
Now we can try and solve the equations, but note that the last row has not
information, xj could get any value.
0
0
= x1 + 3x2 + 3x3 + 2x4

= 3x3 + x4
and solving by steps
we have the solution
x3
= x4 /3
x1
= 3x2 x4

x4
3x2 x4
3x2

0
x
x
2
2
=
+
x=
x4 /3
0 13 x4
x4
0
x4
16
which means that all solutions to our initial problem Ax = b are combinations
of this two vectors and form an infinite set of possible solutions. You can choose
ANY value of x2 or x4 and you will get always the correct answer.
2.3
Linear Vector Spaces
A vector space is an abstraction of ordinary space and its members can be

loosely be regarded as ordinary vectors. To define a linear vector space (LVS)
it involves two types of objects, the elements of the space (f, g) and scalars
(, R, although sometimes C is useful). A real linear vector space is a set
V containing elements which can be related by two operations
f +g
and
addition
f
scalar multiplication
where f, g V and R. In addition, for any f, g, h V and any scalar ,

the following set of nine relations must be valid
f +g
(2.2)
(2.3)
f +g = g+f
f + (g + h) = (f + g) + h
(2.4)
(2.5)
f + g = f + h, if and only ifg = h

(f + g) = f + g
(2.6)
(2.7)
( + )f
(2.8)
= f + f
(f ) = ()f
1f = f
(2.9)
(2.10)
An important consequence of these laws is that every vector space contains a

unique zero element 0
f +0=f V
and whenever
f = 0
either = 0 or f = 0
Some examples
The most obvious space is Rn , so
x = [x1 , x2 , . . . , xn ]
is an element of Rn .
Perhaps less familiar are spaces whose elements are functions, not just a
finite set of numbers. One could define a vector space C N [a, b], a space of all
2.3 Linear Vector Spaces
17
N -differentiable functions on the interval [a, b]. Or solutions to PDEs (2 = 0)

with homogeneous boundary conditions.
You can check some of the laws. For example, in the vector space C N [a, b] it
should be easy to proof that adding two N -differentiable functions the resultant
function is also N -differentiable.
Linear combinations
In a linear vector space you can add together a collection of elements to form a
linear combination
g = 1 f1 + 2 f2 + . . .
where fj V , j R and obviously g V .
Now, a set of elements in a linear vector space a1 , a2 , . . . , an is said to be
linearly independent if
n
X
only if 1 = 2 = = n = 0
j aj = 0
j=1
in words, the only linear combination of the elements that equals zero is the one
in which all the scalars vanish.
Subspaces
A subspace of a linear vector space V is a subset of V that is itself a LVS,
meaning all the laws apply. For example
Rn
is a subset of Rn+1
or
C n+1 [a, b]
is a subset of C n [a, b]
since all (N + 1)-differentiable functions are themselves N -differentiable.

Other terms
span the spanning set of a collection of vectors is the LVS that can be
nuilt from linear combinations of the vectors.
basis a set of linearly independent vectors that form or span the LVS.
range written R(A) of a matrix Rmn , it is simply the linear vector space
that can be formed by taking linear combinations of the column vectors.
Ax R(A)
Ax = b
is the set of ALL vectors b that can be build by linear combinations of
the elements in A by using x with all possible scalar coefficients.
18

rank: The rank represents the number of linearly independent rows in A.
rank((A)) = dim[R(A)]
A matrix is said to be full rank if
rank(A Rmn ) = min(m, n)
or to be rank deficient otherwise.
Null space: This is the other side of the coin of the rank. This is the set
of xi s that cause
Ax = 0
and it can be shown that
dim[N (A)] = min(m, n) rank(A)
2.4
Functionals
In geophysics we usually have a collection of real numbers (could be complex

numbers in for example EM) as our observations. An observation or measurement will be a single real number.
The forward problem is
Z
dj = gj (x)m(x)dx
(2.11)
where gj (x) is the mathematical model and will be treated as an element in the
vector space V .
We thus need something, a rule that unambigously assigns a real number to
an element gj (x) and this is where the term functional comes in.
A functional is a rule that unambigously assigns a single real
number to an element in V .
Note that every element in V will not necessarily be connected with a real
number (remember the terms range and null space). Some examples of functionals include
Zb
Ii [m]
gi (x)m(x)dx
m C 0 [a, b]

d2 f
D2 [f ] =
f C 2 [a, b]
dx2 x=0
N1 [x] = |x1 | + |x2 | + + |xn |
x Rn
There are two kinds of functionals that will be relevant to our work, linear
functionals and norms. We will devote a section to the second one later.
2.5 Norms
2.4.1
19
Linear functionals
For f, g D and , R a linear functional L obeys

L[f + g] = L[f ] + L[g]
and in general
f + g D
so that a combination of elements in space D, lies in space D. It is a subspace
of D.
The most general linear functions in RN is the dot product
Y [x] = x1 y1 + x2 y2 + + xN yN =
xi yi
which is an example of an inner product For finite models and data, the general
relationship is
d = gj m
or for multiple data
di = Gij mj
and is some way our forward problem is an inner product between the model
and the mathematical theory to generate the data.
2.5
Norms
The norm provides a mean of attributing sizes to elements of a vector space. It

should be recognized that there are many ways to define the size of an element.
This leads to some level of arbitrariness, but it turns out that one can choose a
norm with the right behavior to suit a particular problem.
A norm, denoted k k is a real-valued functional and satisfies the following
conditions
kf k
kf k
kf + gk
kf k
> 0
(2.12)
= ||kf k
6 |f | + |g|
= 0
(2.13)
the triangle inequality
only iff = 0
(2.14)
(2.15)
If we omit the last condition, the functional is called a seminorm.

Using the norms, in a linear vector space equipped with such a norm the
distance between two elements
d(f, g) = kf gk
20
Some norms in finite dimensional space

Here we define some of the common used norms
L1
L2
L
Lp
kxk1 = |x1 | + |x2 | + + |xN |

1/2
kxk2 = x1 2 + x2 2 + + xN 2
max|xi |
1/p
kxkp = (|x1 |p + |x2 |p + + |xN |p )
x RN
Euclidean norm
p61
The areas for which the so called p-norms are less that unit (kxk 6 1) are shown
p=1
p=2
p=3
p=
Figure 2.1: The unit circle p-norms

in Figure 2.1. For the Euclidean norm the area is called the unit ball. Note that
for large values of p, the larger vectors will tend to dominate the norm.
Some norms in infinite dimensional space
For the infinite dimensional space we work with functions rather than vectors
Zb
kf k1
|f (x)|dx
kf k2
b
1/2
Z
= |f (x)|2 dx
kf k
max
a6x6b(|f (x)|)
2.5 Norms
21
and other norms can be designed to measure some aspect of the roughness of
the functions
00
1/2
Zb
00
kf k
= f 2 (a) + [f (a)] +
kf kS
b
1/2
Z
0
= (w0 (x)f 2 (x) + w1 (x)f (x)2 )dx
[f (x)] dx
a
Sobolev norm
This last set of norms are going to be useful when we try to solve underdetermined problems. They are typically applied to the model rather than the
data.
2.5.1
Norms and the inverse problem
Remembering our simple inverse problem

d = Gm
(2.16)
we form the residual

r = d Gm
r = dd
where from our physics we can make data predictions d and we want our predictions to be as close as possible to the acquired data.
What do we mean by small? We use the norm to define how small is small
by setting the length of r, namely the norm of r krk as small as possible
L1 :
1
kd dk
or minimizing the Euclidean or 2-norm

L2 :
kd dk
2
leading in the second case to the least squares solution.
2.5.2
Matrix Norms and the Condition Number
We return to the question of the condition number. Imagine we have a discrete

inverse problem for the unperturbed system
y = Ax
(2.17)
y0 = Ax0
(2.18)
and the perturbed case is
22
Here, assume the perturbation is small. Note that in real life, we have uncertainties in our observations, and we wish to know whether these small errors in
the observations are severely effecting our end-result.
Using a norm, we wish to know what the effect of the small perturbations
is, so using the relations above
A(x x0 )
= y y0
(x x0 )
= A1 (y y0 )
kx x0 k
6 kA1 kky y0 k
where in the last step, we use the triangular inequality rule.

To get an idea of the relative effect of the perturbations to our result,
kx x0 k
Ax
kx x0 k
kx x0 k
kxk
ky y0 k
y
ky y0 k
6 kAxkkA1 k
kyk
ky
y0 k
6 kAkkA1 k
kyk
6 kA1 k
where we defined the condition number

(A) = kAkkA1 k
(2.19)
and shows the amount that a small perturbation in the observations (y) is
reflected in perturbations in the resultant estimated model x. For the L2 norm,
the condition number of a matrix = max /min , where i are the eigenvalues
of the matrix in question.
Chapter 3
Linear regression, least

squares and normal
equations
3.1
Linear Regression
Sometimes we will talk about the term inverse problem, while some other people
will prefer the term regression. What is the difference? In practice, none.
In the case where we are dealing with a function fitting procedure that can be
cast as an inverse problem, the procedure is many times referred as a regression.
In fact, economists use regressions quite extensively.
Finding a parameterized curve that approximately fits a set of data points
is referred to as regression. For example, the parabolic trajectory problem is
defined
y(t) = m1 + m2 t (1/2)m3 t2
where y(t) represents the altitude of the object at time t, and the three (N = 3)
model parameters mi are associated with a constant, slope and quadratic terms.
Note that even if the function is quadratic, the problem in question is linear for
the three parameters.
If we have M discrete measurements yi at times ti , the linear regression
problem or inverse problem can be written in the form
1 2
1 t1
y1
2 t1
1 2
y 2 1 t2
m1
2 t2
.. = ..
..
.. m2
. .
.
. m3
yM
1 tM 12 t2M
When the regression model is linear in the unknown parameters, then we call
this a linear regression or linear inverse problem.
24
3.2
Chapter 3. Least Squares & Normal Equations
The simple least squares problem
We start the application of all those terms we have learned above by looking at
an overdetermined linear problem (more equations than unknowns) involving
the simplests of norms, the L2 or Euclidean norm.
Suppose we are given a collection of M measurements of a property to form
a vector d RM . From our geophysics we know the forward problem such that
we can predict the data from a known model m RN . That is, we know the N
vectors gk RM such that
d=
N
X
gk mk = Gm
(3.1)
k=1
where
G = [g1 , g2 , . . . , gN ]
that minimizes the size of the residual vector
We are looking for a model m
defined
r=d
N
X
gk m
k
k=1
We do not expect to have an exact fit so there will be some error, and we use a
norm to measure the size of the residual
krk = kd Gmk
For the least squares problem we use the L2 or Euclidean norm
v
uM
uX
krk2 = t
rk2
k=1
Example 1: the mean value

Suppose we have M measurements of the same quantity, so we have our data
vector
T
d = [d1 d2 , . . . , dM ]
The residual is defined as the distance between each individual measurement

and the predicted value m:
ri = d i m
3.2 The simple least squares problem
25
Using the L2 norm

krk22 =
M
X
rk2
k=1
M
X
(di m)
2
k=1
M
X
d2k 2md
i+m
2
k=1
M
X
d2k 2m
M
X
dk + M m
2
k=1
k=1
Now, to minimize the residual, we take the derivative w.r.t. the model m
and set to zero

M
X
d
krk22 = 2
dk + 2M m
=0
dm
k=1
and by solving for m

we have
m
=
M
1 X
dk
M
k=1
which shows that the sample mean is the result of a least squares solution for
the measurements.
The corresponding estimate that minimizes the L1 norm is the median. Note
that the median is not found by a linear operation on the data, which is a general
feature of the L1 norm estimates.
3.2.1
General LS Solution
Going back to our general problem, we have

= Gm
d
and the predicted data is a linear combination of the gk s. Using the Linear
must lie in the
vector space theory, we can show that the predicted data d
estimation space, on ALL possible results that G can produce (range).
Setting the L2 norm for the residuals between the data and the prediction
krk22
2
= kd dk
2
T (d Gm)
= rT r = (d Gm)
T
T
T
G d+m
T GT Gm
= d d 2m
and set to zero

now we take the derivative wrt m

d T
d
T GT d + m
T GT Gm
=0
krk22 =
d d 2m
dm
dm
0
0 2GT d + 2GT Gm
26
Note that it is worth pointing out that the derivative is of a scalar with respect
to a vector. We will show below that this works as simply as it appears, by
writing out all the components. Simplifying a bit more
GT d = GT Gm
(3.2)
which is called the normal equations.

to end up with
Assuming the inverse of (GT G) exists, we can isolate m
1
= (GT G)
m
GT d
Note that the matrix (GT G) is a square N N matrix and GT d is an N column

vector.
Derivation with another notation
Starting with the L2 norm of the residuals
krk22
M
X
(rj ) =
j=1
M
X
dj
j=1
N
X
!2
gji mi
i=1
we take the derivative and set to zero

0
M
d
d X
krk22 =
(rj )2
dm
dm
k j=1
!
!
M
N
N
X
X
d X
dj
dj
gji mi
gjl ml
dm
k j=1
i=1
l=1
#
"
M
N
N
N X
N
X
X
X
d X
dj dj dj
gjl ml dj
gji mi +
gji gjl mi ml
dm
k j=1
i=1
i=1
l=1
l=1
We may look at each of these terms independently. The first term is

M
d X
dj dj = 0
dm
k j=1
the second and third terms are similar

#
"
#
"
M
N
N
M
X
X
d X
d X
gjl ml
2dj
gjl ml
=
2dj
dm
k j=1
dm
k
j=1
l=1
l=1
= 2
M
X
j=1
dj gjk lk GT d
27
and the last term

#
"N N
M
d X XX
gji gjl mi ml
dm
k j=1
i=1
l=1
M
X
N
X
d
gji gjl mi ml
dm
k
j=1
i,l=1
"N N
#
M
X
XX
=
(ik gji gjl ml + lk gji gjl mi )
j=1
l=1 i=1
"N N
M
X
XX
j=1
#
(gjk gjl ml + gji gjk mi )
l=1 i=1
and now note that gjk gjl gji gjk due to symmetry. So in the end we will have
2
M X
N
X
mi gjk gji GT Gm
j=1 i=1
and we have derived the same previous result for the normal equations.
3.2.2
Geometrical Interpretation of the normal equations
The normal equations seem to have no intuitive content.

1
= (GT G)
m
GT d
which was derived from

= GT d
(GT G)m
(3.3)
as a linear combination of the gk vectors

Lets consider the data prediction d
and assume they are linearly independent
= Gm
= g1 m
d
1 + g2 m
2 + + gN m
N
where gk is the kth column vector of the G matrix.
Recall that the set of gk s form a subspace of the entire RM data space,
sometimes called the estimation space or model space. Starting with (3.3) we
have
= GT d
(GT G)m
GT (Gm)
= GT d
d) = 0
GT (Gm
and recalling the definition of the residual
d) = GT r = 0
GT (Gm
28
So in other words the normal equation in the least squares sense means that
GTr = 0
g.1 r
g.2 r
= 0=
..
0
0
..
.
g.N r
suggesting that the residual vector is orthogonal to every one of the column
vectors of the G matrix. The key thing here is that making the residual perpendicular to the estimation sub-space minimizes the length of r.
Gm
subspace of G
Figure 3.1: Geometrical interpretation of the LS & normal equations. We are

basically projecting the data d RM onto the column space of G.
This concept is called the orthogonal projection of d into the subspace of
R(G), such that the actual measurements d can be expressed as
+r
d=d
where d
is called the orthogonal
= d,
We have created a vector Gm
projection of d into the subspace of G. The idea of this projection relies on
the Projection Theorem for Hilbert spaces, but we are not going to go too
deeply into this.
The theorem says that given a subspace of G, every vector can be written
uniquely as the sum of two parts, one part lies in the subspace of G and the
other part is orthogonal to the first (see Figure 3.1). The part lying in this
subspace of G is the orthogonal projection of the vector d onto G,
+r
d=d
There is a linear operator PG , the projection matrix, that acts on d to generate
d.
PG = G(GT G)1 GT
29
This projection matrix has particularly interesting properties. For example,

P2 = P, meaning that if we apply the projection matrix twice to a vector d,
This also
we get the same result as if we apply it only once, namely we get d.
suggests that P must be a symmetric matrix.
Example: Straight line fit
Assume we have 3 measurements d RM where M = 3. For a straight line
we only need 2 coefficients, the zero crossing and the slope, thus m RN with
N = 2. The data predictions are then
= Gm

d1
1 x1
1
d2 = 1 x2 m
m
2
1 x3
d3
or
d1
d2
= g11 m
1 + g12 m
2
d3
= g31 m
1 + g32 m
2
= g21 m
1 + g22 m
2
and as we have said, the residual vector would be
r=dd
Reorganizing we have
+ r,
d=d
rd
which is described in the figure below.
3.2.3
Maximum Likelihood
We can also use the Maximum LIkelihood method in order to interpret the Least
Squares method and normal equations. This technique was developed by R.A.
Fisher in the 1920s and has dominated the field of statistical inference since
then. Its power is that it can (in principle) be applied to any type of estimation
problem, provided that one can write down the joint probability distribution of
the random variables which we are assuming model the observations.
The maximum likelihood looks for the optimum values of the unknown model
parameters as those that maximize the probability that the observed data is due
to the model from a probabilistic point of view.
Suppose we have a random sample of M observations x = x1 , x2 , . . . , xM
drawn from a probability distribution (PDF) f (xi , ) where the parameter
is unknown. We can extend this probability to a set of model parameters to
f (xi , m). The joint probability for all M observations is:
f (x, m) = f (x1 , m)f (x2 , m) f (xM , m) = L(x, m)
30
y3
r3
d = Gm
y1
y2
r1
x1
r2
x2
x3
Figure 3.2: The LS fit for a straight line. The estimation space is the straight
this is where all predictions will lie. The real measurements
line given by Gm,
dk line above or below this line, and are sort of projected into the line via the
residual.
We call L(x, m) = f (x, m) the likelihood function of m. If L(x, m0 ) >

L(x, m1 ) we can say that m0 is a more plausible value for the model vector m
than m1 , because m0 ascribes a larger probability to the observed values in
vector x than m1 does.
In practice we are given a particular data vector and we wish to find the
more plausible model that generated these data, by finding the model that
gives the largest likelihood.
Example 1: The mean value

Assume we are given M measurements of the same quantity and that the data
contains normally distributed errors, then d N (, 2 ), where is the mean
value and 2 is the variance. The probability function for a single datum is
1
exp
f (di , ) =
2
(di )2
2 2
and the joint distribution or likelihood function is
L(d, ) = (2)M/2 M exp
2
(di )
i=1
2 2
31
Maximizing the likelihood function is equal to maximizing its logarithm, so

L = max L(d, )
max ln{L(d, )}
max [ln{f (d, )}]
where we let L be our log-likelihood function to maximize

L=
M
M
1 X
ln(2) M ln() 2
(di )2
2
2 i=1
taking the derivative with respect to

0=
M
1 X
(di )
2 i=1
M
X
(di )
i=1
M
X
m
X
()
i=1
(di ) m
i=1
and as expected we obtain the arithmetic mean

=
M
1 X
di
M i=1
We can also look for the maximum likelihood for the variance 2
0=
M
M
1 X
+ 3
(di )2
i=1
we get
M
1 X
=
(di )2
M i=1
2
The least squares problem with maximum likelihood

We return to the linear inverse problems we had before
d = Gm +
where we assume the errors are normally distributed i N (0, i2 ). The joint
probability distribution or likelihood function in this case is
L(d, m) =
M
Y

1
exp (di Gi m/2i2
Q
M
M/2
(2)
i=1 i i=1
32
We want to maximize the function above, thus the constant term has no
effect, leading to
"
( M
)#
X
2
2
max L = max exp
(di Gi m) /2i
m
i=1
take the logarithm of this likelihood function

" M
#
X
2
2
max L = max
(di Gi m) /2i
m
i=1
and switching to a minimization instead

"M
#
X
2
2
min
(di Gi m) /2i
m
i=1
In matrix form this can be expressed as

1
min (d Gm)T 1 (d Gm)
m
2
where is the data covariance matrix.
So, to minimize, we want to take the derivative with respect to the model
parameter vector as set to zero
0

(d Gm)T 1 (d Gm)
m

1 T
=
d d 2mT GT 1 d + mT GT 1 Gm
m

= 0 2GT 1 d + 2GT 1 Gm
=
finally leading to
= (GT 1 G)1 GT 1 d
m
which comes from the sometimes called the generalized normal equations
= GT 1 d
(GT 1 G)m
(3.4)
or the weighted least squares solution for the overdetermined case.
3.3
Why LS and the effect of the norm
As you might have expected, the choice of norm is kind of arbitrary. So why is
the use of least squares so popular?
1. Least squares estimates are linear in data and easy to program.
3.4 The L2 problem from 3 Perspectives
33
L1
L2
outlier
x
Figure 3.3: Schematic of a straight line fit for (x, d) data points under the L1,
L2, and L norms. The L1 is not as affected by the single outlier.
2. Corresponds to maximum likelihood estimate for normally distributed errors. The normal distribution comes from the central limit theorem add
up random effects and you get a Gaussian.
3. The statistical distribution is linear, meaning we will have the propagation
of errors as a linear mapping from the input statistics (data).
4. Well known statistical tests and confidence intervals can be obtained.
It has some disadvantages too. The main one is that the result is sensitive to
outliers (see figure 3.3).
Another popular norm used is the L1 norm. Some characteristics include
1. non-linear, solved by linear programming (to be seen later).
2. Less sensitive to outliers
3. confidence intervals and hypothesis testing are somewhat more difficult,
but can be done.
3.4
The L2 problem from 3 Perspectives
1. Geometry Orthogonality of the residual & predicted data

r = 0
d
(d Gm)
(Gm)
= 0
T
G (d Gm)
= GT r = 0
T
which leads to
= (GT G)1 GT d
m
34

2. Calculus where we want to minimize krk2
rT r =
T (d Gm)
(d Gm)
(rT r) = 0
m
GT (d Gm)
= 0
leading to
= (GT G)1 GT d
m
3. Maximum likelihood for a multivariate normal distribution
Maximize:

T 1 (d Gm)
exp (d Gm)
Minimize:
T 1 (d Gm)
(d Gm)
Led to:
= (GT 1 G)1 GT 1 d
m
which comes from the generalized normal equations.
3.5
Full Example: Line fit
We come back to the general line fit problem, where we have two unknows,
the intercept m1 and the slope m2 . We have M observations di . The inverse
problem is
d = Gm
and the indexed matrix is then

d1
d2

.. =
.
dM
1
1
..
.
x1
x2
..
.
xM

m1
m2
As you are already aware of, the least squares solution of this problem is
= (GT G)1 GT d
m
which we are doing explicitly.
3.5 Full Example: Line fit
35
The last term is
1
x1
G d =
1
x2
1
dM
d1
d2
..
.
dM
M
X
di
i=1
=
M
X
xi di
i=1
The first term (note typo in Book)
(GT G)1

1
=
x1
1
x2
=
M
X
xi
i=1

1
xM
M
X
i=1
M
X
M
X
xM
xi
x2i
i=1
1
M
x1
x2
..
.
M
X
1
1
..
.
x2i
M
X
xi
!2
i=1
M
X
i=1
xi
xi
i=1
i=1
i=1
x2i
M
X
leading to our final result
=
m
M
M
X
i=1
x2i
M
X
i=1
xi
!2
M
X
x2i
i=1
M
X
M
X
M
X
xi
di
i=1
M
X
M
xi di
i=1
xi
i=1
i=1
Using the concept for the covariance of the model parameters
cov(m)
= 2 (GT G)1
M
X
x2i
i=1
2
M
X
xi
i=1
M
X
i=1
xi
36
where 2 is the variance of the individual estimates. This equation shows that
even if the data di are uncorrelated, the model parameters can be correlated:
cov(m1 , m2 ) =
M
X
xi
i=1
A number of important observations

There is a negative correlation between intercept and slope
The magnitude of the correlation depends on the spread of the x axis.
How can we reduce the covariance between the model parameters? We define a
new axis
M
1 X
xi
yi = xi
M i=1
which is basically equivalent to shifting the origin in
is now
M
0
M
2
X
cov(m)
=
0
yi2
the x axis. The covariance

1
i=1
1
M
= 2
0
0
1
M
X
yi2
i=1
This new relation now shows independent intercept and slopes and if are the
standard errors in the observed data then
Standard error of intercept
/M,
with more data you reduce the variance of the intercept.
Standard error of slope
v
uM
uX
t
y2
i
i=1
showing that if the observation points on the x axis are close, the uncertainties in the slope estimates are greater.
Chapter 4
Tikhonov Regularization,
variance and resolution
4.1
Tikhonov Regularization
Tikhonov Regularization is one of the most common methods used for regularizing an inverse problem. The reason to do this is that in many cases the inverse
problem is ill-posed and small errors in the data will give very large errors in
the resultant model.
Another possible reason for using this method is if we have a mixed determined problem, where for example we might have a model null-space.For the
overdetermined part, we would like to minimize the residual vector
min krk
= (GT G)1 GT d
m
while for the underdetermined case we actually minimize the model norm
min kmk
= GT (GGT )1 d
m
and of course, for the mixed determined case, we will be trying something in
between
m = kd Gmk22 + 2 kmk22
and as we have seen before, we want to minimize

2

G
d

min m = min
m
0 2
m
m I
or equivalently
min m
m
= (GT G + 2 I)1 GT d
m
So the question is, what do we choose for ? If we choose very large, we are
focusing our attention on minimizing the model norm kmk, while neglecting
38
Chapter 4. Tikhonov Regularazation
the residual norm. If we choose too small, we are dealing with the complete
contrary and we are trying to fit the data perfectly, which is probably not what
we want.
A graphical way to see how the two norms interact depending on the choice
of is shown in Figure 5.1 of our book. The idea is that as the residual norm
increases, the model norm decreases, leading to the so-called L-curve. This is
because kmk2 is a strictly decreasing function of , while kd Gmk2 is an
strictly increasing function of .
Our job now is to find an optimal value of . There are a few methods we are
going to see that get the optimal . This include the discrepancy criterion, the Lcurve criterion and cross-validation. Before going there, we want to understand
the effect of the choice of in the resolution of the estimate and well as the
covariance of the model parameters. Similarly we want to understand the choice
of the number of singular values used in solving the generalized inverse using
SVD and the SVD implementation of the Tikhonov regularization. Finally we
will see how other norms can be chosen in order to penalize models with excesive
roughness or curvature.
4.2
SVD Implementation
Using our previous expression, but using the SVD of the G nmatrix, namely
G = UVT
and from above
= GT d
(GT G + 2 I)m
we can replace
(VUT UVT + 2 I)m

2 T
2
(V V + I)m
and the solution is
=
m
X
i
where
fi =
2i
= VUT d
= VUT d
uTi d
2i
vi
+ 2 i
2i
2i + 2
are called the filter factors.

The filter factors have an obvious effect on the resultant model, such that
for i , the factor fi 1 and the result would be like
T
= Vp 1
m
p Up d
where we had chosen the value of p for all singular values that are large. In
contrast, for i , the factor fi 0 and this part of the solution will be
damped out, or downweighted.
4.3 Resolution vs variance, the choice of , or p
39
In matrix form we can write the expression as

= VF1 UT d
m
where
Fii =
2i
2i
+ 2
and zero elsewhere.

Unlike what we saw earlier, the truncation achieved by choosing an integer
value p, for the number of singular values and singular vectors to use, there is
going to be a smooth transition from the included and excluded singular values.
Other filter factors have been suggested
fi =
4.3
i
i +
Resolution vs variance, the choice of , or p
From previous lectures, we can now discuss the resolution and variance of our
resultant model using the generalized inverse, and in this case, the Tikhonov
regularization. We had
(GT G + 2 I)1 GT d
= G# d
= VF1 UT d
T
= Vp 1
p Up d
where the first and second equations use the general Tikhonov regularization,
the third equation is the SVD using filter factors and the last one is the result
if we choose a p number of singular values.
The model resolution matrix Rm was defined via
= Ggen d = Ggen Gmtrue = Rmtrue
m
is then defined for the three cases as
Rm,
= G# G
Rm,
= VFVT
Rm,p
= Vp VpT
6= mtrue . The
In all regularizations R 6= I, the estimate will be biased and m
bias introduced by regularizing is
mtrue = [R I]mtrue
m
but since we dont know mtrue , we dont know the sense of the bias. We cant
even bound the bias, since it depends on the true m as well.
40
Finally, we also have to deal with uncertainties, so as we have seen before

the model covariance matrix m is

m
T
m = m
D
E
T
=
G# ddT G#
= G# d G#
and assuming d = d2 I, our three cases lead to

T
m,
m,
= 2 G# G#
= 2 VF2 2 VT
m,p
T
= 2 Vp 2
p Vp
We could use this to evaluate confidence intervals, ellipses on the model, but
since the model is biased by an unknown amount, the confidence intervals might
not be representative of the true deviation of the estimated model.
4.3.1
Example 1: Shaws problem
In this example I would like to use some practical application of the Tikhonov
regularazation using both the general approach (generalized matrix explicitly)
and using the SVD.
I take the examples from Asters book directly. In the Shaw problem, the
data that is measured is diffracted light intensity as a function of outgoing angle
d(s), where the angle is /2 6 s 6 /2. We use the discretized version of the
problem as outlined in the book, namely the mathematical model relating the
data observed d and the model vector m is
d = Gm
where d RM and m RN , but in our example we will have M = N . The G
matrix is defined for the discrete case
2

2 sin ((sin(si ) + sin(j )))

Gij = (cos(si ) + cos(j ))
N
(sin(si ) + sin(j ))
Note that the part inside the large brackets is the sinc function.
We discretize the model and data vectors at the same angles
si = i =
(i 0.5)
N
2
i = 1, 2, . . . , N.
which in theory would give us an even-determined linear inverse problem, but

as we will see, the problem is very ill-conditioned.
Similar to what was done in the book, we use a simple delta function for the
true model

1 i = 10
mi =
0 otherwise
4.3 Resolution vs variance, the choice of , or p
41
and generate synthetic data by using

d = Gm +
where the errors are i N (0, 2 ) with a = 106 . Note that the errors are
quite small, but nevertheless due to the ill-posed inverse problem, will have a
significant effect on the resultant models.
In this section we will focus on two main ways to estimate an appropriate
model m,
(GT G + 2 I)1 GT d
T
= Vp 1
p Up d
where in the first case we need to choose a value of , while in the second case
(SVD) we need to choose a value of p, the num,ber of singular values and vectors
to use. Since the singular values are rarely exactly zero, the choice is not so
easy to make. In addition to making a particular choice, we need to understand
what the effect of our choice has on our model resolution and model covariance.
In the next figures I present the results graphically in order to get an intuitive
understanding of our choices.
2.5
10
real model
= 0.001
= 3.1623e06
p = 8e08
10
1.5
2
Intensity
||m||
10
10
0.5
0.5
10
1.5
10 6
10
10
||dGm||
10
10
2
1.5
0.5
Resolution 0.0017783
0.5
Resolution 1e05
1.5
Resolution 8e08
20
20
20
15
15
15
10
10
10
0.7
real model
= 0.001
= 3.1623e06
p = 8e08
0.6
Observed Intensity
0.5
10
15
20
5
5
10
15
20
10
15
20
0.4
Covariance 0.0017783
Covariance 1e05
Covariance 8e08
0.3
20
20
20
0.2
15
15
15
10
10
10
0.1
0
1.5
0.5
0
Outgoing angle
0.5
1.5
10
15
20
10
15
20
10
15
20
Figure 4.1: Some models using the Generalized inverse. Top-Left: The L-curve
for the residual norm and model norm. Various choices of are used, and the
colored dots are three choices made. Top-Right: True model (circles) and the
estimated models for the three choices on the left. Bottom-Left: The synthetic
data (circles) and the three predicted data. Bottom-Right: The resolution (top
panels) and covariance (bottom panels) matrices for the three choices. White
represents large amplitudes, black represents lower amplitudes.
42

10
10
1.5
10
real model
p = 14
p=8
p=2
|uTid|
|uTd/s |
i
10
1
0
10
Intensity
||dGm||
10
10
0.5
10
10
10
0
15
10
10
5
10
15
Number of Singular Value (p)
20
20
0.5
1.5
0.5
0.5
1.5
Resolution 14
10
10
12
Singular Value
Resolution 8
14
16
18
20
Resolution 2
20
20
20
15
15
15
10
10
10
0.7
real model
p = 14
p=8
p=2
0.6
0.5
Observed Intensity
10
15
20
10
15
20
10
15
20
0.4
0.3
Covariance 14
Covariance 8
Covariance 2
20
20
20
15
15
15
10
10
10
0.2
0.1
0.1
1.5
0.5
0
Outgoing angle
0.5
1.5
10
15
20
10
15
20
10
15
20
Figure 4.2: Some models using the Generalized inverse. Top-Left: The L-curve
for the residual norm and model norm. Various choices of are used, and the
colored dots are three choices made. Top-Right: True model (circles) and the
estimated models for the three choices on the left. Bottom-Left: The synthetic
data (circles) and the three predicted data. Bottom-Right: The resolution (top
panels) and covariance (bottom panels) matrices for the three choices. White
represents large amplitudes, black represents lower amplitudes.
4.4
Smoothing Norms or Higher-Order Tikhonov
Very often we seek solution that minimize the misfit, but also some measure of
roughness of the solution. In some cases when we try to minimize the minimum
norm solution
Zb
2
kf k = f (x)2 dx
a
we may get the unwanted consequence of putting the estimated model only
where you happen to have data. Instead, our geophysical intuition might suggest
that the solution should not be very rough, so we minimize instead
Zb
kf k =
f 0 (x)2 dx,
f (a) = 0
where we need to add a boundary condition (right hand side). The boundary
condition is needed, since the derivative norm is insensitive to constants, that
is the norm of kf + bk is equal to kf k. This means we really have a semi-norm.
4.4 Smoothing Norms or Higher-Order Tikhonov
4.4.1
43
The discrete Case
Assuming the model parameters are ordered in physical space (e.g., with depth,
or lateral distance), then we can define differential operators of the form
1 1
0
0 ...
0 1 1
0 ...
D1 = 0
0 1 1 . . .
.. ..
.
.
and the second derivative
D2 =
2
1
0
1
2
1
0
1
2
..
.
0
0
1
..
.
...
...
...
..
.
There are a few ways to implement this in the discrete case, namely
1. minimize a functional of the for
2
m = kd Gmk2 + 2 kDmk2
which leads to

1
= GT G + 2 DT D GT d
m
Note the similarity with our previous results, where instead of the matrix
DT D we had the identity matrix I.
2. Alternatively, we can try and solve the coupled system of equations

d
G
=
m+
0
D
and you can rewrite this in a simplified way
d0 = Hm +
where we have now the standard expression for the inverse problem to be
solved. Due to the effect of the D matrix, the ill-posedness of the original
expression can be significantly reduced (depending on the chosen value
of ). The advantage of this approach is that one can impose additional
constraints, like non-negativity.
3. We can also transform the system of equations in a similar way by
d = Gm +
d = GD1 Dm +
d = G0 m 0 +
44

where
G0
m
= GD1
= Dm
As you can see, we have not changed the condition of fitting the data, so
that
2
2
kd = G0 m0 k = kd Gmk
but we have also added the model norm of the form
2
km0 k = kDmk
Note that for this to actually work, the matrix D needs to be invertible.
Sometimes, it is possible to do it analytically. We can also use the SVD
at this stage.
As a cautonary note, it is important to keep in mind that the Tikhonov
regularization will recover the true model depending on whether the assumptions
of the additional norm (be it kmk, or kDmk) is correct. We would not expect
to get the right answer in the previous examples, since the true model mtrue is
a delta function.
4.5
Fitting within tolerance
In real life, the data that we have acquired has some level of uncertainty. This
means there is some random error which we do not know, but we think we
know its statistical distribution (e.g., normally distributed with zero mean and
variance 2 ). So, in this respect we should not try to fit the data exactly, but
rather fit it to within the error bars.
This method is sometimes called the discrepance principle, but I prefer to use
the term fitting within tolerance. In our inverse problem we want to minimize a
functional with two norms
min kDmk
min kd Gmk
and to do that we were looking at the L-curve, using the Damped least squares
or the SVD approach, that is choosing an or a value of p non-zero singular
vectors.
In fact, for data with uncertainties we should actually be looking at a system
that looks like
min kDmk
min kd Gmk
6 T
4.5 Fitting within tolerance
45
where the value of the tolerance T we arrive at by a subjective decision about

what we regard as acceptable odds of being wrong. We will almost always use
the 2-norm on the data space, and thus the chi-squared statistic will be our
guide.
In contrast to our previous case, we dont have equality anymore. Under
certain assumptions
Dm = 0
where for model norm D = I, does not satisfy
min kd Gmk 6 T
we can try to find the equality constrained equation
h
i
2
2
m = T 2 kd Gmk2 + 2 kmk2
From a simple point of view, for a fixed value of T , minimization of the two
constraints can be regarded as seeking a compromise between two undesirable
properties of the solution: the first term represents model complexity, which
we wish to keep small; the second measures model misfit, also a quantity to be
suppressed as far as possible. By making > 0 but small we pay attention to
the penalty function at the expense of data misfit, while making large works
in the other direction, and allows large penalty values to secure a good match
to observation.
From a more quantitative perspective, when the residual norm kd Gmk2
is just above the tolerance T , we are not fitting the data to the level needed.
But we also dont want to over-fit the data. As can be seen from the figure, if
we know the threshold value, the problem is simpler, because we just need to
figure out what the value of the Lagrange multiplier is, such that the residual
norm tolerance is satisfied.
Choosing a value of to the left of this threshold, will fit the data better
and will result in a model with a larger norm (or rougher) than what is required
by the data. Choosing a value of to the right instead, will have a poor fit to
the data, even within uncertainties.
4.5.1
Example 2
First, we need to figure out the value of T . In our example, we said that the
errors were
N (0, 2 ),
= 1e6
Since we have M = 20 points, we need to find a solution whose residual norm is
v
u 20
p
uX
i2 = 20 1e12 = 4.47e6
T = kk2 = t
i
46
Now that we have our value of the tolerance T , we can go back to our initial
problem
i
h
2
2
m = T 2 kd Gmk2 + 2 kmk2
and find the value of or the ideal value of the p that satisfies our new functional.
In this example I will use the same graphical interface as in the previous
example. Now, in addition to the L-curve obtained for the SVD and the Damped
Least Squares methods, we have our threshold value T represented by a vertical
dash line. We pick the value on the L-curve that is closest to T . In the SVD
approach, since we have discrete singular values, we would choose the one that
is closest, while for the DLS we could get in fact really close. In both cases, I
just show approximate values, using my discretization of the used for plotting
the figure.
6
10
1.2
Damped LS
SVD (*10)
real model
= 0.0001
p=9
= 3.1623e05
p = 10
= 1e05
p = 10
10
10
0.8
10
yaxis
||m||
0.6
10
0.4
10
0.2
10
10
10
10
10
10
||dGm||
10
0.2
10
10
10
x axis
12
14
16
18
20
10
10
0.7
si
real model
= 0.0001
p=9
= 3.1623e05
p = 10
= 1e05
p = 10
0.6
T
|ui d|
|uTd/s |
i
i
10
0.5
Observed Intensity
10
0.4
5
10
0.3
10
10
0.2
15
10
0.1
20
10
12
Outgoing x axis
14
16
18
20
10
10
12
Singular Value
14
16
18
20
Figure 4.3: Fitting within tolerance with the DLS and SVD approaches. Our
preferred model is the blue colored one. Top-Left: The L-curve for the residual
norm and model norm. The SVD has been shifted upwards for clarity. Value of
T is shown as a vertical dashed line. Various choices of around T are chosen.
Top-Right: True model (circles) and estimated models for the choices on the left.
Bottom-Left: The synthetic data (circles) and predicted data. Bottom-Right:
For the SVD, the singular values and Picard criteria are shown.
4.5 Fitting within tolerance
Resolution 0.0001
20
47
Resolution 3.1623e05
20
Resolution 1e05
20
15
15
15
10
10
10
5 10 15 20
Covariance 0.0001
20
5 10 15 20
Covariance 3.1623e05
20
5 10 15 20
Covariance 1e05
20
15
15
15
10
10
10
5 10 15 20
Resolution 9
5 10 15 20
Resolution 10
5 10 15 20
Resolution 10
20
20
20
15
15
15
10
10
10
5 10 15 20
Covariance 9
5 10 15 20
Covariance 10
5 10 15 20
Covariance 10
20
20
20
15
15
15
10
10
10
5 10 15 20
5 10 15 20
5 10 15 20
Figure 4.4: Resolution matrix and Covariance matrix for the DLS (top 2 panels)
and SVD (bottom 2 panels) approaches, while fitting within tolerance. Note
that since the SVD approach is discrete in nature, we might not get an ideal
selection, hence the repeated value of p.. Using the filter factors approach might
lead to better results. Our preferred value is the column in the middle.
48

Inverse Theory

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Inverse Theory

Hochgeladen von

Copyright:

Verfügbare Formate

Geophysical Inverse Theory

Notes by German A. Prieto

2 Review of Linear Algebra

where from a known field m(x, t, . . . ) we can predict the observations di . We

Chapter 1. Introduction to inverse theory

Table 1.1: Example properties and measurements for inverse problems

Why is the inverse problem more difficult?

A simple reason is that we have a finite number of measurements (and of limited

Ylm (, )Zn (r)almn

l=0 m=l n=0

where is the P -wave velocity as measured at position (, , r), Zn (r) are

Ylm (, )Zn(i) (r)almn

l=0 m=l n=0

Ylm (, )Zn(i) (r)almn

1.2 So, what can we do?

or a 1D velocity assumption with radial dependence only

So, what can we do?

Imagine we could interpolate between measurements to have a complete data. In

Consider the anti-plane problem for an infinitely long strike-slip fault

Chapter 1. Introduction to inverse theory

1.3 Some terms

Example: Null space

We consider data for a vertical gravity anomaly observed at some height h to

g(x s) [mtrue (x) + m+ ]dx

g(x s) mtrue (x)dx + 0

Chapter 1. Introduction to inverse theory

Table 1.2: Example of inverse problems

The inverse problem is not just simple linear algebra.

Review of Linear Algebra

a11 a12 . . . a1n

Chapter 2. Review of Linear Algebra

Lower triangular matrix

Most entries zero

Having a set of matrices Rmn , addition

aij = bij + cij

and scalar multiplication

More important is matrix multiplication, where

2.1 Matrix operations

unless some special properties exist.

1. If x and y are in the same space Rm , A is providing a linear mapping or

Example 2: Rigid body rotation, used in Plate tectonics reconstructions.

so that the new vector y is simply a linear combination of te column vectors

and the inner product of two vectors of the same length is

Chapter 2. Review of Linear Algebra

If A is a square matrix and if there is a matrix B such that

so if we know y and A and A is square and has an inverse we can recover

The condition Number

The key to understanding the accuracy in the solution of

Remember our definition that if we have a matrix A Rnn is invertible if

2.2 Solving systems of equations

Some examples of inverses

diagonal matrix, so P will have a

Solving systems of equations

Consider a system of equations

and solve using Gaussian elimination.

the factor 2 is called pivot

Subtract 1 (1) from (3).

the -1 is called pivot

The next step is

Chapter 2. Review of Linear Algebra

and now solve each equation from bottom to

top by the process called back

In solving this system of equations we have used elementary row operations,

2.2 Solving systems of equations