Beruflich Dokumente
Kultur Dokumente
ii
Contents
1 Introduction to inverse theory
1.1 Why is the inverse problem more difficult?
1.1.1 Example: Non-uniqueness . . . . .
1.2 So, what can we do? . . . . . . . . . . . .
1.2.1 Example: Instability . . . . . . . .
1.2.2 Example: Null space . . . . . . . .
1.3 Some terms . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
2
3
3
5
6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
8
10
10
11
13
14
16
18
19
19
21
21
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
normal equations
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
24
25
27
29
32
33
34
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iv
4 Tikhonov Regularazation
4.1 Tikhonov Regularization . . . . . . . . . . . .
4.2 SVD Implementation . . . . . . . . . . . . . .
4.3 Resolution vs variance, the choice of , or p .
4.3.1 Example 1: Shaws problem . . . . . .
4.4 Smoothing Norms or Higher-Order Tikhonov
4.4.1 The discrete Case . . . . . . . . . . .
4.5 Fitting within tolerance . . . . . . . . . . . .
4.5.1 Example 2 . . . . . . . . . . . . . . . .
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
38
39
40
42
43
44
45
Chapter 1
Introduction to inverse
theory
In geophysics we are often faced with the following situation: We have measurements at the surface of the Earth of some quantity (magnetic field, seismic
waveforms) and we want to know some property of the ground under the place
where we made the measurements. Inverse theory is a method to infer the
unknown physical properties (model) from these measurements (data).
This class is called Geophysical Inverse Theory (GIT) because it is assumed
we understand the physics of the system. That is, if we knew the properties
accurately, we would be able to reconstruct the observations that we have taken.
First, we need to be able to solve the forward problem
di = Gi (m)
(1.1)
1.1
Data
Altitute/Bathymetry measurements
Magnetic field at the surface
Gravity measurements
Waveforms / Geodetic motion
Arrival times / Waveforms
1.1.1
Example: Non-uniqueness
Imagine we want to describe the Earths velocity structure, the forward problem
could be described as follows:
(, , r) =
X
l
X
X
(1.3)
X
l
X
X
(i)
(1.4)
where i = 1, . . . , M . We have an infinite number of parameters almn to determine, leading to the non-uniqueness problem.
A commonly used strategy is to drastically oversimplify the model
i (, , r) =
l
6
6
X
X
X
l=0 m=l n=0
(i)
(1.5)
20
X
Zn(i) (r)an
(1.6)
n=0
In this cases the number of data points is larger than the model parameters
M > N , so the problem is overdetermined.
If the oversimplification (i. e., radial dependence only) is justified by observations this may be a fine approach, but when there is no evidence for this
arrangement even if the data is fit, we will be uncertain o the significance of the
result. Another problem is that this may unreasonably limit the solution.
1.2
1.2.1
Example: Instability
x2
x1
x3
Figure 1.1: Anti-plane slip for infinitely long strike-slip fault
The displacement at the Earths surface u(x1 , x2 , x3 ) is in the x
1 direction,
due to slip S() as a function of depth, ,
1
u1 (x2 , x3 = 0) =
S()
x2
2
x2 + 2
d
(1.7)
where S() is the slip along x1 and varies only with depth x
3 . If we had only
discrete measurements
di =
(i)
u1 (x2 )
Z
S() gi ()d
(1.8)
where
1
gi () =
(i)
x2
2
(i)
x2
+
Now, lets assume that slip occurs only at some depth c, so that S() = (c)
d(x2 )
S()
x22
d
x2
+ c2
(1.9)
(1.10)
u1
x2
x22 + 2
x2
Figure 1.2: Observations at the surface due to concentrated slip at depth c.
The results (Figure 1.2) show
1. Effect of concentrated slip is spread widely
2. This will lead to trouble (instability) in the inverse problem
so that even if we did have data at every point on the surface of the Earth, the
inverse problem would be unstable.
The kernel of functional g() smooths the focused deformation. The
problem lies in the physical model, not really how you solve it
1.2.2
h
(x s) +
h2
3/2 m(x)dx
(1.11)
Z
g(x s) m(x)dx
(1.12)
Suppose now we can find a smooth function m+ (x), such that the integral
in (1.12) vanishes, such that d(s) = 0. Because of the symmetry of the kernel
g(x s), if we choose m+ (x) to be a line with a given slope, the observed
anomaly d(s) will be zero. The consequence of this is that we can add to the
true anomaly, an anomaly function m+ to it
m = mtrue + m+
and the new gravity anomaly profile will match the data just as well as mtrue
Z
d(s)
Z
g(x s) mtrue (x)dx +
g(x s) m+ (x)dx
(1.14)
(1.13)
(1.15)
From the field observations, even if error free and infinitely sampled there is no
way to distinguish between the real anomaly and any member of an infinitely
large family of alternatives.
Models m+ (x) that lie in the null space of g(x s) are solutions to
Z
g(x s)m(x)dx = 0
By superposition, any linear combination of these null space models
can be added to a particular model and not change the fit to the
data. This kind of problems do not have a unique answer even with
perfect data,
1.3
Theory
Linear
Linear
Nonlinear
Linear
Nonlinear
Determinancy
Overdetermined
Underdetermined
Overdetermined
Underdetermined
Underdetermined
Examples
Line fit
Interpolation
Earthquake Location
Fault Slip
Tomography
Some terms
Chapter 2
x1
a11 a12 . . . a1n
x2
..
..
.
a
.
A=
x
=
..
21
.
..
.
amn
xm
We can also think of a matrix as an ordered collection of column vectors
..
..
.
.
A=
a21
= a 1 a2 an
..
.
amn
There are a number of special matrices to keep in mind. These are useful
since some of them are used to get matrix inverses.
Square matrix
Diagonal Matrix
Tridiagonal matrix
m=n
aij = 0
whenever i 6= j
aij = 0
whenever |i j| > 1
aij = 0
whenever i > j
aij = 0
whenever i < j
Sparse matrix
Note that the definition of upper and lower triangular matrices may apply to
non-square matrices as well as square ones.
A zero matrix is a matrix composed of all zero elements. It plays the role
on matrix algebra as the scalar 0.
A+0
= A
= 0+A
The unit matrix is the square, diagonal matrix with only unity in the diagonal
and zeros elsewhere and is usually denoted I. Assuming the right matrix sizes
apply
AI = A = IA
2.1
Matrix operations
means
aij = bij
where is a scalar.
Another basic manipulation is transposition
B = AT
means
bij = aji
means
cij =
n
X
aik bkj
k=1
Notice we can only multiply two matrices when the numbers of columns (n) in
the first one equals the number of rows in the second. The other dimensions are
not important, so non-square matrices can be multiplied.
Other standard aritmethic rules are valid, such as distribution A(B + C) =
AB + AC. Less obvious the association of multiplication holds A(BC) =
(AB)C as long as the matrix sizes permit. But multiplication is not commutative
AB 6= BA
(2.1)
angular velocity
inertia tensor
angular momentum
= Ax = [a1 , a2 , . . . , an ] x
= x1 a1 + x2 a2 + xn an
x1
x1 y1 x1 yq
x2
..
..
T
..
xy = . y1 y2 yq = .
.
.
..
xp y1 xp yq
xp
Rp and Rq
y1
y2
xT y = x1 x2 xp . = x1 y1 + x2 y2 + + xp yp
..
yp
The inner product is just a vector dot product of vector analysis.
10
(AT )
= (A1 )
The inverse is useful for solving linear systems of algebraic equations. Starting with equation 2.1
y
y
= Ax
= A1 Ax
= Ix = x
(AB)
(AB)
2.1.1
= BT A T
= B1 A1
2.1.2
Matrix Inverses
and AA1 = I
11
d1
0
d2
D=
0
D1 =
d3
1
d1
0
1
d2
1
d3
the inverse of a diagonal matrix, is a diagonal matrix with the diagonal elements
to the negative power.
1 0 0
1 0 0
P= 0 0 1
P1 = 0 0 1
0 1 0
0 1 0
If ones exchanges rows 2 and 3, you get a
simple inverse.
1 0
E= 2 1
0 0
0
0
1
In this case, the matrix is not diagonal, but we can use Gaussian Elimination
which we will go through next.
2.2
2x +
4x +
2x +
y
y
2y
+ z
+ z
=
1
= 2
=
7
1
2
3
2x +
y
y
3y
+ z
2z
+ 2z
=
1
= 4
=
8
2x +
y
y
z
2z
4z
=
1
= 4
= 4
12
z=1
y=2
x = 2
or
2
4
2
1
1
2
Ax
= b
Aij xj
= bi
1
x
1
0 y = 2
1
z
7
We are going to try and get A = LU, where L is a lower triangular matrix and
U is upper triangular with the same Gaussian Elimination steps.
1. Subtract two times the first equation from the second
1 0 0
2 1 1
x
1
2 1 0 4 1 0 y = 2
0 0 1
2 2 1
z
7
2
1
1
x
1
0 1 2 y = 4
2
2
1
z
7
or for short
E1 Ax
= E1 b
A1 x = b1
2. Subtract 1 times the first equation from
1 0 0
2
1
1
0 1 0 0 1 2
1 0 1
2
2
1
2
1
1
0 1 2
0
3
2
the third
x
1
y = 4
7
z
x
1
y = 4
8
z
or for short
E2 A1 x = E2 b1
A2 x = b2
13
1
0 0
2
1
1
x
0
1 0 0 1 2 y =
0 3 1
0
3
2
z
2
1
1
x
0 1 2 y =
0
0 4
z
1
4
8
1
4
4
or for short
E3 A2 x = E3 b2
A3 x = b3
This new matrix will be assigned a new name, so the system now looks
E3 E2 E1 Ax
= E3 E2 E1 b
Ux = c
and since
U = E3 E2 E1 A
A
= E1 1 E2 1 E3 1 U = LU
1
L= 2
1
0 0
1 0
3 1
to
Ux = c
and back-substitute. Easy right?
2.2.1
Basic steps
Uses multiples of first equation to eliminate first coefficient of subsequent
equations.
Repeat for coefficients n 1.
Back substitute in reverse order.
14
Problems
Zero in first column
Linearly dependent equations
Inconsistent equations
Efficiency
If we count division, multiplication, sum as one operation and assume we have
a matrix A Rnn
n operations to get zero in first coefficient
n 1 for the # of rows to do
n2 n operations so far
N = (12 + + n2 ) (1 + + n) =
n3 n
3
to do remaining coefficients.
For large n N n3 /3
Back-substitution part N = n2 /2
There are other more efficient ways to solve systems of equations.
2.2.2
Some examples
15
Non-square matrices
An example of a system with 3 equations and four unknowns (overdetermined,
underdetermined?) is as follows
x1
1
3 3 2
0
x
2
= 0
2
6 9 5
x3
1 3 3 0
0
x4
We can use Gaussian elimination by setting to zero first
2 and 3
x1
1 0 0
1
3 3 2
x2
2 1 0 2
6 9 5
x3 =
1 0 1
1 3 3 0
x4
x1
1 3 3 2
0 0 3 1 x2 =
x3
0 0 6 2
x4
and for the third coefficient for the last
1
0 0
1 3 3
0
1 0 0 0 3
0 2 1
0 0 6
row
2
1
1 3 3 2
0 0 3 1
0 0 0 0
x1
x2
=
x3
x4
x1
x2
=
x3
x4
coefficients in rows
0
0
0
0
0
0
0
0
0
0
0
0
The underlined values are the pivots. The pivots have a column of zeros below
and are to the right and below other pivots.
Now we can try and solve the equations, but note that the last row has not
information, xj could get any value.
0
0
x3
= x4 /3
x1
= 3x2 x4
x4
3x2 x4
3x2
0
x
x
2
2
=
+
x=
x4 /3
0 13 x4
x4
0
x4
16
which means that all solutions to our initial problem Ax = b are combinations
of this two vectors and form an infinite set of possible solutions. You can choose
ANY value of x2 or x4 and you will get always the correct answer.
2.3
and
addition
f
scalar multiplication
(2.2)
(2.3)
f +g = g+f
f + (g + h) = (f + g) + h
(2.4)
(2.5)
(2.6)
(2.7)
( + )f
(2.8)
= f + f
(f ) = ()f
1f = f
(2.9)
(2.10)
either = 0 or f = 0
Some examples
The most obvious space is Rn , so
x = [x1 , x2 , . . . , xn ]
is an element of Rn .
Perhaps less familiar are spaces whose elements are functions, not just a
finite set of numbers. One could define a vector space C N [a, b], a space of all
17
only if 1 = 2 = = n = 0
j aj = 0
j=1
in words, the only linear combination of the elements that equals zero is the one
in which all the scalars vanish.
Subspaces
A subspace of a linear vector space V is a subset of V that is itself a LVS,
meaning all the laws apply. For example
Rn
is a subset of Rn+1
or
C n+1 [a, b]
is a subset of C n [a, b]
18
2.4
Functionals
gi (x)m(x)dx
m C 0 [a, b]
d2 f
D2 [f ] =
f C 2 [a, b]
dx2 x=0
N1 [x] = |x1 | + |x2 | + + |xn |
x Rn
There are two kinds of functionals that will be relevant to our work, linear
functionals and norms. We will devote a section to the second one later.
2.5 Norms
2.4.1
19
Linear functionals
xi yi
which is an example of an inner product For finite models and data, the general
relationship is
d = gj m
or for multiple data
di = Gij mj
and is some way our forward problem is an inner product between the model
and the mathematical theory to generate the data.
2.5
Norms
> 0
(2.12)
= ||kf k
6 |f | + |g|
= 0
(2.13)
the triangle inequality
only iff = 0
(2.14)
(2.15)
20
x RN
Euclidean norm
p61
The areas for which the so called p-norms are less that unit (kxk 6 1) are shown
p=1
p=2
p=3
p=
|f (x)|dx
kf k2
b
1/2
Z
= |f (x)|2 dx
kf k
max
a6x6b(|f (x)|)
2.5 Norms
21
and other norms can be designed to measure some aspect of the roughness of
the functions
00
1/2
Zb
00
kf k
= f 2 (a) + [f (a)] +
kf kS
b
1/2
Z
0
= (w0 (x)f 2 (x) + w1 (x)f (x)2 )dx
[f (x)] dx
a
Sobolev norm
This last set of norms are going to be useful when we try to solve underdetermined problems. They are typically applied to the model rather than the
data.
2.5.1
(2.16)
r = dd
where from our physics we can make data predictions d and we want our predictions to be as close as possible to the acquired data.
What do we mean by small? We use the norm to define how small is small
by setting the length of r, namely the norm of r krk as small as possible
L1 :
1
kd dk
kd dk
2
2.5.2
(2.17)
y0 = Ax0
(2.18)
22
Here, assume the perturbation is small. Note that in real life, we have uncertainties in our observations, and we wish to know whether these small errors in
the observations are severely effecting our end-result.
Using a norm, we wish to know what the effect of the small perturbations
is, so using the relations above
A(x x0 )
= y y0
(x x0 )
= A1 (y y0 )
kx x0 k
6 kA1 kky y0 k
ky y0 k
y
ky y0 k
6 kAxkkA1 k
kyk
ky
y0 k
6 kAkkA1 k
kyk
6 kA1 k
(2.19)
and shows the amount that a small perturbation in the observations (y) is
reflected in perturbations in the resultant estimated model x. For the L2 norm,
the condition number of a matrix = max /min , where i are the eigenvalues
of the matrix in question.
Chapter 3
Linear Regression
Sometimes we will talk about the term inverse problem, while some other people
will prefer the term regression. What is the difference? In practice, none.
In the case where we are dealing with a function fitting procedure that can be
cast as an inverse problem, the procedure is many times referred as a regression.
In fact, economists use regressions quite extensively.
Finding a parameterized curve that approximately fits a set of data points
is referred to as regression. For example, the parabolic trajectory problem is
defined
y(t) = m1 + m2 t (1/2)m3 t2
where y(t) represents the altitude of the object at time t, and the three (N = 3)
model parameters mi are associated with a constant, slope and quadratic terms.
Note that even if the function is quadratic, the problem in question is linear for
the three parameters.
If we have M discrete measurements yi at times ti , the linear regression
problem or inverse problem can be written in the form
1 2
1 t1
y1
2 t1
1 2
y 2 1 t2
m1
2 t2
.. = ..
..
.. m2
. .
.
. m3
yM
1 tM 12 t2M
When the regression model is linear in the unknown parameters, then we call
this a linear regression or linear inverse problem.
24
3.2
We start the application of all those terms we have learned above by looking at
an overdetermined linear problem (more equations than unknowns) involving
the simplests of norms, the L2 or Euclidean norm.
Suppose we are given a collection of M measurements of a property to form
a vector d RM . From our geophysics we know the forward problem such that
we can predict the data from a known model m RN . That is, we know the N
vectors gk RM such that
d=
N
X
gk mk = Gm
(3.1)
k=1
where
G = [g1 , g2 , . . . , gN ]
that minimizes the size of the residual vector
We are looking for a model m
defined
r=d
N
X
gk m
k
k=1
We do not expect to have an exact fit so there will be some error, and we use a
norm to measure the size of the residual
krk = kd Gmk
For the least squares problem we use the L2 or Euclidean norm
v
uM
uX
krk2 = t
rk2
k=1
d = [d1 d2 , . . . , dM ]
ri = d i m
25
M
X
rk2
k=1
M
X
(di m)
2
k=1
M
X
d2k 2md
i+m
2
k=1
M
X
d2k 2m
M
X
dk + M m
2
k=1
k=1
Now, to minimize the residual, we take the derivative w.r.t. the model m
k=1
M
1 X
dk
M
k=1
which shows that the sample mean is the result of a least squares solution for
the measurements.
The corresponding estimate that minimizes the L1 norm is the median. Note
that the median is not found by a linear operation on the data, which is a general
feature of the L1 norm estimates.
3.2.1
General LS Solution
d
and the predicted data is a linear combination of the gk s. Using the Linear
must lie in the
vector space theory, we can show that the predicted data d
estimation space, on ALL possible results that G can produce (range).
Setting the L2 norm for the residuals between the data and the prediction
krk22
2
= kd dk
2
T (d Gm)
= rT r = (d Gm)
T
T
T
G d+m
T GT Gm
= d d 2m
dm
dm
0
0 2GT d + 2GT Gm
26
Note that it is worth pointing out that the derivative is of a scalar with respect
to a vector. We will show below that this works as simply as it appears, by
writing out all the components. Simplifying a bit more
GT d = GT Gm
(3.2)
= (GT G)
m
GT d
M
X
(rj ) =
j=1
M
X
dj
j=1
N
X
!2
gji mi
i=1
M
d
d X
krk22 =
(rj )2
dm
dm
k j=1
!
!
M
N
N
X
X
d X
dj
dj
gji mi
gjl ml
dm
k j=1
i=1
l=1
#
"
M
N
N
N X
N
X
X
X
d X
dj dj dj
gjl ml dj
gji mi +
gji gjl mi ml
dm
k j=1
i=1
i=1
l=1
l=1
l=1
= 2
M
X
j=1
dj gjk lk GT d
27
M
X
N
X
d
gji gjl mi ml
dm
k
j=1
i,l=1
"N N
#
M
X
XX
=
(ik gji gjl ml + lk gji gjl mi )
j=1
l=1 i=1
"N N
M
X
XX
j=1
#
(gjk gjl ml + gji gjk mi )
l=1 i=1
and now note that gjk gjl gji gjk due to symmetry. So in the end we will have
2
M X
N
X
mi gjk gji GT Gm
j=1 i=1
and we have derived the same previous result for the normal equations.
3.2.2
= (GT G)
m
GT d
(3.3)
GT (Gm)
= GT d
d) = 0
GT (Gm
and recalling the definition of the residual
d) = GT r = 0
GT (Gm
28
So in other words the normal equation in the least squares sense means that
GTr = 0
g.1 r
g.2 r
= 0=
..
0
0
..
.
g.N r
suggesting that the residual vector is orthogonal to every one of the column
vectors of the G matrix. The key thing here is that making the residual perpendicular to the estimation sub-space minimizes the length of r.
Gm
subspace of G
d.
PG = G(GT G)1 GT
29
d1
1 x1
1
d2 = 1 x2 m
m
2
1 x3
d3
or
d1
d2
= g11 m
1 + g12 m
2
d3
= g31 m
1 + g32 m
2
= g21 m
1 + g22 m
2
r=dd
Reorganizing we have
+ r,
d=d
rd
3.2.3
Maximum Likelihood
We can also use the Maximum LIkelihood method in order to interpret the Least
Squares method and normal equations. This technique was developed by R.A.
Fisher in the 1920s and has dominated the field of statistical inference since
then. Its power is that it can (in principle) be applied to any type of estimation
problem, provided that one can write down the joint probability distribution of
the random variables which we are assuming model the observations.
The maximum likelihood looks for the optimum values of the unknown model
parameters as those that maximize the probability that the observed data is due
to the model from a probabilistic point of view.
Suppose we have a random sample of M observations x = x1 , x2 , . . . , xM
drawn from a probability distribution (PDF) f (xi , ) where the parameter
is unknown. We can extend this probability to a set of model parameters to
f (xi , m). The joint probability for all M observations is:
f (x, m) = f (x1 , m)f (x2 , m) f (xM , m) = L(x, m)
30
y3
r3
d = Gm
y1
y2
r1
x1
r2
x2
x3
Figure 3.2: The LS fit for a straight line. The estimation space is the straight
this is where all predictions will lie. The real measurements
line given by Gm,
dk line above or below this line, and are sort of projected into the line via the
residual.
(di )2
2 2
2
(di )
i=1
2 2
31
max ln{L(d, )}
M
M
1 X
ln(2) M ln() 2
(di )2
2
2 i=1
M
1 X
(di )
2 i=1
M
X
(di )
i=1
M
X
m
X
()
i=1
(di ) m
i=1
M
1 X
di
M i=1
We can also look for the maximum likelihood for the variance 2
0=
M
M
1 X
+ 3
(di )2
i=1
we get
M
1 X
=
(di )2
M i=1
2
M
Y
1
exp (di Gi m/2i2
Q
M
M/2
(2)
i=1 i i=1
32
We want to maximize the function above, thus the constant term has no
effect, leading to
"
( M
)#
X
2
2
max L = max exp
(di Gi m) /2i
m
i=1
i=1
i=1
(d Gm)T 1 (d Gm)
m
1 T
=
d d 2mT GT 1 d + mT GT 1 Gm
m
= 0 2GT 1 d + 2GT 1 Gm
=
finally leading to
= (GT 1 G)1 GT 1 d
m
which comes from the sometimes called the generalized normal equations
= GT 1 d
(GT 1 G)m
(3.4)
3.3
As you might have expected, the choice of norm is kind of arbitrary. So why is
the use of least squares so popular?
1. Least squares estimates are linear in data and easy to program.
33
L1
L2
outlier
x
Figure 3.3: Schematic of a straight line fit for (x, d) data points under the L1,
L2, and L norms. The L1 is not as affected by the single outlier.
2. Corresponds to maximum likelihood estimate for normally distributed errors. The normal distribution comes from the central limit theorem add
up random effects and you get a Gaussian.
3. The statistical distribution is linear, meaning we will have the propagation
of errors as a linear mapping from the input statistics (data).
4. Well known statistical tests and confidence intervals can be obtained.
It has some disadvantages too. The main one is that the result is sensitive to
outliers (see figure 3.3).
Another popular norm used is the L1 norm. Some characteristics include
1. non-linear, solved by linear programming (to be seen later).
2. Less sensitive to outliers
3. confidence intervals and hypothesis testing are somewhat more difficult,
but can be done.
3.4
(Gm)
= 0
T
G (d Gm)
= GT r = 0
T
which leads to
= (GT G)1 GT d
m
34
T (d Gm)
(d Gm)
(rT r) = 0
m
GT (d Gm)
= 0
leading to
= (GT G)1 GT d
m
3. Maximum likelihood for a multivariate normal distribution
Maximize:
T 1 (d Gm)
exp (d Gm)
Minimize:
T 1 (d Gm)
(d Gm)
Led to:
= (GT 1 G)1 GT 1 d
m
which comes from the generalized normal equations.
3.5
We come back to the general line fit problem, where we have two unknows,
the intercept m1 and the slope m2 . We have M observations di . The inverse
problem is
d = Gm
and the indexed matrix is then
d1
d2
.. =
.
dM
1
1
..
.
x1
x2
..
.
xM
m1
m2
As you are already aware of, the least squares solution of this problem is
= (GT G)1 GT d
m
which we are doing explicitly.
35
1
x1
G d =
1
x2
1
dM
d1
d2
..
.
dM
M
X
di
i=1
=
M
X
xi di
i=1
(GT G)1
1
=
x1
1
x2
=
M
X
xi
i=1
1
xM
M
X
i=1
M
X
M
X
xM
xi
x2i
i=1
1
M
x1
x2
..
.
M
X
1
1
..
.
x2i
M
X
xi
!2
i=1
M
X
i=1
xi
xi
i=1
i=1
i=1
x2i
M
X
=
m
M
M
X
i=1
x2i
M
X
i=1
xi
!2
M
X
x2i
i=1
M
X
M
X
M
X
xi
di
i=1
M
X
M
xi di
i=1
xi
i=1
i=1
cov(m)
= 2 (GT G)1
M
X
x2i
i=1
2
M
X
xi
i=1
M
X
i=1
xi
36
where 2 is the variance of the individual estimates. This equation shows that
even if the data di are uncorrelated, the model parameters can be correlated:
cov(m1 , m2 ) =
M
X
xi
i=1
M
0
M
2
X
cov(m)
=
0
yi2
i=1
1
M
= 2
0
0
1
M
X
yi2
i=1
This new relation now shows independent intercept and slopes and if are the
standard errors in the observed data then
Standard error of intercept
/M,
with more data you reduce the variance of the intercept.
Standard error of slope
v
uM
uX
t
y2
i
i=1
showing that if the observation points on the x axis are close, the uncertainties in the slope estimates are greater.
Chapter 4
Tikhonov Regularization,
variance and resolution
4.1
Tikhonov Regularization
Tikhonov Regularization is one of the most common methods used for regularizing an inverse problem. The reason to do this is that in many cases the inverse
problem is ill-posed and small errors in the data will give very large errors in
the resultant model.
Another possible reason for using this method is if we have a mixed determined problem, where for example we might have a model null-space.For the
overdetermined part, we would like to minimize the residual vector
min krk
= (GT G)1 GT d
m
while for the underdetermined case we actually minimize the model norm
min kmk
= GT (GGT )1 d
m
and of course, for the mixed determined case, we will be trying something in
between
m = kd Gmk22 + 2 kmk22
and as we have seen before, we want to minimize
2
G
d
min m = min
m
0
2
m
m
I
or equivalently
min m
m
= (GT G + 2 I)1 GT d
m
So the question is, what do we choose for ? If we choose very large, we are
focusing our attention on minimizing the model norm kmk, while neglecting
38
the residual norm. If we choose too small, we are dealing with the complete
contrary and we are trying to fit the data perfectly, which is probably not what
we want.
A graphical way to see how the two norms interact depending on the choice
of is shown in Figure 5.1 of our book. The idea is that as the residual norm
increases, the model norm decreases, leading to the so-called L-curve. This is
because kmk2 is a strictly decreasing function of , while kd Gmk2 is an
strictly increasing function of .
Our job now is to find an optimal value of . There are a few methods we are
going to see that get the optimal . This include the discrepancy criterion, the Lcurve criterion and cross-validation. Before going there, we want to understand
the effect of the choice of in the resolution of the estimate and well as the
covariance of the model parameters. Similarly we want to understand the choice
of the number of singular values used in solving the generalized inverse using
SVD and the SVD implementation of the Tikhonov regularization. Finally we
will see how other norms can be chosen in order to penalize models with excesive
roughness or curvature.
4.2
SVD Implementation
Using our previous expression, but using the SVD of the G nmatrix, namely
G = UVT
and from above
= GT d
(GT G + 2 I)m
we can replace
(V V + I)m
and the solution is
=
m
X
i
where
fi =
2i
= VUT d
= VUT d
uTi d
2i
vi
+ 2 i
2i
2i + 2
where we had chosen the value of p for all singular values that are large. In
contrast, for i , the factor fi 0 and this part of the solution will be
damped out, or downweighted.
39
2i
2i
+ 2
4.3
i
i +
From previous lectures, we can now discuss the resolution and variance of our
resultant model using the generalized inverse, and in this case, the Tikhonov
regularization. We had
(GT G + 2 I)1 GT d
= G# d
= VF1 UT d
T
= Vp 1
p Up d
where the first and second equations use the general Tikhonov regularization,
the third equation is the SVD using filter factors and the last one is the result
if we choose a p number of singular values.
The model resolution matrix Rm was defined via
= Ggen d = Ggen Gmtrue = Rmtrue
m
is then defined for the three cases as
Rm,
= G# G
Rm,
= VFVT
Rm,p
= Vp VpT
6= mtrue . The
In all regularizations R 6= I, the estimate will be biased and m
bias introduced by regularizing is
mtrue = [R I]mtrue
m
but since we dont know mtrue , we dont know the sense of the bias. We cant
even bound the bias, since it depends on the true m as well.
40
m
T
m = m
D
E
T
=
G# ddT G#
= G# d G#
m,
m,
= 2 G# G#
= 2 VF2 2 VT
m,p
T
= 2 Vp 2
p Vp
We could use this to evaluate confidence intervals, ellipses on the model, but
since the model is biased by an unknown amount, the confidence intervals might
not be representative of the true deviation of the estimated model.
4.3.1
In this example I would like to use some practical application of the Tikhonov
regularazation using both the general approach (generalized matrix explicitly)
and using the SVD.
I take the examples from Asters book directly. In the Shaw problem, the
data that is measured is diffracted light intensity as a function of outgoing angle
d(s), where the angle is /2 6 s 6 /2. We use the discretized version of the
problem as outlined in the book, namely the mathematical model relating the
data observed d and the model vector m is
d = Gm
where d RM and m RN , but in our example we will have M = N . The G
matrix is defined for the discrete case
2
(i 0.5)
N
2
i = 1, 2, . . . , N.
41
model m,
(GT G + 2 I)1 GT d
T
= Vp 1
p Up d
where in the first case we need to choose a value of , while in the second case
(SVD) we need to choose a value of p, the num,ber of singular values and vectors
to use. Since the singular values are rarely exactly zero, the choice is not so
easy to make. In addition to making a particular choice, we need to understand
what the effect of our choice has on our model resolution and model covariance.
In the next figures I present the results graphically in order to get an intuitive
understanding of our choices.
2.5
10
real model
= 0.001
= 3.1623e06
p = 8e08
10
1.5
2
Intensity
||m||
10
10
0.5
0.5
10
1.5
10 6
10
10
||dGm||
10
10
2
1.5
0.5
Resolution 0.0017783
0.5
Resolution 1e05
1.5
Resolution 8e08
20
20
20
15
15
15
10
10
10
0.7
real model
= 0.001
= 3.1623e06
p = 8e08
0.6
Observed Intensity
0.5
10
15
20
5
5
10
15
20
10
15
20
0.4
Covariance 0.0017783
Covariance 1e05
Covariance 8e08
0.3
20
20
20
0.2
15
15
15
10
10
10
0.1
0
1.5
0.5
0
Outgoing angle
0.5
1.5
10
15
20
10
15
20
10
15
20
Figure 4.1: Some models using the Generalized inverse. Top-Left: The L-curve
for the residual norm and model norm. Various choices of are used, and the
colored dots are three choices made. Top-Right: True model (circles) and the
estimated models for the three choices on the left. Bottom-Left: The synthetic
data (circles) and the three predicted data. Bottom-Right: The resolution (top
panels) and covariance (bottom panels) matrices for the three choices. White
represents large amplitudes, black represents lower amplitudes.
42
10
1.5
10
real model
p = 14
p=8
p=2
|uTid|
|uTd/s |
i
10
1
0
10
Intensity
||dGm||
10
10
0.5
10
10
10
0
15
10
10
5
10
15
Number of Singular Value (p)
20
20
0.5
1.5
0.5
0.5
1.5
Resolution 14
10
10
12
Singular Value
Resolution 8
14
16
18
20
Resolution 2
20
20
20
15
15
15
10
10
10
0.7
real model
p = 14
p=8
p=2
0.6
0.5
Observed Intensity
10
15
20
10
15
20
10
15
20
0.4
0.3
Covariance 14
Covariance 8
Covariance 2
20
20
20
15
15
15
10
10
10
0.2
0.1
0.1
1.5
0.5
0
Outgoing angle
0.5
1.5
10
15
20
10
15
20
10
15
20
Figure 4.2: Some models using the Generalized inverse. Top-Left: The L-curve
for the residual norm and model norm. Various choices of are used, and the
colored dots are three choices made. Top-Right: True model (circles) and the
estimated models for the three choices on the left. Bottom-Left: The synthetic
data (circles) and the three predicted data. Bottom-Right: The resolution (top
panels) and covariance (bottom panels) matrices for the three choices. White
represents large amplitudes, black represents lower amplitudes.
4.4
Very often we seek solution that minimize the misfit, but also some measure of
roughness of the solution. In some cases when we try to minimize the minimum
norm solution
Zb
2
kf k = f (x)2 dx
a
we may get the unwanted consequence of putting the estimated model only
where you happen to have data. Instead, our geophysical intuition might suggest
that the solution should not be very rough, so we minimize instead
Zb
kf k =
f 0 (x)2 dx,
f (a) = 0
where we need to add a boundary condition (right hand side). The boundary
condition is needed, since the derivative norm is insensitive to constants, that
is the norm of kf + bk is equal to kf k. This means we really have a semi-norm.
4.4.1
43
Assuming the model parameters are ordered in physical space (e.g., with depth,
or lateral distance), then we can define differential operators of the form
1 1
0
0 ...
0 1 1
0 ...
D1 = 0
0 1 1 . . .
.. ..
.
.
and the second derivative
D2 =
2
1
0
1
2
1
0
1
2
..
.
0
0
1
..
.
...
...
...
..
.
There are a few ways to implement this in the discrete case, namely
1. minimize a functional of the for
2
m = kd Gmk2 + 2 kDmk2
which leads to
1
= GT G + 2 DT D GT d
m
Note the similarity with our previous results, where instead of the matrix
DT D we had the identity matrix I.
2. Alternatively, we can try and solve the coupled system of equations
d
G
=
m+
0
D
and you can rewrite this in a simplified way
d0 = Hm +
where we have now the standard expression for the inverse problem to be
solved. Due to the effect of the D matrix, the ill-posedness of the original
expression can be significantly reduced (depending on the chosen value
of ). The advantage of this approach is that one can impose additional
constraints, like non-negativity.
3. We can also transform the system of equations in a similar way by
d = Gm +
d = GD1 Dm +
d = G0 m 0 +
44
= GD1
= Dm
As you can see, we have not changed the condition of fitting the data, so
that
2
2
kd = G0 m0 k = kd Gmk
but we have also added the model norm of the form
2
km0 k = kDmk
Note that for this to actually work, the matrix D needs to be invertible.
Sometimes, it is possible to do it analytically. We can also use the SVD
at this stage.
As a cautonary note, it is important to keep in mind that the Tikhonov
regularization will recover the true model depending on whether the assumptions
of the additional norm (be it kmk, or kDmk) is correct. We would not expect
to get the right answer in the previous examples, since the true model mtrue is
a delta function.
4.5
In real life, the data that we have acquired has some level of uncertainty. This
means there is some random error which we do not know, but we think we
know its statistical distribution (e.g., normally distributed with zero mean and
variance 2 ). So, in this respect we should not try to fit the data exactly, but
rather fit it to within the error bars.
This method is sometimes called the discrepance principle, but I prefer to use
the term fitting within tolerance. In our inverse problem we want to minimize a
functional with two norms
min kDmk
min kd Gmk
and to do that we were looking at the L-curve, using the Damped least squares
or the SVD approach, that is choosing an or a value of p non-zero singular
vectors.
In fact, for data with uncertainties we should actually be looking at a system
that looks like
min kDmk
min kd Gmk
6 T
45
4.5.1
Example 2
First, we need to figure out the value of T . In our example, we said that the
errors were
N (0, 2 ),
= 1e6
Since we have M = 20 points, we need to find a solution whose residual norm is
v
u 20
p
uX
i2 = 20 1e12 = 4.47e6
T = kk2 = t
i
46
Now that we have our value of the tolerance T , we can go back to our initial
problem
i
h
2
2
m = T 2 kd Gmk2 + 2 kmk2
and find the value of or the ideal value of the p that satisfies our new functional.
In this example I will use the same graphical interface as in the previous
example. Now, in addition to the L-curve obtained for the SVD and the Damped
Least Squares methods, we have our threshold value T represented by a vertical
dash line. We pick the value on the L-curve that is closest to T . In the SVD
approach, since we have discrete singular values, we would choose the one that
is closest, while for the DLS we could get in fact really close. In both cases, I
just show approximate values, using my discretization of the used for plotting
the figure.
6
10
1.2
Damped LS
SVD (*10)
real model
= 0.0001
p=9
= 3.1623e05
p = 10
= 1e05
p = 10
10
10
0.8
10
yaxis
||m||
0.6
10
0.4
10
0.2
10
10
10
10
10
10
||dGm||
10
0.2
10
10
10
x axis
12
14
16
18
20
10
10
0.7
si
real model
= 0.0001
p=9
= 3.1623e05
p = 10
= 1e05
p = 10
0.6
T
|ui d|
|uTd/s |
i
i
10
0.5
Observed Intensity
10
0.4
5
10
0.3
10
10
0.2
15
10
0.1
20
10
12
Outgoing x axis
14
16
18
20
10
10
12
Singular Value
14
16
18
20
Figure 4.3: Fitting within tolerance with the DLS and SVD approaches. Our
preferred model is the blue colored one. Top-Left: The L-curve for the residual
norm and model norm. The SVD has been shifted upwards for clarity. Value of
T is shown as a vertical dashed line. Various choices of around T are chosen.
Top-Right: True model (circles) and estimated models for the choices on the left.
Bottom-Left: The synthetic data (circles) and predicted data. Bottom-Right:
For the SVD, the singular values and Picard criteria are shown.
Resolution 0.0001
20
47
Resolution 3.1623e05
20
Resolution 1e05
20
15
15
15
10
10
10
5 10 15 20
Covariance 0.0001
20
5 10 15 20
Covariance 3.1623e05
20
5 10 15 20
Covariance 1e05
20
15
15
15
10
10
10
5 10 15 20
Resolution 9
5 10 15 20
Resolution 10
5 10 15 20
Resolution 10
20
20
20
15
15
15
10
10
10
5 10 15 20
Covariance 9
5 10 15 20
Covariance 10
5 10 15 20
Covariance 10
20
20
20
15
15
15
10
10
10
5 10 15 20
5 10 15 20
5 10 15 20
Figure 4.4: Resolution matrix and Covariance matrix for the DLS (top 2 panels)
and SVD (bottom 2 panels) approaches, while fitting within tolerance. Note
that since the SVD approach is discrete in nature, we might not get an ideal
selection, hence the repeated value of p.. Using the filter factors approach might
lead to better results. Our preferred value is the column in the middle.
48