Beruflich Dokumente
Kultur Dokumente
Sari Lasanen
March 3, 2014
2
Introduction to inverse problems (4 op)
Additional material:
i
ii
Chapter 1
Inverse problems
Inverse problems belong to the field of applied mathematics, but contain also elements
from pure mathematics. Several inverse problems contribute to both the practical ap-
plications and the pure mathematics underneath them. Even the best mathematical
journals, like Annals of Mathematics contain publications about inverse problems. The
main scientific journals that are dedicated solely to inverse problems are Inverse Prob-
lems (IP), Inverse Problems and Imaging (IPI), Journal of Inverse and Ill-posed Problems
and Inverse Problems in Science and Engineering. These journals are available at Oulu
University Library, especially through Nelli portal (which has also remote access).
Definition 1. The mapping F that takes the unknown to the corresponding data is
called the direct theory (also forward mapping).
The inverse problem is to seek for x that has produced y. In layman terms,
In mathematical terms, the question is plainly about the determination of the inverse
mapping F 1 , but we will see later that inaccurate data makes things more complicated
1
1.2 Examples of inverse problems and their typical
properties
Example 1
Direct problem: Add all numbers that are on the same row or on the same column or of
the same color.
? ? ? ? ?
? 1 5 7 ?
? 4 3 8 ?
? 6 2 9 ?
Inverse problem: Determine the numbers when only row, column and color sums are
known.
3 11 10 24 10
13 ? ? ? 13
15 ? ? ? 9
17 ? ? ? 10
Example 2
Direct problem: Determine the function f C 1 (0, 1) when its derivative f 0 (t) = 3t2 and
initial value f (0) = 0 are given.
f (t) = t3 .
This problem is easy to solve, but some difficulties arise when the given data is inac-
curate, say instead of f we are given
1
g(t) = f (t) + sin(100t),
100
which has derivative
g 0 (t) = 3t2 + cos(100t).
Solutions of inverse problems are often strongly disturbed by small changes
in data.
2
1.2 4
g Dg
f Df
3.5
1
3
0.8 2.5
2
0.6
1.5
0.4
1
0.2 0.5
0
0
0.5
0.2 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 1.1: Disturbed data g is close to f .... but corresponding derivatives are not!
Example 3
In image deblurring, one tries to improve a blurred photo.
Direct problem: Mathematically model how a crisp photo can be transformed into a
blurred photo.
Suora ongelma
Inversio-ongelma
M Rnm ,
whose elements Mij represent the pixel color: the larger the value the lighter the color is
(see Figures 1.2 and 1.3).
3
Figure 1.2: Black and white picture consists of pixels (=rectangular elements of a single
color).
10
Figure 1.3: 9 9-matrix of grey shade pixels and scale of pixel colors.
Image blurrring can be modeled with a Gaussian convolution (choose n = m for simplic-
ity)
n
e(|ki| /n +|lj| /n )/2 Mij ,
2 2 2 2 2
X
Mkl = Ckl
f
i,j=1
n
!1
e(|ki| )/22
2 /n2 +|lj|2 /n2
X
Ckl = .
i,j=1
4
For a small image n, m = 256 but for good quality picture n and m can be several
thousands, so that the matrix contains millions of elements. In inverse problems, the
unknown is often very high dimensional.
Example 4
A weather radar transmits electromagnetic pulses in microwave frequencies (5600-5650
MHz, wavelength approximately 5.3 cm). The pulses are reflected back from obstacles,
like rain drops, snowflakes and hails. The weather radar detects the reflected pulses, and
the pulse travel time tells the distance between the transmitter and the obstacle. Typical
maximum detection range of a weather radar is around 250 km. Radar measurements
can be made in several directions by moving the transmitting/receiving antenna. The
magnitude of the reflected signal tells how heavy the rain is. Doppler phenomenon tells
also how fast the rain drops are moving.
Direct problem: Determine the reflected signal when the distribution of the rain drops
(and their speed) is known.
Inverse problem: Determine the distribution of the rain drops (and their speed) when the
reflected signal is known.
For simplicity, lets consider here the mathematical description of the problem for a
single object. The transmitted signal is of the form
(t) = P e(t) sin(0 t),
where 0 is the carrier frequency, P is the transmitted power and e(t) describes the pulse
shape. The equation of motion for a single object is
1
r(t) = x2 + x3 t + x4 t2 ,
2
where x2 is the distance to the radar, x3 is the velocity of the object and x4 is the
acceleration of the object. The reflected signal is modeled as
2 0 1 2
z(t) = x1 t x2 exp i2 (x3 t + x4 t ) ,
c c 2
5
where x1 is the power of the reflected signal and c is the speed of light. The power x1
satisfies so called radar equation
CP
x1 = ,
(4)2 x42
where C is a constant (independent of the radar) and the radar cross section depends
on the reflectivity and size of the object.
Inverse problem: Determine (x1 , x2 , x3 , x4 ) when the function t 7 z(t) (or some values
z(tk ), tk > 0, k = 1, ..., m) is given.
In inverse problems, one uses indirect data about the unknown object.
6
Example 5
In medical computerized tomography (CT scan), one forms slice images of interior parts
of the patient from x-ray data
Different tissues (like muscle and bone) absorp different amounts of x-ray radiation:
their mass absoption coefficients have different values. When the total x-ray absorption
across the body is measured in several different directions, one can form slice images of
the body. Actually, one reconstructs the mass absorption coefficient across the body.
Figure 1.6: Construction of slice images vs. ordinary x-ray images. In ordinary x-
ray image, only absorption directly through the body is seen. In tomography, several
directions are used to create a slice image.
7
Let (x, y) 7 f (x, y) be a piecewise continuous function that represents the mass
absorption coefficient at point (x, y). When I0 is the intensity of the transmitted x-ray
radiation and I1 is the intensity of the received x-ray radiation that has traveled through
the body along the path C with parametrization r(t), t [t1 , t2 ], then
Z Z t2
I0
ln = f ds = f (r(t))|r0 (t)|dt,
I1 C t1
where the term in the middle denotes the path integral of f over path C. Here we
assume for simplicity that the cross-section of the body is contained in the square S =
[1, 1] [1, 1] so that f (x, y) = 0 when |x| > 1 or |y| > 1.
y
1
Suora y = x
-1
1 x
-1
Figure 1.7: The path integrals of function f are calculated along different paths (like in
this picture) that are straight lines.
along all different line segments C that start and end at the boundary of S.
along different line segments C that start and end at the boundary of S.
The inverse problem can be solved with the help of the Fourier analysis. However,
this is outside the scope of this course. Therefore, we proceed to to the following practical
problem.
8
In practice, the measurements cant be taken along every line, but only for finitely
many lines. The less there are lines, the less there is information about the unknown
function f . The problem is that different functions can have same integrals along finitely
many lines. For example, take f (x, y) = f (x, y) = x2 + y 2 when (x, y) B(0, 1) and
f (x, y) = 0 otherwise, Then the integral of f along C = {(x, y) R2 : y = 0, 1 x 1}
is Z 1
2
x2 dx =
1 3
which is the same as the integral of the function f (x, y) = 31 . Since f is rotationally
symmetric, the integral along any straight line C = {(x, y) R2 : y = ax, (x, y)
B(0, 1)}, a R has the same value.
In tomography, the limitedness of data is usually compensated by restricting the form
of allowed solutions. Instead of any function, we allow the solution be only of the form
n
X
f (x, y) = cj j (x, y),
j=1
where n is a fixed positive number, functions j are known and coefficients ci R are
unknown. For example, functions j can be characteristic functions of disjoint rectangles
(the pixels) (
1 kun (x, y) Ij
j (x, y) =
0 otherwise.
I
24
y
I1 I8
I2 I9
I3 I10
I4 I11
I5 I12
I6
I7
x
y
Lets assume that there are m paths, say Cj whose parametrisations are rj : [t1 , t2 ]
R2 , j = 1, . . . , m. Then data is modeled by equations
Z Z t2 n
X n
X
yi = f ds = j (ri (t))|r0i (t)|cj dt = Mij cj
C t1 j=1 j=1
where Z t2
Mij = j (ri (t))|r0i (t)|dt
t1
9
and i = 1, . . . , m, j = 1, . . . , n. The inverse problem is now to determined coefficients
c = (c1 , . . . , cn ), when vector y = (y1 , . . . , ym ) is given. The matrix M of the direct theory
is known, because functions j and paths Ci are known. Additionally, the data will also
contain disturbances!
10
Figure 1.10: CT scan: Different shades represents different values of f . (image: Siemens Press
Picture).
Esimerkki 6
10
Virta
Jnnite
satisfies
function u C 2 (D) C 1 (D)
(u)(x) = 0, x D
u(x) = f (x), x D
This type of inverse problems are more commonly known as inverse boundary value
problems.
The solution of this particular inverse problem is known for quite general D anda.
The solution method requires knowledge of e.g. partial differential equations so it is not
in the scope of this course. However, the last chapter contains some relevant methods for
the practical solution of the problem.
Where it is used?
Noninvasive testing (e.g. integrity tests for glass jars, airplane wings, bridges).
11
The problem has also a coarser counterpart that is widely used body composition
monitoring by bioelectrical imbedance analysis. The measuring setup is the same: small
currents are fed to the body and voltages are measured. The main difference to EIT is
the forward model: instead of the more accurate partial differential equation, the forward
theory consists of a parametrized coarse approximation, where the main assumption is
that the persons body is a cylinder that has the same height as the person and the
water content of body corresponds to the volume of the cylinder. The voltage is used to
calculate the water content of the cylinder. Other parameters that are commonly used
to tune the model are age, weight, gender...
Inverse problems are a way to extract information about objects that are
difficult or impossible to study otherwise.
Example 7
Medical ultrasound imaging gives a picture of patients interior structures on the basis of
sound waves. The main principle is the following: sound pulses are transmitted inside a
patient (at frequency of 2-15 MHz). The pulses partially reflect backwards from different
structures of the body. The backscattered signal is receive and transformed to brightness
values. The procedure is repeated along different measurement lines.
Figure 1.12: Ultrasound imaging 1. Pulse reflects from boundaries (the same color cor-
responds here to homogeneous structure)
Ultrasound imaging is based on several physical simplifications that are more or less
inaccurate. First of all, the sound speed is assumed to be constant regardless of the
tissue. This makes the size of the organs slightly distorted. Moreover, the model does
not take into account multipath propagation and wave diffraction, which can place objects
in wrong places in images. Very rough surfaces can also produce speckled images.
12
1.5
0.5
0.5
1.5
0 0.2 0.4 0.6 0.8 1
Figure 1.13: Ultrasound imaging 2. backscattered pulse (blue line) is receive and trans-
formed to brightness values (red curve) by using an envelope curve.
A more precise mathematical model would be to use the physics of acoustic wave
propagation through inhomogeneous media. A time-harmonic acoustic wave u in domain
D Rn satisfies the equation
2
u(x) + u(x) = 0, x D,
c2 (x)
where is the frequency and c(x) is the speed of sound in the inhomogeneous media.
The sound source is described with equation
where n(x) is the outer normal vector of the surface D. The sound wave at the surface
is
g(x) = u(x), x D.
Function x 7 u(x) is connected to the physical pressure wave p(x, t) through equation
p(x, t) = Re u(x)eit .
Direct problem: Determine u when f and c are given. Inverse problem: Determine the
The same acoustic equation can be used to describe the propagation of seismic waves
(=earth quake waves). This method is used in mapping the inner structure of Earth.
Sound waves propagate also very well in water, which leads to the so called sonar imaging.
13
Example 8
Inverse scattering problems are a class of mathematically challenging inverse problems.
In inverse scattering, a wave (e.g. sound or electromagnetic) is transmitted towards an
unknown obstacle or inhomogeneity, which then distorts the propagating wave. The
distortion of the incident wave is described by a scattered wave, which is finally observed
far away from the unknown. The inverse problem is to deduce the properties of the
unknown from the observations of the scattered wave.
Figure 1.14: The scattering. Incident wave is ui . Scatterer produces the scattered wave
us . The total wave is u = ui + us .
In mathematical scattering theory, the term wave is usually replaced with the term
field, which means a multivariate function. A common simplification is to assume that
the time-dependence is time-harmonic, meaning that u(x, t) = eit u(x).
where is the frequency of the incident wave. Additionally, we require the radiation
condition
x s s x
lim |x| u (x) i u (x) = 0 uniformly in every directions .
|x| |x| c |x|
Function c(x) describes the speed of sound in the media. It is a physical quantity depend-
ing on the structure of the media (e.g. its molecular composition). Above, it is assumed
that c > 0 is a smooth function, that has constant value far away of the unknown.
In direct scattering problem, one determines the scattered field us , when ui and c are
known. The incident field is usually a plane wave ui (x) = eiax , where a is a unit vector.
14
In inverse acoustic scattering problem, one tries to determine the function c when us
is given far away of the unknown scatterer and also ui is known.
netic field, respectively. In isotropic media, these fields satisfy the following Maxwells
equations
H
E(x, t) + 0 (x, t) = 0
t
E
H(x, t) (x) (x, t) = (x)E (x, t).
t
Assuming time-harmonic time dependence gives
1 1
E(x, t) = 0 2 E(x)eit , H(x, t) = 0 2 H(x)eit ,
where is the frequency and 0 , 0 are permittivity and permeability of the vacuum.
Corresponding time-harmonic Maxwells equations are
lim (H s x |x|E s ) = 0
|x|
x
uniformly in every direction |x|
.
Inversse problem: Determine n(x) when H s and E s are given far away from the scatterer
when E i and H i are known.
15
1.3 Classification of inverse problems
(A) Mathematical inverse problems. For example
Inverse scattering problems (e.g. scattering form media, data acquired with
single or several frequencies or directions).
Inverse boundary value problems (like electrical impedance tomography)
Mathematical tomography
Inverse initial value problems
Inverse eigenvalue problems.
Image enhancement
Remote sensing (including ecological, geological and astronomical targets)
Medical imaging
Noninvasive testing (including industrial process monitoring)
Retrospective inverse problems (like determination of the source for pollution
particles in the atmosphere)
Biological inverse problems (like phylogenetic problems: From DNA differ-
ences, determine the family tree of different species).
Economic problems (determine the parameters of economical models)
1.4 Recap
In inverse problems, one uses indirect observations to obtain information about the un-
known target. Applications can be found in several fields (Where ?). Inverse problems
can be divided to mathematical and computational problems. Typical properties are
Please be prapared to
mention practical examples of inverse problems that utilize indirect data (direct
data= observe the values of the unknown, indirect data 6= direct data),
explain what is meant by image deblurring when the blurring function is given.
explain what is meant by an inverse boundary value problem, when the equations
of the direct problem are given,
16
explain what is meant by an inverse scattering problem when the equations of the
direct scattering are given,
17
18
Chapter 2
In this chapter, we study inverse problems in normed vector spaces. Especially, we pay
attention to the solvability and stability of the inverse problem.
Definition 2. A set V is a linear vector space over the scalar field K if there exists
mappings V V 3 (x, y) 7 x + y V and K V 3 (, x) 7 x V that satisfy
1. (x + y) + z = x + (y + z)
2. x + y = y + x
4. for every x V there exists (x) V such that x + (x) = 0 (we also denote
x y := x + (y))
5. (x) = ()x
6. 1x = x
7. (x + y) = x + y
8. ( + )x = x + x
Definition 3. A normed vector space is a pair (V, k k), where V is a linear vector space
over the field K and x 7 kxk is a real-valued mapping such that
19
Let (V, k k) be a normed vector space. The set
where x0 V and r > 0, is called an open ball at x0 with radius r. The (ordinary)
topology of a normed space is defined by open balls (a set U V is open if and only if
for every x U there exists r > 0 such that B(x, r) U ).
Let V1 and V2 be two linear vector spaces with norms k k1 and k k2 , respectively.
Let D V1 . Recall, that a function F : D V2 is continuous at x1 D if for every
> 0 there exists > 0 so that kF (x1 ) F (x2 )k2 < if kx1 x2 k1 < and x2 D. The
function F is continuous on D is it is continuous at every x D.
The linear vector space Rn , n 1 is equipped with the usual norm, where the vectors
x = (x1 , .., xn ) Rn norm |x| is
v
u n
uX
|x| = t |xi |2 .
i=1
Recall, that a finite-dimensional vector space suits well for describing the unknowns in
imaging problems, since an image can be represented as a matrix (which, in turn, can be
represented as a vector by rearranging the elements)
For every y W there has to be x V such that y = F (x). In other words, the
direct theory F : V W needs to be a surjection, meaning that F (V ) = W .
20
Sub-problem A: Characterisation. What is the image F (V )? Characterise
those y V2 that correspond to unknowns x V .
When F is both injective and surjective, then it is a bijection and the inverse
mapping F 1 : W V exists. The inverse mapping F 1 : W V has to be
continuous.
In practical inverse problems there is a rule of thumb: the given data is never
exactly the same as in the mathematical formulation. There are several rea-
sons for this. (i) Measurement equipment has limited accuracy. (ii) Electrical
devices suffer from intrinsic errors, like thermal noise. (iii) Direct theory is not
necessarily completely correct. It may contain approximations. (iv) There can
be external disturbances in the measurement environment. (v) In numerical
calculations, the real numbers are replaced by floating point numbers that are
of finite accuracy.
When F 1 is continuous at y1 W , then for given > 0 there is > 0 such
that |F 1 (y1 ) F 1 (y2 )| < whenever y2 W and |y1 y2 | < . Especially,
if y1 = F (x1 ) for some x1 V and y2 W is of the form
y2 = F (x1 ) + e,
Sub-problem C: stability. How small changes in the data disturb the corre-
sponding mathematical solutions?
21
Example 1. (Uniqueness) An object, with mass m, experiences a time-dependent force
F(t) = (F1 (t), F2 (t), F3 (t)), where the real functions Fi are continuous for i = 1, 2.3. The
path of the obstacle is described with a function g(t) = (g1 (t), g2 (t), g3 (t)). When F and
g(0), g0 (0) are known, then the path of the obstacle g(t) satisfies
F(t) = mg00 (t), t (0, 1) (2.1)
g0 (0) = (v1 , v2 , v3 ) (2.2)
g(0) = (a1 , a2 , a3 ). (2.3)
Let
V1 = C([0.1]; R3 ) = {F = (F1 , F2 , F3 ) : Fi : [0, 1] R, i = 1, 2.3, is continuous.}
equipped with norm kFk1 = supt[0,1] |F(t)| ja
V2 = C 2 ([0.1]; R3 ) = {g = (g1 , g2 , g3 ) : gi : [0, 1] R, i = 1, 2.3, is twice
continuously differentiable and the derivatives are continuous up to the end-points.}
equipped with norm k
d g
kgk2 = sup k (t) .
t[0,1],k=0,1,2 dt
Inverse problem: Determine F V1 , when the object moves along the path g V2 .
Lets prove the uniqueness. Let F1 , F2 V1 be two forces that satisfy (2.1)-(2.3)
when the path g is given. Then
kF1 F2 k1 = sup |F1 (t) F2 (t)| = sup |mg00 (t) mg00 (t)| = 0.
t[0,1] t[0,1]
Hence F1 = F2 . A similar inverse problem was used in 19th century to find the planet
Neptune. The data was the path of a known planet, Uranus, whose orbit had curious
irregularities that could not be explained unless a gravity pull of an unknown planet
effected it.
Example 2. (Reconstruction) Let V1 = V2 = C([0, 1]) equipped with norm kf k =
supt[0,1] |f (t)|. Lets consider the direct theory
Z 1
F f (t) = f (ts)ds.
0
Let g C([0, 1]) be given. Inverse problems is to determine f C([0, 1]) so that g = F f .
Because F f (0) = f (0), we may assume that t 6= 0 in the following. By change of the
variables formula for ts = r we get
Z t
1
g(t) = F f (t) = t f (r)dr.
0
Especially Z t
f (r)dr = tg(t),
0
and we may differentiate (by using the fundamental theorem in calculus) to obtain
dg(t)t
f (r) =
dt t=r
kun r (0, 1].
22
2.2 Ill-posed inverse problems
Definition 6. If the problem is not well-posed, then it is ill-posed.
(*) Find x V that satisfies y = F (x) when the given data y W is arbitrary.
1. There is no solution.
We may end up with this situation, if the given data contains disturbation. For
example, if we are given y = F (x) + e instead of y = F (x), where e V2 is is
not known and y / Im (F ). Despite this fact, we would like to extract information
about the unknown x.
Example 3. Let us consider the Fredholm integral equation of the first kind
Z 1
g(x) = K(x, y)f (y)dy, y [0.1], (2.4)
0
where the integral kernel (x, y) 7 K(x, y) is C 1 -function. Inverse problem: deter-
mine function f C([0.1]), when function g C([0.1]) is given.
Whn g is continuous function that is not differentiable, then the solution does
not exists. Indeed, the right hand side of the integral equation (2.4) is always
differentiable, since it holds that
R1 R Z 1
0
K(x + h, y)f (y)dy K(x, y)f (y)dy K(x + h, y) K(x, y)
= f (y)dy
h 0 h
Z 1 R x+h
x
x K(x0 , y)dx0
= f (y)dy
0 h
1 x+h 1
Z Z
= x K(x0 , y)f (y)dydx0 .
h x 0
Example 4. Let the direct theory be a linear mapping F : C(0.1) C 1 (0, 1), that
takes the function f into is integral
Z x
F f (x) = f (y)dy.
0
23
Inverse problem is to determine f C(0, 1) so that F f = g when such g C 1 (0, 1)
is given that g(0) = 0. The solution is obtained by differentiating i.e. f = g 0 , and
it is easy to see that the solution is unique. However, if the function g is only given
at points ti , t1 , ..., tn (0, 1), then there exist several different functions g that give
the same data g(ti ), i = 1, ..., n. Each function that is compatible with the data
gives different derivative g 0 = f .
Example 5. In practical inverse problems the unknown is often higher dimensional
vector than the given data vector. One simple example is any matrix equation
n
X
yi = Mij xj ,
j=1
where i = 1, ..., m and n > m. Then there are n unknown variables xj and they
satisfy only m equations. In the next chapter, we consider more linear finite-
dimensional inverse problems.
Example 6. Look Chapter 1 Example 5.
but
kDgk = sup |cos(x)| = 1
x[0,1]
24
2.3 Recap
Inverse problems can be either well posed or ill posed.
Please memorise
25
26
Chapter 3
Definition 7. An inverse problem is called finite dimensional if the ukknown and data
belong to finite dimensional vector spaces. An inverse problem is called linear if the direct
theory is linear.
When x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ) Kn , then the norm and the inner
product satisfy
|x|2 = (x, x),
and the Cauchy-Schwart inequality
(x, y) |x||y|
n n
! 12 n
! 21
X X X
xi y i |xi |2 |yi |2 .
i=1 i=1 i=1
when a, b K and x, y Kn .
27
A basis of a linear subspace V Kn consists of linearly independent vectors
{e1 , . . . , ek } that satisfy
k
X
V = {x Kn : x = ai ei , ai K, i = 1, . . . , k}.
i=1
when i 6= j and i, j = 1, . . . , k.
Notation
x1
(x1 , . . . , xn ) = ...
xn
and
f1 = (1, 0, 0, . . . , 0), f2 = (0.1, 0, . . . , 0), . . . , fm = (0, . . . , 0, 1)
Pn
When x = j=1 xj ej , then
n
! n
X X
F (x) = F xj ej = xj F (ej ).
j=1 j=1
Especially
n
X n
X
F (x)i = (F (x), fi ) = xj (F (ej ), fi ) = Mij xj ,
j=1 j=1
Mij = (F ej , fi ).
28
i.e. F (x) = M x when x V . Here all matrices are defined with respect to natural basis
of Rn .
3.1.1 Injectivity
Recall the following result from linear algebra.
Theorem 1. Let V Rn be a linear subspace. The linear mapping F : V Rm is an
injection if and only if its kernel N (F ) = {x V : F (x) = 0} contains only the zero
vector. This is equivalent to the fact that the kernel N (M ) = {x Rn : M x = 0} of
matrix M of the linear mapping F satisfies V N (M ) = {0}.
Proof. See courses of linear algebra.
Corollary 1 (Identifiability). The solution of inverse problem ()is unique if and only
if N (F ) = {0}. This is equivalent to V N (M ) = {0}.
The direct theory in () is injective, if and only M x = 0 has only trivial solution in
the subspace V .
Example 8. Let W = R2 , V = R3 and
1 1 0
M= .
0 0 1
Then M x = 0 if and only if x1 + x2 = 0 ja x3 = 0. In other words
N (M ) = {(x1 , x1 , 0) : x1 R} =
6 {0}.
The inverse problem () is ill posed, because the solution is not unique.
29
Remark 3. When V = Rn , W = Rm and n > m, the direct theory isnt injective.
Remark 4 (Affine problems). An affine subspace Vaf Rn is a set such that Vaf +
x0 = {x + x0 : x Vaf } is a linear subspace for some x0 Rn . If in Example 9 the
linear subspace V is replaced with an affine subspace, say Vaf = {(x1 , x2 , x3 ) R3 :
x1 + x2 + x3 = 1}, then either (a) the uniqueness has to be checked from the condition
M x = M x x = x or (b) affine problem needs to reformulated as a linear problem. In
(b), we may proceed as follows. The unknown x = (x1 , x2 , x3 ) has to satisfy x1 +x2 +x3 = 1
in order to x Vaf . This equation can be added to the matrix equation, and we obtain
a new linear problem
y1 1 0 0 x1
y2 = 1 3 1 x2 ye = M fx,
1 1 1 1 x3
where x R3 ja y = (y1 , y2 , 1) R3 .
Example 10. Let us study Example 1 of Chapter 1. Denote y = (y1 , . . . , y11 ) the
data vector and x = (x1 , . . . , x9 ) the unknown vector. Lets form the direct theory
F : R9 R11 from row, column and color sums by using a matrix M = M129 .
y1 y2 y3 y4 y5
y9 x1 x4 x7 y6
y10 x2 x5 x8 y7
y11 x3 x6 x9 y8
Chapter 1.2. Example 1: Determine the numbers from their row, column and color sums
30
Then
y1 0 0 0 0 1 0 0 0 0
y2 1 1 1 0 0 0 0 0 0
x1
y3 0 0 0 1 1 1 0 0 0 x2
y4 0 0 0 0 0 0 1 1 1 x3
y5 0 1 1 0 0 0 0 0 0 x4
y = Mx
y6 = 0
0 0 1 0 0 0 1 0 x5 .
y7 0 0 0 0 0 1 1 0 0 x6
y8 1 0 0 0 0 0 0 0 1 x7
y9 1 0 0 1 0 0 1 0 0 x8
y10 0 1 0 0 1 0 0 1 0 x9
y11 0 0 1 0 0 1 0 0 1
From linear algebra, we know that equation 0 = M x has only the trivial solution x = 0
if and only if M can be transformed by Gauss-Jordan elimination method into the form
1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0
0 0 0 0 1 0 0 0 0
M
0 0 0 0 0 1 0 0 0.
0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
31
Lets proceed with Gauss-Jordan elimination method.
1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0
1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1
0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0
0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
M 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1
0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0
0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0
0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1
1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0
0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1
0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0
0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0
0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0
0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1
1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0
0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0
0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0
0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
0 0
0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0
0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 2 1
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0
0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0
0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 2 0
0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0
2 1
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
32
1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0
0
0 1 0 0 0 0 1 0
0
0 1 0 0 0 0 1 0
0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
M 0 0 0 0 0 1 0 1 0
0 0 0 0 0 1 0 1 0
(1)0
0 0 0 0 0 1 1 0
0
0 0 0 0 0 1 1 0
1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0
2
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3.1.2 Surjectivity
The inverse problem () has a solution if for every y W there exists x V that
satisfies
y = M x.
33
Lemma 1. Let V Rn be a linear subspace. The image F (V ) Rm of the linear
mapping F : V Rm , whose matrix is M , is a linear subspace that consists of those
linear combinations y of the columns Mi of M that satisfy
n
X
y= ai Mi ,
i=1
where a = (a1 , . . . , an ) V .
n
X
yi = Mij xj , i = 1, . . . , m (3.1)
j=1
Denote in (3.1) (Mi )j := Mij the jth element of the ith column M .
Remark 5. When V = Rn , W = Rm and n < m, then () is ill posed, since the image
R(M ) Rm is by Theorem 1 linear subspace of dimension n, where n < m.
Lets recall more linear algebra. The transpose of matrix M Rmn is matrix
(M T )ij = Mji , j = 1, . . . , m, i = 1, . . . , n. The inner product of Rn satisfies
x
.1
(x, y) = y T x = y1 yn .. ,
yn
34
The matrix QV is called orthogonal projection onto V . Especially QTV = QV and
eT1 eT1
T T
e2
e2
Q2V = QV QV = e1 e2 ek .. e1 e2 ek ..
. .
T
ek eTk
(e1 , e1 ) (e1 , e1 ) (e1 , en ) eT1
T
(e2 , e1 ) (e2 , e2 ) (e2 , en ) e2
= e1 e2 ek .. .. .. ..
. . . .
(en , e1 ) (en , e2 ) (en , en ) eTk
1 0 0 eT1
0 1 0 eT2
= e1 e2 ek .. .. .. .. = QV
. . . .
0 0 1 eTk
Vector x V if and
P only if QV x = x, because every x V can be represented as a linear
combination x = ki=1 (x, ek )ek of the basis vectors.
n n
!2
X X
|F (x) F (y)|2 = |F (x y)|2 = Mij (xj yj ) .
i=1 j=1
n
!2 n
! n
! n
!
X X X X
Mij (xj yj ) (Mij )2 (xj yj )2 (Mij )2 |xy|2 , which implies
j=1 j=1 j=1 j=1
35
m X
n
!
X
2
|F (x) F (y)| (Mij )2 |x y|2 .
i=1 j=1
Therefore F is continuous
Suppose
Pk then that V isPa linear subspace and {e1 , . . . ek } its orthonormal basis. Denote
k
x = j=1 xj ej and y = j=1 yj ej two arbitrarily chosen vectors from V . Similarly as
before, ! k
m X
X k X
|F (x) F (y)|2 fij )2
(M (xj yj )2 ,
i=1 j=1 j=1
where
M
fij = (F (
ej ))i , i = 1, . . . , m, j = 1, . . . , k.
Moreover
k k k
!
X X X
xj yj )2 =
( xj ej yj ej , xj ej yj ej = |x y|2 .
j=1 j=1 j=1
F 1 (ay + b
y ) = F 1 (aF x + bF x) = F 1 (F (ax + b x = aF 1 (y) + bF 1 (
x)) = ax + b y ).
By Theorem 3, F 1 is continuous.
A square matrix M is invertible if its determinant det(M ) is non zero. In such a case,
1 adj(M )
M = det(M )
, where adj(M ) is the adjugate matrix of M . In case of 2 2-matrix
1
a b a b 1 d b
det = ad bc ja = .
c d c d ad bc c a
36
Since det(M ) = 1 1 2 3 = 5 6= 0, the matrix M has the inverse
1
1 1 2 1 1 2
M = = .
3 1 5 3 1
M x = M x M 1 M x = M 1 M x x = x.
The solution depends also continuously from the data by Corollary 2. Hence () is well
posed.
Theorem 4 (The case of square matrix). Let M Rnn and V = W = Rn . The inverse
problem () is well posed if and only if det(M ) 6= 0.
Proof. When det(M ) 6= 0, there exist the inverse M 1 . It follows that the solution exists
(x = M 1 y), it is unique (M x = M x M 1 M x = M 1 M x) and the solution depends
continuously on the data by Corollary 2. Hence the inverse problem is well posed. We
need to show also that the inverse problem is ill posed when det(M ) = 0. In this case,
the matrix M is not invertible which implies that the linear mapping F is not bijective.
Because the direct theory of a well posed problem is always a bijection, the problem is ill
posed.
Solution: The matrix M is a square matrix and the direct theory maps between full
spaces. By Theorem 4, we only need to calculate the determinant of M :
det(M ) = 11 (11 (66) (13) 13) 10 (12 (66) (13) 14) + 14 (12 13 11 14)
= 11 11 66 + 11 13 13 + 10 12 66 10 13 14 + 14 12 13 14 11 14
= (121 + 120) 66 + 11 13 13 + 2 13 14 14 11 14
= 66 + 11 13 13 + 2 13 14 14 11 14
= 66 + 11 13 13 + (26 154) 14
= 66 + 11 13 13 128 (13 + 1)
= 66 + (143 128) 13 128 = 15 13 194 = 15 (3 + 10) 194 = 45 44 = 1.
37
3.2 Ill conditioned solutions
An ill posed problem can be sensitive to disturbances of data but also well posed problems
can have different behaviors with respect to disturbances of fixed magnitude in data.
Loosely speaking, we say that a problem A is more ill conditioned than problem B if
disturbances of the same order produce larger changes in the solution of problem A than
in problem B.
M 1 y = x + M 1 = (1, 1, 1, 1, 1, 1, 1, 1.16) ja
f1 y = x + M
M f1 = (1, 1, 1, 1, 1, 1, 1, 1 + 28 0.02)
The last element is changed by 28 0.02 = 5.12. Although the problem is well posed,
the disturbed data produces a solution that is not very accurate approximation of the
unknown.
A well posed problem that is very ill conditioned resembles a lot an ill posed problem
where the solution does not depend continuously on the data. Ill conditioned problems
need be taken into account, since most practical situation generate inaccurate and noisy
data.
Definition 8. Singular values of matrix Mmn Cmn are numbers i (M ) that are the
square roots of the eigenvalues i of M M i.e. i (M ) = i , where i = 1, ..., n.
The matrix M M has only non negative eigenvalues i since the corresponding eigenvec-
tors ei satisfy
0 (M ei , M ei ) = (M M ei , ei ) = i (ei , ei ) = i |ei |2 .
Moreover, det(M M ) = det(M ) det(M ) = det(M T ) det(M ) = | det(M )|2 6= 0 for non-
singular matrices so that zero is not an eigenvalue of M M when M is non singular.
38
Different conditions numbers can be defined by using different matrix norms.
Theorem 5. Let M Cnn be non singular matrix. The largest singular value of M 1
is
1
max (M 1 ) = ,
min (M )
where min (M ) is the smallest singular value of M .
We apply here two lemmas.
Lemma 2. Let A, B Cnn be non singular. Then AB and BA have equal eigenvalues..
Proof. Lets look at the characteristic polynomials. By calculation rules for determinants,
we obtain
det(AB I) = det(A(B A1 )) = det(A) det(B A1 )
= det(B A1 ) det(A) = det((B A1 )A) = det(BA I).
There matrices AB ja BA share the characteristic polynomial, which implies that their
eigenvalues are equal.
Lemma 3. Let A Cnn be non singular. The eigenvalues of matrix A1 are inverse
numbers of eigenvalues of A.
Proof. Lets study the characteristic polynomial
det(A I) = det(A(1 A1 )) = (n) det(A) det(1 A1 ).
Because A is non singular, its eigenvalues are non zero. Hence is a zero of the char-
acteristic polynomial of A if and only if 1 is a zero of the characteristic polynomial of
A1 .
Proof. (Theorem 5) Let us define the singular values of M 1 by calculating
(M 1 ) M 1 = (M )1 M 1 = (M M )1 ,
whose eigenvalues are by Lemma 3 the inverses of eigenvalues of M M . By Lemma 2,
the eigenvalues of M M are equal to eigenvalues of M M . Therefore, singular values of
M 1 are inverse numbers of singular values of M . Especially, max (M 1 ) = min1(M ) .
39
Theorem 6. Let A : V V be self-adjoint linear mapping in finite-dimensional inner
product space V. Then V has orthonormal basis that consists of eigenvectors of A.
Now M M is self-adjoint (meaning that (M M ) = M M ). Denote i the eigenvalues
of M MPand ei corresponding eigenvectors. By Theorem 6, a vector x can be represented
as x = ni=1 x0i ei , where (x01 , . . . , x0n ) Rn are the coordinates of x with respect to basis
ei . Then the quadratic form (3.3) can be represented as
n n
! n
X X X
0 0
(M M x, x) = xi M M e i , xi e i = i |x0i |2 .
i=1 i=1 i=1
By approximating the expression from above by using the largest eigenvalue, we obtain
p |y|
|y| = |M x| max i |x| = max (M )|x| |x| . (3.4)
1in max (M )
where Lemma 3 has been applied. By (3.4) and (3.5), the relative errors satisfy
|x| |y|
(M ) .
|x| |y|
The condition number gives hence an upper bound for the relative errors. When the
condition number is large (say, > 105 ), even numerical rounding errors start to affect the
inversion of a matrix.
Example 15. Identity matrix has condition number 1. This is the smallest possible
condition number.
Example 16. In Example 14, the condition numbers are
(M ) = 8
and
1 8
(M
f) = 2 = 128
2
Example 17. Lets calculate the condition number of
11 10 14
M = 12 11 13 .
14 13 66
First compute
T
11 10 14 11 10 14 461 424 926
M T M = 12 11 13 12 11 13 = 424 390 861 .
14 13 66 14 13 66 926 861 4721
40
The eigenvalues are zeros of the characteristic polynomial
461 424 926
p() = det 424 390 861
926 861 4721
so lets set
p() = (461 ) (390 ) (4721 ) 8612 424 (424 (4721 ) 861 926)
The equation (3.6) has solutions 1 , 2 and 3 , whose square roots are
p p p
( 1 , 2 , 3 ) (0.0006, 21.8, 71.4).
det(M ) = 11(11(66)(13)13)10(12(66)(13)14)+14(12131114) = 1,
so
T
11 (66) (13) 13) (12 (66) (13) 14)) 12 13 11 14
M 1 = (10 (66) 14 13) 11 (66) 14 14 (11 13 10 14)
10 (13) 14 11 (11 (13) 14 12) 11 11 10 12
557 842 284
= 610 922 311
2 3 1
41
where [, ] and the functions R and f are twice continuously differentiable 2-
periodic functions i.e. R( +n2) = R() and f ( +n2) = f () for every n Z. Assume
additionally that R() = R() and R() 0, t [0, ].
We are given data
g(1 ), ..., g(n ),
where j = hj , j = 1, .., n and h = 2
n
, n = 2m for some m > 3 and we know the
is defined as a
function R. What can we say about the function f ? The integral g()
Denote
Mkj = R(k j )h
and
xk = f (k ) and yk = g(k )
when k, j = 1, ..., n. We replace the problem with a matrix problem
y = M x + e.
42
By periodicity of R, the matrix M is so called circulant matrix. A matrix M Rnn is
called circulant if
m1 mn mn1 m3 m2
m2
m1 mn m4 m3
M = m3
m2 m1 m5 m4
.. .. .. .. ..
. . . . .
mn mn1 mn2 m2 m1
for some vector (m1 , ..., mn ) Rn .
and a circulant matrix is M is unitarily similar to diagonal matrix (i.e. there exists a
unitary matrix U so that U M U is diagonal matrix).
Proof. There exists F (k) Rn so that M F (k) = k F (k) for k = 1, ...., n. Indeed, take
(k)
Fj = exp(2i(j 1)(k 1)/n), k, j = 1, ..., n.
Straightforwardly,
n
X n
X
(k) (k)
(M F )j = Mjl Fl = m(jl+1)mod n exp(2i(l 1)(k 1)/n)
l=1 l=1
n
X
= mL exp(2i(j L)(k 1)/n) = k exp(2(j 1)(k 1))
L=1
(k)
= k Fj .
43
where we apply geometric progression for z = exp(2i(k l)/n) 6= 1. If k = l, then
n
X
(F (k) , F (k) ) = exp(2i(j 1)(k 1)/n) exp(2i(j 1)(k 1)/n) = n.
j=1
The absolute values of the eigenvalues of M are also singular values, since
1 , ...,
M M = U diag( n )U U diag(1 , ..., n )U = U diag(|1 |2 , ..., |n |2 )U
is similar with diag(|1 |2 , ..., |n |2 ) and similar matrices share their eigenvalues.
Let mj = R(h(j 1))h, j = 1, ..., n. The eigenvalues of the corresponding circulant
matrix M are then
X n
k = hR(h(j 1)) exp(2i(j 1)(k 1)/n).
j=1
44
Then we divide summation into two parts: one containing the integrals over [0, ] and
one containing the integrals over [, 2]:
n/41 Z (2J+1)h n/21 Z (2J+1)h
X dR X dR
|n/2+1 | = h ()d + ()d
J=0 (2J)h d d
J=n/4 (2J)h
n/41 Z (2J+1)h n/41 Z (2(J 0 +n/4)+1)h
J 0 =Jn/4
X dR X dR
=
h ()d + ()d
J=0 (2J)h d 0
J 0 =0 (2(J +n/4))h
d
n/41 Z (2J+1)h n/41 Z (2J 0 +1)h+
= nh
X dR X dR
=
2 h ()d + ()d
J=0 (2J)h d J 0 =0 (2J 0 )h+ d
n/41 Z (2J+1)h n/41 Z (2J 0 +1)h
periodicity
X dR X dR
=
h ()d + ()d .
J=0 (2J)h d J 0 =0 (2J 0 )h d
45
implying
hR(0) R(0)
(Mnn ) 00
= 2 O(n).
h2 sup |R ()| 2 sup |R00 ()|
The larger the n is the more unstable the inversion of matrix Mnn becomes. This is
typical behavior for finite-dimensional approximations of smoothing convolutions.
3.3 Recap
In finite dimensional linear inverse problems the direct theory F : V W is a
linear mapping between two finite-dimensional linear subspaces V, W . The direct
theory can be represented with the help of a matrix.
The finite-dimensional linear inverse problem is ill posed if at least one of the fol-
lowing claims holds:
For the well-posedness of the inverse problm in the case of square matrix M , it is
enough to find out whether the determinant of M does not vanish.
If the data contains too much disturbances, the solution of a well-posed problem can
be far from the true solution. A well posed problem which is highly ill conditioned
can resemble an ill posed problem where the solution does not depend continuously
on the data.
Learning outcomes:
student knows how to study whether a finite-dimensional linear inverse problem is
well posed.
student can recognise and give examples of ill posed finite-dimensional linear inverse
problems.
46
what does exact data and inexact (disturbed) data mean
that a well posed finite dimensional approximation of an ill posed function valued
problem can have increasing condition numbers as the dimensionality of the problem
grows.
47
48
Chapter 4
Lets study classical approximate solutions for ill posed or ill conditioned finite dimen-
sional linear problems.
In other words
x = argmin |M x y|.
xRn
49
The notation argmin means the argument x of the functional x 7 |M x y| that gives
the minimum value. The hat above the vector x is used to indicate that the value is not
necessarily the exact solution but an estimate.
Definition 10. The least squares (LS) solution of () is
x = argmin |M x y|.
xRn
2 2
2
1 0 x1 1 2 1 1
f (x1 , x2 ) = |M x y| =
1 = (x1 1) + > 0.
0 0 x2 10 10 100
At the minimum point, x1 = 1 and x2 is a free parameter. In other words, least squares
solutions are
x = (1, x2 ),
where x2 R. A problem that has no solutions, has infinitely many approximate solu-
1
tions. (Here the true values were x0 = (1, 0) ja hairio = (0, 10 )).
x = argmin |M x y|
xRn
M T M x = M T y.
f (x) = |M x y|2 = (M x y, M x y)
= (M x, M x) (y, M x) (M x, y) + (y, y)
= (M T M x, x) 2(M T y, x) + (y, y). (4.2)
50
All extreme values of f : Rn R, especially its minimum, are located at critical points
(that is, the minimum satisfies f (
x) = 0). The partial derivatives
n n
T X T X xj
(M y, x) = (M y)j xj = (M T y)j = (M T y)k (4.3)
xk xk j=1 j=1
x k
n
!
X
= (M M )jk xj + (M T M x)k = 2(M T M x)k
T
(4.4)
j=1
V = {x Rn : (x, y) = 0 y V }.
R(A ) = N (A).
51
Proof. Let x R(A ) . For every y Cm , we have
0 = (A y, x) = (y, Ax). (4.6)
By choosing y = Ax in (4.6), we see that 0 = |Ax|2 . Then Ax = 0 x N (A).
Therefore, R(A ) N (A). On the other hand, if x N (A), then
(A y, x) = (y, Ax) = 0
for every y Cm , so that x R(A ) . Hence N (A) R(A ) .
Theorem 8. Let M Rmn and y Rm . Then there exists a least squares solution
x = argmin |M x y|.
xRn
Example 21. We have the following noisy observations of the unknown x = (x1 , x2 ) R2 :
1 = x1 + e 1
3 = x1 + x2 + e2
4 = x1 + x2 + e3
2 = x2 + e 4 .
Find an approximate solution by using the least squares method. Denote
1 1 0 e1
3 1 1 e2
y= 4 , M = 1 1 and e = e3 .
2 0 1 e4
52
Determine LS solution for the given data y = M x + e. Lets calculate
1 0
T 1 1 1 0 1 1 3 2
M M= =
0 1 1 1 1 1 2 3
0 1
and
1
1 1 1 0 3
= 8 .
MT y =
0 1 1 1 4 9
2
We obtain the equation
T 3 2
T x1 8
M M x = M y = ,
2 3 x2 9
x1 , x2 ) = ( 56 , 11
which has the solution ( 5
).
x = (x1 , 1 x1 , 1)
Definition 11. A least square solution x of () is called the minimum norm solution, if
x| = min{|x| : x Rn , M T M x = M T y} .
|
53
The next theorem shows that working with minimum norm solutions removes the
nonuniqueness of the approximate solutions (but the cost is increased error in the LS
solution).
Proof. Let x be LS solution of (). Denote QR(M T ) the orthogonal projection onto
R(M T ). By Lemma 5, Rn = R(M T ) N (M ) and by (4.7)
M T M (QR(M T ) x) = M T M (QR(M T ) x + QN (M ) x) = M T M x = M T y
Remark 8. According to the above proof, the minimum norm solution is of the form
Lemma 5
QR(M T ) x = QN (M ) x.
M = U DV ,
Dij = j ij
54
4 0 0 1
0 4
0 0 0
3 0
, then M + = 0 1 0 0.
Example 24. When M =
0 3
0 2
0 0 21 0
0 0 0
1. QR(M ) = M M + .
2. QN (M ) = M + M .
3. QN (M ) M + = M + QR(M ) = M + .
Proof. Let M = U DV be the SVD of M , where Dii > 0 if and only if i r. Denote
V = (V1 Vn ). Then
D11 V1 D11 (x, V1 )
.. ..
0 = M x = U DV x U 1 0 = DV x =
.
x =
.
Drr Vr Drr (x, Vr )
0(mr)n 0(mr)1
Pn
if and only if x = i=r+1 x0i Vi . Then
N (M ) = span{Vr+1 , . . . , Vn }. (4.8)
Then
M M + = (U DV )(V D+ U ) = U (DD+ )U
U1
Irr 0r(mr)
= U U = (U1 , . . . Ur ) ... ,
0(mr)r 0(mr)(mr)
Ur
which is the orthogonal projection onto image space by (4.9). Claim 2 follows from (4.8)
similarly. Claim 3 follows from using the above formulas for the projections and noticing
that
1 U1
Drr 0r(mr) .
M+ = V U = (V1 Vr )D
e 1
e
.. .
0(nr)r 0(nr)(mr)
Ur
x = M + y.
55
Proof. The vector x = M + y satisfies
M T M x = M T M (M + y) = M T (M M + )y = M T y,
since M M + = QR(M ) by Theorem 10 and y = QR(M ) y+(IQR(M ) )y = QR(M ) y+QN (M T ) y
by Lemma 5.
The vector x has minimum norm by the Remark after Theorem 9, since
Lemma 5 Theorem 10
QR(M T ) M + y = QN (M ) M + y = M + y.
Solution:
! !
12 12 1 1
1
0 3
x = M + y = V D+ U T = 2 2 2
12 12 0 0 12 12 4
! !
21 2 0 12 12 3
= 1 1 1
22 0 2 2 4
1 1 7
4 4
3
= 1 1 = 47 .
4 4
4 4
56
where De is invertible. Recall that N (M ) = span{V1 , . . . Vr } and R(M ) = span{U1 , . . . Ur }.
Then the linear mapping F : N (M ) 3 x 7 M x R(M ) is invertible and its matrix
with respect to basis {V1 , . . . Vr } ja {U1 , . . . Ur } is D.
e Especially, D
e is non-singular and
has condition number
e = D11 .
e
(D)
De rr
Hence, the relative error satisfies
|M + | e |QR(M ) | .
(D)
|QN (M T ) x0 | |M x0 |
The worst case relative error is explained by non zero singular values of M . Problems
occur when M has very small (compared to the matrix norm) non-zero singular values.
4.1.3 Regularisation
In the least squares method, the ill-posed problem y = M x is replaced with a closely
related well-posed problem M T M x = M T y. In general, regularisation means a method
to replace an ill conditioned problem with a better conditioned problem. A linear finite-
dimensional inverse problem that contains disturbances is often presented in the following
form.
57
With TSVD of M we can get a regularised problem
y = M(k) x, (4.10)
whose LS solution is
+
x(k) = M(k) y. (4.11)
D11
k = .
Dkk
+
x(k) = M(k) y.
Remark 10. The cost of the smaller condition number is larger kernel N (M(k) ). This
+
decreases the accuracy of the estimate M(k) y. In Chapter ?? we saw that the minimum
+
norm solution here M(k) y does not contain components from the linear subspace
N (M(k) ).
1
10
0 0 0
0 1
10
0 0
V T and the given data is y = M x0 +
M =U
0 1
0 10
0
1
0 0 0 10000
Then
1
10
0 0 0 10 0 0 0
0 1
10
0 0
V T and M3+ = V
0 10 0 0 T
M(3) =U
0 1 0 0 10 0 U .
0 10
0
0 0 0 0 0 0 0 0
1 2 1 3 1 T
Lets study the case, where || 10 . Say, = U ( 100 100
100 100
) for example. Then
equation y = M x has the solution
2 2
10 0 0 0 100 10
1 1
0 10 0 0 T 100
M 1 y = x0 + V
U U 103 .
3 = x0 + V
0 0 10 0
100 10
1
0 0 0 10000 100
100
58
Now |x0 M 1 y| 100. But with TSVD we get
2
10 0 0 0 100
1
+ +
0 10 0 0 T 100
M(3) y = M(3) M x0 + V U U 3
0 0 10 0 100
1
0 0 0 0 100
2
1 0 0 0 10
0 1 0 0 T 1
= V 10
0 0 1 0 V x0 + V 3
10
0 0 0 0 0
2
0 0 0 0 10
0 0 0 0 T 1
= x0 V 0 0 0 0 V x0 + V 3
10
10
0 0 0 1 0
2
10
1
= x0 V4 V4T x0 + V 10 ,
3
10
0
Regularised solution x(k) does not contain those components of the unknown x0
that belong to the kernel of M(k) i.e. span{Vk+1 , . . . , Vn }
59
4.1.5 Tikhonov regularization
Let x0 Rn be the unknown, M Rmn the theory matrix and
y = M x0 + Rm (4.12)
x = M + y = M + M x0 + M + = QN (M ) x0 + M +
i.e.
x = argmin |M x y|2 + |x|2 .
xRn
|M x y|2 + |
x |2 = minn |M x y|2 + |x|2
xR
has a unique solution x . The solution x coincides with the unique solution of
(M T M + I)
x = M T y.
which leads to a LS problem. We can use Theorem 7, to find the minimiser by solving
T T
M M M y
x =
I I I 0
in other words
(M T M + I)
x = M T y.
60
M
This equation has a unique solution by Theorem 8, since the kernel of contains
I
only zero by the lower row of
M M x
0= x= .
I x
Remark 12. Above, it was actually verified that the Tikhonov regularisation corresponds
to LS solution of
M y
x= .
I 0
and
T
y = M x0 + = 14.1 13.1 65.9 .
We saw in Example ?? that
6 T
M 1 (M x0 + ) = x0 + 168 10
3 3
184 10 10
.
61
Choosing the regularisation parameter
How the parameter affects the regularised solution. First, we ask what happens to x
when 0 or when . We need to calculate the limits
lim (M T M + I)1 M T y and lim (M T M + I)1 M T y,
0+
if they exist.
Assume for simplicity that zero is not an eigenvalue of M T M . Then the inverse
(M T M )1 exists and we can study
x x| = |(M T M + I)1 M T y (M T M )1 M T y|.
|
The difference of two inverse matrices satisfies
B 1 C 1 = B 1 (I BC 1 ) = B 1 (C B)C 1 .
Especially,
(M T M + I)1 (M T M )1 = (M T M + I)1 (I)(M T M )1 .
Then
|(M T M + I)1 M T y (M T M )1 M T y| k(M T M + I)1 k|(M T M )1 M T y|.
Recall, that k(M T M + I)1 k is the inverse of the smallest eigenvalue min of the matrix
(M T M + I). Denote umin the corresponding normed eigenvector. We can estimate the
smallest eigenvalue as follows:
min = ((M T M + I)umin , umin ) = ((M T M + I)umin , umin ) (M T M umin , umin )
min (M T M ).
Then we get an estimate
|(M T M + I)1 M T y (M T M )1 M T y| min (M T M )1 |(M T M )1 M T y|,
which implies that
lim x = (M T M + I)1 M T y = (M T M )1 M T y. (4.14)
0+
For large values of the regularised solution approaches to zero. For small values of
the regularised solution approaches to the LS minimum norm solution.
62
The choice of can be done as follows.
Definition 16. Let y = M x0 + be the given data, where || e. According to Morozovs
discrepancy principle the parameter is chosen so that
|M x y| = e,
|M x y| = |(M x M x0 ) | ||.
x = (M T M + I)1 M T y.
where
(M T M + I) = V DT U T U DV T + I = V DT DV T + V V T = V (DT D + I)V T
63
Hence
can be written as
min(m,n) m 2
X X Djj
(M x )i = Uij 2
Ukj yk .
j=1 k=1
Djj +
m min(m,n) 2
2
X
T
X
f () := |M x y| = (U y)2j + 2
(U T y)j .
j=1
Djj +
j=min(m,n)+1
min(m,n) 2
0
X d
f () = 2
(U T y)j
j=1
d Djj +
min(m,n)
X 1
= 2 2
(U T y)j 2
2
(U T yj )
j=1
Djj + Djj + (Djj + )2
min(m,n) 2
X Djj
= 2 2 3
(U T y)2j 0.
j=1
(Djj + )
and by (4.15)
Theorem 10
lim f () = |M M + y y|2 = |QR(M ) y|2 .
0+
|x0 x | = |x0 (M T M + )1 M T M x0 (M T M + )1 M T |,
which depends from two terms that have opposite behavior as a function of . Denote
G1 () = (I (M T M + )1 M T M )x0 and G2 () = (M T M + )1 M T .
64
By equations (4.14)-(4.16), we have that
The larger the regularization parameter is the smaller the effect of the noise on
the solution is. However, simultaneously the distortion caused by the penalization
increases.
The penalization distorts the regularised solution even for the exact data.
Generalizations
More generally, Tikhonov regularisation means the minimisation problem
where B = Bn0 n is usually some matrix whose singular values are positive. The vector
Bx represents some unwanted feature of the approximative solution.
Example 29.
1 0 0 0 0 0
1 1 0 0 0 0
0 1 1 0 0 0
0
B = 0 1 1 0 0
.. .. .. ..
. . . .
0 0 0 1 1 0
0 0 0 0 1 1
The matrix B penalizes the differences of neighboring points. This forces the approxima-
tive solution to be smoother.
In regularisation, also other norms than inner product norms can be used. For exam-
ple,
n
X
2
x = argmin |M x y| + |xi |,
xRn
i=1
where the penalisation term is the so called `1 norm. In such a case, the minimisa-
tion problem is solved numerically by different methods than in the case of Tikhonov
regularisation.
65
4.2 Recap
Least squares method:
Tkhonov regularisation:
Please be able to
Please remember
what is SVD.
66
Chapter 5
A statistical inverse problem does not give answer to the question what is the
unknown x0 but rather to the question what do we know about the unknown x0 .
The aim of this chapter is to understand the solution principle of a statistical in-
verse problem. In Section 5.1 we recap necessary concepts from probability theory (see
the blue text below) and multidimensional integration, which is used in calculating the
probabilities. In Chapter 5.2 we meet statistical inverse problems.
3. The distribution of the unknown is called the prior distribution. It represents our
knowledge about the unknown.
4. The solution of the statistical inverse problem is the posterior distribution, which
is the conditional distribution of X given Y = y and it has probability density
function (pdf)
f (x|y) = cf (y|x)fpr (x) (Bayes formula)
where f (y|x) is the conditional pdf of Y given X = x,
fpr (x) is the prior pdf of X and c > 0 a norming constant.
67
Remark 13. The word prior refers to the time when the observation y of the value of Y
is not yet available. The work posterior refers to the time when Y = y is available.
Example 30. What does it mean that the distribution represents knowledge about the
unknown? Consider two simple cases (a) unknown x0 R and (b) unknown x0 R2 .
(a) Let the unknown X be tomorrows temperature at noon. Today, we do not know
for sure what is the exact value of X. However, X may have a probability distribution,
whose pdf is, say, f (x). Below are some examples of f .
0.4
0.3
0.2
f
0.1
0.0
10 5 0 5 10
Temperature
Figure 5.1: Pdf f : Temperatures in interval [10, 0] seem to be unlikely, as are temper-
atures in [+5,+10]. The temprerature +2 has the highest value. On the basis of f , we
believe that tomorrows temperature at noon is around +2 degrees.
0.08
0.06
0.04
f
0.02
10 5 0 5 10
Temperature
Figure 5.2: Pdf f : Temperatures in [5, 10] look unlikely. The temperature -2 has the
highest value, but the density is quite wide. This reflects uncertainty about the tomorrows
temperature at noon.
68
0.20
0.15
0.10
f
0.05
0.00
10 5 0 5 10
Temperature
Figure 5.3: Pdf f : Temperatures in [10, 5] and [5, 10] seem to be quite unlikely. Tem-
perture +2 has the highest values but also -2 is at a local maximum. This reflects
uncertainty about tomorrows temperature at noon. We actually have two scenarios of
how the weather will develop.
(b) Let the unknown X = (X1 , X2 ), where X1 and X2 are, for instance, parameters
of the elliptic path X1 x2 + X2 y 2 = 10 of an asteroid moving in the same plane as planets.
Let the pdf of X be f (x) = f (x1 , x2 ).
15
10
5
0
x2
0.012
0.011
0.010
0.009
0.008
0.007
0.006
0.005
0.004
0.003
0.002
0.001
15
f
10 5
5
0
x1 10
5
10
15
15
Figure 5.4: Pdf f : the function f = f (x1 , x2 ) of two variables x1 and x2 can be either
presented with colors or elevations. The values of x1 are on horizontal axis and the values
of x2 are on the vertical axis. For example, the value f (10, 5) is the number corresponding
to the color at coordinates x1 = 10, x2 = 5 It seems that the values of unknown are close
to (10,5), since f has highest values there. On the other hand, values near the point
(10, 10) seem to be unlikely.
69
In the same way, we can fix pdfs for finite-dimensional unknowns of any inverse prob-
lem, like color values in image deblurring, the coefficients of approximated mass absorp-
tion coeffiecients in CT scans and even coefficients of approximated conductivities in
electrical impedance tomography.
Remark 14. In statistical inverse problems, the random vectors are usually very high
dimensional. The visualisation of high-dimensional pdfs is usually done few coordinates
at a time or by using statistics of distributions.
70
Random vectors
Let (, , P ) be a probability space. The Borel sets of Rn is the smallest -algebra B(Rn )
that contains all open sets of Rn .
We skip the proof of the next theorem, which relies on properties of Borel sets (namely,
the generation of Borel sets with the help of hyperrectangulars).
Theorem 12. The mapping X : Rn is a random vector if and only the components
Xi , i = 1, ..., n of X = (X1 , ..., Xn ) are random variables.
P (X A Y B) = P (X A)P (Y B)
The axiomatisation was finally done after the development of abstract measure and
integration theory at the end of 1920s. The father of the axioms of probability
theory is A. N. Kolmogorov (1903-1987). This has been the only consistent way of
treating the probability theory.
As mathematical objects, the random variables and random vectors are just ordi-
nary functions: they do not have any randomness attached to them and there is
no mechanism to produce random numbers. This may seem a little odd... that
randomness is modeled without any randomly occuring phenomena...?
71
The values of X are real numbers, but we do not beforehand know which value
X will take. Our knowledge of X is imperfect.
When the elevator arrives at time x0 , then x0 is a sample of X. This means
that x0 = X(0 ) for some 0 .
Mathematics does not tell how we ended up with X(0 ). The mechanism for
producing the elementary event 0 is unknown.
Although we know exactly the function X, the set and probability P , we
can not say anything else about the value of X except what the distribution
P (X B), where B B(R), reveals.
for every step functions s : B R such that s f , and for every step function S : B R,
such that f S, then we say that f is Riemann integrable (over B) and we denote
Z
f (x)dx = I.
B
72
Let K(B) denote the set of all step functions f : B R.
Theorem 13. Bounded function f : B R is Riemann integrable if and only if
Z Z
sup s(x)dx = I = inf S(x)dx
sK(B) SK(B)
sf f S
in which case Z
f (x)dx = I.
B
as long as all the integrals are well-defined. Moreover, we can change the order of inte-
gration.
The integral over the whole space is defined as an improper integral i.e. we take a
limit of integrals over increasing sets.
Similarly, when f is non-negative, Fubinis theorem is still true when the compact
sets are replaced with whole spaces.
if possible.
73
Example 31. Let (
1
2n
[1, 1]n
,x
f (x) =
0, x 6 [1, 1]n .
Then n
Z Z Z Z 1
1 1 Fubini 1
f (x)dx = n
dx = n dx = dx = 1.
[1,1]n 2 2 [1,1]n 2n 1
for all a, b R, a b.
Definition 26. Let (, , P ) be a probability space The random vector X = (X1 , ..., Xn ) :
Rn is said to have a pdf fX , if fX : Rn [0, ) is a such a pdf that
Z
P (ai Xi bi , i = 1, . . . , n) = fX (x)dx
[a1 ,b1 ][an ,bn ]
for all ai , bi R, ai bi , i = 1, ..n. The pdf fX is called the joint probability density
function of X1 , ..., Xn .
Definition 27. The function
Z Z Z Z
fXi (x) = fX (x1 , ..., xn )dx1 dxi1 dxi+1 dxn
x1 = xi1 = xi+1 = xn =
74
Remark 16. Random vectors do not always have the expectation.
Definition 29. Let X be a random vector with pdf fX : Rn [0, ) and expectation
E[X] = (m1 , ..., mn ). The covariance matrix CX Rnn of X is defined by equations
Z
(CX )ij = (xi mi )(xj mj )fX (x)dx,
Rn
Gaussian distributions
The random vector Z : Rn has Gaussian (or multinormal) distribution if its pdf is
of the form
1 1 T 1
fZ (x) = p e 2 (xm) C (xm) ,
(2)n det(C)
where m Rn and C Rnn is a symmetric non-singular matrix whose eigenvalues
are positive. We denote Z N (m, C), meaning that Z has Gaussian distribution with
expectation m and covariance matrix C.
75
Lemma 6. Function
1 1 T 1
fZ (x) = p e 2 (xm) C (xm) ,
(2)n det(C)
is a pdf. If Z : Rn is a random vector and Z N (m, C), then
E[Z] = m
and the covariance matrix of Z is
CZ = C.
Proof. Clearly, fZ 0. Lets check up what is
Z
1 1 T 1
I=p e 2 (xm) C (xm) dx.
(2)n det(C) Rn
Perform change of variables x0 = x m
Z
1 1 T 1 0
I=p e 2 (x) C x dx0 .
(2)n det(C) Rn
1 1
Perform another change of variables x00 = C 2 x0 . Recall, that C 2 = U diag( 11 , ..., 1n )U T ,
where we have used the eigenvalue decomposition C = U diag(1 , ..., n )U T . We get
Z
1 1 00 2
I=p e 2 |x | | det(C 1/2 )|dx00 .
n
(2) det(C) Rn
We need to calculate the integrals
Z
1 1 2 2 2
I = p e 2 (x1 +x2 +....+xn ) dx1 dxn
(2)n Rn
Z n
1 12 x2
= p e dx .
(2)n R
76
The concept of probability
There are rarely disputes in mathematics but the meaning of P (X B) is such. The
question is simply: what does P (X B) stand for? There are two schools:
2. Bayesian: The probability of the event is the degree of our belief that the event
will happen. (We use this one!)
Remark 19. Why Bayesian viewpoint? In inverse problems, there scarcely is objective
information about the unknown. The Bayesian viewpoint allows us to complement the
objective information by using believable prior distributions.
Pros:
Honesty: The prior distribution contains all the prior information about the un-
known that is used in the problem solving. For example, in regularisation methods
the prior information is contained in the choice of the method, which makes applied
prior informations harder to compare.
Robustness: Prior distributions helps to compensate the effects of noise and dis-
turbances in data.
Cons:
77
Probabilities and density functions
Let (, , P ) be a probability space and X : Rn a random vector.
The pdf is a tool for calculating the probabilities P (X B).
Example 33 (Probability distribution without pdf). Let X be a random vector with pdf
fX : R [0, ). The random vector (X, X) does not have pdf.
Proof by contradiction: Assume that there is pdf f(X,X) (x, y). Denote
B = {(x, y) R R : x 6= y}
(is a Borel set whose indicator function 1B (x, y) is Riemann integrable). The probability
distribution gives to the set B the value P ((X, X) B) = 0 since (X, X) / B. From the
existence of pdf, it follows that
Z
0 = P ((X, X) B) = f(X,X) (x, y)dxdy
B
Z Z x Z Z
= f(X,X) (x, y)dxdy + f(X,X) (x, y)dxdy = 1,
x= y= x= y=x
which is impossible. (The slightly dubious Fubini is not actually necessary here, since we
can also divide the integration area into the corresponding parts). Hence, there is no pdf
for (X, X).
Example 34 (Pdf is not unique). Let X : Rn be rv with pdf
Hence, also
feX (x) = 1(0,1) (x)
is the pdf of X. Clearly feX 6 fX . For multidimensional example, take n-dimensional
random vector X = (X1 , . . . , Xn ), with statistically independent components with pdf
given by (5.1). Then
fX (x1 , . . . , xn ) = 1[0,1]n (x1 , . . . , xn )
and
feX (x1 , . . . , xn ) = 1(0,1)n (x1 , . . . , xn )
define the same probability distribution.
78
Definition 31. Let X : Rn be a random vector. Different pdfs f : Rn [0, )
satisfying Z
P (X B) = fX (x)dx
B
for all hyperrectangulars B Rn , are called versions of the pdf of X.
Remark 20. Let X be n-dimensional and Y such an m-dimensional random vector that
the random vector (X, Y ) has (joint) pdf f(X,Y ) (x, y). When the marginal pdf
Z
fX (x) = f(X,Y ) (x, y)dy
79
Conditional pdfs
Let (, , P ) be a probability space.
f(X,Y ) (x, y0 )
Rn 3 x 7 fX (x|Y = y0 ) = . (5.2)
fY (y0 )
Other aspects of conditional pdf can be seen by using abstract measure theory (not
done in this course).
Remark 21. The condition Y = y0 means that a random event has happened
and the random vector Y has attained the sample value y0 = Y (0 ), where 0
. In practical inverse problems this means that the noisy value of the data is
observed/measured (the noisy data is then available aka given).
If X and Y are statistically independent, then knowing the value of Y does not
affect the distribution of X, since
The significance of the conditional pdf in statistical inverse problems is based on the
fact than there is dependence between the unknown X and the data Y . When Y = y0 is
given, it can change the distribution of the unknown X.
Conditional pdfs are more easily handled by the Bayes formula.
Proof. Skipped. (not hard with Lebesgues integral but requires pretty much more ma-
terial with Riemann integral ).
80
Corollary 4. Let X : Rn and Y : Rm be two random vector. If there exists
open sets O1 Rn and O2 Rm , satisfying
Z Z
fX (x)dx = 1and fY (y|X = x)dy = 1 x (5.3)
O1 O2
is a conditional
R pdf of X = x that is on O1 uniquely determined and continuous. whenever
y0 O2 and fY (y0 |X = x)fX (x)dx > 0.
Proof. The product of two Riemann integrable bounded functions is Riemann integrable.
By Theorem 16, the product fY (y|X = x)fX (x) is a version of f(X,Y ) (x, y). Since
Z Z Z
Fubini (5.3)
fX (x)fY (y|X = x)dxdy = fX (x) fY (y|X = x)dy dx = 1,
O1 O2 O1 O2
then Lemma 7 holds for fX (x)fY (y|X = x) when O = O1 O2 , which proves the unique-
ness of f(X,Y ) on O by continuity. By Def. 32,
whenever the denominator is positive. The finiteness of the denominator follows from the
boundedness of the pdfs. The value
Z Z Z
fY (y0 |X = x)fX (x)dx = fY (y0 |X = x)fX (x)dx + fY (y0 |X = x)fX (x)dx
O1 O1C
Definition 33. Let X and Y be as in Def. 32. The conditional expectation of X given
Y = y0 is Z
E[X|Y = y0 ] = xfX (x|Y = y0 )dx,
Rn
if the integral exists.
81
We do not prove the following theorem, since the proof requires measure theoretic
tools.
Theorem 17. Let X be Rn -valued random vector that is statistically independent from
the Rm -valued random vector Z. Let G : Rn Rm Rk be a continuous mapping and
let G(x0 , Z) have pdf for every x0 Rn . Then
fG(X,Z) (y|X = x0 ) = fG(x0 ,Z) (y)
for all y Rk .
82
5.2 Statistical inverse problems
Consider an inverse problem, where we are given the noisy data y0 = F (x0 )+ Rm
about the unknown x0 Rn . The direct theory F : Rn Rm is here continuous.
We often have statistical information about the noise . For example, = (1 , ..., m )
could consist of statistically independent components with the probability distribu-
tions Z b
1 1 2
P (a i b) = exp y dy,
2 a 2
where i = 1, ..., m, a < b R and > 0.
5. The solution of the statistical inverse problem is the posterior distribution whose
pdf is
Example 37. Let the noise N (0, C ), the unknown X N (0, CX ), the noise and
the unknown are statistically independent, F : Rn Rm is linear and has matrix M .
The given data y0 = M x0 + 0 is sample of the random variable Y = M X + . Then by
Cor. 5, we have
1 1 T 1
fY (y|X = x) = f (y M x) = p e 2 (yM x) C (yM x) ,
(2)m det(C )
83
which is continuous and bounded. The prior pdf
1 1 T 1
fpr (x) = p e 2 x CX x
m
(2) det(CX )
is also continuous and bounded. By Cor. 4 the posterior pdf is
1 T C 1 (y M x) 1 T 1
fpost (x) = Cy0 e 2 (y0 M x) 0
e 2 xCX x
,
84
When the noise is distributed as N (0, I) and the prior distribution is N (0, cI), then
posterior expectation coincides with a Tikhonov-regularised solution with regularisation
parameter = /c.
Prior can be interpreted so that
Xi N (0, c)
represents knowledge of the unknown that tells as that the values of the unknown are
not exactly know but we feel that the negative and positive values of the components
are as likely and large values of the component are quite unlikely. Independence between
components allows large variation between the values of the components.
85
Recap: Finite-dimensional linear Gaussian satistical inverse problem
The given data is y0 = M x0 + 0 , where M Rmn .
The statistical model of the noise is an m-dimensional Gaussian random vector,
that is distributed according to N (0, C ) i.e.
1 1 T 1
f (y) = p e 2 y C y
,
(2)m det(C )
for all y Rm .
The statistical model of the unknown is n-dimensional Gaussian random vector X,
that is independent from and distributed according to N (0, CX ) i.e.
1 1 T 1
fpr (x) = p e 2 x C x
(2)n det(CX )
for all x Rn .
The statistical model of the data is Y = M X + .
The solution is the posterior pdf
fY (y0 |X = x)fpr (x)
fpost (x) = R
f (y0 |X = x)fpr (x)dx
Rn Y
1 T C 1 (y M x) 1 T 1
= cy0 e 2 (y0 M x) 0
e 2 x CX x
,
which simplifies to
1 1 T 1
fpost (x) = p e 2 (xmpost ) Cpost (xmpost ) ,
(2)n det(Cpost )
where 1
1
mpost = M T C1 M + CX M T C1 y0
and 1
1
Cpost = M T C1 M + CX .
In more general cases, the unknown and the noise can have non-zero expectations
and the unknown and the noise need not be independent.
86
The case of independent noise term
Let X and be independent random vectors and denote Y = F (X) + , where the forwad
mapping F : Rn Rm is continuous.
If random vector has a pdf, then the conditional pdf of Y = F (X) + given X = x
is, by Corollary 5,
fY (y0 |X = x) = f+F (x) (y0 ) = f (y0 F (x)). (5.6)
Example 38 (CT scan). The unknown X-ray mass absorption coefficient f = f (x0 , y 0 )
is approximated by equation
n
X
f (x0 , y 0 ) = xj j (x0 , y 0 ), x0 , y 0 R2
j=1
where x = (x1 , . . . , xn ) Rn contains the unknowns and the functions j are fixed. The
data can be (coarsely) modeled as a vector y = (y1 , . . . , ym ) whose components are
Z n Z
X
y= f ds + i = j ds xi + i = (M x)i + i ,
Ci j=1 Ci
where i = 1, . . . , , m and the random vector is distributed according to N (0, I). Then
we end up with the statistical inverse problem
Y = M X + .
When X and are taken to be statistically independent, the likelihood function is
1 1
2 |y0 M x|2
fY (y0 |X = x) = m e .
(2) 2
Model errors
Next, we allow model errors for the direct theory and the unknown.
Theorem 19. Let Y be an m-dimensional rv, X be an n-dimensional rv and U be a k-
dimensional rv so that the joint pdf f(X,U ) is positive and the conditional pdfs fY (y|(X, U ) =
(x, u)) and fU (u|X = x), are given. Then the conditional pdf
Z
fY (y|X = x) = fY (y|(X, U ) = (x, u))fU (u|X = x)du.
Rk
87
where the integrand is determined by Theorem 16. Then
f(X,Y,U ) (x, y, u) f(X,U ) (x, u)
Z
fY (y|X = x) = du,
Rk f(X,U ) (x, u) fX (x)
which gives the claim by the definition of conditional pdfs..
Example 39. (Approximation error) Consider the statistical inverse problem Y = F (X)+
, where the unknown X and the noise are statistically independent. For computa-
tional reasons, a high-dimensional X is often approximated by a lower dimensional rv
Xn . Lets take Xn = Pn X, where Pn : RN RN is an orthogonal projection onto some
n-dimensional subspace of RN , where n < N (and also m < N ). Then
F (X) = F (Xn ) + (F (X) F (Xn )) =: F (Xn ) + U,
which leads to
Y = F (X) + = F (Xn ) + U + .
According to Theorem 19, the likelihood function for Xn can be expressed as
Z
fY (y|Xn = x) = fU (u|Xn = x)f (y F (x) u)du, (5.7)
Rm
whenever the assumptions of the theorem are fulfilled. Especially fU (u|Xn = x) needs to
be available.
The integral (5.7) is often computationally costly. One approximation is to replace
U by a rv U e that is a similarly distributed but independent from X. Whe the prior
distribution of X is given, then Ue + has a known probability distribution. When this
distribution has a pdf, then
fY (y|Xn = x) = f+Ue (y F (x)).
Example 40. (Inaccuracies of the forward model) Let the forward model F : Rn Rm
be a linear mappping whose matrix M = M deoends continuously from R, where
the value of is not precisely known. For example, in image enhancing (Chapter 1.2)
the blurring map
n
e(|ki| /n +|lj| /n )/2 mij
2 2 2 2 2
X
m
e kl = Ckl
i,j=1
contains such a parameter. Then we may model the inaccuracies of with a probability
distribution. Say, , X and are statistically independent and f (s) is the pdf of . Then
Y = M X + = G(, X, )
is a random vector, since
G : R Rn Rm 3 (s, x, z) 7 M x + z
is continuous. By Theorem 17,
fY (y|(X, ) = (x, s)) = fG(s,x,) (y) = f (y M s x).
Under the assumptions of Theorem 19, we have
Z
fY (y|X = x) = f (y M s x)f (s)ds.
Rm
88
5.2.2 The prior pdf fpr (x)
The prior pdf represents the information that we have about the unknown and describes
also our perception of the lack of information.
Assume that x Rn corresponds to values of some unkown function g at fixed points
of [0, 1] [0, 1], say
xi = g(ti ),
where ti [0, 1] [0, 1] kun i = 1, ..., n.
Function g Vectorx
Some values of g are known Some component of x are known
exactly or inexactly. exactly or inexactly.
Smoothness of g. Behavior of the neighbor components in x.
Image of g is known The subset, where x belongs is known.
E.g. g 0, or monotonicity E.g. . xi 0, xi xi+1
Symmetry of g. Symmetry of x.
Other restrictions for g Restrictions for x.
E.g. if g : R3 R3 is Equations G(x) = 0.
a magnetic field, then g 0.
Uniform distribution
Let B Rn be a closed and bounded hyper-rectangular
B = {x Rn : ai xi bi , i = 1, .., n},
89
The random vector X is uniformly distributed on B if
1
fpr (x) = 1B (x),
|B|
R
where the number |C| := C
dx.
Unknown belongs to set B i.e. the ith component belongs to the interval [ai , bi ].
Reflects almost perfect uncertainty about the values of the unknown. They belong
to B.
`1 -prior
Define a new norm `1 by
n
X
kxk1 = |xi |
i=1
for all x Rn .
A random vector X has `1 -prior, if
n
fpr (x) = ekxk1
2
Parameter reflects how certain we are that the unknown attains large values.
5.3.1 `2 -prior
A random vector X has `2 -prior, if
n2 2
fpr (x) = e|x|
Components of X are independent and normally distributed.
90
1
alpha=0.5
0.9 alpha=1
alpha=2
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10 8 6 4 2 0 2 4 6 8 10
0.8
alpha=0.5
alpha=1
alpha=2
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10 8 6 4 2 0 2 4 6 8 10
91
Cauchy prior
A random vector X has Cauchy prior if
n
n Y 1
fpr (x) =
i=1
1 + 2 x2i
where x Rn .
Reflects best a situation, where some of the components of the unknown can attain
large values.
0.7
alpha=0.5
alpha=1
0.6 alpha=2
0.5
0.4
0.3
0.2
0.1
0
10 8 6 4 2 0 2 4 6 8 10
1. i
/ Ni
2. i Nj if and only if j Ni .
92
0.4
Gauss
l1
0.35 Cauchy
0.3
0.25
0.2
0.15
0.1
0.05
0
10 8 6 4 2 0 2 4 6 8 10
Figure 5.8: Pdf of N (0, 1) , pdf of 1D-Cauchy priori with = and pdf of 1D `1 -prior
2
2
when = 2 .
Definition 35. A random vector X is a discrete Markov field with respect to neighbor-
hood system Ni , i = 1, .., n if jos
fXi (x|(X1 , X2 , .., Xi1 , Xi+1 , Xi+2 , ..., Xn ) = (x1 , x2 , .., xi1 , xi+1 , xi+2 , ..., xn ))
= fXi (x|Xk = xk k Ni )
The components Xi of discrete Markov field depend only on its neighboring compo-
nents Xk , k Ni .
Theorem 20 (Hammersley-Clifford). Let rv X : Rn be a discrete Markov field with
respect to the neighborhood system Ni , i = 1, .., n. If X has a pdf fX > 0, then
Pn
fX (x) = ce i=1 Vi (x)
where X
Vj (x) = lij |xi xj |
iNj
and the neighborhood Nj of index j contains only of indeces of those pixel i that share
an edge with the pixel j. Moreover, the number lij is the length of the common edge
between pixels i and j.
93
The total variation nj=1 21 iNj lij |xi xj | is small if the difference between the
P P
value xi of the pixel i color and its the corresponding value of its neighbor com-
ponents is small except possibly for those pixel sets whose borders have very short
length.
Example 42. (1D-Gaussian smoothness priors)
Let X be such a rv that corresponds to values o an unknown function g at points
ti [0, 1], i = 1, .., n, 0 = t0 < t1 < < tn < 1 are equidistant points and g(t) = 0 for
t 0.
Fix the prior pdf of X as
Pn
fpr (x) = ce(x1 + ).
2 2
i=2 (xi xi1 )
If is large, then the neighbor components of X are more likely to be close to each
other.
and Nj contains only the indices of points ti that are next to the point tj ( over it or
under it or to left or to right from it).
Positvity constraint
If we know that the unknown has non-negative components, then we may restrict and
renorm the pdf
fpr (x) = cf+ (x)fX (x)
where (
1, xi 0 i = 1, .., n
f+ (x) =
0 otherwise.
94
Hierarchical priors
When the unknown is modeled as a random vector, whose pdf depends continuously on
0
parameter Rn , it is possible to model the uncertainty of the parameter by attaching
a pdf to it.
Let X : Rn be the rv that models the unknown and let the pdf of X be fX . Let
0
: Rn be a rv that models the unknown parameter and let its pdf be f . Assume
that we have the conditional pdf of X given = s, that is
fpost (x, s) = cfY (y|(X, ) = (x, s))fpr (x, s) = cfY (y|X = x, s)fpr (x, s)
whenever fY (y) > 0 (note that the likelihood function does not depend on s but only on
x).
0
In options 1,2 the prior pdf is called a hierarchical prior and the parameter : Rn
is called a hyperparameter and its distribution a hyperprior.
Example 44. Let X : R3 be rv that models the unknown and has pdf
s 1 2 s 2 1 2
fpr (x; s) = 3 exp x1 (x2 x1 ) (x3 x2 ) ,
2 2 2 2
95
where > 0, and f+ (s) = 1 for s > 0 and 0 otherwise. Then
s 1 2 s 1
f(X,) (x, s) = f+ (s) exp x1 (x2 x1 ) (x3 x2 ) es
2 2
( 2) 3 2 2 2
and
Z
1 2 1 2
s
fX (x) = exp x1 (x3 x2 ) s exp (x2 x1 )2 s ds
( 2)3 2 2 0 2
Z
1 2 1 2 1 1 2
= exp x1 (x3 x2 ) s 2 exp(s (x2 x1 ) + )ds
( 2)3 2 2 0 2
Z
1 2 1 2 1 1
= exp x1 (x3 x2 ) 3 s 2 exp(s)ds
( 2)3 2 2 ( 12 (x2 x1 )2 + ) 2 0
exp 12 x21 21 (x3 x2 )2
3
= 3
( 2)3 ( 12 (x2 x1 )2 + ) 2 2
1 2 1
exp 2 x1 2 (x3 x2 )2
= 3 .
4 2 ((x2 x1 )2 + 2) 2
The value of the Gamma function (3/2) = /4.
0.7
lambda=0.3
lambda=1
0.6 lambda=2
0.5
0.4
0.3
0.2
0.1
0
20 15 10 5 0 5 10 15 20
Figure 5.9: Pdf f (x) = 3 .
(x2 +2) 2
96
0.25
Cauchy
Transformed Beta
0.2
0.15
0.1
0.05
0
20 15 10 5 0 5 10 15 20
Figure 5.10: Cauchy prior and pdf f (x) = 3 .
(x2 +2) 2
97
When data is y we look for h(y), that gives the smallest possible posterior expectation.
The number
Z Z
r(h) = L(x, h(y))fpost (x; y)dx fY (y)dy
Rm Rn
The interpretation of the Bayes risk is that when the unknown is X and the noisy data
Y , then the Bayes risk r(h) of the estimator h is the expected loss with respect to the
joint distribution of X and Y i.e. r(h) = E[L(X, h(Y ))].
Example 45. (CM estimate) Take L(x, z) = |xh(y)|2 as the loss function. Let mpost (y)
denote the posterior expectation
Z
mpost (y) = xfpost (x)dx
Rn
Then
Z Z
L(x, h(y))fpost (x; y)dx = |x h(y)|2 fpost (x; y)dx
Rn ZR
n
The minimum loss is attained when |mpost (y) h(y))|2 = 0 i.e h(y) = mpost (y), so that
Z n
X
L(x, h(y))fpost (x; y)dx = (Cpost (y))ii .
Rn i=1
98
In other words, the expectation of the loss function is the sum of the diagonal elements
of the posterior covariance matrix i.e. its trace.
Posterior expectation is often denoted by xCM (CM=ccnditional mean).
Example 46. MAP estimate
We say that the pdf is unimodal if its global maximum is attained at only one point.
n
Let > 0 and L (x, z) = 1B(z,)
C (x) when x, z R . Let x 7 fpost (x; y) be unimodal
n
for the given data y R . The limit of the estimate
Z
h (y) = argmin 1B(z,)
C (x)fpost (x; y)dx
zRn Rn
Z
= argmin fpost (x; y)dx
zRn
Rn \B(z,)
is
lim h (y) = xM AP (y)
0+
where
xM AP (y) = argmax fpost (x; y).
xRn
The maximum a posterior estimate xM AP (y) is useful when expectations are hard to
obtain. It can be also written as
MAP estimate is often used also in situations when the posterior pdf is not unimodal
whereby it is not unique.
In addition to estimates x we can also determine their componentwise Bayesian con-
fidence intervals by choosing a e.g. in such a way that
Ppost (|Xi xi | a) = 1
where = 0.05.
5.5 Recap
About probability theory
Conditional pdf of random vector X given Y = y (with marginal pdf fY (y) >
0) is
f(X,Y ) (x, y)
fX (x|Y = y) = .
fY (y)
Bayes formula
99
Statistical inverse problem
The unknown and the data are modeled as random vectors X and Y .
The probability distributions of X and Y represents quantitative and qualita-
tive information about X and Y and lack of such information.
The given data y0 is a sample of Y i.e. y0 = Y (0 ) for some elementary event
0 .
The solution of a statistical inverse problem is the conditional pdf of X given
Y = y0 (with fY (y0 ) > 0)
Posterior pdf
Typical priors include Gaussian priors (especially smoothness priors), `1 -prior, Cauchy
prior and total variation prior (e.g. for 2D images).
Please learn:
how to define posterior pdf (up to the norming constant) when the unknown and
noise are statistically independent and the needed pdfs are continuous.
how to write the expressions for the posterior pdf, its mean and covariance, in the
linear Gaussian case.
how to explain the connection between Tikhonov regularisation and Gaussian linear
inverse problems
how to form the hierarchical prior pdf when the conditional pdf and the hyperprior
are given
100