Sie sind auf Seite 1von 175

Least Squares

Dat a Fi t t i ng
with Applications
Per Christian Hansen
Vctor Pereyra
Godela Scherer
Least Squares Data Fitting with
Applications
Per Christian Hansen
Department of Informatics and Mathematical Modeling
Technical University of Denmark
Vctor Pereyra
Energy Resources Engineering
Stanford University
Godela Scherer
Department of Mathematics and Statistics
University of Reading
Contents
Foreword ix
Preface xi
Symbols and Acronyms xv
1 The Linear Data Fitting Problem 1
1.1 Parameter estimation, data approximation 1
1.2 Formulation of the data tting problem 4
1.3 Maximum likelihood estimation 9
1.4 The residuals and their properties 13
1.5 Robust regression 19
2 The Linear Least Squares Problem 25
2.1 Linear least squares problem formulation 25
2.2 The QR factorization and its role 33
2.3 Permuted QR factorization 39
3 Analysis of Least Squares Problems 47
3.1 The pseudoinverse 47
3.2 The singular value decomposition 50
3.3 Generalized singular value decomposition 54
3.4 Condition number and column scaling 55
3.5 Perturbation analysis 58
4 Direct Methods for Full-Rank Problems 65
4.1 Normal equations 65
4.2 LU factorization 68
4.3 QR factorization 70
4.4 Modifying least squares problems 80
4.5 Iterative renement 85
4.6 Stability and condition number estimation 88
v
vi CONTENTS
4.7 Comparison of the methods 89
5 Direct Methods for Rank-Decient Problems 91
5.1 Numerical rank 92
5.2 Peters-Wilkinson LU factorization 93
5.3 QR factorization with column permutations 94
5.4 UTV and VSV decompositions 98
5.5 Bidiagonalization 99
5.6 SVD computations 101
6 Methods for Large-Scale Problems 105
6.1 Iterative versus direct methods 105
6.2 Classical stationary methods 107
6.3 Non-stationary methods, Krylov methods 108
6.4 Practicalities: preconditioning and
stopping criteria 114
6.5 Block methods 117
7 Additional Topics in Least Squares 121
7.1 Constrained linear least squares problems 121
7.2 Missing data problems 131
7.3 Total least squares (TLS) 136
7.4 Convex optimization 143
7.5 Compressed sensing 144
8 Nonlinear Least Squares Problems 147
8.1 Introduction 147
8.2 Unconstrained problems 150
8.3 Optimality conditions for constrained
problems 156
8.4 Separable nonlinear least squares problems 158
8.5 Multiobjective optimization 160
9 Algorithms for Solving Nonlinear LSQ Problems 163
9.1 Newtons method 164
9.2 The Gauss-Newton method 166
9.3 The Levenberg-Marquardt method 170
9.4 Additional considerations and software 176
9.5 Iteratively reweighted LSQ algorithms for robust data tting
problems 178
9.6 Variable projection algorithm 181
9.7 Block methods for large-scale problems 186
CONTENTS vii
10 Ill-Conditioned Problems 191
10.1 Characterization 191
10.2 Regularization methods 192
10.3 Parameter selection techniques 195
10.4 Extensions of Tikhonov regularization 198
10.5 Ill-conditioned NLLSQ problems 201
11 Linear Least Squares Applications 203
11.1 Splines in approximation 203
11.2 Global temperatures data tting 212
11.3 Geological surface modeling 221
12 Nonlinear Least Squares Applications 231
12.1 Neural networks training 231
12.2 Response surfaces, surrogates or proxies 238
12.3 Optimal design of a supersonic aircraft 241
12.4 NMR spectroscopy 248
12.5 Piezoelectric crystal identication 251
12.6 Travel time inversion of seismic data 258
Appendixes 262
A Sensitivity Analysis 263
A.1 Floating-point arithmetic 263
A.2 Stability, conditioning and accuracy 264
B Linear Algebra Background 267
B.1 Norms 267
B.2 Condition number 268
B.3 Orthogonality 269
B.4 Some additional matrix properties 270
C Advanced Calculus Background 271
C.1 Convergence rates 271
C.2 Multivariable calculus 272
D Statistics 275
D.1 Denitions 275
D.2 Hypothesis testing 280
References 281
Index 301
Foreword
Scientic computing is founded in models that capture the properties of
systems under investigation, be they engineering systems; systems in natu-
ral sciences; nancial, economic, and social systems; or conceptual systems
such as those that arise in machine learning or speech processing. For mod-
els to be useful, they must be calibrated against "real-world" systems and
informed by data. The recent explosion in availability of data opens up
unprecedented opportunities to increase the delity, resolution, and power
of models but only if we have access to algorithms for incorporating this
data into models, eectively and eciently.
For this reason, least squares the rst and best-known technique
for tting models to data remains central to scientic computing. This
problem class remains a fascinating topic of study from a variety of perspec-
tives. Least-squares formulations can be derived from statistical principles,
as maximum-likelihood estimates of models in which the model-data dis-
crepancies are assumed to arise from Gaussian white noise. In scientic
computing, they provide the vital link between model and data, the nal
ingredient in a model that brings the other elements together. In their linear
variants, least-squares problems were a foundational problem in numerical
linear algebra, as this eld grew rapidly in the 1960s and 1970s. From the
perspective of optimization, nonlinear least-squares has appealing structure
that can be exploited with great eectiveness in algorithm design.
Least squares is foundational in another respect: It can be extended
in a variety of ways: to alternative loss functions that are more robust to
outliers in the observations, to one-sided "hinge loss" functions, to regular-
ized models that impose structure on the model parameters in addition to
tting the data, and to "total least squares" models in which errors appear
in the model coecients as well as the observations.
This book surveys least-squares problems from all these perspectives. It
is both a comprehensive introduction to the subject and a valuable resource
to those already well versed in the area. It covers statistical motivations
along with thorough treatments of direct and iterative methods for linear
least squares and optimization methods for nonlinear least squares. The
ix
x PREFACE
later chapters contain compelling case studies of both linear and nonlinear
models, with discussions of model validation as well as model construction
and interpretation. It conveys both the rich history of the subject and its
ongoing importance, and reects the many contributions that the authors
have made to all aspects of the subject.
Stephen Wright
Madison, Wisconsin, USA
Preface
This book surveys basic modern techniques for the numerical solution of
linear and nonlinear least squares problems and introduces the treatment
of large and ill-conditioned problems. The theory is extensively illustrated
with examples from engineering, environmental sciences, geophysics and
other application areas.
In addition to the treatment of the numerical aspects of least squares
problems, we introduce some important topics from the area of regression
analysis in statistics, which can help to motivate, understand and evaluate
the computed least squares solutions. The inclusion of these topics is one
aspect that distinguishes the present book from other books on the subject.
The presentation of the material is designed to give an overview, with
the goal of helping the reader decide which method would be appropriate
for a given problem, point toward available algorithms/software and, if nec-
essary, help in modifying the available tools to adapt for a given application.
The emphasis is therefore on the properties of the dierent algorithms and
few proofs are presented; the reader is instead referred to the appropriate
articles/books. Unfortunately, several important topics had to be left out,
among them, direct methods for sparse problems.
The content is geared toward scientists and engineers who must analyze
and solve least squares problems in their elds. It can be used as course
material for an advanced undergraduate or graduate course in the sciences
and engineering, presupposing a working knowledge of linear algebra and
basic statistics. It is written mostly in a terse style in order to provide a
quick introduction to the subject, while treating some of the not so well-
known topics in more depth. This in fact presents the reader with an
opportunity to verify the understanding of the material by completing or
providing the proofs without checking the references.
The least squares problem is known under dierent names in dierent
disciplines. One of our aims is to help bridge the communication gap be-
tween the statistics and the numerical analysis literature on the subject,
often due to the use of dierent terminology, such as l
2
-approximation,
xi
xii PREFACE
regularization, regression analysis, parameter estimation, ltering, process
identication, etc.
Least squares methods have been with us for many years, since Gauss
invented and used them in his surveying activities [83]. In 1965, the paper
by G. H. Golub [92] on using the QR factorization and later his devel-
opment of a stable algorithm for the computation of the SVD started a
renewed interest in the subject in the, by then, changed work environment
of computers.
Thanks also to, among many others, . Bjrck, L. Eldn, C. C. Paige,
M. A. Saunders, G. W. Stewart, S. van Huel and P.-. Wedin, the topic
is now available in a robust, algorithmic and well-founded form.
There are many books partially or completely dedicated to linear and
nonlinear least squares. The rst and one of the fundamental references for
linear problems is Lawson and Hansons monograph [150]. Besides summa-
rizing the state of the art at the time of its publication, it highlighted the
practical aspects of solving least squares problems. Bates and Watts [9]
have an early comprehensive book focused on the nonlinear least squares
problem with a strong statistical approach. Bjrcks book [20] contains a
very careful and comprehensive survey of numerical methods for both lin-
ear and nonlinear problems, including the treatment of large, sparse prob-
lems. Golub and Van Loans Matrix Computations [105] includes several
chapters on dierent aspects of least squares solution and on total least
squares. The total least squares problem, known in statistics as latent root
regression, is discussed in the book by S. van Huel and J. Vandewalle
[239]. Seber and Wild [223] consider exhaustively all aspects of nonlinear
least squares estimation and modeling. Although it is a general treatise
on optimization, Nocedal and Wrights book [170] includes a very clear
chapter on nonlinear least squares. Additional material can be found in
[21, 63, 128, 232, 242, 251].
We would like to acknowledge the help of Michael Saunders (iCME, Stan-
ford University), who read carefully the whole manuscript and made a
myriad of observations and corrections that have greatly improved the nal
product.
Per Christian Hansen would like to thank several colleagues from DTU
Informatics who assisted with the statistical aspects.
Godela Scherer gives thanks for all the support at the Department of
Mathematics and Statistics, University of Reading, where she was a visiting
research fellow while working on this book. In particular, she would like
to thank Professor Mike J. Baines and Dr. I Llatas for numerous inspiring
discussions.
Victor Pereyra acknowledges Weidlinger Associates Inc. and most es-
pecially David Vaughan and Howard Levine, for their unagging support
PREFACE xiii
and for letting him keep his oce and access to computing facilities after
retirement. Also Ms. P. Tennant helped immensely by improving many of
the gures.
Special thanks are due to the professional handling of the manuscript
by the publishers and more specically to executive editor Vincent J. Burke
and production editor Andre Barnett.
Prior to his untimely death on November 2007, Professor Gene Golub
had been an integral part of this project team. Although the book has
changed signicantly since then, it has greatly beneted from his insight
and knowledge. He was an inspiring mentor and great friend, and we miss
him dearly.
Symbols and Acronyms
Symbol Represents
A mn matrix
A

, A

, A
T
pseudoinverse, generalized inverse and transpose of A
b right-hand side, length m
cond() condition number of matrix in l
2
-norm
Cov() covariance matrix
diag() diagonal matrix
e vector of noise, length m
e
i
noise component in data
e
i
canonical unit vector

M
machine precision
c() expected value
f
j
(t) model basis function
(t) pure-data function
M(x, t) tting model
^ normal (or Gaussian) distribution
null(A) null space of A
p degree of polynomial
P, P
i
, P
x
probability
T
X
projection onto space X
permutation matrix
Q mm orthogonal matrix, partitioned as Q = ( Q
1
Q
2
)
r = r(A) rank of matrix A
r, r

residual vector, least squares residual vector


r
i
residual for ith data
range(A) range of matrix A
R, R
1
mn and n n upper triangular matrix
spanw
1
, . . . , w
p
subspace generated by the vectors
diagonal SVD matrix
xv
xvi SYMBOLS and ACRONYMS
Symbol Represents
,
i
standard deviation

i
singular value
t independent variable in data tting problem
t
i
abscissa in data tting problem
U, V mm and n n left and right SVD matrices
u
i
, v
i
left and right singular vectors, length m and n respectively
W mm diagonal weight matrix
w
i
weight in weighted least squares problem
x, x

vector of unknowns, least squares solution, length n


x

minimum-norm LSQ solution


x
B
, x
TLS
basic LSQ solution and total least squares solution
x
i
coecient in a linear tting model
y vector of data in a data tting problem, length m
y
i
data in data tting problem
| |
2
2-norm, |x|
2
= (x
2
1
+ +x
2
n
)
1/2

perturbed version of
SYMBOLS and ACRONYMS xvii
Acronym Name
CG conjugate gradient
CGLS conjugate gradient for LSQ
FG fast Givens
GCV generalized cross validation
G-N Gauss-Newton
GS Gram-Schmidt factorization
GSVD generalized singular value decomposition
LASVD SVD for large, sparse matrices
L-M Levenberg-Marquardt
LP linear prediction
LSE equality constrained LSQ
LSI inequality constrained LSQ
LSQI quadratically constrained LSQ
LSQ least squares
LSQR Paige-Saunders algorithm
LU LU factorization
MGS modied Gram-Schmidt
NLLSQ nonlinear least squares
NMR nuclear magnetic resonance
NN neural network
QR QR decomposition
RMS root mean square
RRQR rank revealing QR decomposition
SVD singular value decomposition
TLS total least squares problem
TSVD truncated singular value decomposition
UTV UTV decomposition
VARPRO variable projection algorithm, Netlib version
VP variable projection
Chapter 1
The Linear Data Fitting
Problem
This chapter gives an introduction to the linear data tting problem: how
it is dened, its mathematical aspects and how it is analyzed. We also give
important statistical background that provides insight into the data tting
problem. Anyone with more interest in the subject is encouraged to consult
the pedagogical expositions by Bevington [13], Rust [213], Strutz [233] and
van den Bos [242].
We start with a couple of simple examples that introduce the basic
concepts of data tting. Then we move on to a more formal denition, and
we discuss some statistical aspects. Throughout the rst chapters of this
book we will return to these data tting problems in order to illustrate the
ensemble of numerical methods and techniques available to solve them.
1.1 Parameter estimation, data approximation
Example 1. Parameter estimation. In food-quality analysis, the amount
and mobility of water in meat has been shown to aect quality attributes like
appearance, texture and storage stability. The water contents can be mea-
sured by means of nuclear magnetic resonance (NMR) techniques, in which
the measured signal reects the amount and properties of dierent types of
water environments in the meat. Here we consider a simplied example
involving frozen cod, where the ideal time signal (t) from NMR is a sum
of two damped exponentials plus a constant background,
(t) = x
1
e

1
t
+x
2
e

2
t
+x
3
,
1
,
2
> 0.
In this example we assume that we know the parameters
1
and
2
that
control the decay of the two exponential components. In practice we do not
1
2 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Figure 1.1.1: Noisy measurements of the time signal (t) from NMR, for
the example with frozen cod meat.
measure this pure signal, but rather a noisy realization of it as shown in
Figure 1.1.1.
The parameters
1
= 27 s
1
and
2
= 8 s
1
characterize two dierent
types of proton environments, responsible for two dierent water mobilities.
The amplitudes x
1
and x
2
are proportional to the amount of water contained
in the two kinds of proton environments. The constant x
3
accounts for an
undesired background (bias) in the measurements. Thus, there are three
unknown parameters in this model, namely, x
1
, x
2
and x
3
. The goal of data
tting in relation to this problem is to use the measured data to estimate
the three unknown parameters and then compute the dierent kinds of water
contents in the meat sample. The actual t is presented in Figure 1.2.1.
In this example we used the technique of data tting for the
purpose of estimating unknown parameters in a mathemat-
ical model from measured data. The model was dictated by
the physical or other laws that describe the data.
Example 2. Data approximation. We are given measurements of air
pollution, in the form of the concentration of NO, over a period of 24 hours,
on a busy street in a major city. Since the NO concentration is mainly due
to the cars, it has maximum values in the morning and in the afternoon,
when the trac is most intense. The data is shown in Table 1.1 and the
plot in Figure 1.2.2.
For further analysis of the air pollution we need to t a smooth curve
to the measurements, so that we can compute the concentration at an arbi-
trary time between 0 and 24 hours. For example, we can use a low-degree
polynomial to model the data, i.e., we assume that the NO concentration
can be approximated by
f(t) = x
1
t
p
+x
2
t
p1
+ +x
p
t +x
p+1
,
THE LINEAR DATA FITTING PROBLEM 3
t
i
y
i
t
i
y
i
t
i
y
i
t
i
y
i
t
i
y
i
0 110.49 5 29.37 10 294.75 15 245.04 20 216.73
1 73.72 6 74.74 11 253.78 16 286.74 21 185.78
2 23.39 7 117.02 12 250.48 17 304.78 22 171.19
3 17.11 8 298.04 13 239.48 18 288.76 23 171.73
4 20.31 9 348.13 14 236.52 19 247.11 24 164.05
Table 1.1: Measurements of NO concentration y
i
as a function of time t
i
.
The units of y
i
and t
i
are g/m
3
and hours, respectively.
where t is the time, p is the degree of the polynomial and
x
1
, x
2
, . . . , x
p+1
are the unknown coecients in the polynomial. A better
model however, since the data repeats every day, would use periodic func-
tions:
f(t) = x
1
+x
2
sin( t) +x
3
cos( t) +x
4
sin(2 t) +x
5
cos(2 t) +
where = 2/24 is the period. Again, x
1
, x
2
, . . . are the unknown coe-
cients. The goal of data tting in relation to this problem is to estimate the
coecients x
1
, x
2
, . . ., such that we can evaluate the function f(t) for any
argument t. At the same time we want to suppress the inuence of errors
present in the data.
In this example we used the technique of data tting for the
purpose of approximating measured discrete data: we tted a
model to given data in order to be able to compute smoothed
data for any value of the independent variable in the model.
We were free to choose the model as long as it gave an ade-
quate t to the data.
Both examples illustrate that we are given data with measurement er-
rors and that we want to t a model to these data that captures the overall
behavior of it without being too sensitive to the errors. The dierence be-
tween the two examples is that in the rst case the model arises from a
physical theory, while in the second there is an arbitrary continuous ap-
proximation to a set of discrete data.
Data tting is distinctly dierent from the problem of interpolation,
where we seek a model a function f(t) that interpolates the given data,
i.e., it satises f(t
i
) = y
i
for all the data points. We are not interested
in interpolation (which is not suited for noisy data) rather, we want to
approximate the noisy data with a parametric model that is either given or
that we can choose, in such a way that the result is not too sensitive to the
noise. In this data tting approach there is redundant data: i.e., more data
than unknown parameters, which also helps to decrease the uncertainty in
4 LEAST SQUARES DATA FITTING WITH APPLICATIONS
the parameters of the model. See Example 15 in the next chapter for a
justication of this.
1.2 Formulation of the data tting problem
Let us now give a precise denition of the data tting problem. We assume
that we are given m data points
(t
1
, y
1
), (t
2
, y
2
), . . . , (t
m
, y
m
),
which can be described by the relation
y
i
= (t
i
) +e
i
, i = 1, 2, . . . , m. (1.2.1)
The function (t), which we call the pure-data function, describes the
noise-free data (it may be unknown, or given by the application), while
e
1
, e
2
, . . . , e
m
are the data errors (they are unknown, but we may have
some statistical information about them). The data errors also referred
to as noise represent measurement errors as well as random variations
in the physical process that generates the data. Without loss of generality
we can assume that the abscissas t
i
appear in non-decreasing order, i.e.,
t
1
t
2
t
m
.
In data tting we wish to compute an approximation to (t) typically
in the interval [t
1
, t
m
]. The approximation is given by the tting model
M(x, t), where the vector x = (x
1
, x
2
, . . . , x
n
)
T
contains n parameters
that characterize the model and are to be determined from the given noisy
data. In the linear data tting problem we always have a model of the form
Linear tting model: M(x, t) =
n

j=1
x
j
f
j
(t). (1.2.2)
The functions f
j
(t) are called the model basis functions, and the number
n the order of the t should preferably be smaller than the number
m of data points. A notable modern exception is related to the so-called
compressed sensing, which we discuss briey in Section 7.5.
The form of the function M(x, t) i.e., the choice of basis functions
depends on the precise goal of the data tting. These functions may be
given by the underlying mathematical model that describes the data in
which case M(x, t) is often equal to, or an approximation to, the pure-data
function (t) or the basis functions may be chosen arbitrarily among all
functions that give the desired approximation and allow for stable numerical
computations.
The method of least squares (LSQ) is a standard technique for deter-
mining the unknown parameters in the tting model. The least squares t
THE LINEAR DATA FITTING PROBLEM 5
is dened as follows. We introduce the residual r
i
associated with the data
points as
r
i
= y
i
M(x, t
i
), i = 1, 2, . . . , m,
and we note that each residual is a function of the parameter vector x, i.e.,
r
i
= r
i
(x). A least squares t is a choice of the parameter vector x that
minimizes the sum-of-squares of the residuals:
LSQ t: min
x
m

i=1
r
i
(x)
2
= min
x
m

i=1
_
y
i
M(x, t
i
)
_
2
. (1.2.3)
In the next chapter we shall describe in which circumstances the least
squares t is unique, and in the following chapters we shall describe a
number of ecient computational methods for obtaining the least squares
parameter vector x.
We note in passing that there are other related criteria used in data
tting; for example, one could replace the sum-of-squares in (1.2.3) with
the sum-of-absolute-values:
min
x
m

i=1
[r
i
(x)[ = min
x
m

i=1

y
i
M(x, t
i
)

. (1.2.4)
Below we shall use a statistical perspective to describe when these two
choices are appropriate. However, we emphasize that the book focuses on
the least squares t.
In order to obtain a better understanding of the least squares data tting
problem we take a closer look at the residuals, which we can write as
r
i
= y
i
M(x, t
i
) =
_
y
i
(t
i
)
_
+
_
(t
i
) M(x, t
i
)
_
= e
i
+
_
(t
i
) M(x, t
i
)
_
,
i = 1, 2, . . . , m. (1.2.5)
We see that the ith residual consists of two components: the data error
e
i
comes from the measurements, while the approximation error (t
i
)
M(x, t
i
) is due to the discrepancy between the pure-data function and the
computed tting model. We emphasize that even if (t) and M(x, t) have
the same form, there is no guarantee that the estimated parameters x used
in M(x, t) will be identical to those underlying the pure-data function (t).
At any rate, we see from this dichotomy that a good tting model M(x, t)
is one for which the approximation errors are of the same size as the data
errors.
Underlying the least squares formulation in (1.2.3) are the assumptions
that the data and the errors are independent and that the errors are white
noise. The latter means that all data errors are uncorrelated and of the
6 LEAST SQUARES DATA FITTING WITH APPLICATIONS
same size or in more precise statistical terms, that the errors e
i
have mean
zero and identical variance: c(e
i
) = 0 and c(e
2
i
) =
2
for i = 1, 2, . . . , m
(where is the standard deviation of the errors).
This ideal situation is not always the case in practice! Hence, we also
need to consider the more general case where the standard deviation de-
pends on the index i, i.e.,
c(e
i
) = 0, c(e
2
i
) =
2
i
, i = 1, 2, . . . , m,
where
i
is the standard deviation of e
i
. In this case, the maximum likeli-
hood principle in statistics (see Section 1.3), tells us that we should min-
imize the weighted residuals, with weights equal to the reciprocals of the
standard deviations:
min
x

m
i=1
_
r
i
(x)

i
_
2
= min
x

m
i=1
_
y
i
M(x,t
i
)

i
_
2
. (1.2.6)
Now consider the expected value of the weighted sum-of-squares:
c
_
m

i=1
_
r
i
(x)

i
_
2
_
=
m

i=1
c
_
r
i
(x)
2

2
i
_
=
m

i=1
c
_
e
2
i

2
i
_
+
m

i=1
c
_
((t
i
) M(x, t
i
))
2

2
i
_
= m+
m

i=1
c
_
((t
i
) M(x, t
i
))
2
_

2
i
,
where we have used that c(e
i
) = 0 and c(e
2
i
) =
2
i
. The consequence of
this relation is the intuitive result that we can allow the expected value of
the approximation errors to be larger for those data (t
i
, y
i
) that have larger
standard deviations (i.e., larger errors). Example 4 illustrates the usefulness
of this approach. See Chapter 3 in [233] for a thorough discussion on how
to estimate weights for a given data set.
We are now ready to state the least squares data tting problem in
terms of matrix-vector notation. We dene the matrix A R
mn
and the
vectors y, r R
m
as follows,
A =
_
_
_
_
_
f
1
(t
1
) f
2
(t
1
) f
n
(t
1
)
f
1
(t
2
) f
2
(t
2
) f
n
(t
2
)
.
.
.
.
.
.
.
.
.
f
1
(t
m
) f
2
(t
m
) f
n
(t
m
)
_
_
_
_
_
, y =
_
_
_
_
_
y
1
y
2
.
.
.
y
m
_
_
_
_
_
, r =
_
_
_
_
_
r
1
r
2
.
.
.
r
m
_
_
_
_
_
,
i.e., y is the vector of observations, r is the vector of residuals and the
matrix A is constructed such that the jth column is the jth model basis
THE LINEAR DATA FITTING PROBLEM 7
function sampled at the abscissas t
1
, t
2
, . . . , t
m
. Then it is easy to see that
for the un-weighted data tting problem we have the relations
r = y Ax and (x) =
m

i=1
r
i
(x)
2
= |r|
2
2
= |y Ax|
2
2
.
Similarly, for the weighted problem we have

W
(x) =
m

i=1
_
r
i
(x)

i
_
2
= |W (y Ax)|
2
2
,
with the weighting matrix and weights
W = diag(w
1
, . . . , w
m
), w
i
=
1
i
, i = 1, 2, . . . , m.
In both cases, the computation of the coecients in the least squares t is
identical to the solution of a linear least squares problem for x. Throughout
the book we will study these least squares problems in detail and give
ecient computational algorithms to solve them.
Example 3. We return to the NMR data tting problem from Example 1.
For this problem there are 50 measured data points and the model basis
functions are
f
1
(t) = e

1
t
, f
2
(t) = e

2
t
, f
3
(t) = 1,
and hence we have m = 50 and n = 3. In this example the errors in all
data points have the same standard deviation = 0.1, so we can use the
un-weighted approach. The solution to the 50 3 least squares problem is
x
1
= 1.303, x
2
= 1.973, x
3
= 0.305.
The exact parameters used to generate the data are 1.27, 2.04 and 0.3,
respectively. These data were then perturbed with random errors. Figure
1.2.1 shows the data together with the least squares t M(x, t); note how
the residuals are distributed on both sides of the t.
For the data tting problem in Example 2, we try both the polynomial
t and the trigonometric t. In the rst case the basis functions are the
monomials f
j
(t) = t
nj
, for j = 1, . . . , n = p + 1, where p is the degree of
the polynomial. In the second case the basis functions are the trigonometric
functions:
f
1
(t) = 1, f
2
(t) = sin( t), f
3
(t) = cos( t),
f
4
(t) = sin(2 t), f
5
(t) = cos(2 t), . . .
Figure 1.2.2 shows the two ts using a polynomial of degree p = 8 (giving
a t of order n = 9) and a trigonometric t with n = 9. The trigonometric
t looks better. We shall later introduce computational tools that let us
investigate this aspect in more rigorous ways.
8 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Figure 1.2.1: The least squares t (solid line) to the measured NMR data
(dots) from Figure 1.1.1 in Example 1.
Figure 1.2.2: Two least squares ts (both of order n = 9) to the mea-
sured NO data from Example 2, using a polynomial (left) and trigonometric
functions (right).
THE LINEAR DATA FITTING PROBLEM 9
Figure 1.2.3: Histograms of the computed values of x
2
for the modied
NMR data in which the rst 10 data points have larger errors. The left
plot shows results from solving the un-weighted LSQ problem, while the
right plot shows the results when weights w
i
=
1
i
are included in the LSQ
problem.
Example 4. This example illustrates the importance of using weights when
computing the t. We use again the NMR data from Example 1, except
this time we add larger Gaussian noise to the rst 10 data points, with
standard deviation 0.5. Thus we have
i
= 0.5 for i = 1, 2, . . . , 10 (the
rst 10 data with larger errors) and
i
= 0.1 for i = 11, 12, . . . , 50 (the
remaining data with smaller errors). The corresponding weights w
i
=
1
i
are therefore 2, 2, . . . , 2, 10, 10, . . . , 10. We solve the data tting problem
with and without weights for 10, 000 instances of the noise. To evaluate the
results, we consider how well we estimate the second parameter x
2
whose
exact value is 2.04. The results are shown in Figure 1.2.3 in the form of
histograms of the computed values of x
2
. Clearly, the weighted t gives more
robust results because it is less inuenced by the data with large errors.
1.3 Maximum likelihood estimation
At rst glance the problems of interpolation and data tting seem to re-
semble each other. In both cases, we use approximation theory to select
a good model (through the model basis functions) for the given data that
results in small approximation errors in the case of data tting, given by
the term (t) M(x, t) for t [t
1
, t
m
]. The main dierence between the
two problems comes from the presence of data errors and the way we deal
with these errors.
In data tting we deliberately avoid interpolating the data and instead
settle for less degrees of freedom in the model, in order to reduce the models
sensitivity to errors. Also, it is clear that the data noise plays an important
role in data tting problems, and we should use concepts and tools from
statistics to deal with it.
The classical statistical motivation for the least squares t is based
on the maximum likelihood principle. Our presentation follows [13]. We
assume that the data are given by (1.2.1), that the errors e
i
are unbiased
10 LEAST SQUARES DATA FITTING WITH APPLICATIONS
and uncorrelated, and that each error e
i
has a Gaussian distribution with
standard deviation
i
, i.e.,
y
i
= (t
i
) +e
i
, e
i
^(0,
2
i
).
Here, ^(0,
2
i
) denotes the normal (Gaussian) distribution with zero mean
and standard deviation
i
. Gaussian errors arise, e.g., from the measure-
ment process or the measuring devices, and they are also good models of
composite errors that arise when several sources contribute to the noise.
Then the probability P
i
for making the observation y
i
is given by
P
i
=
1

2
e

1
2

y
i
(t
i
)

2
=
1

2
e

1
2
(
e
i
i
)
2
.
The probability P for making the observed set of measurements y
1
, y
2
, . . . , y
m
is the product of these probabilities:
P = P
1
P
2
P
m
=
m

i=1
1

2
e

1
2

e
i

2
= K
m

i=1
e

1
2

e
i

2
,
where the factor K =

m
i=1
_

2
_
1
is independent of the pure-data
function.
Now assume that the pure-data function (t) is identical to the tting
model M(x, t), for a specic but unknown parameter vector x

. The proba-
bility for the given data y
1
, y
2
, . . . , y
m
to be produced by the model M(x, t)
for an arbitrary parameter vector x is given by
P
x
= K
m

i=1
e

1
2

y
i
M(x,t
i
)

2
= K e

1
2
P
m
i=1

y
i
M(x,t
i
)

2
. (1.3.1)
The method of maximum likelihood applied to data tting consists of mak-
ing the assumption that the given data are more likely to have come from
our model M(x, t) with specic but unknown parameters x

than any
other model of this form (i.e., with any other parameter vector x). The best
estimate of the model is therefore the one that maximizes the probability
P
x
given above, as a function of x. Clearly, P
x
is maximized by mini-
mizing the exponent, i.e., the weighted sum-of-squared residuals. We have
thus under the assumption of Gaussian errors derived the weighted least
squares data tting problem (1.2.6). In practice, we use the same principle
when there is a discrepancy between the pure-data function and the tting
model.
The same approach can be used to derive the maximum likelihood t
for other types of errors. As an important example, we consider errors
that follow a Laplace distribution, for which the probability of making the
THE LINEAR DATA FITTING PROBLEM 11
observation y
i
is given by
P
i
=
1
2
i
e

|y
i
(t
i
)|

i
=
1
2
i
e

|e
i
|

i
,
where again
i
is the standard deviation. The Laplace density function
decays slower than the Gaussian density function, and thus the Laplace
distribution describes measurement situations where we are more likely to
have large errors than in the case of the Gaussian distribution.
Following once again the maximum likelihood approach we arrive at the
problem of maximizing the function
P
x
=

K
m

i=1
e

|y
i
M(x,t
i
)|

i
=

K e

P
m
i=1

y
i
M(x,t
i
)

with

K =

m
i=1
_
2
i
_
1
. Hence, for these errors we should minimize the
sum-of-absolute-values of the weighted residuals. This is the linear 1-norm
minimization problem that we mentioned in (1.2.4).
While the principle of maximum likelihood is universally applicable, it
can lead to complicated or intractable computational problems. As an ex-
ample, consider the case of Poisson data, where y
i
comes from a Poisson
distribution, with expected value (t
i
) and standard deviation (t
i
)
1/2
.
Poisson data typically show up in counting measurements, such as the pho-
ton counts underlying optical detectors. Then the probability for making
the observation y
i
is
P
i
=
(t
i
)
y
i
y
i
!
e
(t
i
)
,
and hence, we should maximize the probability
P
x
=
m

i=1
M(x, t
i
)
y
i
y
i
!
e
M(x,t
i
)
=
m

i=1
1
y
i
!
m

i=1
M(x, t
i
)
y
i
e

P
m
i=1
M(x,t
i
)
.
Unfortunately, it is computationally demanding to maximize this quantity
with respect to x, and instead one usually makes the assumption that the
Poisson errors for each data y
i
are nearly Gaussian, with standard deviation

i
= (t
i
)
1/2
y
1/2
i
(see, e.g., pp. 342343 in [146] for a justication of this
assumption). Hence, the above weighted least squares approach derived for
Gaussian noise, with weights w
i
= y
1/2
i
, will give a good approximation
to the maximum likelihood t for Poisson data.
12 LEAST SQUARES DATA FITTING WITH APPLICATIONS
The Gauss and Laplace errors discussed above are used to model addi-
tive errors in the data. We nish this section with a brief look at relative
errors, which arise when the size of the error e
i
is perhaps to a good
approximation proportional to the magnitude of the pure data (t
i
). A
straightforward way to model such errors, which ts into the above frame-
work, is to assume that the data y
i
can be described by a normal distribu-
tion with mean (t
i
) and standard deviation
i
= [(t
i
)[ . This relative
Gaussian errors model can also be written as
y
i
= (t
i
) (1 +e
i
), e
i
^(0,
2
). (1.3.2)
Then the probability for making the observation y
i
is
P
i
=
1
[(t
i
)[

2
e

1
2

y
i
(t
i
)
(t
i
)

2
.
Using the maximum likelihood principle again and substituting the mea-
sured data y
i
for the unknown pure data (t
i
), we arrive at the following
weighted least squares problem:
min
x
m

i=1
_
y
i
M(x, t
i
)
y
i
_
2
= min
x
|W (y Ax)|
2
2
, (1.3.3)
with weights w
i
= y
1
i
.
An alternative formulation, which is suited for problems with positive
data (t
i
) > 0 and y
i
> 0, is to assume that y
i
can be described by a log-
normal distribution, for which log y
i
has a normal distribution, with mean
log (t
i
) and standard deviation :
y
i
= (t
i
) e

i
log y
i
= log (t
i
) +
i
,
i
^(0,
2
).
In this case we again arrive at a sum-of-squares minimization problem,
but now involving the dierence of the logarithms of the data y
i
and the
model M(x, t
i
). Even when M(x, t) is a linear model, this is not a linear
problem in x.
In the above log-normal model with standard deviation , the probabil-
ity P
i
for making the observation y
i
is given by
P
i
=
1
y
i

2
e

1
2

log y
i
log (t
i
)

2
=
1
y
i

2
e

1
2

log y
i

2
,
with y
i
= y
i
/(t
i
). Now let us assume that is small compared to (t
i
),
such that y
i
(t
i
) and y
i
1. Then, we can write y
i
= (t
i
) (1 + e
i
)
y
i
= 1 + e
i
, with e
i
1 and log y
i
= e
i
+ O(e
2
i
). Hence, the exponential
THE LINEAR DATA FITTING PROBLEM 13
factor in P
i
becomes
e

1
2

log y
i

2
= e

1
2


i
+O(e
2
i
)

2
= e

1
2

e
2
i

2
+
O(e
3
i
)

= e

1
2
(
e
i

)
2
e

O(e
3
i
)

2
= e

1
2
(
e
i

)
2
O(1),
while the other factor in P
i
becomes
1
y
i

2
=
1
(t
i
) (1 +e
i
)

2
=
1 +O(e
i
)
(t
i
)

2
.
Hence, as long as (t
i
) we have the approximation
P
i

1
(t
i
)

2
e

1
2
(
e
i

)
2
=
1
(t
i
)

2
e

1
2

y
i
(t
i
)
(t
i
)

2
,
which is the probability introduced above for the case of relative Gaussian
errors. Hence, for small noise levels [(t
i
)[, the two dierent models
for introducing relative errors in the data are practically identical, leading
to the same weighted LSQ problem (1.3.3).
1.4 The residuals and their properties
This section focuses on the residuals r
i
= y
i
M(x, t
i
) for a given t
and how they can be used to analyze the quality of the t M(x, t) that
we have computed. Throughout the section we assume that the residuals
behave like a time series, i.e., they have a natural ordering r
1
, r
2
, . . . , r
m
associated with the ordering t
1
< t
2
< < t
m
of the samples of the
independent variable t.
As we already saw in Equation (1.2.5), each residual r
i
consists of two
components the data error e
i
and the approximation error (t
i
)M(x, t
i
).
For a good tting model, the approximation error should be of the same
size as the data errors (or smaller). At the same time, we do not want
the residuals to be too small, since then the model M(x, t) may overt the
data: i.e., not only will it capture the behavior of the pure-data function
(t), but it will also adapt to the errors, which is undesirable.
In order to choose a good tting model M(x, t) we must be able to
analyze the residuals r
i
and determine whether the model captures the
pure-data function well enough. We can say that this is achieved when the
approximation errors are smaller than the data errors, so that the residuals
are practically dominated by the data errors. In that case, some of the
statistical properties of the errors will carry over to the residuals. For
example, if the noise is white (cf. Section 1.1), then we will expect that the
residuals associated with a satisfactory t show properties similar to white
noise.
14 LEAST SQUARES DATA FITTING WITH APPLICATIONS
If, on the other hand, the tting model does not capture the main
behavior of the pure-data function, then we can expect that the residuals
are dominated by the approximation errors. When this is the case, the
residuals will not have the characteristics of noise, but instead they will
tend to behave as a sampled signal, i.e., the residuals will show strong
local correlations. We will use the term trend to characterize a long-term
movement in the residuals when considered as a time series.
Below we will discuss some statistical tests that can be used to determine
whether the residuals behave like noise or include trends. These and many
other tests are often used in time series analysis and signal processing.
Throughout this section we make the following assumptions about the data
errors e
i
:
They are random variables with mean zero and identical variance,
i.e., c(e
i
) = 0 and c(e
2
i
) =
2
for i = 1, 2, . . . , m.
They belong to a normal distribution, e
i
^(0,
2
).
We will describe three tests with three dierent properties:
Randomness test: check for randomness of the signs of the residuals.
Autocorrelation test: check whether the residuals are uncorrelated.
White noise test: check for randomness of the residuals.
The use of the tools introduced here is illustrated below and in Chapter 11
on applications.
Test for random signs
Perhaps the simplest analysis of the residuals is based on the statistical
question: can we consider the signs of the residuals to be random? (Which
will often be the case when e
i
is white noise with zero mean.) We can
answer this question by means of a run test from time series analysis; see,
e.g., Section 10.4 in [134].
Given a sequence of two symbols in our case, + and for positive
and negative residuals r
i
a run is dened as a succession of identical
symbols surrounded by dierent symbols. For example, the sequence ++
+ + + + + + has m = 17 elements, n
+
= 8 pluses,
n

= 9 minuses and u = 5 runs: + + +, , ++, and


+++. The distribution of runs u (not the residuals!) can be approximated
by a normal distribution with mean
u
and standard deviation
u
given by

u
=
2 n
+
n

m
+ 1,
2
u
=
(
u
1) (
u
2)
m1
. (1.4.1)
THE LINEAR DATA FITTING PROBLEM 15
With a 5% signicance level we will accept the sign sequence as random if
z

=
[u
u
[

u
< 1.96 (1.4.2)
(other values of the threshold, for other signicance levels, can be found in
any book on statistics). If the signs of the residuals are not random, then
it is likely that trends are present in the residuals. In the above example
with 5 runs we have z

= 2.25, and according to (1.4.2) the sequence of


signs is not random.
Test for correlation
Another question we can ask is whether short sequences of residuals are
correlated, which is a clear indication of trends. The autocorrelation of the
residuals is a statistical tool for analyzing this. We dene the autocorrela-
tion of the residuals, as well as the trend threshold T

, as the quantities
=
m1

i=1
r
i
r
i+1
, T

=
1

m1
m

i=1
r
2
i
. (1.4.3)
Since is the sum of products of neighboring residuals, it is in fact the
unit-lag autocorrelation. Autocorrelations with larger lags, or distances in
the index, can also be considered. Then, we say that trends are likely to be
present in the residuals if the absolute value of the autocorrelation exceeds
the trend threshold, i.e., if [[ > T

. Similar techniques, based on shorter


sequences of residuals, are used for placing knots in connection with spline
tting; see Chapter 6 in [125].
We note that in some presentations, the mean of the residuals is sub-
tracted before computing and T

. In our applications this should not be


necessary, as we assume that the errors have zero mean.
Test for white noise
Yet another question we can ask is whether the sequence of residuals be-
haves like white noise, which can be answered by means of the normalized
cumulative periodogram. The underlying idea is that white noise has a at
spectrum, i.e., all frequency components in the discrete Fourier spectrum
have the same probability; hence, we must determine whether this is the
case. Let the complex numbers r
k
denote the components of the discrete
Fourier transform of the residuals, i.e.,
r
k
=
m

i=1
r
i
e
(2 (i1)(k1)/m)
, k = 1, . . . , m,
16 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Figure 1.4.1: Two articially created data sets used in Example 5. The
second data set is inspired by the data on p. 60 in [125].
where denotes the imaginary unit. Our indices are in the range 1, . . . , m
and thus shifted by 1 relative to the range 0, . . . , m1 that is common in sig-
nal processing. Note that r
1
is the sum of the residuals (called the DC com-
ponent in signal processing), while r
q+1
with q = m/2| is the component of
the highest frequency. The squared absolute values [ r
1
[
2
, [ r
2
[
2
, . . . , [ r
q+1
[
2
are known as the periodogram (in statistics) or the power spectrum (in
signal processing). Then the normalized cumulative periodogram consists
of the q numbers
c
i
=
[ r
2
[
2
+[ r
3
[
2
+ +[ r
i+1
[
2
[ r
2
[
2
+[ r
3
[
2
+ +[ r
q+1
[
2
, i = 1, . . . , q, q = m/2|,
which form an increasing sequence from 0 to 1. Note that the sums exclude
the rst term in the periodogram.
If the residuals are white noise, then the expected values of the normal-
ized cumulative periodogram lie on a straight line from (0, 0) to (q, 1). Any
realization of white noise residuals should produce a normalized cumula-
tive periodogram close to a straight line. For example, with the common
5% signicance level from statistics, the numbers c
i
should lie within the
Kolmogorov-Smirno limit 1.35/q of the straight line. If the maximum
deviation max
i
[c
i
i/q[ is smaller than this limit, then we recognize the
residual as white noise.
Example 5. Residual analysis. We nish this section with an example
that illustrates the above analysis techniques. We use the two dierent data
sets shown in Figure 1.4.1; both sets are articially generated (in fact, the
second set is the rst set with the t
i
and y
i
values interchanged). In both
examples we have m = 43 data points, and in the test for white noise
we have q = 21 and 1.35/q = 0.0643. The tting model M(x, t) is the
polynomial of degree p = n 1.
For tting orders n = 2, 3, . . . , 9, Figure 1.4.2 shows the residuals and
the normalized cumulative periodograms, together with z

(1.4.2), the ratios


THE LINEAR DATA FITTING PROBLEM 17
Figure 1.4.2: Residual analysis for polynomial ts to articial data set 1.
18 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Figure 1.4.3: Residual analysis for polynomial ts to articial data set 2.
THE LINEAR DATA FITTING PROBLEM 19
/T

from (1.4.3) and the maximum distance of the normalized cumulative


periodogram to the straight line. A visual inspection of the residuals in the
left part of the gure indicates that for small values of n the polynomial
model does not capture all the information in the data, as there are obvious
trends in the residuals, while for n 5 the residuals appear to be more
random. The test for random signs conrms this: for n 5 the numbers
z

are less than 1.96, indicating that the signs of the residuals could be
considered random. The autocorrelation analysis leads to approximately the
same conclusion: for n = 6 and 7 the absolute value of the autocorrelation
is smaller than the threshold T

.
The normalized cumulative periodograms are shown in the right part
of Figure 1.4.2. For small values, the curves rise fast toward a at part,
showing that the residuals are dominated by low-frequency components. The
closest we get to a straight line is for n = 6, but the maximum distance 0.134
to the straight line is still too large to clearly signify that the residuals are
white noise. The conclusion from these three tests is nevertheless that n = 6
is a good choice of the order of the t.
Figure 1.4.3 presents the residual analysis for the second data set. A
visual inspection of the residuals clearly shows that the polynomial model
is not well suited for this data set the residuals have a slowly varying
trend for all values of n. This is conrmed by the normalized cumulative
periodograms that show that the residuals are dominated by low-frequency
components. The random-sign test and the autocorrelation analysis also
give a clear indication of trends in the residuals.
1.5 Robust regression
The least squares t introduced in this chapter is convenient and useful in
a large number of practical applications but it is not always the right
choice for a data tting problem. In fact, we have already seen in Section
1.3 that the least squares t is closely connected to the assumption about
Gaussian errors in the data. There we also saw that other types of noise,
in the framework of maximum likelihood estimation, lead to other criteria
for a best t such as the sum-of-absolute-values of the residuals (the 1-
norm) associated with the Laplace distribution for the noise. The more
dominating the tails of the probablility density function for the noise, the
more important it is to use another criterion than the least squares t.
Another situation where the least squares t is not appropriate is when
the data contain outliers, i.e., observations with exceptionally large errors
and residuals. We can say that an outlier is a data point (t
i
, y
i
) whose value
y
i
is unusual compared to its predicted value (based on all the reliable data
points). Such outliers may come from dierent sources:
20 LEAST SQUARES DATA FITTING WITH APPLICATIONS
The data errors may come from more than one statistical distribution.
This could arise, e.g., in an astronomical CCD camera, where we have
Poisson noise (or photon noise) from the incoming light, Gaussian
noise from the electronic circuits (amplier and A/D-converter), and
occasional large errors from cosmic radiation (so-called cosmic ray
events).
The outliers may be due to data recording errors arising, e.g., when
the measurement device has a malfunction or the person recording
the data makes a blunder and enters a wrong number.
A manual inspection can sometimes be used to delete blunders from the
data set, but it may not always be obvious which data are blunders or
outliers. Therefore we prefer to have a mathematical formulation of the
data tting problem that handles outliers in such a way that all data are
used, and yet the outliers do not have a deteriorating inuence on the t.
This is the goal of robust regression. Quoting from [82] we say that an
estimator or statistical procedure is robust if it provides useful information
even if some of the assumptions used to justify the estimation method are
not applicable.
Example 6. Mean and median. Assume we are given n 1 samples
z
1
, . . . , z
n1
from the same distribution and a single sample z
n
that is an
outlier. Clearly, the arithmetic mean
1
n
(z
1
+ z
2
+ + z
n
) is not a good
estimate of the expected value because the outlier constributes with the same
weight as all the other data points. On the other hand, the median gives a
robust estimate of the expected value since it is insensitive to a few outliers;
we recall that if the data are sorted then the median is z
(n+1)/2
if n is odd,
and
1
2
(z
n/2
+z
n/2+1
) if n is even.
The most common method for robust data tting or robust regression,
as statisticians call it is based on the principle of M-estimation introduced
by Huber [130], which can be considered as a generalization of maximum
likelihood estimation. Here we consider un-weighted problems only (the
extension to weighted problems is straightforward). The underlying idea
is to replace the sum of squared residuals in (1.2.3) with the sum of some
function of the residuals:
Robust t: min
x
m

i=1

_
r
i
(x)
_
= min
x
m

i=1

_
y
i
M(x, t
i
)
_
, (1.5.1)
where the function denes the contribution of each residual to the function
to be minimized. In particular, we obtain the least squares t when (r) =
1
2
r
2
. The function must satisfy the following criteria:
1. Non-negativity: (r) 0 r.
THE LINEAR DATA FITTING PROBLEM 21
Figure 1.5.1: Four functions (r) used in the robust data tting problem.
All of them increase slower than
1
2
r
2
that denes the LSQ problem and thus
they lead to robust data tting problems that are less sensitive to outliers.
2. Zero only when the argument is zero: (r) = 0 r = 0.
3. Symmetry: (r) = (r).
4. Monotonicity: (r

) (r) for r

r.
Some well-known examples of the function are (cf. [82, 171]):
Huber : (r) =
_
1
2
r
2
, [r[
[r[
1
2

2
, [r[ >
(1.5.2)
Talwar : (r) =
_
1
2
r
2
, [r[
1
2

2
, [r[ >
(1.5.3)
Bisquare : (r) =
2
log
_
cosh
_
z

__
(1.5.4)
Logistic : (r) =
2
_
[r[

log
_
1 +
[r[

__
(1.5.5)
Note that all four functions include a problem-dependent positive parameter
that is used to control the behavior of the function for large values of
r, corresponding to the outliers. Figure 1.5.1 shows these functions for the
case = 1, and we see that all of them increase slower than the function
1
2
r
2
, which underlies the LSQ problem. This is precisely why they lead to a
robust data tting problem whose solution is less sensitive to outliers than
the LSQ solution. The parameter should be chosen from our knowledge
of the standard deviation of the noise; if is not known, then it can be
estimated from the t as we will discuss in Section 2.2.
It appears that the choice of function for a given problem relies mainly
on experience with the specic data for that problem. Still, the Huber
22 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Figure 1.5.2: The pure-data function (t) (thin line) and the data with
Gaussian noise (dots); the outlier (t
60
, y
60
) = (3, 2.5) is outside the plot.
The tting model M(x, t) is a polynomial with n = 9. Left: the LSQ t
and the corresponding residuals; this t is dramatically inuenced by the
outlier. Right: the robust Huber t, using = 0.025, together with the
residuals; this is a much better t to the given data because it approximates
the pure-data function well.
function has attained widespread use, perhaps due to the natural way it
distinguishes between large and small residuals:
Small residual satisfying [r
i
[ are treated in the same way as in
the LSQ tting problem; if there are no outliers then we obtain the
LSQ solution.
Large residuals satisfying [r
i
[ > are essentially treated as [r
i
[ and
the robust t is therefore not so sensitive to the corresponding data
points.
Thus, robust regression is a compromise between excluding the outliers
entirely from the analysis and including all the data points and treating
all of them equally in the LSQ regression. The idea of robust regression
is to weight the observations dierently based on how well behaved these
observations are. For an early use in seismic data processing see [48, 219].
Example 7. Robust data tting with the Huber function. This
example illustrates that the Huber function gives a more robust t than the
LSQ t. The pure-data funtion (t) is given by
(t) = sin
_
e
t
_
, 0 t 5.
THE LINEAR DATA FITTING PROBLEM 23
We use m = 100 data points with t
i
= 0.05i, and we add Gaussian noise
with standard deviation = 0.05. Then we change the 60th data point to
an outlier with (t
60
, y
60
) = (3, 2.5); Figure 1.5.2 shows the function (t)
and the noisy data. We note that the outlier is located outside the plot.
As tting model M(x, t) we use a polynomial with n = 9 and the left
part of Figure 1.5.2 shows the least squares t with this model, together
with the corresponding residuals. Clearly, this t is dramatically inuenced
by the outlier, which is evident from the plot of the t and as well by the
behavior of the residuals, which exhihit a strong positive trend in the range
2 t 4. This illustrates the inability of the LSQ t to handle outliers in
a satisfactory way.
The right part of Figure 1.5.2 shows the robust Huber t, with parameter
= 0.025 (this parameter is chosen to reect the noise level in the data).
The resulting t is not inuenced by the outlier, and the residuals do not
seem to exhibit any strong trend. This is a good illustration of robust
regression.
In Section 9.5 we describe numerical algorithms for computing the so-
lutions to robust data tting problems.
Chapter 2
The Linear Least Squares
Problem
This chapter covers some of the basic mathematical facts of the linear least
squares problem (LSQ), as well as some important additional statistical
results for data tting. We introduce the two formulations of the least
squares problem: the linear system of normal equations and the optimiza-
tion problem form.
The computation of the LSQ solution via an optimization problem has
two aspects: simplication of the problem structure and actual minimiza-
tion. In this and in the next chapter we present a number of matrix factor-
izations, for both full-rank and rank-decient problems, which transform
the original problem to one easier to solve. The QR factorization is empha-
sized, for both the analysis and the solution of the LSQ problem, while in
the last section we look into the more expensive complete factorizations.
Some very interesting historical paper on Gaussian elimination that in-
cludes also least squares problems can be found in Grcar [109, 110].
2.1 Linear least squares problem formulation
As we saw in the previous chapter, underlying the linear (and possibly
weighted) least squares data tting problem is the linear least squares prob-
lem
min
x
|b Ax|
2
or min
x
|W(b Ax)|
2
,
where A R
mn
is the matrix with samples of the model basis functions,
x is the vector of parameters to be determined, the right-hand side b is
the vector of observations and W is a diagonal weight matrix (possibly the
identity matrix). Since the weights can always be absorbed into A and b
25
26 LEAST SQUARES DATA FITTING WITH APPLICATIONS

I
T
range(A)
b

= Ax

b
Figure 2.1.1: The geometric interpretation of the linear least squares so-
lution x

. The plane represents the range of A, and if the vector b has


a component outside this subspace, then we have an inconsistent system.
Moreover, b

= Ax

is the orthogonal projection of b on range(A), and r

is the LSQ residual vector.


in the mathematical formulation we can, without loss of generality, restrict
our discussion to the un-weighted case. Also, from this point on, when
discussing the generic linear least squares problem, we will use the notation
b for the right-hand side (instead of y that we used in Chapter 1), which
is more common in the LSQ literature.
Although most of the material in this chapter is also applicable to the
underdetermined case (m < n), for notational simplicity we will always
consider the overdetermined case m n. We denote by r the rank of
A, and we consider both full-rank and rank-decient problems. Thus we
always have r n m in this chapter.
It is appropriate to remember at this point that an m n matrix A is
always a representation of a linear transformation x Ax with A : R
n

R
m
, and therefore there are two important subspaces associated with it:
The range or column space,
range(A) = z R
m
[ x R
n
, z = Ax ,
and its orthogonal complement, the null space of A
T
:
null(A
T
) =
_
y R
m
[ A
T
y = 0
_
.
When A is square and has full rank, then the LSQ problem min
x
|Ax
b|
2
reduces to the linear system of equations Ax = b. In all other cases,
due to the data errors, it is highly probable that the problem is inconsistent,
i.e., b / range(A), and as a consequence there is no exact solution, i.e., no
coecients x
j
exist that express b as a linear combination of columns of A.
THE LINEAR LEAST SQUARES PROBLEM 27
Instead, we can nd the coecients x
j
for a vector b

in the range of A
and closest to b. As we have seen, for data tting problems it is natural
to use the Euclidean norm as our measure of closeness, resulting in the least
squares problem
Problem LSQ : min
x
|b Ax|
2
2
, A R
mn
, r n m (2.1.1)
with the corresponding residual vector r given by
r = b Ax. (2.1.2)
See Figure 2.1.1 for a geometric interpretation. The minimizer, i.e., the
least squares solution which may not be unique, as it will be seen later
is denoted by x

. We note that the vector b

range(A) mentioned above


is given by b

= Ax

.
The LSQ problem can also be looked at from the following point of view.
When our data are contaminated by errors, then the data are not in the
span of the model basis functions f
j
(t) underlying the data tting problem
(cf. Chapter 1). In that case the data vector b cannot and should not be
precisely predicted by the model, i.e., the columns of A. Hence, it must
be perturbed by a minimum amount r, so that it can then be represented
by A, in the form of b

= Ax

. This approach will establish a viewpoint


used in Section 7.3 to introduce the total least squares problem.
As already mentioned, there are good statistical reasons to use the Eu-
clidean norm. The underlying statistical assumption that motivates this
norm is that the vector r has random error elements, uncorrelated, with
zero mean and a common variance. This is justied by the following theo-
rem.
Theorem 8. (Gauss-Markov) Consider the problem of tting a model
M(x, t) with the n-parameter vector x to a set of data b
i
= (t
i
) + e
i
for i = 1, . . . , m (see Chapter 1 for details).
In the case of a linear model b = Ax, if the errors are uncorrelated with
mean zero and constant variance
2
(not necessarily normally distributed)
and assuming that the mn matrix A obtained by evaluating the model at
the data abscissas t
i

i=1,...,m
has full rank n, then the best linear unbiased
estimator is the least squares estimator x

, obtained by solving the problem


min
x
|b Ax|
2
2
.
For more details see [20] Theorem 1.1.1. Recall also the discussion
on maximum likelihood estimation in Chapter 1. Similarly, for nonlinear
models, if the errors e
i
for i = 1, . . . , m have a normal distribution, the
unknown parameter vector x estimated from the data using a least squares
criterion is the maximum likelihood estimator.
28 LEAST SQUARES DATA FITTING WITH APPLICATIONS
There are also clear mathematical and computational advantages asso-
ciated with the Euclidean norm: the objective function in (2.1.1) is dif-
ferentiable, and the resulting gradient system of equations has convenient
properties. Since the Euclidean norm is preserved under orthogonal trans-
formations, this gives rise to a range of stable numerical algorithms for the
LSQ problem.
Theorem 9. A necessary and sucient condition for x

to be a minimizer
of |b Ax|
2
2
is that it satises
A
T
(b Ax) = 0. (2.1.3)
Proof. The minimizer of (x) = |b Ax|
2
2
must satisfy (x) = 0, i.e.,
(x)/x
k
= 0 for k = 1, . . . , n. The kth partial derivative has the form
(x)
x
k
=
m

i=1
2
_
b
i

j=1
x
j
a
ij
_
(a
ik
) = 2
m

i=1
r
i
a
ik
= 2 r
T
A(: , k) = 2 A(: , k)
T
r,
where A(: , k) denotes the kth column of A. Hence, the gradient can be
written as
(x) = 2 A
T
r = 2 A
T
(b Ax)
and the requirement that (x) = 0 immediately leads to (2.1.3).
Denition 10. The two conditions (2.1.2) and (2.1.3) can be written as a
symmetric (m + n) (m + n) system in x and r, the so-called augmented
system:
_
I A
A
T
0
__
r
x
_
=
_
b
0
_
. (2.1.4)
This formulation preserves any special structure that A might have,
such as sparsity. Also, it is the formulation used in an iterative renement
procedure for the LSQ solution (discussed in Section 4.5), because of the
relevance it gives to the residual.
Theorem 9 leads to the normal equations for the solution x

of the least
squares problem:
Normal equations: A
T
Ax = A
T
b. (2.1.5)
The normal equation matrix A
T
A, which is sometimes called the Gram-
mian, is square, symmetric and additionally:
If r = n (A has full rank), then A
T
A is positive denite and the
LSQ problem has a unique solution. (Since the Hessian for the least
squares problem is equal to 2A
T
A, this establishes the uniqueness of
x

.)
THE LINEAR LEAST SQUARES PROBLEM 29
If r < n (A is rank decient), then A
T
A is non-negative denite. In
this case, the set of solutions forms a linear manifold of dimension
n r that is a translation of the subspace null(A).
Theorem 9 also states that the residual vector of the LSQ solution lies
in null(A
T
). Hence, the right-hand side b can be decomposed into two
orthogonal components
b = Ax + r,
with Ax range(A) and r null(A
T
), i.e., Ax is the orthogonal projection
of b onto range(A) (the subspace spanned by the columns of A) and r is
orthogonal to range(A).
Example 11. The normal equations for the NMR problem in Example 1
take the form
_
_
2.805 4.024 5.055
4.024 8.156 1.521
5.055 1.521 50
_
_
_
_
x
1
x
2
x
3
_
_
=
_
_
13.14
25.98
51.87
_
_
,
giving the least squares solution x

= ( 1.303 , 1.973 , 0.305 )


T
.
Example 12. Simplied NMR problem. In the NMR problem, let us
assume that we know that the constant background is 0.3, corresponding to
xing x
3
= 0.3. The resulting 2 2 normal equations for x
1
and x
2
take
the form
_
2.805 4.024
4.024 8.156
__
x
1
x
2
_
=
_
10.74
20.35
_
and the LSQ solution to this simplied problem is x
1
= 1.287 and x
2
=
1.991. Figure 2.1.2 illustrates the geometry of the minimization associated
with the simplied LSQ problem for the two unknowns x
1
and x
2
. The
left plot shows the residual norm surface as a function of x
1
and x
2
, and
the right plot shows the elliptic contour curves for this surface; the unique
minimum the LSQ solution is marked with a dot.
In the rank-decient case the LSQ solution is not unique, but one can re-
duce the solution set by imposing additional constraints. For example, the
linear least squares problem often arises from a linearization of a nonlinear
least squares problem, and it may be of interest then to impose the addi-
tional constraint that the solution has minimal l
2
-norm, x = min
x
|x|
2
, so
that the solution stays in the region where the linearization is valid. Be-
cause the set of all minimizers is convex there is a unique solution. Another
reason for imposing minimal length is stability, as we will see in the section
on regularization.
For data approximation problems, where we are free to choose the
model basis functions f
j
(t), cf. (1.2.2), one should do it in a way that
30 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Figure 2.1.2: Illustration of the LSQ problem for the simplied NMR
problem. Left: the residual norm as a function of the two unknowns x
1
and
x
2
. Right: the corresponding contour lines for the residual norm.
gives A full rank. A necessary condition is that the (continuous) functions
f
1
(t), . . . , f
n
(t) are linearly independent, but furthermore, they have to de-
ne linearly independent vectors when evaluated on the specic discrete set
of abscissas. More formally:
A necessary and sucient condition for the matrix A to have full rank
is that the model basis functions be linearly independent over the abscissas
t
1
, . . . , t
m
:
n

j=1

j
f
j
(t
i
) = 0 for i = 1, . . . , m
j
= 0 for j = 1, ..., n.
Example 13. Consider the linearly independent functions f
1
(t) = sin(t),
f
2
(t) = sin(2t) and f
3
(t) = sin(3t); if we choose the data abscissas t
i
=
/4 + i/2, i = 1, . . . , m, the matrix A has rank r = 2, whereas the same
functions generate a full-rank matrix A when evaluated on the abscissas
t
i
= (i/m), i = 1 . . . , m1.
An even stronger requirement is that the model basis functions f
j
(t) be
such that the columns of A are orthogonal.
Example 14. In general, for data tting problems, where the model basis
functions f
j
(t) arise from the underlying model, the properties of the matrix
A are dictated by these functions. In the case of polynomial data tting,
it is possible to choose the functions f
j
(t) so that the columns of A are
orthogonal, as described by Forsythe [81], which simplies the computation
of the LSQ solution. The key is to choose a clever representation of the
tting polynomials, dierent from the standard one with the monomials:
THE LINEAR LEAST SQUARES PROBLEM 31
f
j
(t) = t
j1
, j = 1, ..., n, such that the sampled polynomials satisfy
m

i=1
f
j
(t
i
) f
k
(t
i
) = 0 for j ,= k. (2.1.6)
When this is the case we say that the functions are orthogonal over the
given abscissas. This is satised by the family of orthogonal polynomials
dened by the recursion:
f
1
(t) = 1
f
2
(t) = t
1
f
j+1
(t) = (t
j
) f
j
(t)
j
f
j1
(t), j = 2, . . . n 1,
where the constants are given by

j
=
1
s
2
j
m

i=1
t
i
f
j
(t
i
)
2
, j = 0, 1, . . . , n 1

j
=
s
2
j
s
2
j1
, j = 0, 1, . . . , n 1
s
2
j
=
m

i=1
f
j
(t
i
)
2
, j = 2, . . . n,
i.e., s
j
is the l
2
-norm of the jth column of A. These polynomials satisfy
(2.1.6), hence, the normal equations matrix A
T
A is diagonal, and it follows
that the LSQ coecients are given by
x

j
=
1
s
2
j
m

i=1
y
i
f
j
(t
i
), j = 1, . . . , n.
When A has full rank, it follows from the normal equations (2.1.5) that
we can write the least squares solution as
x

= (A
T
A)
1
A
T
b,
which allows us to analyze the solution and the residual vector in statis-
tical terms. Consider the case where the data errors e
i
are independent,
uncorrelated and have identical standard deviations , meaning that the
covariance for b is given by
Cov(b) =
2
I
m
,
since the errors e
i
are independent of the exact (t
i
). Then a standard
result in statistics says that the covariance matrix for the LSQ solution is
Cov(x

) = (A
T
A)
1
A
T
Cov(b) A(A
T
A)
1
=
2
(A
T
A)
1
.
32 LEAST SQUARES DATA FITTING WITH APPLICATIONS
We see that the unknown coecients in the t the elements of x

are
uncorrelated if and only if A
T
A is a diagonal matrix, i.e., when the columns
of A are orthogonal. This is the case when the model basis functions are
orthogonal over the abscissas t
1
, . . . , t
m
; cf. (2.1.6).
Example 15. More data gives better accuracy. Intuitively we expect
that if we increase the number of data points then we can compute a more
accurate LSQ solution, and the present example conrms this. Specically
we give an asymptotic analysis of how the solutions variance depends on
the number m of data points, in the case of linear data tting. There is no
assumption about the distribution of the abscissas t
i
except that they belong
to the interval [a, b] and appear in increasing order. Now let h
i
= t
i
t
i1
for i = 2, . . . , m and let h = (b a)/m denote the average spacing between
the abscissas. Then for j, k = 1, . . . , n the elements of the normal equation
matrix can be approximated as
(A
T
A)
jk
=
m

i=1
h
1
i
f
j
(t
i
)f
k
(t
i
)h
i

1
h
m

i=1
f
j
(t
i
)f
k
(t
i
)h
i

m
b a
_
b
a
f
j
(t)f
k
(t) dt,
and the accuracy of these approximations increases as m increases. Hence,
if F denotes the matrix whose elements are the scaled inner products of the
model basis functions,
F
jk
=
1
b a
_
b
a
f
j
(t)f
k
(t)dt, i, j = 1, . . . , n,
then for large m the normal equation matrix approximately satises
A
T
A mF (A
T
A)
1

1
m
F
1
,
where the matrix F is independent of m. Hence, the asymptotic result (as
m increases) is that no matter the choice of abscissas and basis functions,
as long as A
T
A is invertible we have the approximation for the white-noise
case:
Cov(x

) =
2
(A
T
A)
1


2
m
F
1
.
We see that the solutions variance is (to a good approximation) inversely
proportional to the number m of data points.
To illustrate the above result we consider again the frozen cod meat
example, this time with two sets of abscissas t
i
uniformly distributed in
[0, 0.4] for m = 50 and m = 200, leading to the two matrices (A
T
A)
1
THE LINEAR LEAST SQUARES PROBLEM 33
Figure 2.1.3: Histograms of the error norms x
exact
x

2
for the two
test problems with additive white noise; the errors are clearly reduced by a
factor of 2 when we increase m from 50 to 200.
given by
_
_
1.846 1.300 0.209
1.300 1.200 0.234
0.209 0.234 0.070
_
_
and
_
_
0.535 0.359 0.057
0.359 0.315 0.061
0.057 0.061 0.018
_
_
,
respectively. The average ratio between the elements in the two matrices
is 3.71, i.e., fairly close to the factor 4 we expect from the above analysis
when increasing m by a factor 4.
We also solved the two LSQ problems for 1000 realizations of additive
white noise, and Figure 2.1.3 shows histograms of the error norms |x
exact

|
2
, where x
exact
= (1.27, 2.04, 0.3)
T
is the vector of exact parameters for
the problem. These results conrm that the errors are reduced by a factor
of 2 corresponding to the expected reduction of the standard deviation by
the same factor.
2.2 The QR factorization and its role
In this and the next section we discuss the QR factorization and its role in
the analysis and solution of the LSQ problem. We start with the simpler
case of full-rank matrices in this section and then move on to rank-decient
matrices in the next section.
The rst step in the computation of a solution to the least squares
problem is the reduction of the problem to an equivalent one with a more
convenient matrix structure. This can be done through an explicit fac-
torization, usually based on orthogonal transformations, where instead of
solving the original LSQ problem (2.1.1) one solves an equivalent problem
with a triangular matrix. The basis of this procedure is the QR factoriza-
tion, the less expensive decomposition that takes advantage of the isometric
properties of orthogonal transformations (proofs for all the theorems in this
section can be found in [20], [105] and many other references).
34 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Theorem 16. QR factorization. Any real mn matrix A can be fac-
tored as
A = QR with Q R
mm
, R =
_
R
1
0
_
R
mn
, (2.2.1)
where Q is orthogonal (i.e., Q
T
Q = I
m
) and R
1
R
nn
is upper triangular.
If A has full rank, then so has R and therefore all its diagonal elements are
nonzero.
Theorem 17. Economical QR factorization. Let A R
mn
have full
column rank r = n. The economical (or thin) QR factorization of A is
A = Q
1
R
1
with Q
1
R
mn
, R
1
R
nn
, (2.2.2)
where Q
1
has orthonormal columns (i.e., Q
T
1
Q
1
= I
n
) and the upper trian-
gular matrix R
1
has nonzero diagonal entries. Moreover, Q
1
can be chosen
such that the diagonal elements of R
1
are positive, in which case R
1
is the
Cholesky factor of A
T
A.
Similar theorems hold if the matrix A is complex, with the factor Q now
a unitary matrix.
Remark 18. If we partition the mm matrix Q in the full QR factoriza-
tion (2.2.1) as
Q = ( Q
1
Q
2
),
then the sub-matrix Q
1
is the one that appears in the economical QR fac-
torization (2.2.2). The m (m n) matrix Q
2
satises Q
T
2
Q
1
= 0 and
Q
1
Q
T
1
+Q
2
Q
T
2
= I
m
.
Geometrically, the QR factorization corresponds to an orthogonalization
of the linearly independent columns of A. The columns of matrix Q
1
are an
orthonormal basis for range(A) and those of Q
2
are an orthonormal basis
for null(A
T
).
The following theorem expresses the least squares solution of the full-
rank problem in terms of the economical QR factorization.
Theorem 19. Let A R
mn
have full column rank r = n, with the eco-
nomical QR factorization A = Q
1
R
1
from Theorem 17. Considering that
|b Ax|
2
2
= |Q
T
(b Ax)|
2
2
= |
_
Q
T
1
b
Q
T
2
b
_

_
R
1
0
_
x|
2
2
= |Q
T
1
b R
1
x|
2
2
+|Q
T
2
b|
2
2
,
THE LINEAR LEAST SQUARES PROBLEM 35
then, the unique solution of the LSQ problem min
x
|b Ax|
2
2
can be com-
puted from the simpler, equivalent problem
min
x
|Q
T
1
b R
1
x|
2
2
,
whose solution is
x

= R
1
1
Q
T
1
b (2.2.3)
and the corresponding least squares residual is given by
r

= b Ax

= (I
m
Q
1
Q
T
1
)b = Q
2
Q
T
2
b, (2.2.4)
with the matrix Q
2
that was introduced in Remark 18.
Of course, (2.2.3) is short-hand for solving R x

= Q
T
1
b, and one point
of this reduction is that it is much simpler to solve a triangular system of
equations than a full one. Further on we will also see that this approach has
better numerical properties, as compared to solving the normal equations
introduced in the previous section.
Example 20. In Example 11 we saw the normal equations for the NMR
problem from Example 1; here we take a look at the economical QR factor-
ization for the same problem:
A =
_
_
_
_
_
_
_
_
_
_
_
1.00 1.00 1
0.80 0.94 1
0.64 0.88 1
.
.
.
.
.
.
.
.
.
3.2 10
5
4.6 10
2
1
2.5 10
5
4.4 10
2
1
2.0 10
5
4.1 10
2
1
_
_
_
_
_
_
_
_
_
_
_
,
Q
1
=
_
_
_
_
_
_
_
_
_
_
_
0.597 0.281 0.172
0.479 0.139 0.071
0.384 0.029 0.002
.
.
.
.
.
.
.
.
.
1.89 10
5
0.030 0.224
1.52 10
5
0.028 0.226
1.22 10
5
0.026 0.229
_
_
_
_
_
_
_
_
_
_
_
,
R
1
=
_
_
1.67 2.40 3.02
0 1.54 5.16
0 0 3.78
_
_
, Qb =
_
_
7.81
4.32
1.19
_
_
.
We note that the upper triangular matrix R
1
is also the Cholesky factor of
the normal equation matrix, i.e., A
T
A = R
T
1
R
1
.
36 LEAST SQUARES DATA FITTING WITH APPLICATIONS
The QR factorization allows us to study the residual vector in more
detail. Consider rst the case where we augment A with an additional
column, corresponding to adding an additional model basis function in the
data tting problem.
Theorem 21. Let the augmented matrix A = ( A, a
n+1
) have the QR
factorization
A = ( Q
1
Q
2
)
_
R
1
0
_
,
with Q
1
= ( Q
1
q ), Q
T
1
q = 0 and Q
T
2
q = 0. Then the norms of the least
squares residual vectors r

= (I
m
Q
1
Q
T
1
)b and r

= (I
m
Q
1
Q
T
1
)b are
related by
|r

|
2
2
= | r

|
2
2
+ ( q
T
b)
2
.
Proof. From the relation Q
1
Q
T
1
= Q
1
Q
T
1
+ q q
T
it follows that I
m

Q
1
Q
T
1
= I
m
Q
1
Q
T
1
+ q q
T
, and hence,
|r

|
2
2
= |(I
m
Q
1
Q
T
1
)b|
2
2
= |(I
m
Q
1
Q
T
1
)b + q q
T
b|
2
2
= |(I
m
Q
1
Q
T
1
)b|
2
2
+| q q
T
b|
2
2
= | r

|
2
2
+ ( q
T
b)
2
,
where we used that the two components of r

are orthogonal and that


| q q
T
b|
2
= [ q
T
b[ | q|
2
= [ q
T
b[.
This theorem shows that, when we increase the number of model basis
functions for the t in such a way that the matrix retains full rank, then
the least squares residual norm decreases (or stays xed if b is orthogonal
to q).
To obtain more insight into the least squares residual we study the
inuence of the approximation and data errors. According to (1.2.1) we
can write the right-hand side as
b = +e,
where the two vectors
=
_
(t
1
), . . . , (t
m
)
_
T
and e = (e
1
, . . . , e
m
)
T
contain the pure data (the sampled pure-data function) and the data errors,
respectively. Hence, the least squares residual vector is
r

= Ax

+ e, (2.2.5)
where the vector Ax

is the approximation error. From (2.2.5) it follows


that the least squares residual vector can be written as
r

= (I
m
Q
1
Q
T
1
) + (I
m
Q
1
Q
T
1
) e = Q
2
Q
T
2
+Q
2
Q
T
2
e.
THE LINEAR LEAST SQUARES PROBLEM 37
We see that the residual vector consists of two terms. The rst term Q
2
Q
T
2

is an approximation residual, due to the discrepancy between the n model
basis functions (represented by the columns of A) and the pure-data func-
tion. The second term is the projected error, i.e., the component of the
data errors that lies in the subspace null(A
T
). We can summarize the
statistical properties of the least squares residual vector as follows.
Theorem 22. The least squares residual vector r

= b Ax

has the
following properties:
c(r

) = Q
2
Q
T
2
, Cov(r

) = Q
2
Q
T
2
Cov(e)Q
2
Q
T
2
,
c(|r

|
2
2
) = |Q
T
2
|
2
2
+c(|Q
T
2
e|
2
2
).
If e is white noise, i.e., Cov(e) =
2
I
m
, then
Cov(r

) =
2
Q
2
Q
T
2
, E(|r

|
2
2
) = |Q
T
2
|
2
2
+ (mn)
2
.
Proof. It follows immediately that
c(Q
2
Q
T
2
e) = Q
2
Q
T
2
c(e) = 0 and E(
T
Q
2
Q
T
2
e) = 0,
as well as
Cov(r

) = Q
2
Q
T
2
Cov( +e)Q
2
Q
T
2
and Cov( + e) = Cov(e).
Moreover,
c(|r

|
2
2
) = c(|Q
2
Q
T
2
|
2
2
) +c(|Q
2
Q
T
2
e|
2
2
) +c(2
T
Q
2
Q
T
2
e).
It follows that
Cov(Q
T
2
e) =
2
I
mn
and c(|Q
T
2
e|
2
2
) = trace(Cov(Q
T
2
e)) = (mn)
2
.
From the above theorem we see that if the approximation error Ax

is somewhat smaller than the data error e then, in the case of white noise,
the scaled residual norm s

(sometimes referred to as the standard error),


dened by
s

=
|r

|
2

mn
, (2.2.6)
provides an estimate for the standard deviation of the errors in the data.
Moreover, provided that the approximation error decreases suciently fast
when the tting order n increases, then we should expect that for large
38 LEAST SQUARES DATA FITTING WITH APPLICATIONS
enough n the least squares residual norm becomes dominated by the pro-
jected error term, i.e.,
r

Q
2
Q
T
2
e for n suciently large.
Hence, if we monitor the scaled residual norm s

= s

(n) as a function of n,
then we expect to see that s

(n) initially decreases when it is dominated


by the approximation error while at a later stage it levels o, when the
projected data error dominates. The transition between the two stages of
the behavior of s

(n) indicates a good choice for the tting order n.


Example 23. We return to the air pollution example from Example 2. We
compute the polynomial t for n = 1, 2, . . . , 19 and the trigonometric t for
n = 1, 3, 5, . . . , 19 (only odd values of n are used, because we always need
a sin-cos pair). Figure 2.2.1 shows the residual norm |r

|
2
and the scaled
residual norm s

as functions of n.
The residual norm decreases monotonically with n, while the scaled
residual norm shows the expected behavior mentioned above, i.e., a decaying
phase (when the approximation error dominates), followed by a more at
or slightly increasing phase when the data errors dominate.
The standard error s

introduced in (2.2.6) above, dened as the residual


norm adjusted by the degrees of freedom in the residual, is just one example
of a quantity from statistics that plays a central role in the analysis of
LSQ problems. Another quantity arising from statistics is the coecient of
determination R
2
, which is used in the context of linear regression analysis
(statistical modeling) as a measure of how well a linear model ts the data.
Given a model M(x, t) that predicts the observations b
1
, b
2
, . . . , b
m
and the
residual vector r = (b
1
M(x, t
1
), . . . , b
m
M(x, t
m
))
T
, the coecient of
determination is dened by
R
2
= 1
|r|
2
2

m
i=1
(b
i
b)
2
, (2.2.7)
where b is the mean of the observations. In general, it is an approximation
of the unexplained variance, since the second term compares the variance in
the models errors with the total variance of the data. Yet another useful
quantity for analysis is the adjusted coecient of determination, adj R
2
,
dened in the same way as the coecient of determination R
2
, but adjusted
using the residual degrees of freedom,
adj R
2
= 1
(s

)
2

m
i=1
(b
i

b)
2
/(m1)
, (2.2.8)
making it similar in spirit to the squared standard error (s

)
2
. In Chapter
11 we demonstrate the use of these statistical tools.
THE LINEAR LEAST SQUARES PROBLEM 39
Figure 2.2.1: The residual norm and the scaled residual norm, as functions
of the tting order n, for the polynomial and trigonometric ts to the air
pollution data.
2.3 Permuted QR factorization
The previous section covered in detail full-rank problems, and we saw that
the QR factorization was well suited for solving such problems. However,
for parameter estimation problems where the model is given there is
no guarantee that A always has full rank, and therefore we must also con-
sider the rank-decient case. We give rst an overview of some matrix
factorizations that are useful for detecting and treating rank-decient prob-
lems, although they are of course also applicable in the full-rank case. The
minimum-norm solution from Denition 27 below plays a central role in
this discussion.
When A is rank decient we cannot always compute a QR factorization
(2.2.1) that has a convenient economical version, where the range of A is
spanned by the rst columns of Q. The following example illustrates that
a column permutation is needed to achieve such a form.
Example 24. Consider the factorization
A =
_
0 0
0 1
_
=
_
c s
s c
__
0 s
0 c
_
, for any c
2
+s
2
= 1.
This QR factorization has the required form, i.e., the rst factor is orthog-
onal and the second is upper triangular but range(A) is not spanned by
the rst column of the orthogonal factor. However, a permutation of the
40 LEAST SQUARES DATA FITTING WITH APPLICATIONS
columns of A gives a QR factorization of the desired form,
A =
_
0 0
1 0
_
= QR =
_
0 1
1 0
__
1 0
0 0
_
,
with a triangular R and such that the range of A is spanned by the rst
column of Q.
In general, we need a permutation of columns that selects the linearly
independent columns of A and places them rst. The following theorem
formalizes this idea.
Theorem 25. QR factorization with column permutation. If A is
real, mn with rank(A) = r < n m, then there exists a permutation ,
not necessarily unique, and an orthogonal matrix Q such that
A = Q
_
R
11
R
12
0 0
_
r
mr
, (2.3.1)
where R
11
is r r upper triangular with positive diagonal elements. The
range of A is spanned by the rst r columns of Q.
Similar results hold for complex matrices where Q now is unitary.
The rst r columns of the matrix A are guaranteed to be linearly
independent. For a model with basis functions that are not linearly de-
pendent over the abscissas, this provides a method for choosing r linearly
independent functions. The rank-decient least squares problem can now
be solved as follows.
Theorem 26. Let A be a rank-decient mn matrix with the pivoted QR
factorization in Theorem 25. Then the LSQ problem (2.1.1) takes the form
min
x
|Q
T
A
T
x Q
T
b|
2
2
=
min
y
_
_
_
_
_
R
11
R
12
0 0
__
y
1
y
2
_

_
d
1
d
2
__
_
_
_
2
2
=
min
y
_
|R
11
y
1
R
12
y
2
d
1
|
2
2
+|d
2
|
2
2
_
,
where we have introduced
Q
T
b =
_
d
1
d
2
_
and y =
T
x =
_
y
1
y
2
_
.
The general solution is
x

=
_
R
1
11
(d
1
R
12
y
2
)
y
2
_
, y
2
= arbitrary (2.3.2)
and any choice of y
2
leads to a least squares solution with residual norm
|r

|
2
= |d
2
|
2
.
THE LINEAR LEAST SQUARES PROBLEM 41
Denition 27. Given the LSQ problem with a rank decient matrix A
and the general solution given by (2.3.2), we dene x

as the solution of
minimal l
2
-norm that satises
x

= arg min
x
|x|
2
subject to |b Ax|
2
= min.
The choice y
2
= 0 in (2.3.2) is an important special case that leads to
the so-called basic solution,
x
B
=
_
R
1
11
Q
T
1
b
0
_
,
with at least nr zero components. This corresponds to using only the rst
r columns of A in the solution, while setting the remaining elements to
zero. As already mentioned, this is an important choice in data tting as
well as other applications because it implies that b is represented by the
smallest subset of r columns of A, i.e., it is tted with as few variables as
possible. It is also related to the new eld of compressed sensing [4, 35, 235].
Example 28. Linear prediction. We consider a digital signal, i.e., a
vector s R
N
, and we seek a relation between neighboring elements of the
form
s
i
=

j=1
x
i
s
ij
, i = p + 1, . . . , N, (2.3.3)
for some (small) value of . The technique of estimating the ith element
from a number of previous elements is called linear prediction (LP), and the
LP coecients x
i
can be used to characterize various underlying properties
of the signal. Throughout this book we will use a test problem where the
elements of the noise-free signal are given by
s
i
=
1
sin(
1
t
i
) +
2
sin(
2
t
i
) + +
p
sin(
p
t
i
), i = 1, 2, . . . , N.
In this particular example, we use N = 32, p = 2,
1
= 2,
2
= 1 and no
noise.
There are many ways to estimate the LP coecients in (2.3.3). One
of the popular methods amounts to forming a Toeplitz matrix A (i.e., a
matrix with constant diagonals) and a right-hand side b from the signal,
with elements given by
a
ij
= s
n+ij
, b
i
= s
n+i
, i = 1, . . . , m, j = 1, . . . , n,
where the matrix dimensions m and n satisfy m+n = N and min(m, n) .
We choose N = 32 and n = 7 giving m = 25, and the rst 7 rows of A and
42 LEAST SQUARES DATA FITTING WITH APPLICATIONS
the rst 7 elements of b are
_
_
_
_
_
_
_
_
1.011 1.151 0.918 2.099 0.029 2.770 0.875
2.928 1.011 1.151 0.918 2.099 0.029 2.770
1.056 2.928 1.011 1.151 0.918 2.099 0.029
1.079 1.056 2.928 1.011 1.151 0.918 2.099
1.197 1.079 1.056 2.928 1.011 1.151 0.918
2.027 1.197 1.079 1.056 2.928 1.011 1.151
1.160 2.027 1.197 1.079 1.056 2.928 1.011
_
_
_
_
_
_
_
_
,
_
_
_
_
_
_
_
_
2.928
0.0101
1.079
1.197
2.027
1.160
2.559
_
_
_
_
_
_
_
_
.
The matrix A is rank decient and it turns out that for this problem we can
safely compute the ordinary QR factorization without pivoting, correspond-
ing to = I. The matrix R
1
and the vector Q
T
1
b are
_
_
_
_
_
_
_
_
7.970 2.427 3.392 4.781 5.273 1.890 7.510
0 7.678 3.725 1.700 3.289 6.482 0.542
0 0 6.041 2.360 4.530 1.136 2.765
0 0 0 5.836 0.563 4.195 0.252
0 0 0 0
0 0 0 0 0
0 0 0 0 0 0
_
_
_
_
_
_
_
_
,
_
_
_
_
_
_
_
_
2.573
4.250
1.942
5.836

_
_
_
_
_
_
_
_
,
where denotes an element whose absolute value is of the order 10
14
or
smaller. We see that the numerical rank of A is r = 4 and that b is
the weighted sum of columns 1 through 4 of the matrix A, i.e., four LP
coecients are needed in (2.3.3). A basic solution is obtained by setting the
last three elements of the solution to zero:
x
B
= ( 0.096 , 0.728 , 0.096 , 1.000 , 0 , 0 , 0 )
T
.
A numerically safer approach for rank-decient problems is to use the QR
factorization with column permutations from Theorem 25, for which we get
=
_
_
_
_
_
_
_
_
0 0 0 0 0 1 0
1 0 0 0 0 0 0
0 0 0 0 0 0 1
0 0 1 0 0 0 0
0 0 0 1 0 0 0
0 0 0 0 1 0 0
0 1 0 0 0 0 0
_
_
_
_
_
_
_
_
and
R
11
=
0
B
B
@
8.052 1.747 3.062 4.725
0 7.833 4.076 2.194
0 0 5.972 3.434
0 0 0 5.675
1
C
C
A
, d
1
=
0
B
B
@
3.277
3.990
5.952
6.79
1
C
C
A
.
THE LINEAR LEAST SQUARES PROBLEM 43
The basic solution corresponding to this factorization is
x
B
= ( 0 , 0.721 , 0 , 0.990 , 0.115 , 0 , 0.027 )
T
.
This basic solution expresses b as a weighted sum of columns 1, 3, 4 and
6 of A. The example shows that the basic solution is not unique both
basic solutions given above solve the LSQ problem associated with the linear
prediction problem.
The basic solution introduced above is one way of dening a particular
type of solution of the rank-decient LSQ problem, and it is useful in some
applications. However, in other applications we require the minimum-norm
solution x

from Denition 27, whose computation reduces to solving the


least squares problem
min
x
|x|
2
= min
y
2
_
_
_
_

_
R
1
11
(d
1
R
12
y
2
)
y
2
__
_
_
_
2
.
Using the basic solution x
B
the problem is reduced to
min
y
2
_
_
_
_
x
B

_
R
1
11
R
12
I
_
y
2
_
_
_
_
2
.
This is a full-rank least squares problem with matrix
_
R
1
11
R
12
I
_
. The
right-hand side x
B
and the solution y

2
can be obtained via a QR factor-
ization. The following results from [101] relates the norms of the basic and
minimum-norm solutions:
1
|x
B
|
2
| x

|
2

_
1 +|R
1
11
R
12
|
2
2
.
Complete orthogonal factorization
As demonstrated above, we cannot immediately compute the minimum-
norm least squares solution x

from the pivoted QR factorization. How-


ever, the QR factorization with column permutations can be considered as
a rst step toward the so-called complete orthogonal factorization. This is a
decomposition that, through basis changes by means of orthogonal transfor-
mations in both R
m
and R
n
, concentrates the whole information of A into
a leading square nonsingular matrix of size r r. This gives a more direct
way of computing x

. The existence of complete orthogonal factorizations


is stated in the following theorem.
Theorem 29. Let A be a real m n matrix of rank r. Then there is an
mm orthogonal matrix U and an n n orthogonal matrix V such that
A = URV
T
with R =
_
R
11
0
0 0
_
, (2.3.4)
44 LEAST SQUARES DATA FITTING WITH APPLICATIONS
where R
11
is an r r nonsingular triangular matrix.
A similar result holds for complex matrices with U and V unitary. The
LSQ solution can now be obtained as stated in the following theorem.
Theorem 30. Let A have the complete orthogonal decomposition (2.3.4)
and introduce the auxiliary vectors
U
T
b = g =
_
g
1
g
2
_
r
mr
, V
T
x = y =
_
y
1
y
2
_
r
n r
. (2.3.5)
Then the solutions to min
x
|b Ax|
2
are given by
x

= V
_
y
1
y
2
_
, y
1
= R
1
11
g
1
, y
2
= arbitrary, (2.3.6)
and the residual norm is |r

|
2
= |g
2
|
2
. In particular, the minimum-norm
solution x

is obtained by setting y
2
= 0.
Proof. Replacing A by its complete orthogonal decomposition we get
|bAx|
2
2
= |bURV
T
x|
2
2
= |U
T
bRV
T
x|
2
2
= |g
1
R
11
y
1
|
2
2
+|g
2
|
2
2
.
Since the sub-vector y
2
cannot lower this minimum, it can be chosen
arbitrarily and the result follows.
The triangular matrix R
11
contains all the fundamental information of
A. The SVD, which we will introduce shortly, is a special case of a complete
orthogonal factorization, which is more computationally demanding and
involves an iterative part. The most sparse structure that can be obtained
by a nite number of orthogonal transformations, the bidiagonal case, is left
to be analyzed exhaustively in the chapter on direct numerical methods.
Example 31. We return to the linear prediction example from the previous
section; this time we compute the complete orthogonal factorization from
Theorem 29 and get
R
11
=
_
_
_
_
9.027 4.690 1.193 5.626
0 8.186 3.923 0.373
0 0 9.749 5.391
0 0 0 9.789
_
_
_
_
, g =
_
_
_
_
1.640
3.792
6.704
0.712
_
_
_
_
,
and
V =
_
_
_
_
_
_
_
_
_
_
0.035 0.355 0.109 0.521 0.113 0.417 0.634
0.582 0.005 0.501 0.103 0.310 0.498 0.237
0.078 0.548 0.044 0.534 0.507 0.172 0.347
0.809 0.034 0.369 0.001 0.277 0.324 0.163
0 0.757 0.006 0.142 0.325 0.083 0.543
0 0 0.774 0.037 0.375 0.408 0.305
0 0 0 0.641 0.558 0.520 0.086
_
_
_
_
_
_
_
_
_
_
.
THE LINEAR LEAST SQUARES PROBLEM 45
This factorization is not unique and the zeros in V are due to the particular
algorithm from [78] used here. The minimum-norm solution is
x

= V
_
R
1
11
g
1
0
_
= ( 0.013 , 0.150 , 0.028 , 0.581 , 0.104 , 0.566 , 0.047 )
T
.
This solution, as well as the basic solutions from the previous example, all
solve the rank-decient least squares problem.
Example 32. We will show yet another way to compute the linear predic-
tion coecients that uses the null space information in the complete orthog-
onal factorization. In particular, we observe that the last three columns of
V span the null space of A. If we extract them to form the matrix V
0
and
compute a QR factorization of its transpose, then we get
V
T
0
= Q
0
R
0
, R
T
0
=
_
_
_
_
_
_
_
_
_
_
0.767 0 0
0.031 0.631 0
0.119 0.022 0.626
0.001 0.452 0.060
0.446 0.001 0.456
0.085 0.624 0.060
0.436 0.083 0.626
_
_
_
_
_
_
_
_
_
_
.
Since AR
T
0
= 0, we can normalize the last column (by dividing it by its
maximum norm) to obtain the null vector
v
0
= ( 0 , 0 , 1 , 0.096 , 0.728 , 0.096 , 1 )
T
,
which is another way to describe the linear dependence between ve neigh-
boring columns a
j
of A. Specically, this v
0
states that a
7
= 0.096a
6

0.728a
5
0.96a
4
a
3
, and it follows that the LP coecients are x
1
=
0.728, x
2
= 0.096, x
3
= 0.096 and x
4
= 1, which is identical to the
results in Example 28.
Chapter 7
Additional Topics in Least
Squares
In this chapter we collect some more specialized topics, such as problems
with constraints, sensitivity analysis, total least squares and compressed
sensing.
7.1 Constrained linear least squares problems
The inclusion of constraints in the linear least squares problem is often a
convenient way to incorporate a priori knowledge about the problem. Lin-
ear equality constraints are the easiest to deal with, and their solution is
an important part of solving problems with inequality constraints. Bound
constraints and quadratic constraints for linear least squares are also con-
sidered in this chapter.
Least squares with linear constraints
Linear equality constraints (LSE)
The general form of a linear least squares problem with linear equality
constraints is as follows: nd a vector x R
n
that minimizes |b Ax|
2
subject to the constraints C
T
x = d, where C is a given n p matrix with
p n and d is a given vector of length p. This results in the LSE problem:
Problem LSE: min
x
|b Ax|
2
subject to C
T
x = d. (7.1.1)
A solution exists if C
T
x = d is consistent, which is the case if rank(C) = p.
For simplicity, we assume that this is satised; Bjrck ([20], pp. 188) ex-
121
122 LEAST SQUARES DATA FITTING WITH APPLICATIONS
plains ways to proceed when there is no a priori knowledge about con-
sistency. Furthermore, the minimizing solution will be unique if the aug-
mented matrix
A
aug
=
_
C
T
A
_
(7.1.2)
has full rank n.
The idea behind the dierent algorithms for the LSE problem is to
reduce it to an unconstrained (if possible lower-dimensional) LSQ problem.
To use Lagrange multipliers is of course an option, but we will instead
describe two more direct methods using orthogonal transformations, each
with dierent advantages.
One option is to reformulate the LSE problem as a weighted LSQ prob-
lem, assigning large weights (or penalties) for the constraints, thus enforcing
that they are almost satised:
min
x
_
_
_
_
_
C
T
A
_
x
_
d
b
__
_
_
_
2
, large. (7.1.3)
Using the generalized singular value decomposition (GSVD), it can be
proved that if x() is the solution to (7.1.3) then |x() x|
2
= O(
2
),
so that in fact x() x as . The LU factorization algorithm from
Section 4.2 is well suited to solve this problem, because p steps of Gaussian
elimination will usually produce a well-conditioned L matrix.
Although in principle only a general LSQ solver is needed, there are
numerical diculties because the matrix becomes poorly conditioned for
increasing values of . However, a strategy described in [150], based on
Householder transformations combined with appropriate row and column
interchanges has been found to give satisfactory results. In practice, as [20]
mentions, if one uses the LSE equations in the form (7.1.3), it is sucient
to apply a QR factorization with column permutations.
To get an idea of the size of the weight we refer to an example by
van Loan described in [20], pp. 193. One can obtain almost 14 digits ac-
curacy with a weight of = 10
7
using a standard QR factorization with
permutations (e.g., MATLABs function qr), if the constraints are placed
in the rst rows. Inverting the order in the equations, though, gives only
10 digits for the same weight and an increase in only degrades the
computed solution. In addition, Bjrck [20] mentions a QR decomposition
based on self-scaling Givens rotations that can be used without the risk of
overshooting the optimal weights.
Another commonly used algorithm directly eliminates some of the vari-
ables by using the constraints. The actual steps to solve problem (7.1.1)
are:
ADDITIONAL TOPICS IN LEAST SQUARES 123
1. Compute the QR factorization of C to obtain:
C = Q
_
R
1
0
_
C
T
= ( R
T
1
0 )Q
T
.
2. Use the orthogonal matrix to dene new variables:
_
u
v
_
= Q
T
x x = Q
_
u
v
_
(7.1.4)
and also to compute the value of the unknown p-vector u by solving
the lower triangular system R
T
1
u = d.
3. Introduce the new variables into the equation
|b Ax|
2
=
_
_
b AQQ
T
x
_
_
2

_
_
_
_
b

A
_
u
v
__
_
_
_
2
,
where the matrix

A = AQ is partitioned according to the dimensions
of
_
u
v
_
:

A = (

A
1

A
2
), allowing the reduced n p dimensional
LSQ problem to be written as
min
v
|(b

A
1
u)

A
2
v|
2
. (7.1.5)
4. Solve the unconstrained, lower-dimensional LSQ problem in (7.1.5)
and obtain the original unknown vector x from (7.1.4).
This approach has the advantage that there are fewer unknowns in each
system that need to be solved for and moreover the reformulated LSQ
problem is better conditioned since, due to the interlacing property of the
singular values: cond(

A
2
) cond(

A) = cond(A). The drawback is that


sparsity will be destroyed by this process.
If the augmented matrix in (7.1.2) has full rank, one obtains a unique
solution x

C
, if not, one has to apply a QR factorization with column per-
mutations, while solving problem (7.1.5) to obtain the minimum-length
solution vector.
The method of direct elimination compares favorably with another QR-
based procedure, the null space method (see, for example, [20, 150]), both
in numerical stability and in operational count.
Example 74. Here we return to the linear prediction problem from Ex-
ample 28, where we saw that 4 coecients were sucient to describe the
particular signal used there. Hence, if we use n = 4 unknown LP coef-
cients, then we obtain a full-rank problem with a unique solution given
by
x

= ( 1, 0.096, 0.728, 0.096 )


T
.
124 LEAST SQUARES DATA FITTING WITH APPLICATIONS
If we want the prediction scheme in (2.3.3) to preserve the mean of the
predicted signal, then we should add a linear constraint to the LSQ problem,
forcing the prediction coecients sum to zero, i.e.,

4
i=1
x
i
= 0. In the
above notation this corresponds to the linear equality constraint
C
T
x = d, C
T
= (1, 1, 1, 1), d = 0.
Following the direct elimination process described above, this corresponds to
the following steps for the actual problem.
1. Compute the QR factorization of C:
C =
_
_
_
_
1
1
1
1
_
_
_
_
= QR =
_
_
_
_
0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5
_
_
_
_
_
_
_
_
2
0
0
0
_
_
_
_
.
2. Solve R
T
1
u = d 2u = 0 u = 0.
3. Solve min
v
_
_
b

A
2
v
_
_
2
with

A
2
= A
_
_
_
_
0.5 0.5 0.5
0.5 0.5 0.5
0.5 0.5 0.5
0.5 0.5 0.5
_
_
_
_
,
giving v = ( 0.149, 0.755, 0.324 )
T
.
4. Compute the constrained solution
x

C
= Q
_
u
v
_
=
_
_
_
_
0.5 0.5 0.5
0.5 0.5 0.5
0.5 0.5 0.5
0.5 0.5 0.5
_
_
_
_
_
_
0.149
0.755
0.324
_
_
=
_
_
_
_
0.290
0.141
0.465
0.614
_
_
_
_
.
It is easy to verify that the elements of x

C
sum to zero. Alternatively we can
use the weighting approach in (7.1.3), with the row C
T
added on top of the
matrix A; with = 10
2
, 10
4
and 10
8
we obtain solutions almost identical
to the above x

C
, with elements that sum to 1.72 10
3
, 1.73 10
7
and
1.78 10
15
, respectively.
ADDITIONAL TOPICS IN LEAST SQUARES 125
Figure 7.1.1: Constrained least squares polynomial ts (m = 30, n = 10).
The unconstrained t is shown by the dashed lines, while the constrained
t are shown by the solid lines. Left: equality constraints M(x, 0.5) = 0.65
and M

(x, 0) = M

(x, 1) = 0. Right: inequality constraints M

(x, t
i
) 0
for i = 1, . . . , m.
Linear inequality constraints (LSI)
Instead of linear equality constraints we can impose linear inequality con-
straints on the least squares problem. Then the problem to be solved is
Problem LSI: min
x
|b Ax|
2
subject to l C
T
x u, (7.1.6)
where the inequalities are understood to be component-wise.
There are several important cases extensively discussed in [20, 150], and
Fortran subroutines are available from [88]. A good reference for this part
is [91]. A constrained problem may have a minimizer of |b Ax|
2
that
is feasible, in which case it can be solved as an unconstrained problem.
Otherwise a constrained minimizer will be located on the boundary of the
feasible region. At such a solution, one or more constraints will be active,
i.e., they will be satised with equality.
Thus the solver needs to nd which constraints are active at the solution.
If those were known a priori, then the problem could be solved as an equality
constrained one, using one of the methods we discussed above. If not, then
a more elaborate algorithm is necessary to verify the status of the variables
and modify its behavior according to which constraints become active or
inactive at any given time.
As we showed above, a way to avoid all this complication is to use
penalty functions that convert the problem into a sequence of unconstrained
problems. After a long hiatus, this approach has become popular again in
the form of interior point methods [259]. It is, of course, not devoid of its
own complications (principle of conservation of diculty!).
Example 75. This example illustrates how equality and inequality con-
straints can be used to control the properties of the tting model M(x, t),
126 LEAST SQUARES DATA FITTING WITH APPLICATIONS
using the rather noisy data (m = 30) shown in Figure 7.1.1 and a poly-
nomial of degree 9 (i.e., n = 10). In both plots the dashed line shows the
unconstrained t.
Assume that we require that the model M(x, t) must interpolate the
point (t
int
, y
int
) = (0.5, 0.65), have zero derivative at the end points t = 0,
and t = 1, i.e., M

(x, 0) = M

(x, 1) = 0. The interpolation requirement


correspond to the equality constraint
(t
9
int
, t
8
int
, . . . , t
int
, 1) x = y
int
,
while the constraint on M

(x, t) has the form


(9t
8
, 8t
7
, . . . , 2t, 1, 0) x = M

(x, t).
Hence the matrix and the right-hand side for the linear constraints in the
LSE problem (7.1.1) has the specic form
C
T
=
_
_
t
9
int
t
8
int
t
2
int
t
int
1
0 0 0 1 0
9 8 2 1 0
_
_
, d =
_
_
y
int
0
0
_
_
.
The resulting constrained t is shown as the solid line in the left plot in
Figure 7.1.1.
Now assume instead that we require that the model M(x, t) be mono-
tonically non-decreasing in the data interval, i.e., that M

(x, t) 0 for
t [0, 1]. If we impose this requirement at the data points t
1
, . . . , t
m
, we
obtain a matrix C in the LSI problem (7.1.6) of the form
C
T
=
_
_
_
_
_
9t
8
1
8t
7
1
. . . 2t
1
0
9t
8
2
8t
7
2
. . . 2t
2
0
.
.
.
.
.
.
.
.
.
.
.
.
9t
8
m
8t
7
m
. . . 2t
m
0
_
_
_
_
_
,
while the two vectors with the bounds are l = 0 and u = . The resulting
monotonic t is shown as the solid line in the right plot in Figure 7.1.1.
Bound constraints
It is worthwhile to discuss the special case of bounded variables, i.e., when
C = I in (7.1.6) also known as box constraints. Starting from a feasible
point (i.e., a vector x that satises the bounds), we iterate in the usual way
for an unconstrained nonlinear problem (see Chapter 9), until a constraint
is about to be violated. That identies a face of the constraint box and a
particular variable whose bound is becoming active. We set that variable
to the bound value and project the search direction into the corresponding
ADDITIONAL TOPICS IN LEAST SQUARES 127

.
.
.
..
_

x
h
x
1
x
2
x
3
Figure 7.1.2: Gradient projection method for a bound-constrained problem
with a solution at the upper right corner. From the starting point x we move
along the search direction h until we hit the active constraint in the third
coordinate, which brings us to the point x
1
. In the next step we hit the
active constraint in the second coordinate, bringing us to the point x
2
. The
third step brings us to the solution at x
3
, in which all three coordinates are
at active constraints.
face, in order to continue our search for a constrained minimum on it. See
Figure 7.1.2 for an illustration. If the method used is gradient oriented,
then this technique is called the gradient projection method [210].
Example 76. The use of bound constraints can sometimes have a profound
impact on the LSQ solution, as illustrated in this example where we return
to the CT problem from Example 72. Here we add noise e with relative
noise levels = |e|
2
/|b|
2
equal to 10
3
, 3 10
3
, 10
2
and we enforce
non-negativity constraints, i.e., l = 0 (and u = ). Figure 7.1.3 compares
the LSQ solutions (bottom row) with the non-negativity constrained NNLS
solutions (top row), and we see that even for the largest noise level 10
2
the NNLS solution includes recognizable small features which are lost in
the LS solution even for the smaller noise level 3 10
3
.
Sensitivity analysis
Eldn [74] gave a complete sensitivity analysis for problem LSE (7.1.1), in-
cluding perturbations of all quantities involved in this problem; here we spe-
cialize his results to the case where only the right-hand side b is perturbed.
Specically, if x

C
denotes the solution with the perturbed right-hand side
b +e, then Eldn showed that
|x

C
x

C
|
2
|A

C
|
2
|e|
2
, A
C
= A(I (C

)
T
C
T
).
To derive a simpler expression for the matrix A
C
consider the QR factor-
ization C = Q
1
R
1
introduced above. Then C

= R
1
1
Q
T
1
and it follows
128 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Figure 7.1.3: Reconstructions of the CT problem for three dierent rel-
ative noise levels = e
2
/b
2
. Top: LSQ solutions with non-negativity
constraints. Bottom: standard LSQ solutions.
ADDITIONAL TOPICS IN LEAST SQUARES 129
that if Q = ( Q
1
Q
2
) then I (C

)
T
C
T
= I Q
1
Q
T
1
= Q
2
Q
T
2
and
hence A
C
= AQ
2
Q
T
2
=

A
2
Q
T
2
. Moreover, we have A

C
= Q
2

2
and thus
|A

C
|
2
= |

2
|
2
. The following theorem then follows immediately.
Theorem 77. Let x

C
and x

C
denote the solutions to problem LSE (7.1.1)
with right-hand sides b and b + e, respectively. Then, neglecting second-
order terms,
|x

C
x

C
|
2
|x

C
|
2
cond(

A
2
)
|e|
2
|

A
2
|
2
|x

C
|
2
.
This implies that the equality-constrained LS solution is typically less
sensitive to perturbations, since the condition number of

A
2
= AQ
2
is less
than or equal to the condition number of A.
Least squares with quadratic constraints (LSQI)
If we add a quadratic constraint to the least squares problem, then we
obtain a problem of the form
Problem LSQI: min
x
|b Ax|
2
2
subject to |d Bx|
2
, (7.1.7)
where A R
mn
and B R
pn
. We assume that
rank(B) = r and rank
_
A
B
_
= n,
which guarantees a unique solution of the LSQI problem. Least squares
problems with quadratic constraints arise in many applications, including
ridge regression in statistics, Tikhonov regularization of inverse problems,
and generalized cross-validation; we refer to [121] for more details.
To facilitate the analysis it is convenient to transform the problem
into diagonal form by using the generalized singular value decomposition
(GSVD) from Section 3.3:
U
T
AX = D
A
, V BX = D
B
, b = U
T
b,

d = V
T
d, x = Xy.
The matrices D
A
, D
B
are diagonal with non-negative elements
1
,
2
, ...,
n
and
1
,
2
, ...,
q
, where q = minp, n and there are r values
i
> 0. The
reformulated problem is now
min
y
_
_
b D
A
y
_
_
2
2
= min
y
_
n

i=1
_

i
y
i

b
i
_
2
+
m

n+1

b
2
i
_
(7.1.8)
subject to
_
_
d D
B
y
_
_
2
2
=
r

i=1
_

i
y
i


d
i
_
2
+
p

r+1

d
2
i

2
. (7.1.9)
130 LEAST SQUARES DATA FITTING WITH APPLICATIONS
A necessary and sucient condition for the existence of a solution is that

p
i=r+1

d
2
i

2
. The way to solve problem (7.1.8)(7.1.9) will depend on
the size of the term

p
i=r+1

d
2
i
.
1. If

p
r+1

d
2
i
=
2
, the only y that can satisfy the constraints has as
rst r elements y
i
=

d
i
/
i
, i = 1, . . . , r. The remaining elements
y
i
are dened from the minimization of (7.1.8). The minimum is
attained if

n
i=1
(
i
y
i

b
i
)
2
= 0 or is as small as possible, so that for
i = r + 1, . . . , n we set y
i
=

b
i
/
i
if
i
,= 0 and y
i
= 0 otherwise.
2. If

p
i=r+1

d
2
i
<
2
, one could use the Lagrange multipliers method
directly cf. Appendix C.2.1 but it is also possible to try a simpler
approach to reach a feasible solution: dene the vector y that mini-
mizes (7.1.8), which implies choosing y
i
=

b
i
/
i
if
i
,= 0 as before.
If
i
= 0, then try to make the left-hand side of the constraints as
small as possible by dening y
i
=

d
i
/
i
if
i
,= 0 or else y
i
= 0.
3. If the resulting solution y is feasible, i.e., it satises the constraints
(7.1.9), then x = X y is the LSQI solution.
4. If not, look for a solution on the boundary of the feasible set, i.e.,
where the constraint is satised with equality. That is the standard
form for the use of Lagrange multipliers, so the problem is now, for
(y; ) =
_
_
b D
A
y
_
_
2
2

_
_
_
d D
B
y
_
_
2
2

2
_
(7.1.10)
nd y

and so that = 0.
It can be shown that the solution y

in step 4 satises the normal


equations
(D
T
A
D
A
+ D
T
B
D
B
) y

= D
T
A

b + D
T
B

d, (7.1.11)
where satises the secular equation:
() =
_
_
d D
B
y

_
_
2
2

2
= 0.
An iterative Newton-based procedure can be dened as follows. Starting
from an initial guess
0
, at each successive step calculate y

i
from (7.1.11),
then compute a Newton step for the secular equation obtaining a new value

i+1
. It can be shown that this iteration is monotonically convergent to a
unique positive root if one starts with an appropriate positive initial value
and if instead of () = 0 one uses the more convenient form
1
_
_
d D
B
y

_
_
2
2

2
= 0.
ADDITIONAL TOPICS IN LEAST SQUARES 131
Therefore the procedure can be used to obtain the unique solution of the
LSQI problem. This is the most stable, but also the most expensive numer-
ical algorithm. If instead of using the GSVD reformulation one works with
the original equations (7.1.7), an analogous Newton-based method can be
applied, in which the rst stage at each step is
min
x

_
_
_
_
_
A

B
_
x

_
b

d
__
_
_
_
2
.
Ecient methods for solution of this kind of least squares problem have
been studied for several particular cases; see [20] for details.
7.2 Missing data problems
The problem of missing data occurs frequently in scientic research. It
may be the case that in some experimental plan, where observations had to
be made at regular intervals, occasional omissions arise. Examples would
be clinical trials with incomplete case histories, editing of incomplete sur-
veys or, as in the example given at the end of this section, gene expression
microarray data, where some values are missing. Let us assume that the
missing data are MCAR (missing completely at random), i.e., the proba-
bility that a particular data element is missing is unrelated to its value or
any of the variables. For example, in the case of data arrays, independent
of the column or row.
The appropriate technique for data imputation (a statistical term, mean-
ing the estimation of missing values), will depend among other factors, on
the size of the data set, the proportion of missing values and on the type
of missing data pattern. If only a few values are missing, say, 15%, one
could use a single regression substitution: i.e., predict the missing values
using linear regression with the available data and assign the predicted
value to the missing score. The disadvantage of this approach is that this
information is only determined from the now reduced available data set.
However, with MCAR data, any subsequent statistical analysis remains un-
biased. This method can be improved by adding to the predicted value a
residual drawn to reect uncertainty (see [152], Chapter 4).
Other classic processes to ll in the data to obtain a complete set are
as follows:
Listwise deletion: omit the cases with missing values and work with
the remaining set. It may lead to a substantial decrease in the avail-
able sample size, but the parameter estimates are unbiased.
Hot deck imputation: replace the missing data with a random value
drawn from the collection of data of similar participants. Although
132 LEAST SQUARES DATA FITTING WITH APPLICATIONS
widely used in some applications there is scarce information about
the theoretical properties of the method.
Mean substitution: substitute a mean of the available data for the
missing values. The mean may be formed within classes of data. The
mean of the resulting set is unchanged, but the variance is underesti-
mated.
More computationally intensive approaches based on least squares and
maximum-likelihood principles have been studied extensively in the past
decades and a number of software packages that implement the procedures
have been developed.
Maximum likelihood estimation
These methods rely on probabilistic modeling, where we wish to nd the
maximum likelihood (ML) estimate for the parameters of a model, including
both the observed and the missing data.
Expectation-maximization (or EM) algorithm
One ML algorithm is the expectation-maximization algorithm. This algo-
rithm estimates the model parameters iteratively, starting from some initial
guess of the ML parameters, using, for example, a model for the listwise
deleted data. Then follows a recursion until the parameters stabilize, each
step containing two processes.
E-step: the distribution of the missing data is estimated given the
observed data and the current estimate of the parameters.
M-step: the parameters are re-estimated to those with maximum like-
lihood, assuming the complete data set generated in the E-step.
Once the iteration has converged, the nal maximum likelihood estimates
of the regression coecients are used to estimate the nal missing data.
It has been proved that the method converges, because at each step the
likelihood is non-decreasing, until a local maximum is reached, but the
convergence may be slow and some acceleration method must be applied.
The global maximum can be obtained by starting the iteration several times
with randomly chosen initial estimates.
For additional details see [149, 152, 220]. For software IBM SPSS: miss-
ing value analysis module. Also free software such as NORM is available
from [163].
ADDITIONAL TOPICS IN LEAST SQUARES 133
Multiple imputation (MI)
Instead of lling in a single value for each missing value, Rubins multiple
imputation procedure [152], replaces each missing value with a set of plau-
sible values that represent the uncertainty about the right value to impute.
Multiple imputation (MI) is a Monte Carlo simulation process in which a
number of full imputed data sets are generated. Statistical analysis is per-
formed on each set, and the results are combined [68] to produce an overall
analysis that takes the uncertainty of the imputation into consideration.
Depending on the fraction of missing values, a number between 3 and 10
sets must be generated to give good nal estimates.
The critical step is the generation of the imputed data set. The choice
of the method used for this generation depends on the type of the missing
data pattern (see, for example, [132]). For monotone patterns, a parametric
regression method can be used, where a regression model is tted for the
variable with missing values with other variables as covariates (there is a
hierarchy of missingness: if z
b
is observed, then z
a
for a < b, is also ob-
served). This procedure allows the incorporation of additional information
into the model, for example, to use predictors that one knows are related
to the missing data. Based on the resulting model, a new regression model
is simulated and used to impute the missing values.
For arbitrary missing data patterns, a computationally expensive Markov
Chain Monte Carlo (MCMC) method, based on an assumption of multi-
variate normality, can be used (see [220]). MI is robust to minor departures
from the normality assumption, and it gives reasonable results even in the
case of a large fraction of missing data or small size of the samples.
For additional information see [132, 149, 152]. For software see [132].
Least squares approximation
We consider here the data as a matrix A with m rows and n columns, which
is approximated by a low-rank model matrix. The disadvantage is that no
information about the data distribution is included, which may be impor-
tant when the data belong to a complex distribution. Two types of methods
are in common use: SVD based and local least squares imputations.
SVD-based imputation
In this method, the singular value decomposition is used to obtain a set of
orthogonal vectors that are linearly combined to estimate the missing data
values. As the SVD can only be performed on complete matrices, one works
with an auxiliary matrix A

obtained by substituting any missing position


in A by a row average. The SVD of A

is as usual A

= UV
T
. We select
rst the k most signicant right singular vectors of A

.
134 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Then, one estimates the missing ij value of A by rst regressing the row
i against the signicant right eigenvectors and then using a linear combi-
nation of the regression coecients to compute an estimate for the element
ij. (Note that the j components of the right eigenvectors are not used in
the regression.) This procedure is repeated until it converges, and an im-
portant point is that the convergence depends on the conguration of the
missing entries.
Local least squares imputation
We will illustrate the method with the imputation of a DNA microarray.
The so-called DNA microarray, or DNA chip, is a technology used to ex-
periment with many genes at once. To this end, single strands of comple-
mentary DNA for the genes of interest which can be many thousands
are immobilized on spots arranged in a grid (array) on a support that will
typically be a glass slide, a quartz wafer or a nylon membrane.
The data from microarray experiments are usually in the form of large
matrices of expression levels of genes (gene expression is the process by
which information from a gene is used in the synthesis of a functional gene
product), under dierent experimental conditions and frequently with some
values missing. Missing values occur for diverse reasons, including insuf-
cient resolution, image corruption or simply due to dust or scratches on
the slide. Robust missing value estimation methods are needed, since many
algorithms for gene expression analysis require a complete matrix of gene
array values.
One method, developed by Kim, Golub and Park [145] for estimating
the missing values, is a least squares-based imputation approach using the
concept of coherent genes. The method is a local least squares (LLS) algo-
rithm, since only the genes with high similarity with the target gene, the
one with incomplete values, are used. The coherent genes are identied
using similarity measures based on the
2
-norm or the Pearson correlation
coecient. Once identied, two approaches are used to estimate the miss-
ing values, depending on the relative sizes between the number of selected
similar genes and the available experiments:
1. Missing values can be estimated either by representing the target gene
with missing values as a linear combination of the similar genes.
2. The target experiment that has missing values can be represented as
a linear combination of related experiments.
Denote with G R
mn
a gene expression data matrix with m genes
and n experiments and assume m n. The row g
T
i
of G represents
expressions of the ith gene in n experiments. Assume for now that only
ADDITIONAL TOPICS IN LEAST SQUARES 135
one value is missing and it corresponds to the rst position of the rst gene,
G(1, 1) = g
1
(1), denoted by for simplicity.
Now, among the genes with a known rst position value, either the k
nearest-neighbors gene vectors are located using the
2
-norm or the k most
similar genes are identied using the Pearson correlation coecient. Then,
based on these k closest gene vectors, a matrix A R
k(q1)
and two
vectors b R
k
and w R
(q1)
are formed. The k rows of the matrix A
consist of the k closest gene vectors with their rst values deleted; q varies
depending on the number of known values in these similar genes (i.e., the
number of experiments recorded successfully). The elements of b consist of
the rst values of these gene vectors, and the elements of w are the q 1
known elements of g
1
.
When k < q 1, the missing value in the target gene is approximated
using the same-position value of the nearest genes:
1. Solve the local least squares problem min
x
_
_
A
T
x w
_
_
2
.
2. Estimate the missing value as a linear combination of the rst values
of the coherent genes = b
T
x.
When k q 1, on the other hand, the missing value is estimated using
the experiments:
1. Solve the local least squares problem min
x
|Ax b|
2
.
2. Estimate the missing value as a linear combination of values of exper-
iments, not taking into account the rst experiment in the gene g
1
,
i.e., = w
T
x.
An improvement is to add weights of similarity for the k nearest neighbors,
leading to weighted LSQ problems. In the actual DNA microarray data,
missing values may occur in several locations. For each missing value the
arrays A, b and w are generated and the LLS solved. When building the
matrix A for a missing value, the already estimated values of the gene are
taken into consideration.
An interesting result, based on experiments with data from the Stan-
ford Microarray Database (SMD), is that the most robust missing value
estimation method is the one based on representing a target experiment
that has a missing value as a linear combination of the other experiments.
A program called LSimpute is available from the authors of [145].
There are no theoretical results comparing dierent imputation algo-
rithms, but the test results of [145, 238] are consistent and suggest that
the above-described method is more robust for noisy data and less sensitive
to k, the number of nearest neighbors used, than the SVD method. The
appropriate choice of the k closest genes is still a matter of trial and error,
although some experiments with random matrices point to thresholds for
it [156]. See also [250].
136 LEAST SQUARES DATA FITTING WITH APPLICATIONS
7.3 Total least squares (TLS)
The assumption used until now for the LSQ problem is that errors are
restricted to the right-hand side b, i.e., the linear model is Ax = b + r
where r is the residual. A new problem arises when the data matrix A is
also not known exactly, so both A and b have random errors. For simplicity,
the statistical hypothesis will be that the rows of the errors are independent
and have a uniform distribution with zero mean and common variance (a
more general case is treated in [106]). This leads to the total least squares
(TLS) problem of calculating a vector x
TLS
so that the augmented residual
matrix ( E r ) is minimized:
Problem TLS: min
_
_
( E r )
_
_
2
F
subject to (A +E) x = b +r. (7.3.1)
We note that
|( E r )|
2
F
= |E|
2
F
+|r|
2
2
.
The relation to ordinary least squares problems is as follows:
In the least squares approximation, one replaces the inconsistent prob-
lem Ax = b by the consistent system Ax = b

, where b

is the vector
closest to b in range(A), obtained by minimizing the orthogonal dis-
tance to range(A).
In the total least squares approximation, one goes further and re-
places the inconsistent problem Ax = b by the consistent system
A

x = b

, nding the closest A

and b

to A and b, respectively, by
minimizing simultaneously the sum of squared orthogonal distances
from the columns a
i
to a

i
and b to b

.
Example 78. To illustrate the idea behind TLS we consider the following
small problem:
A =
_
1
0
_
, b =
_
1
1
_
.
The TLS solution x
TLS
= 1.618 is obtained with
E =
_
0.276
0.447
_
, r =
_
0.171
0.276
_
.
Figure 7.3.1 illustrates the geometrical aspects of this simple problem. We
see that the perturbed right-hand side b+r and the perturbed rst (and only)
column A + E are both orthogonal projections of b and A, respectively, on
the subspace range(A+E). The perturbations r and E are orthogonal to b
and A, respectively, and they are chosen such that their lengths are minimal.
Then x
TLS
is the ratio between the lengths of these two projections.
ADDITIONAL TOPICS IN LEAST SQUARES 137
.
.
.
.
.
.
.
.
.
.
Z
Z
Z
Z
Z
Z
Z
Z
Z
Z
.
.
.
.
.
.
.
.
.
.
.
.
LSQ problem
A
b
b

= Ax

.
.
.
.
.
.
.
.
.
.
Z
Z
Z
Z
Z
Z
Z
Z
Z
Z
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
TLS problem
A
A

b
b

Figure 7.3.1: Illustration of the geometry underlying the LSQ problem


(left) and the TLS problem (right). The LSQ solution x

is chosen such that


the residual b

b (the dashed line) is orthogonal to the vector b

= Ax

.
The TLS solution x
TLS
is chosen such that the the residuals b

b and A

A
(the dashed lines) are orthogonal to b

= A

x
TLS
and A

, respectively.
We will only consider the case when rank(A) = n and b / range(A).
The trivial case when b range(A) means that the system is consistent,
(E r) = 0 and the TLS solution is identical to the LSQ solution.
We note that the rank-decient matrix case, rank(A) < n, has in prin-
ciple only a solution in the trivial case b range(A). In the general case
it is treated by reducing the TLS problem to a smaller, full-rank problem
using column subset selection techniques. More details are given in ([239],
section 3.4). The work of Paige and Strako [180] on core problems is also
relevant here. The following example taken from [105], p. 595, illustrates
this case.
Example 79. Consider the problem dened by
A =
_
_
1 0
0 0
0 0
_
_
, b =
_
_
1
1
1
_
_
.
The rank of A is 1 and the matrix is rank decient, with b / range(A), so
there is no solution of the problem as is. Note that
E

=
_
_
0 0
0
0
_
_
is a perturbation, such that for any ,= 0 we have b range(A + E

), but
there is no smallest |E

|
F
.
138 LEAST SQUARES DATA FITTING WITH APPLICATIONS
The TLS problem and the singular value decomposition
Let us rewrite the TLS problem as a system:
( A b )
_
x
1
_
+ ( E r )
_
x
1
_
= 0,
or equivalently
C z + F z = 0 with z =
_
x
1
_
,
where C = ( A b ) and F = ( E r ) are matrices of size m (n + 1).
We seek a solution z to the homogeneous equation
(C +F) z = 0 (7.3.2)
that minimizes |F|
2
F
. For the TLS problem to have a non-trivial solution
z, the matrix C +F must be singular, i.e., rank(C +F) < n+1. To attain
this, the SVD
C = ( A b ) = UV
T
=
n+1

i=1

i
u
i
v
T
i
can be used to determine an acceptable perturbation matrix F. The singu-
lar values of A, here denoted by

2
...

n
, will also enter in the
discussion.
The solution technique is easily understood in the case when the singular
values of C satisfy
1

2
...
n
>
n+1
, i.e., when the smallest
singular value is isolated. Since we are considering the full-rank, non-trivial
case, we also have
n+1
> 0.
From the Eckart-Young-Mirski Theorem 43, the matrix nearest to C
with a rank lower than n + 1 is at distance
n+1
from C, and it is given
by

n
i=1

i
u
i
v
T
i
. Thus, selecting C +F =

n
i=1

i
u
i
v
T
i
implies choosing a
perturbation F =
n+1
u
n+1
v
T
n+1
with minimal norm: |F|
F
=
n+1
.
The solution z is now constructed using the fact that the v
i
are or-
thonormal, and therefore (C + F) v
n+1
= 0. Thus, a general solution of
the TLS problem is obtained by scaling the right singular vector v
n+1
cor-
responding to the smallest singular value, in order to enforce the condition
that z
n+1
= 1:
z
i
=
v
i,n+1
v
n+1,n+1
, i = 1, 2, . . . , n + 1. (7.3.3)
This is possible provided that v
n+1,n+1
,= 0. If v
n+1,n+1
= 0, a solution
does not exist and the problem is called nongeneric.
ADDITIONAL TOPICS IN LEAST SQUARES 139
A theorem proved in [106] gives sucient conditions for the existence of
a unique solution. If the smallest singular value

n
of the full-rank matrix
A is strictly larger than the smallest singular value of ( A b ) = C,
n+1
,
then v
n+1,n+1
,= 0 and a unique solution exists.
Theorem 80. Denote by
1
, . . . ,
n+1
the singular values of the augmented
matrix ( A b ) and by

1
, . . . ,

n
the singular values of the matrix A. If

n
>
n+1
, then there is a TLS correction matrix ( E r ) =
n+1
u
n+1
v
T
n+1
and a unique solution vector
x
TLS
= (v
1,n+1
, . . . , v
n,n+1
)
T
/v
n+1,n+1
(7.3.4)
that solves the TLS problem.
Moreover, there are closed-form expressions for the solution:
x
TLS
= (A
T
A
2
n+1
I)
1
A
T
b
and the residual norm:
|Ax
TLS
b|
2
2
=
2
n+1
_
1 +
n

i=1
_
u
T
i
b
_
2
(

i
)
2

2
n+1
_
.
The following example, taken from pp. 179 in [20], illustrates the diculty
associated with the nongeneric case in terms of the SVDs.
Example 81. Consider the augmented matrix:
( A b ) =
_
1 0
0 2
_
, where A =
_
1
0
_
.
The smallest singular value of A is

1
= 1, and for the augmented matrix
is
2
= 1. So here

n
=
n+1
, and v
n+1,n+1
= 0, and therefore no solution
exists. The formal algorithm would choose the perturbation matrix from the
dyadic form for A + E =

n
i=1

i
u
i
v
T
i
; then
( A + E b +r ) =
_
0 0
0 2
_
and it is easy to see that b +r / range(A +E), and there is no solution.
There are further complications with the nongeneric case, but we shall
not go into these issues here primarily because they seem to have been
resolved by the recent work of Paige and Strako [180].
In [239] p. 86, conditions for the existence and uniqueness of solutions
are given, as well as closed form expressions for these solutions. In [239]
p. 87, a general algorithm for the TLS problem is described. It solves
140 LEAST SQUARES DATA FITTING WITH APPLICATIONS
E
T
line with normal (x, 1)
'
r
i
= a
i
x b
i

#
d
i
= (a
i
x b
i
)/

x
2
+ 1
.
.
.
.
.
.
.
.
.
.
.
.
.
a
b

(a
i
, b
i
)
(a
i
, xa
i
)
Figure 7.3.2: Illustration of LSQ and TLS for the case n = 1. The ith
data point is (a
i
, b
i
) and the point vertically above it on the line is (a
i
, a
i
x),
hence the vertical distance is r
i
= a
i
x b
i
. The orthogonal distance to the
line is d
i
= (xa
i
b
i
)/

x
2
+ 1.
any generic and nongeneric TLS problem, including the case with multiple
right-hand sides. The software developed by van Huel is available from
Netlib [263]. Subroutine PTLS solves the TLS problem by using the partial
singular value decomposition (PSVD) mentioned in Section 5.6, thereby
improving considerably the computational eciency with respect to the
classical TLS algorithm. A large-scale algorithm based on Rayleigh quotient
iteration is described in [25]. See [155] for a recent survey of TLS methods.
Geometric interpretation
It is possible to give a geometric interpretation of the dierence between the
ts using the least squares and total least squares methods. Dene the rows
of the matrix ( A b ) as the m points c
i
= (a
i1
, . . . , a
in
, b
i
)
T
R
n+1
, to
which we try to t the linear subspace o
x
of dimension n that is orthogonal
to the vector z =
_
x
1
_
:
o
x
= span
__
x
1
__

=
_
(
a
b
)[ a R
n
, b R, b = x
T
a
_
.
In LSQ, one minimizes
_
_
_
_
( A b )
_
x
1
__
_
_
_
2
2
=
m

k=1
_
c
T
k
z
_
2
z =
_
x
1
_
,
ADDITIONAL TOPICS IN LEAST SQUARES 141
which measures the sum of squared distances along the n + 1-coordinate
axis from the points c
k
to the subspace o
x
(already known as the residuals
r
i
). For the case n = 1, where A = (a
1
, a
2
, . . . a
m
)
T
and x = x (a single
unknown), the LSQ approximation minimizes the vertical distance of the
points (a
i
, b
i
) to the line through the origin with slope x (whose normal
vector is (x, 1)
T
); see Figure 7.3.2.
The TLS approach can be formulated as an equivalent problem.
From the SVD of ( A b ) and the denition of matrix 2-norm we have
_
_
( A b ) z
_
_
2
/ |z|
2

n+1
,
for any nonzero z. If
n+1
is isolated and the vector z has unit norm, then
z = v
n+1
and the inequality becomes an equality
_
_
( A b ) z
_
_
2
2
=
2
n+1
.
So, in fact the TLS problem consists of nding a vector x R
n
such that
_
_
_
_
( A b )
_
x
1
__
_
_
_
2
2
_
_
_
_
_
x
1
__
_
_
_
2
2
=
|Ax b|
2
2
|x|
2
2
+ 1
=
2
n+1
. (7.3.5)
The left-hand quantity is

m
i=1
(a
T
i
xb
i
)
2
x
T
x+1
, where the ith term is the square
of the true Euclidean (or orthogonal) distance from c
i
to the nearest point
in the subspace o
x
. Again, see Figure 7.3.2 for an illustration in the case
n = 1.
The equivalent TLS formulation min
x
|Ax b|
2
2
/(|x|
2
2
+ 1) is conve-
nient for regularizing the TLS problem additively; see next section. For
more details see Section 10.2.
Further aspects of TLS problems
The sensitivity of the TLS problem
A study of the sensitivity of the TLS problem when there is a unique
solution, i.e., when

n
>
n+1
, is given in [106]. The starting point for the
analysis is the formulation of the TLS problem as an eigenvalue problem
for C
T
C =

n+1
i=1
v
i

2
i
v
T
i
(i.e.,
2
i
are the eigenvalues of C
T
C and v
i
are
the corresponding eigenvectors). The main tool used is the singular value
interlacing property, Theorem 4 from Appendix B. Thus, if x R
n
and
C
T
C
_
x
1
_
=
2
n+1
_
x
1
_
,
then x solves the TLS problem.
142 LEAST SQUARES DATA FITTING WITH APPLICATIONS
One interesting result is that the total least squares problem can be
considered as a de-regularization process, which is apparent when looking
at the rst row of the above eigenvalue problem:
(A
T
A
2
n+1
I) x
TLS
= A
T
b.
This implies that the TLS problem is worse conditioned than the associated
LSQ problem.
An upper bound for the dierence of the LSQ and TLS solutions is
|x
TLS
x

|
2
|x|
2


2
n+1

n
2

2
n+1
,
so that

n
2

2
n+1
measures the sensitivity of the TLS problem. This
suggests that the TLS solution is unstable (the bounds are large) if

n
is
close to
n+1
, for example, if
n+1
is (almost) a multiple singular value.
Stewart [231] proves that up to second-order terms in the error, x
TLS
is
insensitive to column scalings of A. Thus, unlike ordinary LSQ problems,
TLS cannot be better conditioned by column scalings.
The mixed LS-TLS problem
Suppose that only some of the columns of A have errors. This leads to the
model
min
E,r
(|E
2
|
2
F
+|r|
2
2
) subject to ( A
1
A
2
+E
2
) x = b +r, (7.3.6)
with A
1
R
mn
1
and A
2
, E
2
R
mn
2
. This mixed LS-TLS problem
formulation encompasses the LSQ problem (when n
2
= 0), the TLS problem
(when n
1
= 0), as well as a combination of both.
How can one nd the solution to (7.3.6)? The basic idea, developed
in [95], is to use a QR factorization of ( A b ) to transform the problem
to a block structure, thereby reducing it to the solution of a smaller TLS
problem that can be solved independently, plus an LSQ problem. First,
compute the QR factorization ( A b ) = QR with
R =
_
R
11
R
12
0 R
22
_
R
11
R
n
1
n
1
, R
22
R
(n
2
+1)(n
2
+1)
.
Note that |F|
F
= |Q
T
F|
F
and therefore the constraints on the original
and transformed systems are equivalent. Now compute the SVD of R
22
and
let z
2
= v
n
2
+1
be the rightmost singular vector. We can then set F
12
= 0
and solve the standard least squares problem
R
11
z
1
= R
12
z
2
.
ADDITIONAL TOPICS IN LEAST SQUARES 143
Note that, as the TLS part of z has been obtained already and therefore
F
22
is minimized, there is no loss of generality when F
12
= 0. Finally,
compute the mixed LS-TLS solution as
x = z(1 : n)/z(n + 1), z =
_
z
1
z
2
_
.
7.4 Convex optimization
In an earlier section of this chapter we have already met a convex optimiza-
tion problem: least squares with quadratic constraints. But there are many
more problems of interest and some of them are related to the subject of
this book.
Convex optimization has gained much attention in recent years, thanks
to recent advances in algorithms to solve such problems. Also, the eorts
of Stephen Boyd and his group at Stanford have helped to popularize the
subject and provide invaluable software to solve all sizes of problems. One
can say now that convex optimization problems have reached the point
where they can be solved as assuredly as the basic linear algebra problems
of the past.
One of the reasons for the success is that, although nonlinear, strictly
convex problems have unique solutions. General nonlinear programming
problems can have goal landscapes that are very bumpy and therefore it
can be hard to nd a desired optimal point and even harder to nd a global
one.
An important contribution in recent times has been in producing tools
to identify convexity and to nd formulations of problems that are convex.
These problems look like general nonlinear programming problems, but the
objective function f(x) and the constraints g
i
(x) and h
i
(x) are required to
be convex:
min f(x)
subject to g
i
(x) 0, i = 1, ..., m
h
i
(x) = 0, i = 1, ..., p.
In the past 30 years several researchers discovered that by adding a
property (self-concordance) to these problems made it possible to use inte-
rior point methods via Newton iterations that were guaranteed to converge
in polynomial time [167]. That was the breakthrough that allows us now
to solve large problems in a reasonable amount of computer time and has
provided much impetus to the application of these techniques to an ever
increasing set of engineering and scientic problems.
144 LEAST SQUARES DATA FITTING WITH APPLICATIONS
A notable eort stemming from the research group of S. Boyd at Stan-
ford is M. Grants cvx MATLAB package, available from http://cvxr.com.
van den Berg and Friedlander [240, 241] have also some interesting contribu-
tions, including ecient software for the solution of medium- and large-scale
convex optimization problems; we also mention the Python software cvxopt
by Dahl and Vandenberghe [56]. A comprehensive book on the subject is
[30].
7.5 Compressed sensing
The theory and praxis of compressed sensing has become an important
subject in the past few years. Compressed sensing is predicated on the
fact that we often collect too much data and then compress it, i.e., a good
deal of that data is unnecessary in some sense. Think of a mega-pixel
digital photography and its JPEG compressed version. Although this is a
lossy compression scheme, going back and forth allows for almost perfect
reconstruction of the original image. Compressed sensing addresses the
question: is it possible to collect much less data and use reconstruction
algorithms that provide a perfect image? The answer in many cases is yes.
The idea of compressed sensing is that solution vectors can be repre-
sented by much fewer coecients if an appropriate basis is used, i.e., they
are sparse in such a base, say, of wavelets [164]. Following [34, 35] we con-
sider the general problem of reconstructing a vector x R
N
from linear
measurements, y
k
=
T
k
x for k = 1, . . . , K. Assembling all these measure-
ments in matrix form we obtain the relation: y = x, where y R
K
and

k
are the rows of the matrix .
The important next concept is that compressed sensing attempts to
reconstruct the original vector x from a number of samples smaller than
what the Nyquist-Shannon theorem says is possible. The latter theorem
states that if a signal contains a maximum frequency f (i.e., a minimum
wavelength w = 1/f), then in order to reconstruct it exactly it is necessary
to have samples in the time domain that are spaced no more than 1/(2f)
units apart. The key to overcoming this restriction in the compressed sens-
ing approach is to use nonlinear reconstruction methods that combine l
2
and l
1
terms.
We emphasize that in this approach the system y = x is under-
determined, i.e., the matrix has more columns than rows. It turns out
that if one assumes that the vector x is sparse in some basis B, then solving
the following convex optimization problem:
min
e x
| x|
1
subject to B x = y, (7.5.1)
reconstructs x = B x exactly with high probability. The precise statement
ADDITIONAL TOPICS IN LEAST SQUARES 145
is as follows.
Theorem 82. Assume that x has at most S nonzero elements and that we
take K random linear measurements, where K satises
K C S log N,
where C = 22(+1) and > 0. Then, solving (7.5.1) reconstructs x exactly
with probability greater than 1 O(N

).
The secret is that the Nyquist-Shannon theorem talks about a whole
band of frequencies, from zero to the highest frequency, while here we are
considering a sparse set of frequencies that may be in small disconnected
intervals. Observe that for applications of interest, the dimension N in
problem (7.5.1) can be very large and there resides its complexity. This
problem is convex and can be reduced to a linear programming one. For
large N it is proposed to use an interior-point method that basically solves
the optimality conditions by Newtons method, taking care of staying feasi-
ble throughout the iteration. E. Candes (Caltech, Stanford) oers a number
of free software packages for this type of applications. Some very good re-
cent references are [240, 241].
As usual, in the presence of noise and even if the constraints in (7.5.1)
are theoretically satised, it is more appropriate to replace that problem
by min
e x
| x|
1
subject to | x y|
2
, where the positive parameter
is an estimate of the data noise level.
It is interesting to notice the connection with the basic solution of rank-
decient problems introduced in Section 2.3. Already in 1964 J. B. Rosen
wrote a paper [209] about its calculation and in [193] an algorithm and im-
plementation for calculating the minimal and basic solutions was described.
That report contains extensive numerical experimentation and a complete
Algol program. Unfortunately, the largest matrix size treated there was
40 and therefore without further investigation that algorithm may not be
competitive today for these large problems. Basic solutions are used in
some algorithms as initial guesses to solve this problem, where one gener-
ally expects many fewer nonzero elements in the solution than the rank of
the matrix.
Chapter 8
Nonlinear Least Squares
Problems
So far we have discussed data tting problems in which all the unknown pa-
rameters appear linearly in the tting model M(x, t), leading to linear least
squares problems for which we can, in principle, write down a closed-form
solution. We now turn to nonlinear least squares problems (NLLSQ) for
which this is not true, due to (some of) the unknown parameters appearing
nonlinearly in the model.
8.1 Introduction
To make precise what we mean by the term nonlinear we give the following
denitions. A parameter of the function f appears nonlinearly if the
derivative f/ is a function of . A parameterized tting model M(x, t)
is nonlinear if at least one of the parameters in x appear nonlinearly. For
example, in the exponential decay model
M(x
1
, x
2
, t) = x
1
e
x
2
t
we have M/x
1
= e
x
2
t
(which is independent of x
1
, so is linear in it) and
M/x
2
= t x
1
e
x
2
t
(which depends on x
2
) and thus M is a nonlinear
model with the parameter x
2
appearing nonlinearly. We start with a few
examples of nonlinear models.
Example 83. Gaussian model. All measuring devices somehow inu-
ence the signal that they measure. If we measure a time signal g
mes
, then
we can often model it as a convolution of the true signal g
true
(t) and an
147
148 LEAST SQUARES DATA FITTING WITH APPLICATIONS
instrument response function (t):
g
mes
(t) =
_

(t ) g
true
() d.
A very common model for (t) is the non-normalized Gaussian function
M(x, t) = x
1
e
(tx
2
)
2
/(2x
2
3
)
, (8.1.1)
where x
1
is the amplitude, x
2
is the time shift and x
3
determines the width
of the Gaussian function. The parameters x
2
and x
3
appear nonlinearly
in this model. The Gaussian model also arises in many other data tting
problems.
Example 84. Rational model. Another model function that often arises
in nonlinear data tting problems is the rational function, i.e., a quotient
of two polynomials of degree p and q, respectively,
M(x, t) =
P(t)
Q(t)
=
x
1
t
p
+x
2
t
p1
+ +x
p
t +x
p+1
x
p+2
t
q
+ x
p+3
t
q1
+ + x
p+q+1
t + x
p+q+2
, (8.1.2)
with a total of n = p + q + 2 parameters.
Rational models arise, e.g., in parameter estimation problems such as
chemical reaction kinetics, signal analysis (through Laplace and z trans-
forms), system identication and in general as a transfer function for a
linear time-invariant system (similar to the above-mentioned instrument
response function). In these models, the coecients of the two polynomi-
als or, equivalently, their roots, characterize the dynamical behavior of the
system.
Rational functions are also commonly used as empirical data approxi-
mation functions. Their advantage is that they have a broader scope than
polynomials, yet they are still simple to evaluate.
The basic idea of nonlinear data tting is the same as described in
Chapter 1, the only dierence being that we now use a tting model M(x, t)
in which (some of) the parameters x appear nonlinearly and that leads to
a nonlinear optimization problem. We are given noisy measurements y
i
=
(t
i
)+e
i
for i = 1, . . . , m, where (t) is the pure-data function that gives the
true relationship between t and the noise-free data, and we seek to identify
the model parameters x such that M(x, t) gives the best t to the noisy
data. The likelihood function P
x
for x is the same as before, cf. (1.3.1),
leading us to consider the weighted residuals r
i
(x)/
i
= (y
i
M(x, t
i
)/
i
and to minimize the weighted sum-of-squares residuals.
Denition 85. Nonlinear least squares problem (NLLSQ). In the
case of white noise (where the errors have a common variance
2
), nd a
NONLINEAR LEAST SQUARES PROBLEMS 149
Figure 8.1.1: Fitting a non-normalized Gaussian function (full line) to
noisy data (circles).
minimizer x

of the nonlinear objective function f with the special form


min
x
f(x) min
x
1
2
|r(x)|
2
2
= min
x
1
2
m

i=1
r
i
(x)
2
, (8.1.3)
where x R
n
and r(x) = (r
1
(x), . . . , r
m
(x))
T
is a vector-valued function
of the residuals for each data point:
r
i
(x) = y
i
M(x, t
i
), i = 1, . . . , m,
in which y
i
are the measured data corresponding to t
i
and M(x, t) is the
model function.
The factor
1
2
is added for practical reasons as will be clear later. Simply
replace r
i
(x) by w
i
r
i
(x)/
i
in (8.1.3), where w
i
=
1
i
is the inverse of
the standard deviation for the noise in y
i
, to obtain the weighted problem
min
x
1
2
|Wr(x)|
2
2
for problems when the errors have not the same variance
values.
Example 86. Fitting with the Gaussian model. As an example, we
consider tting the instrument response function in Example 83 by the
Gaussian model (8.1.1). First note that if we choose g
true
() = (), a
delta function, then we have g
mes
(t) = (t), meaning that we ideally mea-
sure the instrument response function. In practice, by sampling g
mes
(t) for
selected values t
1
, t
2
, . . . , t
m
of t we obtain noisy data y
i
= (t
i
) + e
i
for
i = 1, . . . , m to which we t the Gaussian model M(x, t). Figure 8.1.1 illus-
trates this. The circles are the noisy data generated from an exact Gaussian
model with x
1
= 2.2, x
2
= 0.26 and x
3
= 0.2, to which we added Gaussian
noise with standard deviation = 0.1. The full line is the least squares
t with parameters x
1
= 2.179, x
2
= 0.264 and x
3
= 0.194. In the next
chapter we discuss how to compute this t.
150 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Example 87. Nuclear magnetic resonance spectroscopy. This is a
technique that measures the response of nuclei of certain atoms that possess
spin when they are immersed in a static magnetic eld and they are exposed
to a second oscillating magnetic eld. NMR spectroscopy, which studies the
interaction of the electromagnetic radiation with matter, is used as a non-
invasive technique to obtain in vivo information about chemical changes
(e.g., concentration, pH), in living organisms. An NMR spectrum of, for
example, the human brain can be used to identify pathology or biochemical
processes or to monitor the eect of medication.
The model used to represent the measured NMR signals y
i
C in the
frequency domain is a truncated Fourier series of the Lorentzian distribu-
tion:
y
i
M(a, , d, , t
i
) =
K

k=1
a
k
e

k
e
(d
k
+j
k
)t
i
, i = 1, . . . , m,
where denotes the imaginary unit. The parameters of the NMR signal pro-
vide information about the molecules: K represents the number of dierent
resonance frequencies, the angular frequency
k
of the individual spectral
components identies the molecules, the damping d
k
characterizes the mo-
bility, the amplitude a
k
is proportional to the number of molecules present,
and
k
is the phase. All these parameters are real. To determine the NMR
parameters through a least squares t, we minimize the squared modulus of
the dierence between the measured spectrum and the model function:
min
a,,d,
m

i=1
[y
i
M(a, , d, , t
i
)[
2
.
This is another example of a nonlinear least squares problem.
8.2 Unconstrained problems
Generally, nonlinear least squares problems will have multiple solutions
and without a priori knowledge of the objective function it would be too
expensive to determine numerically a global minimum, as one can only
calculate the function and its derivatives at a limited number of points.
Thus, the algorithms in this book will be limited to the determination of
local minima. A certain degree of smoothness of the objective function will
be required, i.e., having either one or possibly two continuous derivatives,
so that the Taylor expansion applies.
NONLINEAR LEAST SQUARES PROBLEMS 151
Optimality conditions
Recall (Appendix C) that the gradient f(x) of a scalar function f(x) of
n variables is the vector with elements
f(x)
x
j
, j = 1, . . . , n
and the Hessian
2
f(x) is the symmetric matrix with elements
[
2
f(x)]
ij
=

2
f(x)
x
i
x
j
, i, j = 1, . . . , n.
The conditions for x

to be a local minimum of a twice continouously


dierentiable function f are
1. First-order necessary condition. x

is a stationary point, i.e., the


gradient of f at x

is zero: f(x

) = 0.
2. Second-order sucient conditions. The Hessian
2
f(x

) is pos-
itive denite.
Now we consider the special case where f is the function in the nonlinear
least squares problem (8.1.3), i.e.,
f(x) =
1
2
m

i=1
r
i
(x)
2
, r
i
(x) = y
i
M(x, t
i
).
The Jacobian J(x) of the vector function r(x) is dened as the matrix with
elements
[J(x)]
ij
=
r
i
(x)
x
j
=
M(x, t
i
)
x
j
, i = 1, . . . , m, j = 1, . . . , n.
Note that the ith row of J(x) equals the transpose of the gradient of r
i
(x)
and also:
r
i
(x)
T
= M(x, t
i
)
T
, i = 1, . . . , m.
The elements of the gradient and the Hessian of f are given by
f(x)
x
j
=
m

i=1
r
i
(x)
r
i
(x)
x
j
,
[
2
f(x)]
k
=

2
f(x)
x
k
x

=
m

i=1
r
i
(x)
x
k
r
i
(x)
x

+
m

i=1
r
i
(x)

2
r
i
(x)
x
k
x

,
and it follows immediately that the gradient and Hessian can be written in
matrix form as
152 LEAST SQUARES DATA FITTING WITH APPLICATIONS
f(x) = J(x)
T
r(x),

2
f(x) = J(x)
T
J(x) +
m

i=1
r
i
(x)
2
r
i
(x),
_

2
r
i
(x)

k
=
_

2
M(x, t
i
)

k
=

2
M(x, t
i
)
x
k
x

, k, = 1, . . . , m.
Notice the minus sign in the expression for
2
r
i
(x). The optimality con-
ditions now take the special form
f(x) = J(x)
T
r(x) = 0 (8.2.1)
and

2
f(x) = J(x)
T
J(x) +
m

i=1
r
i
(x)
2
r
i
(x) is positive denite. (8.2.2)
The fact that distinguishes the least squares problem from among the
general optimization problems is that the rst and often dominant
term J(x)
T
J(x) of the Hessian
2
f(x) contains only the Jacobian J(x)
of r(x), i.e., only rst derivatives. Observe that in the second term the
second derivatives are multiplied by the residuals. Thus, if the model is
adequate to represent the data, then these residuals will be small near the
solution and therefore this term will be of secondary importance. In this
case one gets an important part of the Hessian for free if one has already
computed the gradient. Most specialized algorithms exploit this structural
property of the Hessian.
An inspection of the Hessian of f will show two comparatively easy
NLLSQ cases. As we observed above, the term

m
i=1
r
i
(x)
2
r
i
(x) will
be small if the residuals are small. An additional favorable case occurs
when the problem is only mildly nonlinear, i.e., all the Hessians
2
r
i
(x) =

2
M(x, t
i
) have small norm. Most algorithms that neglect this sec-
ond term will then work satisfactorily. The smallest eigenvalue
min
of
J(x)
T
J(x) can be used to quantify the relative importance of the two
terms of
2
f(x): the rst term of the Hessian dominates if for all x near
a minimum x

the quantities [r
i
(x)[ |
2
r
i
(x)|
2
for i = 1, . . . , m are small
relative to
min
. This obviously holds in the special case of a consistent
problem where r(x

) = 0.
The optimality conditions can be interpreted geometrically by observing
that the gradient f(x) is always a vector normal to the level set that
passes through the point x (we are assuming in what follows that f(x) is
NONLINEAR LEAST SQUARES PROBLEMS 153
Figure 8.2.1: Level sets L(c) (in R
2
they are level curves) for a function
f whose minimizer x

is located at the black dot. The tangent plane is a


line perpendicular to the gradient f(x), and the negative gradient is the
direction of steepest descent at x.
twice dierentiable, for simplicity). A level set L(c) (level curve in R
2
) is
dened formally as
L(c) = x : f(x) = c.
The tangent plane at x is y R
n
[ f(x)
T
(yx) = 0, which shows that
f(x) is its normal and thus normal to the level set at x.
This leads to a geometric interpretation of directions of descent. A
direction p with |p|
2
= 1 is said to be of descent (from the point x) if
f(x+tp) < f(x) for 0 < t < t
0
. It turns out that directions of descent can
be characterized using the gradient vector. In fact, by Taylors expansion
we have
f(x + tp) = f(x) + tf(x)
T
p + O(t
2
).
Thus, for the descent condition to hold we need to have f(x)
T
p < 0,
since for suciently small t the linear term will dominate. In addition
we have f(x)
T
p = cos()|f(x)|
2
where is the angle between p and
the gradient. From this we conclude that the direction p = f(x) is of
maximum descent, or steepest descent, while any direction in the half-space
dened by the tangent plane and containing the negative gradient is also a
direction of descent, since cos() is negative there. Figure 8.2.1 illustrates
this point. Note that a stationary point is one for which there are no descent
directions, i.e., f(x) = 0.
We conclude with a useful geometric result when J(x

) has full rank.


In this case the characterization of a minimum can be expressed by using
the so-called normal curvature matrix (see [9]) associated with the surface
dened by r(x) and with respect to the normalized vector r(x)/ |r(x)|
2
,
dened as
K(x) =
1
|r(x)|
2
(J(x)

)
T
_
m

i=1
r
i
(x)
2
r
i
(x)
_
J(x)

.
154 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Then the condition for a minimum can be reformulated as follows.
Condition 88. If J(x

)
T
(I |r(x

)|
2
K(x

))J(x

) is positive denite,
then x

is a local minimum.
The role of local approximations
Local approximations always play an important role in the treatment of
nonlinear problems and this case is no exception. We start with the Taylor
expansion of the model function:
M(x +h, t) = M(x, t) +M(x, t)
T
h +
1
2
h
T

2
M(x, t)h +O(|h|
3
2
).
Now consider the ith residual r
i
(x) = y
i
M(x, t
i
), whose Taylor ex-
pansion for i = 1, . . . , m is given by
r
i
(x +h) = y
i
M(x +h, t
i
)
= y
i
M(x, t
i
) M(x, t
i
)
T
h
1
2
h
T

2
M(x, t
i
)h +O(|h|
3
2
).
Hence we can write
r(x +h) = r(x) + J(x) h +O(|h|
2
2
),
which is a local linear approximation at x valid for small h. If we keep x
xed and consider the minimization problem
min
h
|r(x +h)|
2
min
h
|J(x) h +r(x)|
2
, (8.2.3)
it is clear that we can approximate locally the nonlinear least squares prob-
lem with a linear one in the variable h. Moreover, we see that the h that
minimizes |r(x +h)|
2
can be approximated by
h (J(x)
T
J(x))
1
J(x)
T
r(x) = J(x)

r(x),
where we have neglected higher-order terms. In the next chapter we shall
see how this expression provides a basis for some of the numerical methods
for solving the NLLSQ problem.
As a special case, let us now consider the local approximation (8.2.3) at
a least squares solution x

(a local minimizer) where the local, linear least


squares problem takes the form
min
h
|J(x

) h +r(x

)|
2
.
NONLINEAR LEAST SQUARES PROBLEMS 155
Figure 8.2.2: Histograms and standard deviations for the least squares
solutions to 1000 noisy realizations of the Gaussian model tting problem.
The vertical lines show the exact values of the parameters.
If we introduce x = x

+h, then we can reformulate the above problem as


one in x,
min
x
|J(x

) x J(x

) x

+r(x

)|
2
,
which leads to an approximation of the covariance matrix for x

by using
the covariance matrix for the above linear least squares problem, i.e.,
Cov(x

) J(x

Cov
_
J(x

) x

r(x

)
_
(J(x

)
T
= J(x

Cov
_
r(x

) J(x

) x

_
(J(x

)
T
= J(x

Cov(y)(J(x

)
T
. (8.2.4)
Here, we have used that r(x

) = y(M(x

, t
1
), . . . , M(x

, t
m
))
T
and that
all the terms except y in r(x

) + J(x

) x

are considered constant.


The above equation provides a way to approximately assess the uncer-
tainties in the least squares solution x

for the nonlinear case, similar to


the result in Section 2.1 for the linear case. In particular, we see that the
Jacobian J(x

) at the solution plays the role of the matrix in the linear


case. If the errors e in the data are uncorrelated with the exact data and
have covariance Cov(e) =
2
I, then Cov(x

)
2
_
J(x

)
T
J(x

)
_
1
.
Example 89. To illustrate the use of the above covariance matrix estimate,
we return to the Gaussian model M(x, t) from Examples 83 and 86, with
exact parameters x
1
= 2.2, x
2
= 0.26 and x
3
= 0.2 and with noise level
= 0.1. The elements of the Jacobian for this problem are, for i = 1, . . . , m:
[J(x)]
i,1
= e
(t
i
x
2
)
2
/(2x
2
3
)
,
[J(x)]
i,2
= x
1
t
i
x
2
x
2
3
e
(t
i
x
2
)
2
/(2x
2
3
)
,
[J(x)]
i,3
= x
1
(t
i
x
2
)
2
x
3
3
e
(t
i
x
2
)
2
/(2x
2
3
)
.
156 LEAST SQUARES DATA FITTING WITH APPLICATIONS
The approximate standard deviations for the three estimated parameters
are the square roots of the diagonal of
2
_
J(x

)
T
J(x

)
_
1
. In this case we
get the three values
6.58 10
2
, 6.82 10
3
, 6.82 10
3
.
We compute the least squares solution x

for 1000 realizations of the noise.


Histograms of these parameters are shown in Figure 8.2.2. The standard
deviations of these estimated parameters are
std(x
1
) = 6.89 10
2
, std(x
2
) = 7.30 10
3
, std(x
3
) = 6.81 10
3
.
In this example, the theoretical standard deviation estimates are in very
good agreement with the observed ones.
8.3 Optimality conditions for constrained
problems
In the previous section we reviewed the conditions for a point x

to be a
local minimizer of an unconstrained nonlinear least squares problem. Now
we consider a constrained problem
min
xC
1
2
|r(x)|
2
2
, (8.3.1)
where the residual vector r(x) is as above and the set C R
n
(called the
feasible region) is dened by a set of inequality constraints:
C = x[ c
i
(x) 0, i = 1, . . . , p,
where the functions c
i
(x) are twice dierentiable. In the unconstrained
case, it could be that the only minimum is at innity (think of f(x) =
a + bx). If instead we limit the variables to be in a closed, bounded set,
then the Bolzano-Weierstrass theorem [79] ensures us that there will be at
least one minimum point.
A simple way to dene a bounded set C is to impose constraints on
the variables. The simplest constraints are those that set bounds on the
variables: l
i
x
i
u
i
for i = 1, . . . , p. A full set of such constraints
(i.e., p = n) denes a bounded box in R
n
and then we are guaranteed that
f(x) will have a minimum in the set. For general constraints some of the
functions c
i
(x) can be nonlinear.
The points that satisfy all the constraints are called feasible. A con-
straint is active if it is satised with equality, i.e., the point is on a boundary
of the feasible set. If we ignore the constraints, unconstrained minima can
NONLINEAR LEAST SQUARES PROBLEMS 157
Figure 8.3.1: Optimality condition for constrained optimization. The
shaded area illustrates the feasible region C dened by three constraints.
One of the constraints is active at the solution x

and at this point the


gradient of f is collinear with the normal to the constraint.
occur inside or outside the feasible region. If they occur inside, then the op-
timality conditions are the same as in the unconstrained case. However, if
all unconstrained minima are infeasible, then a constrained minimum must
occur on the boundary of the feasible region and the optimality conditions
will be dierent.
Let us think geometrically (better still, in R
2
, see Figure 8.3.1). The
level curve values of f(x) decrease toward the unconstrained minimum.
Thus, a constrained minimum will lie on the level curve with the lowest
value of f(x) that is still in contact with the feasible region and at the
constrained minimum there should not be any feasible descent directions.
A moment of reection tells us that the level curve should be tangent to
the active constraint (the situation is more complicated if more than one
constraint is active). That means that the normal to the active constraint
and the gradient (pointing to the outside of the constraint region) at that
point are collinear, which can be expressed as
f(x) + c
i
(x) = 0, (8.3.2)
where c
i
(x) represents the active constraint and > 0 is a so-called La-
grange multiplier for the constraint. (If the constraint were c
i
(x) 0, then
we should require < 0.) We see that (8.3.2) is true because, if p is a
descent direction, we have
p
T
f(x) + p
T
c
i
(x) = 0
or, since > 0,
p
T
c
i
(x) > 0,
and therefore p must point outside the feasible region; see Figure 8.3.1 for an
illustration. This is a simplied version of the famous Karush-Kuhn-Tucker
158 LEAST SQUARES DATA FITTING WITH APPLICATIONS
rst-order optimality conditions for general nonlinear optimization [139,
147]. Curiously enough, in the original Kuhn-Tucker paper these conditions
are derived from considerations based on multiobjective optimization!
The general case when more than one constraint is active at the solution,
can be discussed similarly. Now the infeasible directions instead of being
in a half-space determined by the tangent plane of a single constraint will
be in a cone formed by the normals to the tangent planes associated with
the various active constraints. The corresponding optimality condition is
f(x) +
p

i=1

i
c
i
(x) = 0, (8.3.3)
where the Lagrange multipliers satisfy
i
> 0 for the active constraints
and
i
= 0 for the inactive ones. We refer to [170] for more details about
optimality constraints and Lagrange multipliers.
8.4 Separable nonlinear least squares problems
In many practical applications the unknown parameters of the NLLSQ
problem can be separated into two disjoint sets, so that the optimization
with respect to one set is easier than with respect to the other. This sug-
gests the idea of eliminating the parameters in the easy set and minimizing
the resulting functional, which will then depend only on the remaining vari-
ables. A natural situation that will be considered in detail here arises when
some of the variables appear linearly.
The initial approach to this problem was considered by H. Scolnik in
his Doctoral Thesis at the University of Zurich. A rst paper with gen-
eralizations was [113]. This was followed by an extensive generalization
that included a detailed algorithmic description [101, 102] and a computer
program called VARPRO that become very popular and is still in use. For
a recent detailed survey of applications see [103] and [198]. A paper that
includes constraints is [142]. For multiple right hand sides see [98, 141].
A separable nonlinear least squares problem has a model of the special
form
M(a, , t) =
n

j=1
a
j

j
(, t), (8.4.1)
where the two vectors a R
n
and R
k
contain the parameters to be
determined and
j
(, t) are functions in which the parameters appear
nonlinearly. Fitting this model to a set of m data points (t
i
, y
i
) with m >
n + k leads to the least squares problem
NONLINEAR LEAST SQUARES PROBLEMS 159
min
a,
f(a, ) = min
a,
1
2
|r(a, )|
2
2
= min
a,
1
2
|y () a|
2
2
.
In this expression, () is an mn matrix function with elements
() =
j
(, t
i
), i = 1, . . . , m, j = 1, . . . , n.
The special case when k = n and
j
= e

j
t
is called exponential data tting.
In this as in other separable nonlinear least squares problems the matrix
() is often ill-conditioned.
The variable projection algorithm to be discussed in detail in the next
chapter is based on the following ideas. For any xed , the problem
min
a
1
2
|y () a|
2
2
is linear with minimum-norm solution a

= ()

y, where ()

is the
Moore-Penrose generalized inverse of the rectangular matrix (), as we
saw in Chapter 3.
Substituting this value of a into the original problem, we obtain a re-
duced nonlinear problem depending only on the nonlinear parameters ,
with the associated function
f
2
() =
1
2
|y ()()

y|
2
2
=
1
2
|P
()
y|
2
2
,
where, for a given , the matrix
P
()
= I ()()

is the projector onto the orthogonal complement of the column space of the
matrix (). Thus the name variable projection given to this reduction
procedure. The solution of the nonlinear least squares problem min

f
2
()
is discussed in the next chapter.
Once a minimizer

for f
2
() is obtained, it is substituted into the
linear problem min
a
1
2
|y (

) a|
2
2
, which is then solved to obtain a

.
The least squares solution to the original problem is then (a

).
The justication for the variable projection algorithm is the following
theorem (Theorem 2.1 proved in [101]), which essentially states that the
set of stationary points of the original functional and that of the reduced
one are identical.
Theorem 90. Let f(a, ) and f
2
() be dened as above. We assume that
in an open set containing the solution, the matrix () has constant rank
r min(m, n).
1. If

is a critical point (or a global minimizer in ) of f


2
() and
a

= ()

y, then (a

) is a critical point of f(a, ) (or a global


minimizer in ) and f(a

) = f
2
(

).
160 LEAST SQUARES DATA FITTING WITH APPLICATIONS
2. If (a

) is a global minimizer of f(a, ) for , then

is
a global minimizer of f
2
() in and f
2
(

) = f(a

). Further-
more, if there is a unique a

among the minimizing pairs of f(a, )


then a

must satisfy a

= (

y.
8.5 Multiobjective optimization
This is a subject that is seldom treated in optimization books, although it
arises frequently in nancial optimization, decision and game theory and
other areas. In recent times it has increasingly been recognized that many
engineering problems are also of this kind. We have also already mentioned
that most regularization procedures are actually bi-objective problems (see
[240, 241] for an interesting theory and algorithm).
The basic unconstrained problem is
min
x
f(x),
where now f R
k
is a vector function. Such problems arise in areas of
interest of this book, for instance, in cooperative tting and inversion, when
measurements of several physical processes on the same sample are used to
determine properties that are interconnected.
In general, it will be unlikely that the k objective functions share a com-
mon minimum point, so the theory of multiobjective optimization is dier-
ent from the single-objective optimization case. The objectives are usually
in conict and a compromise must be chosen. The optimality concept is
here the so-called Pareto equilibrium: a point x is a Pareto equilibrium
point if there is no other point for which all functions have smaller values.
This notion can be global (as stated) or local if it is restricted to a
neighborhood of x. In general there will be innitely many such points. In
the absence of additional information, each Pareto point would be equally
acceptable. Let us now see if we can give a more geometric interpretation
of this condition, based on some of the familiar concepts used earlier.
For simplicity let us consider the case k = 2. First of all we observe that
the individual minimum points
x

i
= arg min
x
f
i
(x), i = 1, 2
are Pareto optimal. We consider the level sets corresponding to the two
functions and observe that at a point at which these two level sets are
tangent and their corresponding normals are in opposite directions, there
will be no common directions of descent for both functions. But that means
that there is no direction in which we can move so that both functions
are improved, i.e., this is a Pareto equilibrium point. Analytically this is
expressed as
NONLINEAR LEAST SQUARES PROBLEMS 161
Figure 8.5.1: Illustration of a multiobjective optimization problem. The
two functions f
1
(x) and f
2
(x) have minimizers x

1
and x

2
. The curve be-
tween these two points is the set of Pareto points.
f
1
(x)
|f
1
(x)|
2
+
f
2
(x)
|f
2
(x)|
2
= 0,
which can be rewritten as
f
1
(x) + (1 )f
2
(x) = 0, 0 1.
It can be shown that for convex functions, all the Pareto points are
parametrized by this aggregated formula, which coincides with the opti-
mality condition for the scalar optimization problems
min
x
[f
1
(x) + (1 )f
2
(x)], 0 1.
This furnishes a way to obtain the Pareto points by solving a number of
scalar optimization problems. Figure 8.5.1 illustrates this; the curve be-
tween the two minimizers x

1
and x

2
is the set of Pareto points.
Curiously enough, these optimality conditions were already derived in
the seminal paper on nonlinear programming by Kuhn and Tucker in 1951
[147].
A useful tool is given by the graph in phase space of the Pareto points,
the so-called Pareto front (see 8.5.2.) In fact, this graph gives a complete
picture of all possible solutions, and it is then usually straightforward to
make a decision by choosing an appropriate trade-o between the objec-
tives. Since one will usually be able to calculate only a limited number
of Pareto points, it is important that the Pareto front be uniformly sam-
pled. In [190, 194], a method based on continuation in with added dis-
tance constraints produces automatically equispaced representations of the
Pareto front. In Figure 8.5.2 we show a uniformly sampled Pareto front for
a bi-objective problem.
The -constraint method of Haimes [114] is frequently used to solve this
type of problem in the case that a hierarchical order of the objectives is
162 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Figure 8.5.2: Evenly spaced Pareto front (in (f
1
, f
2
) space) for a bi-
objective problem [190].
known and one can make an a priori call on a compromise upper value
for the secondary objective. It transforms the bi-objective problem into a
single-objective constrained minimization of the main goal. The constraint
is the upper bound of the second objective. In other words, one minimizes
the rst objective subject to an acceptable level in the second objective
(that better be larger than the minimum). From what we saw before, the
resulting solution may be sub-optimal.
A good reference for the theoretical and practical aspects of nonlinear
multiobjective optimization is [161].
Chapter 9
Algorithms for Solving
Nonlinear LSQ Problems
The classical method of Newton and its variants can be used to solve
the nonlinear least squares problem formulated in the previous chapter.
Newtons method for optimization is based on a second-order Taylor ap-
proximation of the objective function f(x) and subsequent minimization of
the resulting approximate function. Alternatively, one can apply Newtons
method to solve the nonlinear system of equations f(x) = 0.
The Gauss-Newton method is a simplication of the latter approach for
problems that are almost consistent or mildly nonlinear, and for which
the second term in the Hessian
2
f(x) can be safely neglected. The
Levenberg-Marquardt method can be considered a variant of the Gauss-
Newton method in which stabilization (in the form of Tikhonov regular-
ization, cf. Section 10) is applied to the linearized steps in order to solve
problems with ill-conditioned or rank-decient Jacobian J(x).
Due to the local character of the Taylor approximation one can only
obtain local convergence, in general. In order to get global convergence, the
methods need to be combined with a line search. Global convergence means
convergence to a local minimum from any initial point. For convergence
to a global minimum one needs to resort, for instance, to a Monte Carlo
technique that provides multiple random initial points, or to other costlier
methods [191].
We will describe rst the dierent methods and then consider the com-
mon diculties, such as how to start and end and how to ensure descent
at each step.
163
164 LEAST SQUARES DATA FITTING WITH APPLICATIONS
9.1 Newtons method
If we assume that f(x) is twice continuously dierentiable, then we can use
Newtons method to solve the system of nonlinear equations
f(x) = J(x)
T
r(x) = 0,
which provides local stationary points for f(x). Written in terms of deriva-
tives of r(x) and starting from an initial guess x
0
this version of the Newton
iteration takes the form
x
k+1
= x
k

2
f(x
k
)
_
1
f(x
k
)
= x
k

_
J(x
k
)
T
J(x
k
) + S(x
k
)
_
1
J(x
k
)
T
r(x
k
),
k = 0, 1, 2, . . .
where S(x
k
) denotes the matrix
S(x
k
) =
m

i=1
r
i
(x
k
)
2
r
i
(x
k
). (9.1.1)
As usual, no inverse is calculated to obtain a new iterate, but rather a linear
system is solved by a direct or iterative method to obtain the correction
x
k
= x
k+1
x
k
_
J(x
k
)
T
J(x
k
) + S(x
k
)
_
x
k
= J
T
(x
k
)r(x
k
). (9.1.2)
The method is locally quadratically convergent as long as
2
f(x) is Lip-
schitz continuous and positive denite in a neighborhood of x

. This follows
from a simple adaptation of the Newton convergence theorem in [66], which
leads to the result
|x
k+1
x

|
2
|x
k
x

|
2
2
, k = 0, 1, 2, . . . .
The constant is a measure of the nonlinearity of f(x), it depends on
the Lipschitz constant for
2
f(x) and a bound for
_
_

2
f(x

)
1
_
_
2
2
, the
size of the residual does not appear. The convergence rate will depend on
the nonlinearity, but the convergence itself will not. The foregoing result
implies that Newtons method is usually very fast in the nal stages, close
to the solution.
In practical applications Newtons iteration may not be performed ex-
actly and theorems 4.22 and 4.30 in [143, 183] give convergence results when
the errors are controlled appropriately.
Note that the Newton correction x
k
will be a descent direction (as
explained in the previous chapter), as long as
2
f(x
k
) = J(x
k
)
T
J(x
k
) +
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 165
S(x
k
) is positive denite. In fact, using the denition of positive denite
matrices one obtains, by multiplying both sides of (9.1.2)
0 < x
T
k
_
J(x
k
)
T
J(x
k
) + S(x
k
)
_
x
k
= x
T
k
J
T
(x
k
)r(x
k
).
Therefore, the correction x
k
is in the same half-space as the steepest
descent direction J
T
(x
k
)r(x
k
). Although the Newton correction is a
descent direction, the step size may be too large, since the linear model is
only locally valid. Therefore to ensure convergence, Newtons method is
used with step-length control to produce a more robust algorithm.
One of the reasons why Newtons method is not used more frequently
for nonlinear least squares problem is that its good convergence properties
come at a price: the mn
2
derivatives appearing in S(x
k
) must be computed!
This can be expensive and often the derivatives are not even available and
thus must be substituted by nite dierences. Special cases where Newtons
method is a good option are when S(x
k
) is sparse, which happens frequently
if J(x
k
) is sparse or when S(x
k
) involves exponentials or trigonometric
functions that are easy to compute. Also, if one has access to the code that
calculates the model, automatic dierentiation can be used [111].
If the second derivatives term in
2
f(x
k
) = J(x
k
)
T
J(x
k
) + S(x
k
) is
unavailable or too expensive to compute and hence approximated, the re-
sulting algorithm is called a quasi-Newton method and although the conver-
gence will no longer be quadratic, superlinear convergence can be attained.
It is important to point out that only the second term of
2
f(x
k
) needs
to be approximated since the rst term J(x
k
)
T
J(x
k
) has already been
computed. A successful strategy is to approximate S(x
k
) by a secant-type
term, using updated gradient evaluations:
S(x
k
)
m

i=1
r
i
(x
k
)

G
i
(x
k
),
where the Hessian terms
2
r
i
(x
k
) are approximated from the condition

G
i
(x
k
) (x
k
x
k1
) = r
i
(x
k
) r
i
(x
k1
).
This condition would determine a secant approximation for n = 1, but in
the higher-dimensional case it must be complemented with additional re-
quirements on the matrix

G
i
(x
k
): it must be symmetric and it must satisfy
the so-called least-change secant condition, i.e., that it be most similar to
the approximation

G
i
(x
k1
) in the previous step. In [67], local superlinear
convergence is proved, under the assumptions of Lipschitz continuity and
bounded inverse of
2
f(x).
166 LEAST SQUARES DATA FITTING WITH APPLICATIONS
9.2 The Gauss-Newton method
If the problem is only mildly nonlinear or if the residual at the solution (and
therefore in a reasonable-sized neighborhood) is small, a good alternative to
Newtons method is to neglect the second term S(x
k
) of the Hessian alto-
gether. The resulting method is referred to as the Gauss-Newton method,
where the computation of the step x
k
involves the solution of the linear
system
_
J(x
k
)
T
J(x
k
)
_
x
k
= J(x
k
)
T
r(x
k
) (9.2.1)
and x
k+1
= x
k
+ x
k
.
Note that in the full-rank case these are actually the normal equations
for the linear least squares problem
min
x
k+1
|J(x
k
) x
k
(r(x
k
))|
2
(9.2.2)
and thus x
k
= J(x
k
)

r(x
k
). By the same argument as used for the
Newton method, this is a descent direction if J(x
k
) has full rank.
We note that the linear least squares problem in (9.2.2), which denes
the Gauss-Newton direction x
k
, can also be derived from the principle of
local approximation discussed in the previous chapter. When we linearize
the residual vector r(x
k
) in the kth step we obtain the approximation in
(8.2.3) with x = x
k
, which is identical to (9.2.2).
The convergence properties of the Gauss-Newton method can be sum-
marized as follows [66] (see also [184] for an early proof of convergence):
Theorem 91. Assume that
f(x) is twice continuously dierentiable in an open convex set T.
J(x) is Lipschitz continuous with constant , and |J(x)|
2
.
There is a stationary point x

T, and
for all x T there exists a constant such that
_
_
(J(x) J(x

))
T
r(x

)
_
_
2
|x x

|
2
.
If is smaller than , the smallest eigenvalue of J(x

)
T
J(x

), then for any


c (1,

/) there exists a neighborhood so that the Gauss-Newton sequence


converges linearly to x

starting from any initial point x


0
in T:
|x
k+1
x

|
2

c

|x
k
x

|
2
+
c
2
|x
k
x

|
2
2
and
|x
k+1
x

|
2

c +
2
|x
k
x

|
2
< |x
k
x

|
2
.
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 167
If the problem is consistent, i.e., r(x

) = 0, there exists a (maybe smaller)


neighborhood where the convergence will be quadratic.
The Gauss-Newton method can produce the whole spectrum of conver-
gence: if S(x
k
) = 0, or is very small, the convergence can be quadratic,
but if S(x
k
) is too large there may not even be local convergence. The
important constant is , which can be considered as an approximation of
|S(x

)|
2
since, for x suciently close to x

, it can be shown that


_
_
(J(x) J(x

))
T
r(x

)
_
_
2
|S(x

)|
2
|x x

|
2
.
The ratio

/ must be less than 1 for convergence. The rate of convergence
is inversely proportional to the size of the nonlinearity or the residual,
i.e., the larger |S(x
k
)|
2
is in comparison to |J(x
k
)
T
J(x
k
)|
2
, the slower
the convergence.
As we saw above, the Gauss-Newton method produces a direction x
k
that is of descent, but due to the local character of the underlying approx-
imation, the step length may be incorrect, i.e., f(x
k
+x
k
) > f(x
k
). This
suggests a way to improve the algorithm, namely, to use damping, which
is a simple mechanism that controls the step length to ensure a sucient
reduction of the function. An alternative is to use a trust-region strategy
to dene the direction; this leads to the Levenberg-Marquardt algorithm
discussed later on.
Algorithm damped Gauss-Newton
Start with an initial guess x
0
and iterate for k = 0, 1, 2, . . .
Solve min
x
k
|J(x
k
) x
k
+r(x
k
)|
2
to compute the correction x
k
.
Choose a step length
k
so that there is enough descent.
Calculate the new iterate: x
k+1
= x
k
+
k
x
k
.
Check for convergence.
Several methods for choosing the step-length parameter
k
have been pro-
posed and the key idea is to ensure descent by the correction step
k
x
k
.
One popular choice is the Armijo condition (see Section 9.4), which uses a
constant
k
(0, 1) to ensure enough descent in the value of the objective:
f(x
k
+
k
x
k
) < f(x
k
) +
k
f(x
k
)
T
x
k
= f(x
k
) +
k
r(x
k
)
T
J(x
k
)
T
x
k
. (9.2.3)
This condition ensures that the reduction is (at least) proportional to both
the parameter
k
and the directional derivative f(x
k
)
T
x
k
.
Using the two properties above, a descent direction and an appropri-
ate step length, the damped Gauss-Newton method is locally convergent
168 LEAST SQUARES DATA FITTING WITH APPLICATIONS
y x

f(x

) J(x

)
T
J(x

) S(x

)
8 0.6932 0 644.00 0
3 0.4401 1.639 151.83 9.0527
1 0.0447 6.977 17.6492 8.3527
8 0.7915 41.145 0.4520 2.9605
Table 9.1: Data for the test problem for four values of the scalar y. Note
that x

, J(x

) and S(x

) are scalars in this example.


and often globally convergent. Still, the convergence rate may be slow
for problems for which the standard algorithm is inadequate. Also, the
ineciency of the Gauss-Newton method when applied to problems with
ill-conditioned or rank-decient Jacobian has not been addressed; the next
algorithm, Levenberg-Marquardt, will consider this issue.
Example 92. The following example from [66] will clarify the above con-
vergence results. We t the one-parameter model M(x, t) = e
xt
to the data
set
(t
1
, y
1
) = (1, 2), (t
2
, y
2
) = (2, 4), (t
3
, y
3
) = (3, y),
where y can take one of the four values 8, 3, 1, 8. The least squares
problem is, for every y, to determine the single parameter x, to minimize
the function
f(x) =
1
2
3

i=1
r
i
(x)
2
=
1
2
_
(e
x
2)
2
+ (e
2x
4)
2
+ (e
3x
y)
2

.
For this function of a single variable x, both the simplied and full Hessian
are scalars:
J(x)
T
J(x) =
3

i=1
(t
i
e
xt
i
)
2
and S(x) =
3

i=1
r
i
(x)t
2
i
e
xt
i
.
Table 9.1 lists, for each of the four values of y, the minimizer x

and values
of several functions at the minima.
We use the Newton and Gauss-Newton methods with several starting
values x
0
. Table 9.2 lists the dierent convergence behaviors for every case,
with two dierent starting values. The stopping criterion for the iterations
was [f(x
k
)[ 10
10
. Note that for the consistent problem with y = 8,
Gauss-Newton achieves quadratic convergence, since S(x

) = 0. For y = 3,
the convergence factor for Gauss-Newton is

/ 0.06, which is small
compared to 0.47 for y = 1, although for this value of y there is still linear
but slow convergence. For y = 8 the ratio is 6.5 1 and there is no
convergence.
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 169
Newton Gauss-Newton
y x
0
# iter. rate # iter. rate
8 1 7 quadratic 5 quadratic
0.6 6 quadratic 4 quadratic
3 1 9 quadratic 12 linear
0.5 5 quadratic 9 linear
1 1 10 quadratic 34 linear (slow)
0 4 quadratic 32 linear (slow)
8 1 12 quadratic no convergence
0.7 4 quadratic no convergence
Table 9.2: Convergence behavior for the four cases and two dierent start-
ing values.
Figure 9.2.1: Single-parameter exponential t example: the scalars
J(x)
T
J(x) (left) and
2
f(x) (right) as functions of x and y (the white lines
are level curves). The second term in
2
f(x) has increasing importance as
y decreases.
To support the above results, Figure 9.2.1 shows plots of the Hessian

2
f(x) and its rst term J(x)
T
J(x) (recall that both are scalars) as func-
tions of x and y, for x [0.7, 0.7] and y [8, 8]. We see that the
deterioration of the convergence rate of Gauss-Newtons method, compared
to that of Newtons method, indeed coincides with the increasing impor-
tance of the term S(x) in
2
f(x), due to larger residuals at the solution as
y decreases.
For large-scale problems, where memory and computing time are lim-
iting factors, it may be infeasible to solve the linear LSQ problem for the
correction x
k
to high accuracy in each iteration. Instead one can use one
of the iterative methods discussed in Chapter 6 to compute an approximate
step x
k
, by terminating the iterations once the approximation is accu-
170 LEAST SQUARES DATA FITTING WITH APPLICATIONS
rate enough, e.g., when the residual in the normal equations is suciently
small:
|J(x
k
)
T
J(x
k
) x
k
+ J(x
k
)
T
r(x
k
)|
2
< |J(x
k
)
T
r(x
k
)|
2
,
for some (0, 1). In this case, the algorithm is referred to as an inexact
Gauss-Newton method. For these nested iterations an approximate Newton
method theory applies [61, 143, 183].
9.3 The Levenberg-Marquardt method
While the Gauss-Newton method can be used for ill-conditioned problems,
it is not ecient, as one can see from the above convergence relations when
0. The main diculty is that the correction x
k
is too large and goes
in a bad direction that gives little reduction in the function. For such
problems, it is common to add an inequality constraint to the linear least
squares problem (9.2.2) that determines the step, namely, that the length
of the step |x
k
|
2
should be bounded by some constant. This so-called
trust region technique improves the quality of the step.
As we saw in the previous chapter, we can handle such an inequality
constraint via the use of a Lagrange multiplier and thus replace the problem
in (9.2.2) with a problem of the form
min
x
k+1
_
|J(x
k
) x
k
+r(x
k
)|
2
2
+
k
|x
k
|
2
2
_
,
where
k
> 0 is the Lagrange multiplier for the constraint at the kth itera-
tion. There are two equivalent forms of this problem, either the modied
normal equations
_
J(x
k
)
T
J(x
k
) +
k
I
_
x
k
= J(x
k
)
T
r(x
k
) (9.3.1)
or the modied least squares problem
min
x
k
_
_
_
_
_
J(x
k
)

k
I
_
x
k

_
r(x
k
)
0
__
_
_
_
2
. (9.3.2)
The latter is best suited for numerical computations. This approach, which
also handles a rank-decient Jacobian J(x
k
), leads to the Levenberg-Marquardt
method, which takes the following form:
Algorithm Levenberg-Marquardt
Start with an initial guess x
0
and iterate for k = 0, 1, 2, . . .
At each step k choose the Lagrange multiplier
k
.
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 171
Solve 9.3.1 or 9.3.2 for x
k
.
Calculate the next iterate x
k+1
= x
k
+ x
k
.
Check for convergence.
The parameter
k
inuences both the direction and the length of the step
x
k
. Depending on the size of
k
, the correction x
k
can vary from a
Gauss-Newton step for
k
= 0, to a very short step approximately in the
steepest descent direction for large values of
k
. As we see from these
considerations, the LM parameter acts similarly to the step control for the
damped Gauss-Newton method, but it also changes the direction of the
correction.
The Levenberg-Marquardt step can be interpreted as solving the normal
equations used in the Gauss-Newton method, but shifted by a scaled
identity matrix, so as to convert the problem from having an ill-conditioned
(or positive semidenite) matrix J(x
k
)
T
J(x
k
) into a positive denite one.
Notice that the positive deniteness implies that the Levenberg-Marquardt
direction is always of descent and therefore the method is well dened.
Another way of looking at the Levenberg-Marquardt iteration is to con-
sider the matrix

k
I as an approximation to the second derivative term
S(x
k
) that was neglected in the denition of the Gauss-Newton method.
The local convergence of the Levenberg-Marquardt method is summa-
rized in the following theorem.
Theorem 93. Assume the same conditions as in Theorem 91 and in ad-
dition assume that the Lagrange multipliers
k
for k = 0, 1, 2, . . . are non-
negative and bounded by b > 0. If < , then for any c (1, (+b)/(+b)),
there exists an open ball D around x

such that the Levenberg-Marquardt


sequence converges linearly to x

, starting from any initial point x


0
T:
|x
k+1
x

|
2

c( +b)
+ b
|x
k
x

|
2
+
c
2( + b)
|x
k
x

|
2
2
and
|x
k+1
x

|
2

c( +b) + ( +b)
2( +b)
|x
k
x

|
2
< |x
k
x

|
2
.
If r(x

) = 0 and
k
= O
__
_
J(x
k
)
T
r(x
k
)
_
_
2
_
, then the iterates x
k
converge
quadratically to x

.
Within the trust-region framework introduced above, there are several
general step-length determination techniques, see, for example, [170]. Here
we give the original strategy devised by Marquardt for the choice of the pa-
rameter
k
, which is simple and often works well. The underlying principles
are
172 LEAST SQUARES DATA FITTING WITH APPLICATIONS
The initial value
0
should be of the same size as |J(x
0
)
T
J(x
0
)|
2
.
For subsequent steps, an improvement ratio is dened as in the trust
region approach:

k
=
actual reduction
predicted reduction
=
f(x
k
) f(x
k+1
)
1
2
x
T
k
(J(x
k
)
T
r(x
k
)
k
x
k
)
.
Here, the denominator is the reduction in f predicted by the local linear
model. If
k
is large, then the pure Gauss-Newton model is good enough,
so
k+1
can be made smaller than at the previous step. If
k
is small
(or even negative), then a short, steepest-descent step should be used, i.e.,

k+1
should be increased. Marquards updating strategy is widely used
with some variations in the thresholds.
Algorithm Levenberg-Marquardts parameter updating
If
k
> 0.75, then
k+1
=
k
/3.
If
k
< 0.25, then
k
= 2
k
.
Otherwise use
k+1
=
k
.
If
k
> 0, then perform the update x
k+1
= x
k
+ x
k
.
A detailed description can be found in [154].
The software package MINPACK-1 available from Netlib [263] includes
a robust implementation based on the Levenberg-Marquardt algorithm.
Similar to the Gauss-Newton method, we have inexact versions of the
Levenberg-Marquardt method, where the modied least squares problem
(9.3.2) for the correction is solved only to sucient accuracy, i.e., for some
(0, 1) we accept the solution if:
|(J(x
k
)
T
J(x
k
) +
k
I) x
k
+ J(x
k
)
T
r(x
k
)|
2
< |J(x
k
)
T
r(x
k
)|
2
.
If this system is to be solved for several values of the Lagrange parameter

k
, then the bidiagonalization strategy from Section 5.5 can be utilized
such that only one partial bidiagonalization needs to be performed; for
more details see [121, 260].
Example 94. We return to Example 92, and this time we use the Levenberg-
Marquardt method with the above parameter-updating algorithm and the
same starting values and stopping criterion as before. Table 9.3 compares
the performance of Levenberg-Marquardts algorithm with that of the Gauss-
Newton algorithm. For the consistent problem with y = 8, the convergence
is still quadratic, but slower. As the residual increases, the advantage of
the Levenberg-Marquardt strategy sets in, and we are now able to solve the
large-residual problem for y = 8, although the convergence is very slow.
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 173
Gauss-Newton Levenberg-Marquardt
y x
0
# iter. rate # iter. rate
8 1 5 quadratic 10 quadratic
0.6 4 quadratic 7 quadratic
3 1 12 linear 13 linear
0.5 9 linear 10 linear
1 1 34 linear (slow) 26 linear (slow)
0 32 linear (slow) 24 linear (slow)
8 1 no convergence 125 linear (very slow)
0.7 no convergence 120 linear (very slow)
Table 9.3: Comparison of the convergence behavior of the Gauss-Newton
and Levenberg-Marquardt methods for the same test problem as in Table
9.2.
Figure 9.3.1: Number of iterations to reach an absolute accuracy of 10
6
in the solution by three NLLSQ dierent methods. Each histogram shows
results for a particular method and a particular value of in the perturbation
of the starting point; top = 0.1, middle = 0.2, bottom = 0.3.
174 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Example 95. This example is mainly concerned with a comparison of the
robustness of the iterative methods with regards to the initial guess x
0
. We
use the Gaussian model from Examples 83 and 86 with noise level = 0.1,
and we use three dierent algorithms to solve the NLLSQ problem:
The standard Gauss-Newton method.
An implementation of the damped Gauss-Newton algorithm from MAT-
LABs Optimization Toolbox, available via the options
LargeScale=off and LevenbergMarquardt=off in the func-
tion lsqnonlin.
An implementation of the Levenberg-Marquardt algorithm from MAT-
LABs Optimization Toolbox, available via the options
Jacobian=on and Algorithm=levenberg-marquardt in the
function lsqnonlin.
The starting guess was chosen equal to the exact parameters (2.2, 0.26, 0.2)
T
plus a Gaussian perturbation with standard deviation . We used = 0.1,
0.2, 0.3, and for each value we created 500 dierent realizations of the ini-
tial point. Figure 9.3.1 shows the number of iterations in the form of his-
tograms notice the dierent axes in the nine plots. We give the number of
iterations necessary to reach an absolute accuracy of 10
6
, compared to a
reference solution computed with much higher accuracy. The italic numbers
in the three left plots are the number of times the Gauss-Newton method did
not converge.
We see that when the standard Gauss-Newton method converges, it con-
verges rapidly but also that it may not converge. Moreover, as the starting
point moves farther from the minimizer, the number of instances of non-
convergence increases dramatically.
The damped Gauss-Newton method used here always converged, thus was
much more robust than the undamped version. For starting points close to
the minimum, it may converge quickly, but it may also require about 40
iterations. For starting points further away from the solution, this method
always uses about 40 iterations.
The Levenberg-Marquardt method used here is also robust, in that it con-
verges for all starting points. In fact, for starting points close to the solution
it requires, on the average, fewer iterations than the damped Gauss-Newton
method. For starting points farther from the solution it still converges,
but requires many more iterations. The main advantage of the Levenberg-
Marquardt algorithm, namely, to handle ill-conditioned and rank-decient
Jacobian matrices, does not come into play here as the particular problem
is well conditioned.
We emphasize that this example should not be considered as representa-
tive for these methods in general rather, it is an example of the perfor-
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 175
Figure 9.3.2: Convergence histories from the solution of a particular
NLLSQ problem with three parameters by means of the standard Gauss-
Newton (top) and the Levenberg-Marquardt algorithm (bottom). Each g-
ure shows the behavior of pairs of the three components of the iterates x
k
together with level sets of the objective function.
mance of specic implementations for one particular (and well-conditioned)
problem.
Example 96. We illustrate the robustness of the Levenberg-Marquardt al-
gorithm by solving the same test problem as in the previous example by
means of the standard Gauss-Newton and Levenberg-Marquardt algorithms.
The progress of a typical convergence history for these two methods is shown
in Figure 9.3.2. There are three unknowns in the model, and the starting
point is x
0
= (1, 0.3, 0.4)
T
. The top gures show dierent component
pairs of the iterates x
k
for k = 1, 2, . . . for the Gauss-Newton algorithm
(for example, the leftmost plot shows the third component of x
k
versus its
second component). Clearly, the Gauss-Newton iterates overshoot before
they nally converge to the LSQ solution. The iterates of the Levenberg-
Marquardt algorithm are shown in the bottom gures, and we see that they
converge much faster toward the LSQ solution without any big detour.
176 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Characteristics Newton G-N L-M
Ill-conditioned Jacobian yes yes (but slow) yes
Rank-decient Jacobian yes no yes
Convergence S(x
k
) = 0 quadratic quadratic quadratic
Convergence S(x
k
) small quadratic linear linear
Convergence S(x
k
) large quadratic slow or none slow or none
Table 9.4: Comparison of some properties of the Newton, Gauss-Newton
(G-N) and Levenberg-Marquardt (L-M) algorithms for nonlinear least
squares problems.
9.4 Additional considerations and software
An overall comparison between the methods discussed so far is given in
Table 9.4. The yes or no indicates whether the corresponding algorithm
is appropriate or not for the particular problem. Below we comment on
some further issues that are relevant for these methods.
Hybrid methods
During the iterations we do not know whether we are in the region where
the convergence conditions of a particular method hold, so the idea in
hybrid methods is to combine a fast method, such as Newton (if second
derivatives are available), with a safe method such as steepest descent.
One hybrid strategy is to combine Gauss-Newton or Levenberg-Marquardt,
which in general is only linearly convergent, with a superlinearly convergent
quasi-Newton method, where S(x
k
) is approximated by a secant term.
An example of this ecient hybrid method is the algorithm NL2SOL by
Dennis, Gay and Welsch [65]. It contains a trust-region strategy for global
convergence and uses Gauss-Newton and Levenberg-Marquardt steps ini-
tially, until it has enough good second-order information. Its performance
for large and very nonlinear problems is somewhat better than Levenberg-
Marquardt, in that it requires fewer iterations. For more details see [66].
Starting and stopping
In many cases there will be no good a priori estimates available for the ini-
tial point x
0
. As mentioned in the previous section, nonlinear least squares
problems are usually non-convex and may have several local minima. There-
fore some global optimization technique must be used to descend from un-
desired local minima if the global minimum is required. One possibility is
a simple Monte Carlo strategy, in which multiple initial points are chosen
at random and the least squares algorithm is started several times. It is
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 177
hoped that some of these iterations will converge, and one can then choose
the best solution. See, e.g., [131] for an overview of global optimization
algorithms.
In order to make such a process more ecient, especially when function
evaluations are costly, a procedure has been described in [191] that saves
all iterates and condence radiuses. An iteration is stopped if an iterate
lands in a previously calculated iterates condence sphere. The assumption
underlying this decision is that the iteration will lead to a minimum already
calculated. The algorithm is trivially parallelizable and runs very eciently
in a distributed network. Up to 40% savings have been observed from using
the early termination strategy. This simple approach is easily parallelizable,
but it can only be applied to low-dimensional problems.
Several criteria ought to be taken into consideration for stopping the
iterative methods described above.
The sequence convergence criterion: |x
k+1
x
k
|
2
tolerance (not
a particularly good one)!
The consistency criterion: [f(x
k+1
)[ tolerance (relevant only for
problems with r(x

) = 0).
The absolute function criterion: |f(x
k+1
)|
2
tolerance.
The maximum iteration count criterion: k > k
max
.
The absolute function criterion has a special interpretation for NLLSQ
problems, namely, that the residual at x
k+1
is nearly orthogonal to the
subspace generated by the columns of the Jacobian. For the Gauss-Newton
and Levenberg-Marquardt algorithms, the necessary information to check
near-orthogonality is easily available.
Methods for step-length determination
Control of the step length is important for ensuring a robust algorithm that
converges from starting points far from the minimum. Given an iterate x
k
and a descent direction p, there are two common ways to choose the step-
length
k
.
Take
k
as the solution to the one-dimensional minimization problem
min

|r(x
k
+
k
p)|
2
.
This is expensive if one tries to nd the exact solution. Fortunately,
this is not necessary, and so-called soft or inexact line search strategies
can be used.
178 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Inexact line searches: an
k
is accepted if a sucient descent condi-
tion is satised for the new point x
k
+
k
p and if
k
is large enough
that there is a gain in the step. Use the so-called Armijo-Goldstein
step-length principle, where
k
is chosen as the largest number from
the sequence = 1,
1
2
,
1
4
, . . . for which the inequality
|r(x
k
)|
2
|r(x
k
+
k
p)|
2

1
2

k
|J(x
k
) p|
2
holds.
For details see, for example, chapter 3 in [170] and for a brief overview see
[20] p. 344.
Software
In addition to the software already mentioned, the PORT library, avail-
able from AT&T Bell Labs and partly from Netlib [263], has a range of
codes for nonlinear least squares problems, some requiring the Jacobian
and others that need only information about the objective function. The
Gauss-Newton algorithm is not robust enough to be used on its own, but
most of the software for nonlinear least squares that can be found in NAG,
MATLAB and TENSOLVE (using tensor methods), have enhanced Gauss-
Newton codes. MINPACK-1, MATLAB and IMSL contain Levenberg-
Marquardt algorithms. The NEOS Guide on the Web [264] is a source of
information about optimization in general, with a section on NLLSQ. Also,
the National Institute of Standards and Technology Web page at [265] is
very useful, as well as the book [165].
Some software packages for large-scale problems with a sparse Jacobian
are VE10 and LANCELOT. VE10 [264], developed by P. Toint, implements
a line search method, where the search direction is obtained by a trun-
cated conjugate gradient technique. It uses an inexact Newton method for
partially separable nonlinear problems with sparse structure. LANCELOT
[148] is a software package for nonlinear optimization developed by A. Conn,
N. Gould and P. Toint. The algorithm combines a trust-region approach,
adapted to handle possible bound constraints and projected gradient tech-
niques. In addition it has preconditioning and scaling provisions.
9.5 Iteratively reweighted LSQ algorithms for
robust data tting problems
In Chapter 1 we introduced the robust data tting problem based on the
principle of M-estimation, leading to the nonlinear minimization problem
min
x

m
i=1
(r
i
(x)) where the function is chosen so that it gives less
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 179
weight to residuals r
i
(x) with large absolute value. Here we consider the
important case of robust linear data tting, where the residuals are the
elements of the residual vector r = b Ax. Below we describe several
iterative algorithms that, in spite of their dierences, are commonly referred
to as iteratively reweighted least squares algorithms due to their use of a
sequence of linear least squares problems with weights that change during
the iterations.
To derive the algorithms we consider the problem of minimizing the
function
f(x) =
m

i=1
(r
i
(x)) , r = b Ax.
We introduce the vector g R
m
and the diagonal matrix D R
mm
with
elements dened by
g
i
=

(r
i
), d
ii
=

(r
i
), i = 1, . . . , m, (9.5.1)
where

and

denote the rst and second derivatives of (r) with respect


to r. Then the gradient and the Hessian of f are given by
f(x) = A
T
g(x) and
2
f(x) = A
T
D(x) A, (9.5.2)
where we use the notation g(x) and D(x) to emphasize that these quantities
depend on x.
The original iteratively reweighted least squares algorithm [8] uses a
xed-point iteration to solve f(x) = 0. From (9.5.1) and (9.5.2) and
introducing the diagonal mm weight matrix
W(r) = diag (

(r
i
)/r
i
) for i = 1, . . . , m, (9.5.3)
we obtain the nonlinear equations
A
T
g(x) = A
T
W(r) r = A
T
W(r) (b Ax) = 0
or
A
T
W(r) Ax = A
T
W(r) b.
The xed-point iteration scheme used in [8] is
x
k+1
=
_
A
T
W(r
k
) A
_
1
A
T
W(r
k
) b
= argmin
x
|W(r
k
)
1/2
(b Ax
k
) |
2
, k = 1, 2, . . . (9.5.4)
where r
k
= b Ax
k
is the residual from the previous iteration. Hence,
each new iterate is the solution of a weighted least squares problem, in
which the weights depend on the solution from the previous iteration. This
algorithm, with seven dierent choices of the function , is implemented
180 LEAST SQUARES DATA FITTING WITH APPLICATIONS
in a Fortran software package described in [50] where more details can be
found.
A faster algorithm for solving the robust linear data tting problem
also commonly referred to as an iteratively reweighted least squares algo-
rithm is obtained by applying Newtons method from Section 9.1 to the
function f(x) =

m
i=1
(r
i
(x)); see ([20], section 4.5.3) for details. Accord-
ing to (9.5.2), the Newton iteration is
x
k+1
= x
k

_
A
T
D(x
k
) A
_
1
A
T
g(x
k
), k = 1, 2, . . . . (9.5.5)
We emphasize that the Newton update is not a least squares solution, be-
cause the diagonal matrix D(x
k
) appears in front of the vector g(x
k
).
OLeary [171] suggests a variant of this algorithm where, instead of updat-
ing the solution vectors, the residual vector is updated as
r
k+1
= r
k

k
A
_
A
T
D(x
k
) A
_
1
A
T
g(x
k
), k = 1, 2, . . . , (9.5.6)
and the step-length
k
is determined via line search. Upon convergence
the robust solution x
k
is computed from the nal residual vector r
k
as the
solution to the consistent system Ax = b r
k
. Five dierent numerical
methods for computing the search direction in (9.5.6) are compared in [171],
where it is demonstrated that the choice of the best method depends on
the function . Wolke and Schwetlick [258] extend the algorithm to also
estimate the parameter that appears in the function and which must
reect the noise level in the data.
As starting vector for the above iterative methods one can use the or-
dinary LSQ solution or, alternatively, the solution to the 1-norm problem
min
x
|b Ax|
1
(see below).
The iteratively reweighted least squares formalism can also be used to
solve linear p-norm problem min
x
|bAx|
p
for 1 < p < 2. To see this, we
note that
|b Ax|
p
p
=
m

i=1
[r
i
(x)[
p
=
m

i=1
[r
i
(x)[
p2
r
i
(x)
2
= |W
p
(r)
1/2
(b Ax)|
2
2
where W
p
(r) = diag([r
i
[
p2
). The iterations of the corresponding Newton-
type algorithm take the form
x
k+1
= x
k
+ x
k
, k = 1, 2, . . . , (9.5.7)
where x
k
is the solution to the weighted LSQ problem
min
x
|W
p
(r
k
)
1/2
(r
k
Ax)|
2
, (9.5.8)
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 181
see (Bjrck [20], section 4.5.2) for details. Experience shows that, for p
close to 1, the LSQ problem (9.5.8) tends to become ill-conditioned as the
iterations converge to the robust solution. Special algorithms are needed
for the case p = 1; some references are [6, 51, 153].
For early use in geophysical prospecting see [48, 219].
9.6 Variable projection algorithm
We nish this chapter with a brief discussion of algorithms to solve the
separable NLLSQ problem introduced in Section 8.4, where we listed some
important properties of this problem. The simplest method of solution that
takes into account the special structure is the algorithm NIPALS [256],
where an alternating optimization procedure is employed: the linear and
nonlinear parameters are successively xed and the problem is minimized
over the complementary set. This iteration, however, only converges lin-
early.
The variable projection algorithm, developed by Golub and Pereyra
[101], takes advantage instead of the explicit coupling between the param-
eters and a, in order to reduce the original minimization problem to two
subproblems that are solved in sequence, without alternation. The nonlin-
ear subproblem is smaller (although generally more complex) and involves
only the k parameters in . One linear subproblem is solved a posteriori
to determine the n linear parameters in a. The dierence with NIPALS is
perhaps subtle, but it makes the algorithm much more powerful, as we will
see below.
For the special case when
j
(, t) = e

j
t
, the problem is called expo-
nential data tting, which is a very common and important application that
is notoriously dicult to solve. Fast methods that originated with Prony
[204] are occasionally used, but, unfortunately, the original method is not
suitable for problems with noise. The best-modied versions are from M.
Osborne [175, 176, 177], who uses the separability to achieve better results.
As shown in [198], the other method frequently used with success is the
variable projection algorithm. A detailed discussion of both methods and
their relative merits is found in chapter 1 of [198], while the remaining
chapters show a number of applications in very dierent elds and where
either or both methods are used and compared.
The key idea behind the variable projection algorithm is to eliminate
the linear parameters analytically by using the pseudoinverse, solve for
the nonlinear parameters rst and then solve a linear LSQ problem for
the remaining parameters. Figure 9.6.1 illustrates the variable projection
principle in action. We depict I ()

() for a xed as a linear


mapping fromR
n
R
m
. That is, its range is a linear subspace of dimension
n (or less, if the matrix is rank decient). When varies, this subspace
182 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Figure 9.6.1: The geometry behind the variable projection principle. For
each
i
the projection I ()

() maps R
n
into a subspace in R
m
(depicted as a line).
pivots around the origin. For each the residual is equal to the Euclidean
distance from the data vector y to the corresponding subspace. As usual,
there may not be any subspace to which the data belong, i.e., the problem
is inconsistent and there is a nonzero residual at the solution. This residual
is related to the l
2
approximation ability of the basis functions
j
(, t).
There are several important results proved in [101], [227] and [212],
which show that the reduced function is better conditioned than the original
one. Also, we observe that
The reduction in dimensionality of the problem has as a consequence
that fewer initial parameters need to be guessed to start a minimiza-
tion procedure.
The algorithm is valid in the rank-decient case. To guarantee con-
tinuity of the Moore-Penrose generalized inverse, only local constant
rank of () needs to be assumed.
The reduced nonlinear function, although more complex and there-
fore apparently costlier to evaluate, gives rise to a better-conditioned
problem, which always takes fewer iterations to converge than the full
problem [212]. This may include convergence when the Levenberg-
Marquardt algorithm for the full function does not converge.
By careful implementation of the linear algebra involved and use of
a simplication due to Kaufman [140], the cost per iteration for the
reduced function is similar to that for the full function, and thus
minimization of the reduced problem is always faster. However, in
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 183
hard problems this simplication may lead to a less robust algorithm
[172].
The linear problem is easily solved by using the methods discussed in previ-
ous chapters. There are two types of iterative methods to solve the NLLSQ
problem in the variable projection algorithm:
Derivative free methods such as PRAXIS [263] require only an
ecient computation of the nonlinear function
f
2
() =
1
2
|y ()()

y|
2
2
=
1
2
|P
()
y|
2
2
.
Instead of the more expensive pseudoinverse computation it is possible
to obtain P
()
by orthogonally transforming () into trapezoidal
form. One obtains then a simple expression for the evaluation of the
function from the same orthogonal transformation when applied to
y. For details see [101].
Methods that need derivatives of r() = y ()()

y. In
[101], a formula for the Frchet derivative of the orthogonal projec-
tor P
()
was developed and then used in the Levenberg-Marquardt
algorithm, namely:
Dr() = [P

()
D(())()

+ (P

()
D(())()

)
T
]y.
Kaufman [140] ignores the second term, producing a saving of up to
25% in computer time, without a signicant increase in the number
of iterations. The Levenberg-Marquardt algorithm used in [101] and
[140] starts with an arbitrary
0
and determines the iterates from the
relation:

k+1
=
k

_
Dr(
k
)

k
I
_

_
r(
k
)
0
_
.
At each iteration, this linear LSQ problem is solved by orthogonal
transformations. Also, the Marquardt parameter
k
can be deter-
mined so that divergence is prevented by enforcing descent:
|r(
k+1
)|
2
< |r(
k
)|
2
.
The original VARPRO program by Pereyra is listed in a Stanford Uni-
versity report [100]; there the minimization is done via the Osborne mod-
ication of the Levenberg-Marquardt algorithm. A public domain version,
including modications and additions by John Bolstadt, Linda Kaufman
and Randy LeVeque, can be found in Netlib [263] under the same name.
It incorporates the Kaufman modication, where the second term in the
Frchet derivative is ignored and the information provided by the program
is used to generate a statistical analysis, including uncertainty bounds for
184 LEAST SQUARES DATA FITTING WITH APPLICATIONS
Figure 9.6.2: Data, t and residuals for the exponential tting problem.
the estimated linear and nonlinear parameters. Recent work by OLeary
and Rust indicates that in certain problems the Kaufman modication can
make the algorithm less robust. They present in that work a modern and
more modular implementation. In the PORT library [86] of Netlib there
are careful implementations by Gay and Kaufman of variable projection
versions for the case of unconstrained and constrained separable NLLSQ
problems, including the option of using nite dierences to approximate
the derivatives. VARP2 [98, 141] is an extension for problems with multi-
ple right-hand sides and it is also available in Netlib.
The VARPRO program was inuenced by the state of the art in comput-
ing at the time of its writing, leading to a somewhat convoluted algorithm,
and in a way most of the sequels inherited this approach. The computa-
tional constraints in memory and operation speeds have now been removed,
since for most problems to which the algorithm is applicable in standard
current machinery, results are produced quickly, even if multiple runs are
executed in order to try to get a global minimum. In [198, 228] there is a
description of a simplied approach for the case of complex exponentials,
where some eciency is sacriced in order to achieve a clearer implemen-
tation that is easier to use and maintain.
Example 97. The following example from [174] illustrates the competitive-
ness of the variable projection method, both in speed and robustness. Given
a set of m = 33 data points (t
i
, y
i
) (the data set is listed in the VARPRO
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 185
code in Netlib), t the exponential model:
M(a, , t) = a
1
+ a

1
t
2
+ a

2
t
3
.
We compare the Levenberg-Marquardt algorithm for minimization of the
full function with VARPRO, which uses the reduced function. The two
algorithms needed 32 and 8 iterations, respectively, to reduce the objective
function to less than 5 10
5
; this is a substantial saving, considering that
the cost per iteration is similar for the two approaches. Moreover, the
condition numbers of the respective linearized problems close to the solution
are 48845 and 6.9; a clear example showing that the reduced problem is
better conditioned than the original one. Figure 9.6.2 shows the t and the
residuals for the VP algorithm. The autocorrelation analysis from Section
1.4 gives = 5 10
6
and T

= 9.7 10
6
, showing that the residuals can
be considered as uncorrelated and therefore that the t is acceptable. The
least squares solution for this problem is
a

= (0.375, 1.46, 1.94)


T
,

= (0.0221, 0.0129)
T
.
As explained in Section 3.5, the sensitivity of this solution can be assessed
via the diagonal elements of the estimated covariance matrix

2
_
J(x

)
T
J(x

)
_
1
.
In particular, the square roots of the diagonal elements of this matrix are
estimates of the standard deviations of the ve parameters. If we use the
estimate
2
|r(x

)|
2
2
/m, then these estimated standard deviations are
0.00191, 0.204, 0.203, 0.000824, 0.000413,
showing that the two linear parameters a
2
and a
3
are potentially very sen-
sitive to perturbations. To illustrate this, the two alternative sets of param-
eters
a = (0.374, 1.29, 1.76)
T
, = (0.0231, 0.0125)
T
and
a = (0.378, 2.29, 2.76)
T
, = (0.0120, 0.0140)
T
,
both give ts that are almost indistinguishable from the least squares t.
The corresponding residual norms are
|r(x

)|
2
= 7.39 10
3
, |r( x)|
2
= 7.79 10
3
, |r( x)|
2
= 8.24 10
3
,
showing that the large changes in a
2
and a
3
give rise to only small variations
in the residuals, again demonstrating the lack of sensitivity of the residual
to changes in those parameters.
186 LEAST SQUARES DATA FITTING WITH APPLICATIONS
9.7 Block methods for large-scale problems
We have already mentioned the inexact Gauss-Newton and Levenberg-
Marquardt methods, based on truncated iterative solvers, as a way to deal
with large computational problems. A dierent approach is to use a divide-
and-conquer strategy that, by decomposing the problem into blocks, may
allow the use of standard solvers for the (smaller) blocks. First, one subdi-
vides the observations in appropriate non-overlapping groups. Through an
SVD analysis one can select those variables that are more relevant to each
subset of data; details of one such method for large-scale, ill-conditioned
nonlinear least squares problems are given below and in [188, 189, 191].
The procedure works best if the data can be broken up in such a way that
the associated variables have minimum overlap and only weak couplings are
left with the variables outside the block. Of course, we cannot expect the
blocks to be totally uncoupled; otherwise, the problem would decompose
into a collection of problems that can be independently solved.
Thus, in general, the procedure consists of an outer block nonlinear
Gauss-Seidel or Jacobi iteration, in which the NLLSQ problems correspond-
ing to the individual blocks are solved approximately for their associated
parameters. The block solver is initialized with the current value of the vari-
ables. The full parameter set is updated either after each block is processed
(Gauss-Seidel strategy of information updating as soon as it is available),
or after all the block solves have been completed (Jacobi strategy). The ill-
conditioning is robustly addressed by the use of the Levenberg-Marquardt
algorithm for the subproblems and by the threshold used to select the vari-
ables for each block.
The pre-processing for converting a large problem into block form starts
by scanning the data and subdividing it. The ideal situation is one in which
the data subsets are only sensitive to a small subset of the variables. Hav-
ing performed that subdivision, we proceed to analyze the data blocks to
determine which parameters are actually well determined by each subset of
data. During this data analysis phase we compute the SVD of the Jaco-
bian of each block; this potentially very large matrix is trimmed by deleting
columns that are zero or smaller than a given threshold. Finally, the right
singular vectors of the SVD are used to complete the analysis.
Selecting subspaces through the SVD
Jupp and Vozo [138] introduced the idea of relevant and irrelevant param-
eters based on the SVD. We write rst the Taylor expansion of the residual
vector at a given point x (see 8.2):
r(x +h) = r(x) + J(x) h +O(|h|
2
2
). (9.7.1)
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 187
Considering the SVD of the Jacobian J(x) = U V
T
, we can introduce
the so-called rotated perturbations
p =
1
V
T
h,
where
1
is the largest singular value of J(x). Neglecting higher-order
terms in (9.7.1) we can write this system as
r(x +h) r(x) = J(x) h =
r

i=1
_

1
_
p
i
u
i
,
where
i
/
1
are the normalized singular values and r is the rank of the Ja-
cobian. This shows the direct relationship between the normalized singular
values, the rotated perturbations p
i
=
1
v
T
i
h, and their inuence on the
variation of the residual vector. Thus,
|r(x +h) r(x)|
2
2
=
r

i=1
_

1
_
2
p
2
i
,
which shows that those parameters p
i
that are associated with small normal-
ized singular values will not contribute much to variations in the residual.
Here we have assumed that all the components of the perturbation vector
h are of similar size.
The above analysis is the key to the algorithm for partitioning the pa-
rameter set into blocks, once a partition of the data set has been chosen.
Let RI
k
denote the row indices for the kth block of data with m
[k]
elements.
Given a base point x, calculate the Jacobian J(x)
[k]
for this data set, i.e.,
J(x)
[k]
has only m
[k]
rows and less than n columns, since if the data have
been partitioned appropriately and due to the local representation of the
model, we expect that there will be a signicant number of columns with
small components that can be safely neglected. Then the procedure is as
follows:
1. Compute the SVD of J(x)
[k]
and normalize the singular values by
dividing them by the largest one.
2. Select the rst n
[k]
normalized singular values that are above a given
threshold, and their associated right singular vectors.
3. Inspect the set of chosen singular vectors and select the largest com-
ponents of V
[k]
in absolute value.
4. Choose the indices of the variables in parameter space corresponding
to the columns of V
[k]
that contain large entries to form the set CI
k
of
column indices. This selects the subset of parameters that have most
inuence on the variation in the mist functional for the given data
set.
188 LEAST SQUARES DATA FITTING WITH APPLICATIONS
With this blocking strategy, variables may be repeated in dierent subsets.
Observe also that it is possible for the union of all the subsets to be smaller
than the entire set of variables; this will indicate that there are variables
that cannot be resolved by the given data, at least in a neighborhood of
the base point x and for the chosen threshold. Since this analysis is local,
it should be periodically revised as the optimization process advances.
Once this partition has been completed, we use an outer block nonlinear
Gauss-Seidel or Jacobi iteration [173] in order to obtain the solution of the
full problem. To make this more precise, let us partition the index sets
M =1, 2, . . . , m and N = 1, 2, . . . , n into the subsets RI
k
and CI
k
,
i.e., the index sets that describe the partition of our problem into blocks.
Each subproblem can now be written as
min
x
1
2
|r(x)
[k]
|
2
2
subject to x = x

i
, i 1, 2, . . . , n CI
k
, (9.7.2)
where r(x)
[k]
is the sub-vector of r(x) with elements r(x)
i

iRI
k
. In other
words, we x the values of the parameters that are not in block k to their
current values in the global set of variables x

. Observe that the dimension


of the subproblem is then m
[k]
n
[k]
. By considering enough subsets k =
1, . . . , K we can make these dimensions small, especially n
[k]
and therefore
make the subproblems (9.7.2) amenable to direct techniques (and global
optimization, if necessary).
One step of a sequential block Gauss-Seidel iteration consists then in
sweeping over all the blocks, solving the subproblems to a certain level of
accuracy, and replacing the optimal values in the central repository of all
the variables at once. A sequential block Jacobi iteration does the same,
but it does not replace the values until the sweep over all the blocks is
completed.
Since we allow repetitions of variables in the sub-blocks, it is prudent
to introduce averaging of the multiple appearances of variables. In the case
of Jacobi, this can be done naturally at the end of a sweep. In the case of
Gauss-Seidel, one needs to keep a count of the repetitions and perform a
running average for each repeated variable.
Parallel asynchronous block nonlinear Gauss-Seidel
iteration
The idea of chaotic relaxation for linear systems originated with Rosenfeld
in 1967 [211]. Other early actors in this important topic were A. Ostrowski
[178] and S. Schechter [221]. Chazan and Miranker published in 1969 a
detailed paper [43] describing and formalizing chaotic iterations for the
parallel iterative solution of systems of linear equations. This was extended
to the nonlinear case in [159, 160].
ALGORITHMS FOR SOLVING NONLINEAR LSQ PROBLEMS 189
The purpose of chaotic relaxation is to facilitate the parallel implemen-
tation of iterative methods in a multi-processor system or in a network of
computers by reducing the amount of communication and synchronization
between cooperating processes and by allowing that assigned sub-tasks go
unfullled. This is achieved by not requiring that the relaxation follow a
pre-determined sequence of computations, but rather letting the dierent
processes start their evaluations from a current, centrally managed value of
the unknowns.
Baudet [10] denes the class of asynchronous iterative methods and
shows that it includes chaotic iterations. Besides the classical Jacobi and
Gauss-Seidel approaches he introduces the purely asynchronous method in
which, at each iteration within a block, current values are always used.
This is a stronger cooperative approach than Gauss-Seidel and he shows
in numerical experimentation how it is more ecient with an increasing
number of processors.
We paraphrase a somewhat more restricted denition for the case of
block nonlinear optimization that concerns us. Although Baudets method
would apply directly to calculating the zeros of f(x), we prefer to describe
the method in the optimization context in which it will be used. We use
the notation introduced above for our partitioned problem.
Let
f(x) =
1
2
|r(x)|
2
2
=
K

k=1
1
2
|r(x)
[k]
|
2
2
=
K

k=1
f(x)
[k]
,
where the data are partitioned into K non-overlapping blocks. An asyn-
chronous iteration for calculating min
x
f(x) starting from the vector x
0
is
a sequence of vectors x
k
with elements dened by
x
[k]
j
= arg min
x
f(x)
[k]
for j CI
k
subject to
x
[k]
j
= x
j
for j , CI
k
.
The initial vector for the kth minimization is
x
[k]
= (x
1
(s
1
(k)), . . . , x
n
(s
n
(k))),
where S = s
1
(k), . . . , s
n
(k) is a sequence of elements in N
n
that indicates
at which iteration a particular component was last updated. In addition,
the following conditions should be satised:
s
i
(k) k 1 and s
i
(k) , as k .
These conditions guarantee that all the variables are updated often enough,
while the formulation allows for the use of updated subsets of variables as
190 LEAST SQUARES DATA FITTING WITH APPLICATIONS
they become available. Baudet [10] gives sucient conditions for the con-
vergence of this type of process for systems of nonlinear equations. The
convergence of an asynchronous block Gauss-Seidel process and as a spe-
cial case, the previously described sequential algorithm, follows from the
following theorem proved in [188].
Theorem 98. Convergence of the asynchronous BGS L-M itera-
tion. Let the operator
T(x) = x (J(x)
T
J(x) + I)
1
J(x)
T
r(x)
be Lipschitz, with constant , and uniformly monotone with constant , in
a neighborhood of a stationary point of f(x), with 2/
2
> 1. Then T is
vector contracting and the asynchronous Levenberg-Marquardt method for
minimizing
1
2
|r(x)|
2
2
is convergent.
As we indicated above, an alternative to this procedure is a chaotic
block Jacobi iteration, where all the processes use the same initial vector
at the beginning of a sweep, at the end of which the full parameter vector
is updated. In general, asynchronous block Gauss-Seidel with running av-
erages is the preferred algorithm, since Jacobi requires a synchronization
step at the end of each sweep that creates issues of load balancing.
Appendix A
Sensitivity Analysis
A.1 Floating-point arithmetic
The classic reference for nite precision arithmetic is Wilkinsons mono-
graph Rounding Errors in Algebraic Processes [255], while a more recent
treatment is Highams Accuracy and Stability of Numerical Algorithms
[128]. Almost any numerical analysis book has an introductory chapter
about this topic. Here we list some of the basic ideas used in our text.
Digital computers use oating-point representation for real and complex
numbers based on the binary system, i.e., the basis is 2. Real numbers are
rewritten in a special normalized form, where the mantissa is less than 1.
Usually there is the option to use single (t-digit) or double (2t-digit) length
mantissa representation and arithmetic. If we denote by (x) the oating-
point computer representation of a real number x, and by the oating-
point addition, then the unit round-o (for a given computer) is dened
as the smallest such that in oating-point arithmetic: (1) > (1).
For a binary t-digit oating-point system = 2
t
. The machine epsilon

M
= 2 is the gap between 1 and the next larger oating-point number
(and, thus, in a relative sense, gives an indication of the gap between the
oating-point numbers). Several of the bounds in this book contain the unit
round-o or the machine precision; it is therefore advisable to check the size
of
M
for a particular machine and word length. A small Fortran program is
available from Netlib to compute the machine precision for double precision
and can be adapted easily for single.
Representation error. The relative error in the computer represen-
tation (x) of a real number x = 0 satises
|(x) x|
|x|
,
263
264 APPENDIX A
implying that (x) [x(1
M
), x(1 +
M
)].
Rounding error. The error in a given oating-point operation ,
corresponding to the real operation , satises
(x) (y) = (x y)(1 +), with || .
To measure the cost of the dierent algorithms described in the book
we use as the unit the op. A word of warning though: its denition diers
from one author to the other; here we follow the one used in [105, 128],
which is also common in many articles in the literature.
Denition 1. A op is roughly the work associated with a oating-point
operation (addition, subtraction, multiplication, or division).
In March 2011 the cheapest cost per Gigaop (10
9
ops) was $1.80,
achieved on the computer cluster HPU4Science, made of six dual Core 2
Quad o-the-shelf machines at a cost of $30, 000, with performance en-
hanced by combining the CPUs with the Graphical PUs. In comparison,
the cost in 1984 was $15 million on a Cray X-MP.
A.2 Stability, conditioning and accuracy
A clear and concise review of these topics can be found in [57, 128, 237].
One general comment rst: given a t-digit arithmetic, there is a limit to the
attainable accuracy of any computation, because even the data themselves
may not be representable by a t-digit number. Additionally, in practical
applications, one should not lose sight of the fact that usually the data,
derived from observations, already have a physical error much larger than
the one produced by the oating-point representation.
Let us formally dene a mathematical problem by a function that relates
a data space X with a solutions space Y, i.e., P : X(data) Y(solutions).
Let us also dene a specic algorithm for this problem as a function

P :
X Y. One is interested in evaluating how close the solution computed
by the algorithm is to the exact solution of the mathematical problem.
The accuracy will depend on the sensitivity of the mathematical prob-
lem P to perturbations of its data, the condition of the problem, and on
the sensitivity of the algorithm

P to perturbations of the input data, the
stability of the algorith.
The condition of the mathematical problem is commonly measured by
the condition number (x). We emphasize the problem dependency, so,
for example, the same matrix A may give rise to an ill-conditioned least
squares problem and a well-conditioned eigenvector problem.
A formal denition of the condition number follows.
SENSITIVITY ANALYSIS 265
Denition 2. The condition number is dened by
(x) = sup
x
P(x +x) P(x)
2
P(x)
2
/
x
2
x
2
.
If the mapping P is dierentiable with Jacobian [J(x)]
ij
=
P
i
x
j
, the above
formula can be replaced by
(x) =
J(x)
2
x
2
P (x)
2
.
The condition number (x) is the leading coecient of the data perturba-
tion in the error bound for the solution.
Among the possible formalizations of the stability concept, backward
stability is a convenient one; an algorithm is backward stable if it computes
the exact solution for slightly perturbed data, or in other words, quoting
[237], an algorithm is backward stable if [it] gives the right answer to nearly
the right question:

P(x) = P( x) = P(x + x). The perturbation of the
input data x
2
is the backward error. Formally,
Denition 3. The algorithm

P is backward stable if the computed solution
x is the exact solution of a slightly perturbed problem; i.e., if

P(x) = P( x)
for some x with
x x
2
x
2
= O(
M
).
One can bound the actual error in the solution P(x) of a problem with
condition number (x) if it is computed using a backward stable algorithm

P (x) :
_
_
P (x) P(x)
_
_
2
P(x)
2
= O((x)
M
).
In words, a backward stable algorithm, when applied to a well-conditioned
problem, yields an accurate solution, and if the backward error is smaller
than the data errors, the problem is solved to the same extent that it is actu-
ally known. On the other hand, the computed solution to an ill-conditioned
problem can have a very large error, even if the algorithm used was back-
ward stable, the condition number acting as an amplication factor of the
data errors.
Appendix B
Linear Algebra Background
B.1 Norms
Vector norms
The most commonly used norms for vectors are the l
1
-, l
2
- and l

-norms,
denoted by || ||
1
, || ||
2
, and || ||

, respectively. These norms are dened


by:
v
1
=

n
i=1
|v
i
|.
v
2
=
_
n
i=1
|v
i
|
2
_1
2
=
_
v
T
v
_1
2
(Euclidean norm).
v

= max
1in
|v
i
| (Chebyshev norm).
A useful relationship between an inner product and the l
2
-norms of its
factors is the Cauchy-Schwartz inequality:

x
T
y

x
2
y
2
.
Norms are continuous functions of the entries of their arguments. It follows
that a sequence of vectors x
0
, x
1
, . . . converges to a vector x if and only if
lim
k
x
k
x = 0 for any norm.
Matrix norms
A natural denition of a matrix norm is the so-called induced or operator
norm that, starting from a vector norm , denes the matrix norm as
the maximum amount that the matrix A can stretch a unit vector, or more
formally: A = max
v=1
Av. Thus, the induced norms associated with
the usual vector norms are
267
268 APPENDIX B
A
1
= max
j

n
i=1
|a
ij
| .
A
2
= [max(eigenvalue of (A
T
A))]
1
2
.
A

= max
i

n
j=1
|a
ij
| .
In addition the so-called Frobenius norm (Euclidean length of A considered
as an nm-vector) is
A
F
=
_

n
j=1

m
i=1
|a
ij
|
2
_1
2
= trace(A
T
A)
1
2
.
For square, orthogonal matrices Q R
nn
we have Q
2
= 1 and Q
F
=

n. Both the Frobenius and the matrix l


2
-norms are compatible with
the Euclidean vector norm. This means that Ax A x is true
when using the l
2
-norm for the vector and either the l
2
or the Frobenius
norm for the matrix. Also, both are invariant with respect to orthogonal
transformations Q:
QA
2
= A
2
, QA
F
= A
F
.
In terms of the singular values of A, the l
2
norm can be expressed as A
2
=
max
i

i
=
1
, where
i
, i = 1, . . . , min(m, n) are the singular values of A
in descending order of size. In the special case of symmetric matrices the
l
2
-norm reduces to A
2
= max
i
|
i
|, with
i
an eigenvalue of A. This is
also called the spectral radius of the matrix A.
B.2 Condition number
The condition number of a general matrix A in norm
p
is

p
(A) = A
p
_
_
A

_
_
p
.
For a vector-induced norm, the condition number of A is the ratio of the
maximum to the minimum stretch produced by the linear transformation
represented by this matrix, and therefore it is greater than or equal to 1.
In the l
2
-norm,
2
(A) =
1
/
r
, where
r
is the smallest nonzero singular
value of A and r is the rank of A.
In nite precision arithmetic, a large condition number can be an in-
dication that the exact matrix is close to singular, as some of the zero
singular values may be represented by very small numbers.
LINEAR ALGEBRA BACKGROUND 269
B.3 Orthogonality
The notation used for the inner product of vectors is v
T
w =

i
v
i
w
i
. Note
that v
2
2
= v
T
v. If v, w = 0, and v
T
w = 0, then these vectors are
orthogonal, and they are orthonormal if in addition they have unit length.
Orthogonal matrices. A square matrix Q is orthogonal if Q
T
Q = I
or QQ
T
= I, i.e., the columns or rows are orthonormal vectors and thus
Q
2
= 1. It follows that orthogonal matrices represent isometric trans-
formations that can only change the direction of a vector (rotation, reec-
tion), but not its Euclidean norm, a reason for their practical importance:
Qv
2
= v
2
, QA
2
= AQ
2
= A
2
.
Permutation matrix. A permutation matrix is an identity matrix
with permuted rows or columns. It is orthogonal, and products of permu-
tation matrices are again permutations.
Orthogonal projection onto a subspace of an inner product
space. Given an orthonormal basis, {u
1
, u
2
, . . . , u
n
} of a subspace S
X, where X is an inner product space, the orthogonal projection P : X S
satises
Px =
n

i=1
(x
T
u
i
) u
i
.
The operator P is linear and satises Px = x if x S (idempotent) and
Px
2
x
2
x X. Therefore, the associated square matrix P is an
orthogonal projection matrix if it is Hermitian and idempotent, i.e., if
P
T
= P and P
2
= P.
Note that an orthogonal projection divides the whole space into two or-
thogonal subspaces. If P projects a vector onto the subspace S, then I P
projects it onto S

, the orthogonal complement of S with S +S

= X and
S S

= 0. An orthogonal projection matrix P is not necessarily an or-


thogonal matrix, but I 2P is orthogonal (see Householder transformations
in Section 4.3).
An important projector is dened by a matrix of the form P
U
= UU
T
,
where U has p orthonormal columns u
1
, u
2
, . . . , u
p
. This is a projection
onto the subspace spanned by the columns of U. In particular, the pro-
jection onto the subspace spanned by a single (not necessarily of norm 1)
vector u is dened by the rank-one matrix P
u
=
uu
T
u
T
u
.
Gram matrix. The Gram matrix or Grammian Gram(A) of an mn
matrix A is A
T
A. Its elements are thus the n
2
possible inner products
between pairs of columns of A.
270 APPENDIX B
B.4 Some additional matrix properties
The Sherman-Morrison-Woodbury formula gives a representation of
the inverse of a rank-one perturbation of a matrix in terms of its inverse:
(A+uv
T
)
1
= A
1

A
1
uv
T
A
1
1 +v
T
A
1
u
,
provided that 1 + v
T
A
1
u = 0. As usual, for calculations, A
1
u is short
hand for solve the system Ax = u.
The next theorem presents the interlacing property of the singu-
lar values of A with those of matrices obtained by removing or adding a
column or a row (see [20]).
Theorem 4. Let A be bordered by a column u R
m
,

A = (A, u) R
mn
,
m n. Then, the ordered singular values
i
of A separate the singular
values of

A as follows:

1

1

2

2
...
n1

n1

n
.
Similarly, if A is bordered by a row v R
n
,

A =
_
A
v
T
_
R
mn
, m n,
then

1

1

2

2
...
n1

n1

n

n
.
Appendix C
Advanced Calculus
Background
C.1 Convergence rates
Denition 5. Let x

, x
k
R for k = 0, 1, . . . The sequence {x
k
} is said to
converge to x

if
lim
k
|x
k
x

| = 0.
The convergence is
linear if c [0, 1) and an integer K > 0 such that for k K,
|x
k+1
x

| c |x
k
x

| ;
superlinear if c
k
0 and an integer K > 0 such that for k K,
|x
k+1
x

| c
k
|x
k
x

| ;
quadratic if c [0, 1) and an integer K > 0 such that for k K,
|x
k+1
x

| c |x
k
x

|
2
.
Denition 6. A locally convergent iterative algorithm converges
to the correct answer if the iteration starts close enough. A globally
convergent iterative algorithm converges when starting from almost
any point. For minimization, this is not to be confused with nding the
global minimum of a functional on a compact domain (see below).
Global and local minimum: x

is a global minimizer of a function


f : R
n
R on a compact domain D if f(x

) f(x), x R
n
. x

is a
local minimizer inside a certain region, usually dened as an open ball of
size around x

, if f(x

) f(x) for x x

2
< .
271
272 APPENDIX C
C.2 Multivariable calculus
The gradient and Hessian of a scalar function of several variables f(x) are
a vector and a matrix, respectively, dened by
f(x) (f/x
1
, ..., f/x
n
)
T
,
2
f(x)
_

2
f
x
i
x
j
_
.
For a vector function of several variables
r(x) = (r
1
(x), r
2
(x), . . . , r
m
(x))
T
,
with each r
k
(x) : R
n
R, we denote by J(x) the Jacobian of r(x) and by
G
k
the Hessian of a component function r
k
:
J(x) =
_
r
i
x
j
_
, G
k
(x) =
_

2
r
k
x
i
x
j
_
.
Denition 7. Descent direction p of a function f : R
n
R.
p is a descent direction at x
c
for a function f(x) if for suciently
small and positive f(x
c
+p < f(x
c
). Alternatively, p is a descent
direction at x
c
if the directional derivative (projection of the gradient on
a given direction) of the function f(x) at x
c
in the direction p is negative:
f(x
c
)
T
p < 0.
Theorem 8. Taylors theorem for a scalar function.
If f : R
n
R is continuously dierentiable, then, for some t [0, 1] ,
f(x +p) = f(x) +f(x +tp)
T
p.
If f(x) is twice continuously dierentiable then, for some t [0, 1],
f(x +p) = f(x) +f(x)
T
p +p
T

2
f(x +tp)p.
A necessary condition for x

to be a stationary point of f(x) is that


f(x

) = 0. The sucient conditions for x

to be a local minimizer are


f(x

) = 0 and
2
(x

) is positive denite.
The derivative DA(x) of an m n nonlinear matrix function A(x),
where x R
k
is a tri-dimensional tensor formed with k matrices (slabs)
of dimension mn, each one containing the partial derivatives of the ele-
ments of A with respect to one of the variables of the vector x. Thus, the
second derivative of the vector function r(x) is the tri-dimensional tensor
G = [G
1
, ...,G
m
].
The derivative of the orthogonal projector P
A(x)
onto the column space
of a dierentiable m n matrix function A(x) of local constant rank can
be obtained as follows.
ADVANCED CALCULUS BACKGROUND 273
Lemma 9. (Lemma 4.1 in [101]) Let A

(x) be the pseudoinverse of A(x).


Then P
A(x)
= AA

and
DP
A(x)
= P

A(x)
DAA

+ (P

A(x)
DAA

)
T
,
where P

A
= I P
A(x)
.
Denition 10. A function f is Lipschitz continuous with constant in a
set X R if
x, y X, |f(x) f(y)| |x y| . (C.2.1)
Lipschitz continuity is an intermediary concept between continuity and dif-
ferentiability. The operator V : R
n
R
n
, is Lipschitz continuous with
constant in a set X R
n
if
x, y X, V (x) V (y)
2
x y
2
.
V is contracting if it is Lipschitz continuous with constant less than unity:
< 1. V is uniformly monotone if there exists a positive m such that
mx y
2
2
(V (x) V (y))
T
(x y).
V is vector Lipschitz if
|V (x) V (y)| A|x y| ,
where A is a non-negative nn matrix and the inequality is meant element-
wise. V is vector contracting if it is vector Lipschitz continuous with A <
1.
Lagrange multipliers
Lagrange multipliers are used to nd the local extrema of a function f(x)
of n variables subject to k equality constraints g
i
(x) = c
i
, by reducing the
problem to an (nk)-variable problem without constraints. Formally, new
scalar variables = (
1
,
2
, . . . ,
k
)
T
, one for each contraint, are introduced
and a new function is dened as
F(x, ) = f(x) +
k

i=1

i
(g
i
(x) c
i
).
The local extrema of this extended function F (the Lagrangian), occur at
the points where its gradient is zero: F(x, ) = 0, = 0, or equivalently,

x
F(x, ) = 0,

F(x, ) = 0.
274 APPENDIX C
This form encodes compactly the constraints, because

i
F(x, ) = 0 g
i
(x) = c
i
.
An alternative to this method is to use the equality constraints to eliminate
k of the original variables. If the problem is large and sparse, this elimina-
tion may destroy the sparsity and thus the Lagrange multipliers approach
is preferable.
Appendix D
Statistics
D.1 Denitions
We list some basic denitions and techniques taken from statistics that are
needed to understand some sections of this book.
A sample space is the set of possible outcomes of an experiment
or a random trial. For some kind of experiments (trials), there may
be several plausible sample spaces available. The complete set of
outcomes can be constructed as a Cartesian product of the individual
sample spaces. An event is a subset of a sample space.
We denote by Pr{event} the probability of an event. We often
consider nite sample spaces, such that each outcome has the same
probability (equiprobable).
X is a random variable dened on a sample space if it assigns
a unique numerical value to every outcome, i.e., it is a real-valued
function dened on a sample space.
X is a continuous random variable if it can assume every value in
an interval, bounded or unbounded (continuous distribution). It can
be characterized by a probability density function (pdf) p(x)
dened by
Pr {a X b} =
_
b
a
p(x) dx.
X is a discrete random variable if it assumes at most a countable
set of dierent values (discrete distribution). It can be characterized
by its probability function (pf), which species the probability
275
276 APPENDIX D
that the random variable takes each of the possible values from say
{x
1
, x
2
, . . . , x
i
, . . .}:
p(x
i
) = Pr{X = x
i
}.
All distributions, discrete or continuous, can be characterized through
their (cumulative) distribution function, the total probability up
to, and including, a point x. For a discrete random variable,
P(x
i
) = Pr{X x
i
}.
The expectation or expected value E [X] for a random variable X
or equivalently, the mean of its distribution, contains a summary of
the probabilistic information about X. For a continuous variable X,
the expectation is dened by
E [X] =
_

xp(x) dx.
For a discrete random variable with probability function p(x) the
expectation is dened as
E[X] =

i
x
i
p(x
i
).
For a discrete random variable X with values x
1
, x
2
, . . . , x
N
and with
all p(x
i
) equal (all x
i
equiprobable, p(x
i
) =
1
N
), the expected value
coincides with the arithmetic mean
x =
1
N
N

i=1
x
i
.
The expectation is linear, i.e., for a, b, c constants and X, Y two
random variables,
E[X +c] = E[X] +c,
E[aX +bY ] = E[aX] +E[bY ].
A convenient measure of the dispersion (or spread about the average)
is the variance of the random variable, var [X] or
2
:
var [X] =
2
= E[(X E(X))
2
].
In the equiprobable case this reduces to
var [X] =
2
=
1
N
N

i=1
(x
i
x)
2
.
STATISTICS 277
For a random sample of observations x
1
, x
2
, . . . , x
N
, the following
formula is an estimate of the variance
2
:
s
2
=
1
N 1
N

i=1
(x
i
x)
2
.
In the case of a sample from a normal distribution this is a particularly
good estimate.
The square root of the variance is the standard deviation, . It is
known by physicists as RMS or root mean square in the equiprob-
able case:
=

_
1
N
N

i=1
(x
i
x)
2
.
Given two random variables X and Y with expected values E [X] and
E [Y ] , respectively, the covariance is a measure of how the values of
X and Y are related:
Cov{X, Y } = E(XY ) E(X)E(Y ).
For random vectors X R
m
and Y R
n
the covariance is the mn
matrix
Cov{X, Y } = E(XY
T
) E(X)E(X)
T
.
The (i, j)th element of this matrix is the covariance between the ith
component of X and the jth component of Y .
In order to estimate the degree of interrelation between variables in a
manner not inuenced by measurement units, the (Pearson) corre-
lation coecient is used:
c
XY
=
Cov{X, Y }
(var [X] var [Y ])
1
2
.
Correlation is a measure of the strength of the linear relationship
between the random variables; nonlinear ones are not measured sat-
isfactorily.
If Cov{X, Y } = 0, the correlation coecient is zero and the variables
X, Y are uncorrelated.
The random variables X and Y , with distribution functions P
X
(x),
P
Y
(y) and densities p
X
(x), p
Y
(y), respectively, are statistically in-
dependent if and only if the combined random variable (X, Y ) has
a joint cumulative distribution function P
X,Y
(x, y) = P
X
(x)P
Y
(y) or
278 APPENDIX D
equivalently a joint density p
X,Y
(x, y) = p
X
(x)p
Y
(y). The expecta-
tion and variance operators have the properties E [XY ] = E [X] E [Y ] ,
var [X +Y ] = var [X] + var [Y ]. It follows that independent random
variables have zero covariance.
Independent variables are uncorrelated, but the opposite is not true
unless the variables belong to a normal distribution.
A random vector w is a white noise vector if E(w) = 0 and E(ww
T
) =

2
I. That is, it is a zero mean random vector where all the elements
have identical variance. Its autocorrelation matrix is a multiple of the
identity matrix; therefore the vector elements are uncorrelated. Note
that Gaussian noise white noise.
The coecient of variation is a normalized measure of the disper-
sion:
c
v
=

.
In signal processing the reciprocal ratio

is referred to as the signal


to noise ratio.
The coecient of determination R
2
is used in the context of
linear regression analysis (statistical modeling), as a measure of how
well a linear model ts the data. Given a model M that predicts values
M
i
, i = 1, . . . , N for the observations x
1
, x
2
, . . . , x
N
and the residual
vector r = (M
1
x
1,
. . . , M
N
x
N
)
T
, the most general denition for
the coecient of determination is
R
2
= 1
r
2
2

N
i=1
(x
i
x)
2
.
In general, it is an approximation of the unexplained variance, since
the second term compares the variance in the models errors with the
total variance of the data x
1
, . . . , x
N
.
The normal (or Gaussian) probability function N(,
2
) has mean
and variance
2
. Its probability density function takes the form
p(x) =
1

2
exp
_

1
2
_
x

_
2
_
.
If X
i
, i = 1, . . . , n are random variables with normal distributions
N(0, 1), then

n
i=1
X
2
i
has a chi-square distribution with n degrees
of freedom. Given two independent random variables Y, W with chi-
square distributions with m and n degrees of freedom, respectively,
STATISTICS 279
the random variable X,
X =
Y/m
W/n
,
has an F-distribution with m and n degrees of freedom, F(m, n).
The Students t
n
random variable with n degrees of freedom is de-
ned as
t
n
=
Z
_

2
n
/n
,
where Z is a standard normal random variable,
2
n
is a chi-square
random variable with n degrees of freedom and Z and
2
n
are inde-
pendent.
The F-distribution becomes relevant when testing hypotheses about
the variances of normal distributions. For example, assume that we
have two independent random samples (of sizes n
1
and n
2
respec-
tively) from two normal distributions; the ratio of the unbiased esti-
mators s
1
and s
2
of the two variances,
s
1
/n
1
1
s
2
/n
2
1
,
is distributed according to an F-distribution F(n
1
, n
2
) if the variances
are equal:
2
1
=
2
2
.
A time series is an ordered sequence of observations. The ordering
is usually in time, often in terms of equally spaced time intervals, but
it can also be ordered in, for example, space. The time series elements
are frequently considered random variables with the same mean and
variance.
A periodogram is a data analysis technique for examining the fre-
quency domain of an equispaced time series and search for hidden
periodicities. Given a time series vector of N observations x(j), the
discrete Fourier transform (DFT) is given by a complex vector of
length N:
X(k) =
N

j=1
x(j)
(j1)(k1)
N
, 1 k N,
with
N
= exp((2i)/N).
The magnitude squared of the discrete Fourier transform components
|X(k)|
2
is called the power. The periodogram is a plot of the power
components versus the frequencies {
1
N
,
2
N
, ...
k
N
, ...}.
280 APPENDIX D
D.2 Hypothesis testing
In hypothesis testing one uses statistics to determine whether a given
hypothesis is true. The process consists of four steps:
Formulate the null hypothesis H
0
and the alternative hypothesis H
a
,
which are mutually exclusive.
Identify a test statistics t that can be used to assess the truth of the
null hypothesis from the sample data. It can involve means, propor-
tions, standard deviations, etc.
Determine the corresponding distribution function, assuming that the
null hypothesis is true and, after choosing an acceptable level of
signicance (common choices are 0.01, 0.05, 0.1), determine the
critical region for t (see details below).
Compute the t for the observations. If the computed value of the test
statistics falls into the critical region, reject H
0
.
In other words, the test of a hypothesis on the signicance level is per-
formed by means of a statistic t and critical values t

and

t

so that
1 = Pr {t

t

t

} , 0 1
if H
0
holds. The hypothesis H
0
is rejected if t falls outside the range [t

].
This guarantees that H
0
will only be erroneously rejected in 100 % of
the cases.
For example, assume that we have approximated m observations
y
1
, y
2,
. . . , y
m
with a linear model M
full
with n terms and we want to know
if k of these terms are redundant, i.e., if one could get a good enough model
when setting k coecient of the original model to 0. We have the residual
norms
full
and
red
for both possible models, with
full
<
red
.
We formulate the hypothesis H
0
: the reduced model is good enough,
i.e., the reduction in residual when using all n terms is negligible.
Under the assumption that the errors in the data y
1
, y
2
, . . . , y
m
are
normally distributed with constant variance and zero mean, we can choose
a statistic based on a proportion of variance estimates:
f
obs
=

red

full

full
mn
k
.
This statistic follows an F-distribution with k and mn degrees of freedom:
F(k, m n). The common practice is to denote by f

the value of the


statistic with cumulative probability F

(k, mn) = 1 .
If f
obs
> f

, the computed statistic falls into the critical region for the
F-distribution and the H
0
hypothesis should be rejected, with a possible
error of , i.e., we do need all the terms of the full model.
References
[1] J. J. Alonso, I. M. Kroo and A. Jameson, Advanced algorithms for
design and optimization of quiet supersonic platforms. AIAA Paper
02-0114, 40th AIAA Aerospace Sciences Meeting and Exhibit, Reno,
NV, 2002.
[2] J. J. Alonso, P. Le Gresley, E. van der Weide and J. Martins, pyMDO:
a framework for high-delity multidisciplinary optimization. 10th
AIAA/ISSMO Multidisciplinary Analysis and Optimization Confer-
ence, Albany, NY, 2004.
[3] A. Anda and H. Park, Fast plane rotations with dynamic scaling.
SIAM J. Matrix Anal. Appl. 15:162174, 1994.
[4] E. Angelosante, D. Giannakis and G. B. Grossi, Compressed sensing
of time-varying signals. Digital Signal Processing, 16th International
Conference, Santorini, Greece, 2009.
[5] H. Ashley and M. Landahl, Aerodynamics of Wings and Bodies.
Dover, New York, 1985.
[6] I. Barrodale and F. D. K. Roberts, An improved algorithm for dis-
crete
1
linear approximation. SIAM J. Numer. Anal. 10:839848,
1973.
[7] R. Bartels, J. Beaty and B. Barski, An Introduction to Splines for Use
in Computer Graphics and Parametric Modeling. Morgan Kaufman
Publishers, Los Altos, CA, 1987.
[8] A. E. Beaton and J. W. Tukey, The tting of power series, meaning
polynomials, illustrated on band-spectroscopic data. Technometrics
16:147185, 1974.
[9] D. M. Bates and D. G. Watts, Nonlinear Regression Analysis and Its
Applications. J. Wiley, New York, 1988.
281
282 REFERENCES
[10] G. M. Baudet, Asynchronous iterative methods for multiprocessors.
J. ACM 15:226244, 1978.
[11] A. Beck, A. Ben-Tal and M. Teboulle, Finding a global optimal solu-
tion for a quadratically constrained fractional quadratic problem with
applications to the regularized total least squares. SIAM J. Matrix
Anal. Appl. 28:425445, 2006.
[12] M. Berry, Large scale sparse singular value computations. Interna-
tional J. Supercomp. Appl. 6:1349, 1992.
[13] P. R. Bevington, Data Reduction and Error Analysis for the Physical
Sciences. McGraw-Hill, New York, 1969.
[14] C. H. Bischof and G. Qintana-Orti, Computing rank-revealing QR
factorization of dense matrices. ACM TOMS 24:226253, 1998.
[15] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford Uni-
versity Press, New York, 1995.
[16] . Bjrck, Solving linear least squares problems by Gram-Schmidt
orthogonalization. BIT 7:121, 1967.
[17] . Bjrck, Iterative renement of linear least squares solutions. BIT
7:257278, 1967.
[18] . Bjrck, Iterative renement of linear least squares solutions II.
BIT 8:830, 1968.
[19] . Bjrck, Stability analysis of the method of seminormal equations
for linear least squares problems. Lin. Alg. Appl. 88/89:3148, 1987.
[20] . Bjrck, Numerical Methods for Least Squares Problems. SIAM,
Philadelphia, 1996.
[21] . Bjrck, The calculation of linear least squares problems. Acta
Numerica 13:153, 2004.
[22] . Bjrck and I. S. Du, A direct method for the solution of sparse
linear least squares problems. Lin. Alg. Appl. 34:4367, 1980.
[23] . Bjrck and G. H. Golub, Iterative renement of linear least
squares solution by Householder transformation. BIT 7:322337,
1967.
[24] . Bjrck, E. Grimme and P. Van Dooren, An implicit shift bidiag-
onalization algorithm for ill-posed systems. BIT 34:510534, 1994.
REFERENCES 283
[25] . Bjrck, P. Heggernes and P. Matstoms, Methods for large scale
total least squares problems. SIAM J. Matrix Anal. Appl. 22:413
429, 2000.
[26] . Bjrck and C. C. Paige, Loss and recapture of orthogonality in the
modied Gram-Schmidt algorithm. SIAM J. Matrix Anal. 13:176
190, 1992.
[27] . Bjrck and V. Pereyra, Solution of Vandermonde systems of equa-
tions. Math. Comp. 24:893904, 1970.
[28] C. Bckman, A modication of the trust-region Gauss-Newton
method for separable nonlinear least squares problems. J. Math. Sys-
tems, Estimation and Control 5:116, 1995.
[29] C. de Boor, A Practical Guide to Splines. Applied Mathematical Sci-
ences 27. Springer, New York, 1994.
[30] S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge
University Press, Cambridge, 2003.
[31] R. Brent, Algorithms for Minimization Without Derivatives. Prentice
Hall, Englewood Clis, NJ, 1973. Reprinted by Dover, New York,
2002.
[32] D. Calvetti, P. C. Hansen and L. Reichel, L-curve curvature bounds
via Lanczos bidiagonalization. Electr. Transact. on Num. Anal.
14:134149, 2002.
[33] D. Calvetti, G. H. Golub and L. Reichel, Estimation of the L-curve
via Lanczos bidiagonalization algorithm for ill-posed systems. BIT
39:603619, 1999.
[34] E. J. Candes, Compressive sampling. Proceedings of the Interna-
tional Congress of Mathematicians, Madrid, Spain, 2006.
[35] E. J. Candes and M. B. Wakin, People hearing without listening: An
introduction to compressive sampling. Signal Processing Magazine,
IEEE 25:2130, 2008.
[36] L. Carcione, J. Mould, V. Pereyra, D. Powell and G. Wojcik, Nonlin-
ear inversion of piezoelectrical transducer impedance data. J. Comp.
Accoustics 9:899910, 2001.
[37] L. Carcione, V. Pereyra and D. Woods, GO: Global Optimization.
Weidlinger Associates Report, 2005.
284 REFERENCES
[38] R. I. Carmichael and L. I. Erickson, A higher order panel method for
predicting subsonic or supersonic linear potential ow about arbitrary
congurations. American Institute of Aeronautics and Astronautics
Paper 811255, 1981.
[39] M. Chan, Supersonic aircraft optimization for minimizing drag and
sonic boom. Ph.D. Thesis, Stanford University, Stanford, CA, 2003.
[40] T. F. Chan, Rank revealing QR-factorizations. Lin. Alg. Appl.
88/89:6782, 1987.
[41] T. F. Chan and D. E. Foulser, Eectively well-conditioned linear
systems. SIAM J. Sci. Stat. Comput. 9:963969, 1988.
[42] T. F. Chan and P. C. Hansen, Computing truncated SVD least
squares solutions by rank revealing QR-factorizations. SIAM J. Sci.
Statist. Comput. 11:519530, 1991.
[43] D. Chazan and W. L. Miranker, Chaotic relaxation. Lin. Alg. Appl.
2:199222, 1969.
[44] W. Cheney and D. Kincaid, Numerical Mathematics and Computing.
Brooks/Cole, Belmont, CA, 2007.
[45] S. Choi, J. J. Alonso and H. S. Chung, Design of low-boom supersonic
business jet using evolutionary algorithms and an adaptive unstruc-
tured mesh method. 45th AIAA/ASME/ASCE/AHS/ASC Struc-
tures, Structural Dynamics and Materials Conference, Palm Springs,
CA, 2004.
[46] H. S. Chung, S. Choi and J. J. Alonso, Supersonic business jet
design using knowledge-based genetic algorithms with adaptive, un-
structured grid methodology. AIAA 2003-3791, 21st Applied Aero-
dynamic Conference, Orlando, Fl., June 2003.
[47] H. S. Chung, Multidisciplinary design optimization of supersonic
business jets using approximation model-based genetic algorithms.
Ph.D. Thesis, Stanford University, Stanford, CA, 2004.
[48] J. F. Claerbout and F. Muir, Robust modeling with erratic data.
Geophysics 38:826844, 1973.
[49] A. K. Cline, An elimination method for the solution of linear least
squares problems. SIAM J. Numer. Anal. 10:283289, 1973.
[50] D. Coleman, P. Holland, N. Kaden, V. Klema and S. C. Peters, A
system of subroutines for iteratively reweighted least squares compu-
tations. ACM TOMS 6:328336, 1980.
REFERENCES 285
[51] T. F. Coleman and Y. Li, A globally and quadratically convergent
ane scaling method for linear
1
problems. Mathematical Program-
ming, Series A 56:189222, 1992.
[52] T. P. Collignon, Ecient iterative solution of large linear systems
on heterogeneous computing systems. Ph. D. Thesis, TU Delft, The
Netherlands, 2011.
[53] T. P. Collignon and M. B. van Gijzen. Parallel scientic computing
on loosely coupled networks of computers. In B. Koren and C. Vuik,
editors, Advanced Computational Methods in Science and Engineer-
ing. Springer Series Lecture Notes in Computational Science and En-
gineering, 71:79106. Springer Verlag, Berlin/Heidelberg, Germany,
2010.
[54] G. Cybenko, Approximation by superpositions of a sigmoidal func-
tion. Math. Control Signals Systems 2:303314, 1989.
[55] J. Dahl, P. C. Hansen, S. H. Jensen and T. L. Jensen, Algorithms
and software for total variation image reconstruction via rst-order
methods. Numer. Algo. 53:6792, 2010.
[56] J. Dahl and L. Vanderberghe, CVXOPT: A Python Package for Con-
vex Optimization. http://abel.ee.ucla.edu/cvxopt, 2012.
[57] G. Dahlquist and . Bjrck, Numerical Methods in Scientic Com-
puting. SIAM, Philadelphia, 2008.
[58] T. A. Davis and Y. Hu, The University of Florida sparse matrix
collection. ACM TOMS 38:125, 2011.
[59] K. Deb, A. Pratap, S. Agrawal and T. Meyarivan, A Fast and Elitist
Multiobjective Genetic Algorithm: NSGA-II. Technical Report No.
2000001. Indian Institute of Technology, Kanpur, India, 2000.
[60] P. Deift, J. Demmel, L.-C. Li and C. Tomei, The bidiagonal singular
value decomposition and Hamiltonian mechanics. SIAM J. Numer.
Anal. 28:14631516, 1991.
[61] R. S. Dembo, S. C. Eisenstat and T. Steihaug, Inexact Newton meth-
ods. SIAM J. Numer. Anal. 19:400408, 1982.
[62] C. J. Demeure and L. L. Scharf, Fast least squares solution of Van-
dermonde systems of equations. Acoustics, Speech and Signal Pro-
cessing 4:21982210, 1989.
[63] J. W. Demmel, Applied Numerical Linear Algebra. SIAM, Philadel-
phia, 1997.
286 REFERENCES
[64] J. Demmel, Y. Hida, W. Riedy and X. S. Li, Extra-precise iterative
renement for overdetermined least squares problems. ACM TOMS
35:132, 2009.
[65] J. Dennis, D. M. Gay and R. Welsch, Algorithm 573 NL2SOL An
adaptive nonlinear least-squares algorithm. ACM TOMS 7:369383,
1981.
[66] J. Dennis and R. Schnabel, Numerical Methods for Unconstrained
Optimization and Nonlinear Equations. SIAM, Philadelphia, 1996.
[67] J. E. Dennis and H. F. Walker, Convergence theorems for least-
change secant update methods. SIAM J. Num. Anal. 18:949987,
1981.
[68] A. P. Dempster, N. M. Laird and D. B. Rubin. Maximum likelihood
from incomplete data via the EM algorithm (with discussion). Jour-
nal of the Royal Statistical Society B 39:138, 1977.
[69] P. Dierckx, Curve and Surface Fitting with Splines. Clarendon Press,
Oxford, 1993.
[70] J. Dongarra, J. R. Bunch, C. B. Moler and G. W. Stewart, LINPACK
Users Guide. SIAM, Philadelphia, 1979.
[71] N. Draper and H. Smith, Applied Regression Analysis. J. Wiley, New
York, 1981.
[72] C. Eckart and G. Young, The approximation of one matrix by an-
other of lower rank. Psychometrika 1:211218, 1936.
[73] L. Eldn, A note on the computation of the generalized cross-
validation function for ill-conditioned least squares problems. BIT
24:467472, 1984.
[74] L. Eldn, Perturbation theory for the linear least squares problem
with linear equality constraints. BIT 17:338350, 1980.
[75] L. Eldn, Matrix Methods in Data Mining and Pattern Recognition.
SIAM, Philadelphia, 2007.
[76] J. Eriksson, P.-. Wedin, M. E. Gulliksson and I. Sderkvist, Reg-
ularization methods for uniformly rank-decient nonlinear least-
squares problems. J. Optimiz. Theory and Appl. 127:126, 2005.
[77] J. Eriksson and P.-. Wedin, Truncated Gauss-Newton algorithms
for ill-conditioned nonlinear least squares problems. Optimiz. Meth.
and Software 19:721737, 2004.
REFERENCES 287
[78] R. D. Fierro, P. C. Hansen and P. S. K. Hansen, UTV tools: MAT-
LAB templates for rank-revealing UTV decompositions. Numer.
Algo. 20:165194, 1999. The software is available from:
htpp://www.netlib.org/numeralgo.
[79] P. M. Fitzpatrick, Advanced Calculus. Second edition Brooks/Cole,
Belmont, CA, 2006.
[80] D. Fong and M. A. Saunders, LSMR: An iterative algorithm for
sparse least-squares problems. SIAM J. Sci. Comput. 33:29502971,
2011.
[81] G. E. Forsythe, Generation and use of orthogonal polynomials for
data-tting with a digital computer. J. SIAM 5:7488, 1957.
[82] J. Fox and S. Weisberg, Robust Regression in R, An Appendix
to An R Companion to Applied Regression, Second edition.
http://socserv.socsci.mcmaster.ca/jfox/Books
/Companion/appendix.html.
[83] C. F. Gauss, Theory of the Combination of Observations Least Sub-
ject to Errors. Parts 1 and 2, Supplement, G. W. Stewart. SIAM,
Philadelphia, 1995.
[84] D. M. Gay, Usage Summary for Selected Optimization Routines.
AT&T Bell Labs. Comp. Sc. Tech. Report, 1990.
[85] D. M. Gay and L. Kaufman, Tradeos in Algorithms for Separable
Nonlinear Least Squares. AT&T Bell Labs. Num. Anal. Manuscript,
9011, 1990.
[86] D. M. Gay, NSF and NSG; PORT Library. AT&T Bell Labs.
http://www.netlib.org/port, 1997.
[87] P. E. Gill, G. H. Golub, W. Murray and M. A. Saunders, Methods
for modifying matrix factorisations. Math. Comp. 28:505535, 1974.
[88] P. E. Gill, S. J. Hammarling, W. Murray, M. A. Saunders and M.
H. Wright, Users Guide for LSSOL (Version 1.0): A Fortran Pack-
age for Constrained Linear Least-Squares and Convex Quadratic Pro-
gramming. Report 86-1 Department of Operation Research, Stanford
University, CA, 1986.
[89] P. E. Gill, W. Murray and M. A. Saunders, SNOPT: An SQP al-
gorithm for large-scale constrained optimization. SIAM Rev. 47:99
131, 2005.
288 REFERENCES
[90] P. E. Gill, W. Murray, M. A. Saunders and M. H. Wright, Maintain-
ing LU factors of a general sparse matrix. Linear Algebra and its
Applications 8889:239270, 1987.
[91] P. E. Gill, W. Murray and M. H. Wright, Practical Optimization.
Academic Press, 1981.
[92] G. H. Golub, Numerical methods for solving linear least squares
problems. Numer. Math. 7:20616, 1965.
[93] G. H. Golub, P. C. Hansen and D. P. OLeary, Tikhonov regulariza-
tion and total least squares. SIAM J. Matrix Anal. Appl. 21:185194,
1999.
[94] G. H. Golub, M. Heath and G. Wahba, Generalized cross-validation
as a method for choosing a good ridge parameter. Technometrics
21:215223, 1979.
[95] G. H. Golub, A. Homan and G. W. Stewart, A generalization of
the Eckhard-Young-Mirsky matrix approximation theorem. Linear
Algebra Appl. 88/89:317327, 1987.
[96] G. H. Golub and W. Kahan, Calculating the singular values and
pseudo-inverse of a matrix. SIAM J, Numer. Anal. Ser. B 2:205
224, 1965.
[97] G. H. Golub, V. Klema and G. W. Stewart, Rank Degeneracy and
Least Squares. Report TR-456, Computer Science Department, Uni-
versity of Maryland, College Park, 1977.
[98] G. H. Golub and R. Le Veque, Extensions and uses of the variable
projection algorithm for solving nonlinear least squares problems.
Proceedings of the Army Numerical Analysis and Computers Confer-
ence, White Sands Missile Range, New Mexico, pp. 112, 1979.
[99] G. H. Golub and U. von Matt, Tikhonov regularization for large
problems. Workshop on Scientic Computing, Ed. G. H. Golub, S.
H. Lui, F. Luk and R. Plemmons, Springer, New York, 1997.
[100] G. H. Golub and V. Pereyra, The Dierentiation of Pseudo-Inverses
and Nonlinear Least Squares Problems Whose Variables Separate.
STAN-CS-72-261, Stanford University, Computer Sciences Depart-
ment, 1972. (It contains the original VARPRO computer code.)
[101] G. H. Golub and V. Pereyra, The dierentiation of pseudo-inverses
and nonlinear least squares problems whose variables separate. SIAM
J. Numer. Anal. 10:413432, 1973.
REFERENCES 289
[102] G. H. Golub and V. Pereyra, Dierentiation of pseudoinverses, sepa-
rable nonlinear least squares problems, and other tales. Proceedings
MRC Seminar on Generalized Inverses and Their Applications, Ed.
Z. Nashed, pp. 302324, Academic Press, NY, 1976.
[103] G. H. Golub and V. Pereyra, Separable nonlinear least squares: The
variable projection method and its applications. Inverse Problems
19:R1R26, 2003.
[104] G. H. Golub and J. M. Varah, On a characterization of the best

2
-scaling of a matrix. SIAM J. Numer. Anal. 11:472279, 1974.
[105] G. H. Golub and C. F. Van Loan, Matrix Computations. Third edi-
tion, John Hopkins University Press, Baltimore, 1996.
[106] G. H. Golub and C. F. Van Loan, An analysis of the total least
squares problem. SIAM J. Numer. Anal. 17:883893, 1980.
[107] G. H. Golub and J. H. Wilkinson, Note on iterative renement of
least squares solutions. Numer. Math. 9:139148, 1966.
[108] J. F. Grcar, Optimal Sensitivity Analysis of Linear Least Squares.
Lawrence Berkeley National Laboratory, Report LBNL-52434, 2003.
[109] J. F. Grcar, Mathematicians of Gaussian elimination. Notices of the
AMS 58:782792, 2011.
[110] J. F. Grcar, John von Neumanns analysis of Gaussian elimina-
tion and the origins of modern Numerical Analysis. SIAM Review
53:607682, 2011.
[111] A. Griewank and A. Walther, Evaluating Derivatives: Principles and
Techniques of Algorithmic Dierentiation. Other Titles in Applied
Mathematics 105. Second edition, SIAM, Philadelphia, 2008.
[112] E. Grosse, Tensor spline approximation. Linear Algebra Appl.
34:2941, 1980.
[113] I. Gutman, V. Pereyra and H. D. Scolnik, Least squares estimation
for a class of nonlinear models. Technometrics 15:209218, 1973.
[114] Y. Y. Haimes, L. S. Ladon and D. A. Wismer, On a bicriterion
formulation of the problem of integrated system identication and
system optimization. IEEE Trans. on Systems, Man and Cybernetics
1:296297, 1971.
[115] S. Hammarling, A note on modications to the Givens plane rota-
tions. J. Inst. Math. Applic. 13:215218, 1974.
290 REFERENCES
[116] S. Hammarling, A Survey of Numerical Aspects of Plane Rotations.
University of Manchester, MIMS Eprint 2008.69, 1977.
[117] M. Hanke and P. C. Hansen, Regularization methods for large-scale
problems. Surv. Math. Ind. 3:253315, 1993.
[118] P. C. Hansen, Rank-Decient and Discrete Ill-Posed Problems. SIAM,
Philadelphia, 1998.
[119] P. C. Hansen, The L-curve and its use in the numerical treatment of
inverse problems. In Comp. Inverse Problems in Electrocardiology.
Ed. P. Johnston, pp. 119142, WIT Press, Southampton, UK, 2001.
[120] P. C. Hansen, Regularization tools version 4.0 for MATLAB 7.3.
Numer. Algorithms 46:189194, 2007.
[121] P. C. Hansen, Discrete Inverse Problems Insight and Algorithms.
SIAM, Philadelphia, 2010.
[122] P. C. Hansen, M. Kilmer and R. H. Kjeldsen, Exploiting residual
information in the parameter choice for discrete ill-posed problems.
BIT 46:4159, 2006.
[123] P. C. Hansen and M. Saxild-Hansen, AIR tools A MATLAB pack-
age of algebraic iterative reconstruction methods. J. Comp. Appl.
Math. 236:21672178, 2011.
[124] P. C. Hansen and P. Yalamov, Computing symmetric rank-revealing
decompositions via triangular factorization. SIAM J. Matrix Anal.
Appl. 23:443458, 2001.
[125] J. G. Hayes, ed., Numerical Approximation to Functions and Data.
Athlone Press, London, 1970.
[126] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for
solving linear systems. J. Res. Nat. Stan. B. 49:409432, 1952.
[127] N. Higham, Analysis of the Cholesky decomposition of a semi-denite
matrix. In Reliable Numerical Computing, ed. M. G. Cox and S. J.
Hammarling, Oxford University Press, London, 1990.
[128] N. Higham, Accuracy and Stability of Numerical Algorithms. Second
edition, SIAM, Philadelphia, 2002.
[129] H. P. Hong and C. T. Pan, Rank-revealing QR factorization and
SVD. Math. Comp. 58:213232, 1992.
[130] P. J. Huber and E. M. Ronchetti, Robust Statistics. Second edition,
J. Wiley, NJ, 2009.
REFERENCES 291
[131] R. Horst, P. M. Pardalos and N. Van Thoai, Introduction to Global
Optimization. Second edition, Springer, New York, 2000.
[132] N. J. Horton and S. R. Lipsitz, Multiple imputation in practice:
Comparison of software packages for regression models with missing
variables. The American Statistician 55, 2011.
[133] I. C. F. Ipsen and C. D. Meyer, The idea behind Krylov methods.
Amer. Math. Monthly 105:889899, 1998.
[134] R. A. Johnson, Miller & Freunds Probability and Statistics for En-
gineers. Pearson Prentice Hall, Upper Saddle River, NJ, 2005.
[135] P. Jones, Data Tables of Global and Hemispheric Temperature
Anomalies
http://cdiac.esd.ornl.gov/trends/temp/jonescru/data.html.
[136] T. L. Jordan, Experiments on error growth associated with some
linear least squares procedures. Math. Comp. 22:579588, 1968.
[137] D. L. Jupp, Approximation to data by splines with free knots. SIAM
J. Numer. Anal. 15:328, 1978.
[138] D. L. Jupp and K. Vozo, Stable iterative methods for the inversion
of geophysical data. J. R. Astr. Soc. 42:957976, 1975.
[139] W. Karush, Minima of functions of several variables with inequalities
as side constraints. M. Sc. Dissertation. Department of Mathematics,
University of Chicago, Chicago, Illinois, 1939.
[140] L. Kaufman, A variable projection method for solving nonlinear least
squares problems. BIT 15:4957, 1975.
[141] L. Kaufman and G. Sylvester, Separable nonlinear least squares with
multiple right-hand sides. SIAM J. Matrix Anal. and Appl. 13:68
89, 1992.
[142] L. Kaufman and V. Pereyra, A method for separable nonlinear least
squares problems with separable equality constraints. SIAM J. Nu-
mer. Anal. 15:1220, 1978.
[143] H. B. Keller and V. Pereyra, Finite Dierence Solution of Two-Point
BVP. Preprint 69, Department of Mathematics, University of South-
ern California, 1976.
[144] M. E. Kilmer and D. P. OLeary, Choosing regularization parameters
in iterative methods for ill-posed problems. SIAM. J. Matrix Anal.
Appl. 22:12041221, 2001.
292 REFERENCES
[145] H. Kim, G. H. Golub and H. Park, Missing value estimation for DNA
microarray expression data: Least squares imputation. In Proceed-
ings of CSB2004, Stanford, CA, pp. 572573, 2004.
[146] S. Kotz, N. L. Johnson and A. B. Read, Encyclopedia of Statistical
Sciences. 6. Wiley-Interscience, New York, 1985.
[147] H. W. Kuhn and A. W. Tucker, Nonlinear Programming. Proceedings
of 2nd Berkeley Symposium, 481492, University of California Press,
Berkeley, 1951.
[148] Lancelot: Nonlinear Programming Code
http://www.numerical.rl.ac.uk/lancelot/
[149] K. Lange, Numerical Analysis for Statisticians. Second edition,
Springer, New York, 2010.
[150] C. Lawson and R. Hanson, Solving Least Squares Problems. Prentice
Hall, Englewood Clis, NJ, 1974.
[151] K. Levenberg, A method for the solution of certain nonlinear prob-
lems in least squares. Quart. J. App. Math. 2:164168, 1948.
[152] R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing
Data. Second edition. J. Wiley, Hoboken, NJ, 2002.
[153] K. Madsen and H. B. Nielsen, A nite smoothing algorithm for linear

1
estimation. SIAM J. Optimiz. 3:223235, 1993.
[154] K. Madsen and H. B. Nielsen, Introduction to Optimization and
Data Fitting. Lecture notes, Informatics and Mathematical Mod-
elling, Technical University of Denmark, Lyngby, 2008.
[155] I. Markovsky and S. Van Huel, Overview of total least-squares
methods. Signal Processsing 87:22832302, 2007.
[156] O. Mate, Missing value problem. Masters Thesis, Stanford Univer-
sity, Stanford, CA, 2007.
[157] Matrix Market. http://math.nist.gov/MatrixMarket
[158] C. D. Meyer, Matrix Analysis and Applied Linear Algebra. SIAM,
Philadelphia, 2000.
[159] J.-C. Miellou, Iterations chaotiques a retards. C. R. Acad. Sci. Paris
278:957960, 1974.
[160] J.-C. Miellou, Algorithmes de relaxation chaotique a retards.
RAIRO R1:5582, 1975.
REFERENCES 293
[161] K. M. Miettinen, Nonlinear Multiobjective Optimization. Kluwers
Academic, Boston, 1999.
[162] L. Miranian and M. Gu, Strong rank revealing LU factorizations.
Lin. Alg. Appl. 367:116, 2003.
[163] Software for multiple imputation.
http://www.stat.psu.edu/~jls/misoftwa.html
[164] M. Mohlenkamp and M. C. Pereyra, Wavelets, Their Friends, and
What They Can Do For You. EMS Lecture Series in Mathematics,
European Mathematical Society, Zurich, Switzerland, 2008.
[165] J. Mor and S. J. Wright, Optimization Software Guide. SIAM,
Philadelphia, 1993.
[166] A. Nedic, D. P. Bertsekas and V. S. Borkar, Distributed asyn-
chronous incremental subgradient methods. Studies in Computa-
tional Mathematics 8:381407, 2001.
[167] Y. Nesterov and A. Nemirovskii. Interior-Point Polynomial Methods
in Convex Programming. Studies in Applied Mathematics 13. SIAM,
Philadelphia, 1994.
[168] L. Ngia, System modeling using basis functions and applications to
echo cancellation. Ph.D. Thesis, Chalmers Institute of Technology,
Sweden, Goteborg, 2000.
[169] L. Ngia, Separable Nonlinear Least Squares Methods for On-Line Es-
timation of Neural Nets Hammerstein Models. Department of Signals
and Systems, Chalmers Institute of Technology, Sweden, 2001.
[170] J. Nocedal and S. J. Wright, Numerical Optimization. Springer, New
York, second edition, 2006.
[171] D. P. OLeary, Robust regression computation using iteratively
reweighted least squares. SIAM J. Matrix. Anal. Appl. 22:466480,
1990.
[172] D. P. OLeary and B. W. Rust, Variable projection for non-
linear least squares problems. Submitted for publication in
Computational Optimization and Applications (2011). Also at:
http://www.cs.umd.edu/users/oleary/software/varpro.pdf
[173] J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear
Equations in Several Variables. Academic Press, New York, 1970.
[174] M. R. Osborne, Some special nonlinear least squares problems.
SIAM J. Numer. Anal. 12:571592, 1975.
294 REFERENCES
[175] M. R. Osborne and G. K. Smyth, A modied Prony algorithm for
tting functions dened by dierence equations. SIAM J. Sci. Comp.
12:362382, 1991.
[176] M. R. Osborne and G. K. Smyth, A modied Prony algorithm for
exponential function tting. SIAM J. Sci. Comp. 16:119138, 1995.
[177] M. R. Osborne, Separable least squares, variable projections, and
the Gauss-Newton algorithm. ETNA 28:115, 2007.
[178] A. Ostrowski, Determinanten mit ueberwiegender Hauptdiagonale
and die absolute Konvergenz von linearen Iterationsprozessen.
Comm. Math. Helv. 30:175210, 1955.
[179] C. C. Paige and M. A. Saunders, LSQR: An algorithm for sparse lin-
ear equations and sparse least squares. ACM TOMS 8:4371, 1982.
[180] C. C. Paige and Z. Strakos, Core problems in linear algebraic sys-
tems. SIAM J. Matrix Anal. Appl. 27:861875, 2006.
[181] R. Parisi, E. D. Di Claudio, G. Orlandi and B. D. Rao, A generalized
learning paradigm exploiting the structure of feedforward neural net-
works. IEEE Transactions on Neural Networks 7:14501460, 1996.
[182] Y. C. Pati and P. S. Krishnaprasad, Analysis and synthesis of feedfor-
ward neural networks using discrete ane wavelet transformations.
IEEE Trans. Neural Networks 4:7385, 1993.
[183] V. Pereyra, Accelerating the convergence of discretization algo-
rithms. SIAM J. Numer. Anal. 4:508533, 1967.
[184] V. Pereyra, Iterative methods for solving nonlinear least squares
problems. SIAM J. Numer. Anal. 4:2736, 1967.
[185] V. Pereyra, Stability of general systems of linear equations. Aeq.
Math. 2:194206, 1969.
[186] V. Pereyra, Stabilizing linear least squares problems. Proc. IFIP,
Suppl. 68:119121, 1969.
[187] V. Pereyra, Modeling, ray tracing, and block nonlinear travel-time
inversion in 3D. Pure and App. Geoph. 48:345386, 1995.
[188] V. Pereyra, Asynchronous distributed solution of large scale nonlin-
ear inversion problems. J. App. Numer. Math. 30:3140, 1999.
[189] V. Pereyra, Ray tracing methods for inverse problems. Invited top-
ical review. Inverse Problems 16:R1R35, 2000.
REFERENCES 295
[190] V. Pereyra, Fast computation of equispaced Pareto manifolds
and Pareto fronts for multiobjective optimization problems. Math.
Comp. in Simulation 79:19351947, 2009.
[191] V. Pereyra, M. Koshy and J. Meza, Asynchronous global optimiza-
tion techniques for medium and large inversion problems. SEG An-
nual Meeting Extended Abstracts 65:10911094, 1995.
[192] V. Pereyra and P. Reynolds, Application of optimization techniques
to nite element analysis of piezocomposite devices. IEEE Ultrason-
ics Symposium, Montreal, CANADA, 2004.
[193] V. Pereyra and J. B. Rosen, Computation of the Pseudoinverse of
a Matrix of Unknown Rank. Report CS13, 39 pp. Computer Science
Department, Stanford University, Stanford, CA, 1964.
[194] V. Pereyra, M. Saunders and J. Castillo, Equispaced Pareto Front
Construction for Constrained Biobjective Optimization. SOL Report-
2010-1, Stanford University, Stanford, CA, 2010. Also CSRCR2009-
05, San Diego State University, 2009. In Press, Mathematical and
Computer Modelling, 2012.
[195] V. Pereyra and G. Scherer, Ecient computer manipulation of ten-
sor products with applications in multidimensional approximation.
Math. Comp. 27:595605, 1973.
[196] V. Pereyra and G. Scherer, Least squares scattered data tting by
truncated SVD. Applied Numer. Math. 40:7386, 2002.
[197] V. Pereyra and G. Scherer, Large scale least squares data tting.
Applied Numer. Math. 44:225239, 2002.
[198] V. Pereyra and G. Scherer, Exponential data tting. In Exponential
Data Fitting and Its Applications, Ed. V. Pereyra and G. Scherer.
Bentham Books, Oak Park, IL, 2010.
[199] V. Pereyra, G. Scherer and F. Wong, Variable projections neural net-
work training. Mathematics and Computers in Simulation 73:231
243, 2006.
[200] V. Pereyra, G. Wojcik, D. Powell, C. Purcell and L. Carcione, Folded
shell projectors and virtual optimization. US Navy Workshop on
Acoustic Transduction Materials and Devices, Baltimore, 2001.
[201] G. Peters and J. H. Wilkinson, The least squares problem and
pseudo-inverses. The Computer Journal 13:309316, 1969.
296 REFERENCES
[202] M. J. D. Powell, Approximation Theory and Methods. Cambridge Uni-
versity Press, New York, 1981.
[203] L. Prechelt, Proben 1-A Set of Benchmark Neural Network Problems
and Benchmarking Rules. Technical Report 21, Fakultaet fuer Infor-
matik, Universitaet Karlsruhe, 1994.
[204] Baron Gaspard Riche de Prony, Essai experimental et analytique:
sur les lois de la dilatabilite de uides elastique et sur celles de la
force expansive de la vapeur de lalkool, a dierentes temperatures.
J. Ecole Polyt. 1:2476, 1795.
[205] PZFLEX: Weidlinger Associates Inc. Finite Element Code to Simu-
late Piezoelectric Phenomena. http://www.wai.com, 2006.
[206] L. Reichel, Fast QR decomposition of Vandermonde-like matrices
and polynomial least squares approximations. SIAM J. Matrix Anal.
Appl. 12:552564, 1991.
[207] R. A. Renaut and H. Guo, Ecient algorithms for solution of regu-
larized total least squares. SIAM J. Matrix Anal. Appl. 26:457476,
2005.
[208] J. R. Rice, Experiments on Gram-Schmidt orthogonalization. Math.
Comp. 20:325328, 1966.
[209] J. B. Rosen, Minimum and basic solutions to singular systems. J.
SIAM 12:156162, 1964.
[210] J. B. Rosen and J. Kreuser, A gradient projection algorithm for non-
linear constraints. In Numerical Methods for Non-Linear Optimiza-
tion. Ed. F. A. Lootsma, Academic Press, New York, pp. 297300,
1972.
[211] J. Rosenfeld, A Case Study on Programming for Parallel Processors.
Research Report RC-1864, IBM Watson Research Center, Yorktown
Heights, New York, 1967.
[212] . Ruhe and P-. Wedin, Algorithms for separable non-linear least
squares problems. SIAM Rev. 22:318337, 1980.
[213] B. W. Rust, Fitting natures basis functions. Parts I-IV, Computing
in Sci. and Eng. 200103.
[214] B. W. Rust, http://math.nist.gov/~BRust/Gallery.html, 2011.
REFERENCES 297
[215] B. W. Rust, Truncating the Singular Value Decomposition for Ill-
Posed Problems. Tech. Report NISTIR 6131, Mathematics and Com-
puter Sciences Division, National Institute of Standards and Technol-
ogy, 1998.
[216] B. W. Rust and D. P. OLeary, Residual periodograms for choosing
regularization parameters for ill-posed problems. Inverse Problems
24, 2008.
[217] S. Saarinen, R. Bramley and G. Cybenko, Ill-conditioning in neural
network training problems. SIAM J. Sci. Stat. Comput. 14:693714,
1993.
[218] S. A. Savari and D. P. Bertsekas, Finite termination of asynchronous
iterative algorithms. Parallel Comp. 22:3956, 1996.
[219] J. A. Scales, A. Gersztenkorn and S. Treitel, Fast l
p
solution of large,
sparse, linear systems: application to seismic travel time tomogra-
phy. J. Comp. Physics 75:314333, 1988.
[220] J. L. Schafer, Analysis of Incomplete Multivariate Data. Chapman &
Hall, London, 1997.
[221] S. Schechter, Relaxation methods for linear equations. Comm. Pure
Appl. Math. 12:313335, 1959.
[222] M. E. Schlesinger and N. Ramankutty, An oscillation in the global
climate system of period 6570 years. Nature 367:723726, 1994.
[223] G. A. F. Seber and C. J. Wild, Nonlinear Regression. J. Wiley, New
York, 1989.
[224] R. Sheri and L. Geldart, Exploration Seismology. Cambridge Uni-
versity Press, second edition, Cambridge, 1995.
[225] D. Sima, S. Van Huel and G. Golub, Regularized total least squares
based on quadratic eigenvalue problem solvers. BIT 44:793812,
2004.
[226] Spline2. http://www.structureandchange.3me.tudelft.nl/.
[227] J. Sjberg and M. Viberg, Separable non-linear least squares
minimizationpossible improvements for neural net tting. IEEE
Workshop in Neural Networks for Signal Processing, Amelia Island
Plantation, FL, 1997.
298 REFERENCES
[228] N. Srivastava, R. Suaya, V. Pereyra and K. Banerjee, Accurate cal-
culations of the high-frequency impedance matrix for VLSI intercon-
nects and inductors above a multi-layer substrate: A VARPRO suc-
cess story. In Exponential Data Fitting and Its Applications, Eds. V.
Pereyra and G. Scherer. Bentham Books, Oak Park, IL, 2010.
[229] T. Steihaug and Y. Yalcinkaya, Asynchronous methods in least
squares: An example of deteriorating convergence. Proc. 15 IMACS
World Congress on Scientic Computation, Modeling and Applied
Mathematics, Berlin, Germany, 1997.
[230] G. W. Stewart, Rank degeneracy. SIAM J. Sci. Stat. Comput.
5:403413, 1984.
[231] G. W. Stewart, On the invariance of perturbed null vectors under
column scaling. Numer. Math. 44:6165, 1984.
[232] G. W. Stewart, Matrix Algorithms. Volume I: Basic Decompositions.
SIAM, Philadelphia, 1998.
[233] T. Strutz, Data Fitting and Uncertainty. Vieweg+Teubner Verlag,
Wiesbaden, 2011.
[234] B. J. Thijsse and B. Rust, Freestyle data tting and global temper-
atures. Computing in Sci. and Eng. pp. 4959, 2008.
[235] R. Tibshirani, Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society. Series B (Methodological)
58:267288, 1996.
[236] A. N. Tikhonov, On the stability of inverse problems. Dokl. Akad.
Nauk SSSR 39:195198, 1943.
[237] L. N. Trefethen and D. Bau, Numerical Linear Algebra. SIAM,
Philadelphia, 1997.
[238] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R.
Tibshirani, D. Botstein and R. B. Altman, Missing value estimation
methods for DNA microarrays. Bioinformatics 17:520525, 2001.
[239] S. Van Huel and J. Vandewalle, The Total Least Squares Problem.
Computational Aspects and Analysis, SIAM, Philadelphia, 1991.
[240] E. van den Berg and M. P. Friedlander, Probing the Pareto frontier
for basis pursuit solutions. SIAM J. Sci. Comp. 31:890912, 2008.
[241] E. van den Berg and M. P. Friedlander, Sparse optimization with
least-squares constraints. SIAM J. Optim. 21:12011229, 2011.
REFERENCES 299
[242] A. van den Bos, Parameter Estimation for Scientists and Engineers.
Wiley-Interscience, Hoboken, NJ, 2007.
[243] A. van der Sluis, Condition numbers and equilibration of matrices.
Numer. Math. 14:1423, 1969.
[244] J. W. van der Veen, R. de Beer, P. R. Luyten and D. Van Ormondt,
Accurate quantication of in vivo 31P NMR signals using the variable
projection method and prior knowledge. Magn. Reson. Med. 6:92
98, 1988.
[245] L. Vanhamme, A. van den Boogaart and S. Van Huel, Improved
method for accurate and ecient quantication of MRS data with
use of prior knowledge. J. Magn. Reson. 129:3543, 1997.
[246] L. Vanhamme, S. Van Huel, P. Van Hecke and D. van Ormondt,
Time-domain quantication of series of biomedical magnetic reso-
nance spectroscopy signals. J. Magn. Reson. 140:120130, 1999.
[247] L. Vanhamme, T. Sundin, P. Van Hecke, S. Van Huel and R. Pin-
telon, Frequency-selective quantication of biomedical magnetic res-
onance spectroscopy data. J. Magn. Reson. 143:116, 2000.
[248] S. Van Huel, Partial singular value decomposition algorithm. J.
Comp. Appl. Math. 33:105112, 1990.
[249] B. Walden, R. Karlsson and J.-G. Sun, Optimal backward pertur-
bation bounds for the linear least squares problem. Numer. Linear
Algebra Appl. 2:271286, 2005.
[250] I. Wasito and B. Mirkin, Nearest neighbors in least squares data
imputation algorithms with dierent missing patterns. Comp. Stat.
& Data Analysis 50: 926949, 2006.
[251] D. S. Watkins, Fundamentals of Matrix Computations. J. Wiley, New
York, 2002.
[252] K. Weigl and M. Berthod, Neural Networks as Dynamical Bases in
Function Space. Report 2124, INRIA, Programe Robotique, Image et
Vision, Sophia-Antipolis, France, 1993.
[253] K. Weigl, G. Giraudon and M. Berthod, Application of Projection
Learning to the Detection of Urban Areas in SPOT Satellite Im-
ages. Report #2143, INRIA, Programe Robotique, Image et Vision,
Sophia-Antipolis, France, 1993.
300 REFERENCES
[254] K. Weigl and M. Berthod, Projection learning: Alternative approach
to the computation of the projection. Proceedings of the European
Symposium on Articial Neural Networks pp. 1924, Brussels, Bel-
gium, 1994.
[255] J. H. Wilkinson, Rounding Errors in Algebraic Processes. Prentice-
Hall, Englewood Clis, NJ, 1963. Reprint Dover, New York, 1994.
[256] H. Wold and E. Lyttkens, Nonlinear iterative partial least squares
(NIPALS) estimation procedures. Bull. ISI 43:2951, 1969.
[257] S. Wold, A. Ruhe, H. Wold and W. J. Dunn, The collinearity prob-
lem in linear regression. The partial least squares (PLS) approach to
generalized inverses. SIAM J. Sci. Stat. Comput. 5:735743, 1984.
[258] R. Wolke and H. Schwetlick, Iteratively reweighted least squares:
Algorithms, convergence analysis, and numerical comparisons. SIAM
J. Sci. Stat. Comput. 9:907921, 1988.
[259] S. J. Wright, Primal-Dual Interior-Point Methods. SIAM, Philadel-
phia, 1997.
[260] S. J. Wright, J. N. Holt, Algorithms for non-linear least squares
with linear inequality constraints. SIAM J. Sci. and Stat. Comput.
6:10331048, 1985.
[261] P. Zadunaisky and V. Pereyra, On the convergence and precision of
a process of successive dierential corrections. Proc. IFIPS 65, 1965.
[262] H. Zha, Singular values of a classical matrix. The American Math-
ematical Monthly 104:172173, 1997.
[263] Netlib Repository at UTK and ORNL. http://www.netlib.org.
[264] NEOS Wiki. http://www.neos-guide.org.
[265] NIST/SEMATECH e-Handbook of Statistical Methods.
http://www.itl.nist.gov/div898/strd/nls/nls_info.shtml, 2011.
Index
Activation function, 232
Active constraint, 127, 156
Aerodynamic modeling, 242
Annual temperature anomalies, 213
Asynchronous
block Gauss-Seidel, 190
iterative methods, 189
method, 117
Autocorrelation, 15, 216, 218
test, 14
B-spline, 260
B-splines representation, 203
Back-scattering, 258
Backward stable, 88, 265
Basic solution, 41, 42, 91, 94, 97, 145
Bidiagonalization, 99102, 112
method
Golub-Kahan, 111
LSQR, 111
Block
Gauss-Seidel, 117, 224
Jacobi, 117
methods, 117, 186
nonlinear Gauss-Seidel, 186, 259
nonlinear Jacobi, 186
Bolzano-Weierstrass theorem, 156
Bound constraints, 126, 127
Cauchy-Schwartz inequality, 267
CFL condition, 254
CGLS solution convergence, 116
Chaotic relaxation, 117, 188
Chebyshev
acceleration, 107
method, 107
norm, 267
Chi-square distribution, 278
Cholesky factor, 34, 66, 114
Coecient of determination, 38, 278
adjusted, 38, 216
Complex exponentials, 184
Compressed sensing, 4, 41, 91, 121,
144
Computed tomography (CT), 110
Condition number, 55, 268
estimation, 88
Conjugate gradient method, 109
Control vertices, 205
Covariance, 31, 277
matrix, 31
approximation, 155
method, 81
Cubic splines, 203
Data approximation, 2
Data tting, 4
problem, 1
Data imputation, 131
Derivative free methods, 183
Descent direction, 153, 272
Direct methods, 65, 91
Distribution function, 276
Eckart-Young-Mirski theorem, 53
Elastic waves, 258
Euclidean norm, 267
Expectation, 276
F-distribution, 279
301
302 INDEX
Feasible region, 156
Finite-element simulation, 238
First-order necessary condition, 151
Fitting model, 4, 13, 16
Flop, 264
Frchet derivative, 50
Frobenius norm, 268
Gaussian
density function, 11
distribution, 10, 11
errors, 12
model, 147, 155
probability function, 278
Gauss-Newton
damped, 167
direction, 166
inexact, 170
method, 163
Generalized cross-validation, 197
Genetic algorithm, 248
Geological
medium, 259
surface modeling, 221
Geophone receiver array, 259
Givens rotations, 71
fast, 72
Globally convergent algorithm, 271
Global minimum, 271
Global optimization, 177, 241
Gradient vector, 272
Grammian, 28
matrix, 65
Gram-Schmidt
modied, 77
orthogonalization, 70, 75
Ground boom signature, 242
Hessian, 28, 152, 163, 165, 272
Householder transformations, 71, 122
Hybrid methods, 176
Hypothesis testing, 280
IEEE standard shapes, 252
Ill-conditioned, 191, 207, 223
Jacobian, 168, 176
Ill-conditioning, 191
Inexact methods, 186
Interlacing property, 80, 123, 141,
270
Interpolation, 3
Inverse problem, 129
Iterative
methods, 105, 107, 108, 119
process, 117
renement, 85
Jacobian, 272
Kronecker product, 209, 210
Krylov
iterations, 194
process, 111
subspace, 108
methods, 225
Lagrange multipliers, 158, 273
Lagrangian, 273
Laplace
density function, 11
distribution, 11
errors, 12
Large-scale problems, 105
Least squares
data tting problem, 6
t, 5, 7, 9
overdetermined, 68
recursive, 81
Least squares problems
condition number, 47
large
linear, 117
linear, 25
constrained, 121
minimum-norm solution, 93
modifying, 80
rank decient, 39
Least squares solution
INDEX 303
full-rank problem, 34
Level curves, 63
Level of signicance, 280
Level set, 153
Levenberg-Marquardt
algorithm, 170
asynchronous BGS iteration, 190
direction, 171
implementation, 250
method, 170
programs, 178
step, 171
Linear constraints, 121
equality, 121
inequality, 125
Linear data tting, 4
Linearized sensitivity analysis, 255
Linear least squares applications, 203
Linear prediction, 41
Lipschitz continuity, 273
Local
minimizer, 156, 271
minimum, 151
Locally convergent algorithm, 271
Log-normal model, 12
LU factorization, 68
Peters-Wilkinson, 93
Machine epsilon, 263
Matrix norm, 267
Maximum likelihood principle, 6, 9
Minimal solution, 145
Model basis functions, 4
Model validation, 217
Monte Carlo, 176
Moore-Penrose
conditions, 53
generalized inverse, 182
Multiobjective optimization, 160
Multiple initial points, 176
random, 163
Multiple right-hand sides, 184
Neural networks, 231
training algorithm, 231
Newtons method, 163
NIPALS, 181
NMR, 7
nuclear magnetic resonance, 2,
248
spectroscopy, 150
problem, 29
Non-dominated solutions, 248
Nonlinear data tting, 148
Nonlinear least squares, 147, 148
ill-conditioned, 201
large scale, 186
separable, 158, 231, 234, 236
unconstrained, 150
Non-stationary methods
Krylov methods, 108
Normal distribution, 14
Normal equations, 28, 65
Normalized cumulative periodograms,
16
Numerically rank-decient problems,
192
Numerical rank, 92
Operator norm, 267
Optimality conditions, 151
constrained problems, 156
Order of t, 4
Orthogonal factorization
complete, 43
Orthogonal matrices, 269
Orthogonal projection, 269
derivative, 49
Parameter estimation, 1
Pareto
equilibrium point, 160
front, 161, 248
equispaced representation, 161
optimal, 160
Pearson correlation coecient, 277
Permutation matrix, 269
Perturbation analysis, 58
304 INDEX
Piezoelectric transducer, 251
Poisson data, 11
PRAXIS, 183, 252
Preconditioning, 114
Probability density function, 275
Probability function, 275
Probability of an event, 275
Projection matrix, 48
Proper splitting, 107
Proxies, 238
Pseudoinverse, 47
Pure-data function, 4, 10
QR factorization, 33, 70, 94
economical, 34
full, 82
pivoted, 39, 96
rank-revealing, 95
Quadratic constraints, 129
Randomness test, 14
Random signs, 14
Random variable, 275
continuous, 275
discrete, 275
Rank deciency, 191
Rank-decient problems, 91
Rational model, 148
Regularization, 192, 241
Residual analysis, 16
Response surfaces, 238, 244
Root mean square, 277
Sample space, 275
Second-order
sucient conditions, 151
Secular equation, 130
Seismic
ray tracing, 259
survey, 258
Sensitivity analysis, 59, 127
Sherman-Morrison-Woodbury
formula, 270
Sigmoid activation functions, 244
Signal to noise ratio, 278
Singular value, 50, 110
normalized, 92
Singular value decomposition, 47, 50
economical, 50
generalized, 54, 122, 129
partial, 140
Singular vectors, 50
Stability, 88
Standard deviation, 185, 277
Standard error, 37
Stationary methods, 107
Stationary point, 272
Statistically independent, 277
Steepest descent, 153
Step length, 177
Stopping criteria, 114
Sum-of-absolute-values, 5, 11
Surrogates, 238
SVD computations, 101
Systems of linear equations
parallel iterative solution, 188
Taylors theorem, 272
Tensor product, 209
bivariate splines, 208
cubic B-splines, 222
data, 224
tting, 224
Tikhonov regularization, 118, 163
Toeplitz matrix, 41, 88
Total least squares, 136
Truncated SVD, 118
Uniformly monotone operator, 273
Unitary matrix, 34
Unit round-o, 263
UTV decomposition, 98
Vandermonde matrix, 57
Variable projection, 231, 235, 249
algorithm, 181, 183
principle, 181
Variance, 276
INDEX 305
VARP2, 184
VARPRO program, 183, 236
Vector
contracting, 273
Lipschitz, 273
norms, 267
Weighted residuals, 6
White noise, 5, 14
test, 14, 15
vector, 278