Sie sind auf Seite 1von 283

# Fundamental Numerical

## Methods and Data Analysis

by

George W. Collins, II

## © George W. Collins, II 2003

i
Table of Contents
List of Figures .....................................................................................................................................vi

List of Tables.......................................................................................................................................ix

Preface .............................................................................................................................. xi
Notes to the Internet Edition ....................................................................................xiv

## 2.2 Direct Methods for the Solution of Linear Algebraic Equations............................. 28

a. Solution by Cramer's Rule............................................................................ 28
b. Solution by Gaussian Elimination................................................................ 30
c. Solution by Gauss Jordan Elimination......................................................... 31
d. Solution by Matrix Factorization: The Crout Method................................. 34
e. The Solution of Tri-diagonal Systems of Linear Equations........................ 37

## 2.3 Solution of Linear Equations by Iterative Methods ................................................. 38

a. Solution by The Gauss and Gauss-Seidel Iteration Methods ...................... 38
b. The Method of Hotelling and Bodewig ..................................................... 41
c. Relaxation Methods for the Solution of Linear Equations.......................... 44
d. Convergence and Fixed-point Iteration Theory........................................... 46

## 2.4 The Similarity Transformations and the Eigenvalues and Vectors of a

Matrix ........................................................................................................................ 47

i
ii
Chapter 2 Exercises ............................................................................................................... 52

## 3.1 Polynomials and Their Roots.................................................................................... 56

a. Some Constraints on the Roots of Polynomials........................................... 57
b. Synthetic Division......................................................................................... 58
c. The Graffe Root-Squaring Process .............................................................. 60
d. Iterative Methods .......................................................................................... 61

## 3.2 Curve Fitting and Interpolation................................................................................. 64

a. Lagrange Interpolation ................................................................................. 65
b. Hermite Interpolation.................................................................................... 72
c. Splines ........................................................................................................... 75
d. Extrapolation and Interpolation Criteria ...................................................... 79

## 3.3 Orthogonal Polynomials ........................................................................................... 85

a. The Legendre Polynomials........................................................................... 87
b. The Laguerre Polynomials ........................................................................... 88
c. The Hermite Polynomials............................................................................. 89
d. Additional Orthogonal Polynomials ............................................................ 90
e. The Orthogonality of the Trigonometric Functions..................................... 92

## 4.1 Numerical Differentiation ..........................................................................................98

a. Classical Difference Formulae ......................................................................98
b. Richardson Extrapolation for Derivatives...................................................100

## 4.2 Numerical Evaluation of Integrals: Quadrature ......................................................102

a. The Trapezoid Rule .....................................................................................102
b. Simpson's Rule.............................................................................................103
c. Quadrature Schemes for Arbitrarily Spaced Functions..............................105
d. Gaussian Quadrature Schemes ....................................................................107
e. Romberg Quadrature and Richardson Extrapolation..................................111
f. Multiple Integrals.........................................................................................113

ii
iii
4.3 Monte Carlo Integration Schemes and Other Tricks...............................................115
a. Monte Carlo Evaluation of Integrals...........................................................115
b. The General Application of Quadrature Formulae to Integrals .................117

## 5.1 The Numerical Integration of Differential Equations .............................................122

a. One Step Methods of the Numerical Solution of Differential
Equations......................................................................................................123
b. Error Estimate and Step Size Control .........................................................131
c. Multi-Step and Predictor-Corrector Methods .............................................134
d. Systems of Differential Equations and Boundary Value
Problems.......................................................................................................138
e. Partial Differential Equations ......................................................................146

## 5.2 The Numerical Solution of Integral Equations........................................................147

a. Types of Linear Integral Equations.............................................................148
b. The Numerical Solution of Fredholm Equations........................................148
c. The Numerical Solution of Volterra Equations ..........................................150
d. The Influence of the Kernel on the Solution...............................................154

## 6.1 Legendre's Principle of Least Squares.....................................................................160

a. The Normal Equations of Least Squares.....................................................161
b. Linear Least Squares....................................................................................162
c. The Legendre Approximation .....................................................................164

## 6.2 Least Squares, Fourier Series, and Fourier Transforms..........................................165

a. Least Squares, the Legendre Approximation, and Fourier Series..............165
b. The Fourier Integral.....................................................................................166
c. The Fourier Transform ................................................................................167
d. The Fast Fourier Transform Algorithm ......................................................169

iii
i
6.3 Error Analysis for Linear Least-Squares .................................................................176
a. Errors of the Least Square Coefficients ......................................................176
b. The Relation of the Weighted Mean Square Observational Error
to the Weighted Mean Square Residual......................................................178
c. Determining the Weighted Mean Square Residual ....................................179
d. The Effects of Errors in the Independent Variable .....................................181

## 6.4 Non-linear Least Squares .........................................................................................182

a. The Method of Steepest Descent.................................................................183
b. Linear approximation of f(aj,x) ...................................................................184
c. Errors of the Least Squares Coefficients.....................................................186

## 6.5 Other Approximation Norms ...................................................................................187

a. The Chebyschev Norm and Polynomial Approximation ...........................188
b. The Chebyschev Norm, Linear Programming, and the Simplex
Method .........................................................................................................189
c. The Chebyschev Norm and Least Squares .................................................190

## 7.1 Basic Aspects of Probability Theory .......................................................................200

a. The Probability of Combinations of Events................................................201
b. Probabilities and Random Variables...........................................................202
c. Distributions of Random Variables.............................................................203

## 7.2 Common Distribution Functions .............................................................................204

a. Permutations and Combinations..................................................................204
b. The Binomial Probability Distribution........................................................205
c. The Poisson Distribution .............................................................................206
d. The Normal Curve .......................................................................................207
e. Some Distribution Functions of the Physical World ..................................210

## 7.4 The Foundations of Statistical Analysis ..................................................................217

a. Moments of the Binomial Distribution .......................................................218
b. Multiple Variables, Variance, and Covariance ...........................................219
c. Maximum Likelihood ..................................................................................221

iv
Chapter 7 Exercises .............................................................................................................223

## 8.1 The t, χ2 , and F Statistical Distribution Functions..................................................226

a. The t-Density Distribution Function ...........................................................226
b. The χ2 -Density Distribution Function ........................................................227
c. The F-Density Distribution Function ..........................................................229

## 8.2 The Level of Significance and Statistical Tests ......................................................231

a. The "Students" t-Test...................................................................................232
b. The χ2-test ....................................................................................................233
c. The F-test .....................................................................................................234
d. Kolmogorov-Smirnov Tests ........................................................................235

## 8.3 Linear Regression, and Correlation Analysis..........................................................237

a. The Separation of Variances and the Two-Variable Correlation
Coefficient....................................................................................................238
b. The Meaning and Significance of the Correlation Coefficient ..................240
c. Correlations of Many Variables and Linear Regression ............................242
d Analysis of Variance....................................................................................243

## 8.4 The Design of Experiments .....................................................................................246

a. The Terminology of Experiment Design ....................................................249
b. Blocked Designs ..........................................................................................250
c. Factorial Designs .........................................................................................252

## Chapter 8 References and Supplemental Reading .............................................................257

Index......................................................................................................................................259

v
i
List of Figures

Figure 1.1 shows two coordinate frames related by the transformation angles φij. Four
coordinates are necessary if the frames are not orthogonal.................................................. 11

Figure 1.2 shows two neighboring points P and Q in two adjacent coordinate systems
r
X and X' The differential distance between the two is dx . The vectorial
r r r r
distance to the two points is X(P) or X' (P) and X(Q) or X' (Q) respectively.................. 15

Figure 1.3 schematically shows the divergence of a vector field. In the region where
the arrows of the vector field converge, the divergence is positive, implying an
increase in the source of the vector field. The opposite is true for the region
where the field vectors diverge. ............................................................................................ 19

Figure 1.4 schematically shows the curl of a vector field. The direction of the curl is
determined by the "right hand rule" while the magnitude depends on the rate of
change of the x- and y-components of the vector field with respect to y and x. ................. 19

Figure 1.5 schematically shows the gradient of the scalar dot-density in the form of a
number of vectors at randomly chosen points in the scalar field. The direction of
the gradient points in the direction of maximum increase of the dot-density,
while the magnitude of the vector indicates the rate of change of that density. . ................ 20

Figure 3.1 depicts a typical polynomial with real roots. Construct the tangent to the
curve at the point xk and extend this tangent to the x-axis. The crossing point
xk+1 represents an improved value for the root in the Newton-Raphson
algorithm. The point xk-1 can be used to construct a secant providing a second
method for finding an improved value of x. ......................................................................... 62

Figure 3.2 shows the behavior of the data from table 3.1. The results of various forms
of interpolation are shown. The approximating polynomials for the linear and
parabolic Lagrangian interpolation are specifically displayed. The specific
results for cubic Lagrangian interpolation, weighted Lagrangian interpolation
and interpolation by rational first degree polynomials are also indicated. ......................... 69

Figure 4.1 shows a function whose integral from a to b is being evaluated by the
trapezoid rule. In each interval ∆xi the function is approximated by a straight
line.........................................................................................................................................103

Figure 4.2 shows the variation of a particularly complicated integrand. Clearly it is not
a polynomial and so could not be evaluated easily using standard quadrature
formulae. However, we may use Monte Carlo methods to determine the ratio
area under the curve compared to the area of the rectangle. ...............................................117

vi
ii
Figure 5.1 show the solution space for the differential equation y' = g(x,y). Since the
initial value is different for different solutions, the space surrounding the
solution of choice can be viewed as being full of alternate solutions. The two
dimensional Taylor expansion of the Runge-Kutta method explores this solution
space to obtain a higher order value for the specific solution in just one step....................127

Figure 5.2 shows the instability of a simple predictor scheme that systematically
underestimates the solution leading to a cumulative build up of truncation error..............135

Figure 6.1 compares the discrete Fourier transform of the function e-│x│ with the
continuous transform for the full infinite interval. The oscillatory nature of the
discrete transform largely results from the small number of points used to
represent the function and the truncation of the function at t = ±2. The only
points in the discrete transform that are even defined are denoted by ...............................173

Figure 6.2 shows the parameter space defined by the φj(x)'s. Each f(aj,xi) can be
represented as a linear combination of the φj(xi) where the aj are the coefficients
of the basis functions. Since the observed variables Yi cannot be expressed in
terms of the φj(xi), they lie out of the space. ........................................................................180

Figure 6.3 shows the χ2 hypersurface defined on the aj space. The non-linear least
square seeks the minimum regions of that hypersurface. The gradient method
moves the iteration in the direction of steepest decent based on local values of
the derivative, while surface fitting tries to locally approximate the function in
some simple way and determines the local analytic minimum as the next guess
for the solution. .....................................................................................................................184

Figure 6.4 shows the Chebyschev fit to a finite set of data points. In panel a the fit is
with a constant a0 while in panel b the fit is with a straight line of the form
f(x) = a1 x + a0. In both cases, the adjustment of the parameters of the function
can only produce n+2 maximum errors for the (n+1) free parameters. ..............................188

Figure 6.5 shows the parameter space for fitting three points with a straight line under
the Chebyschev norm. The equations of condition denote half-planes which
satisfy the constraint for one particular point.......................................................................189

Figure 7.1 shows a sample space giving rise to events E and F. In the case of the die, E
is the probability of the result being less than three and F is the probability of
the result being even. The intersection of circle E with circle F represents the
probability of E and F [i.e. P(EF)]. The union of circles E and F represents the
probability of E or F. If we were to simply sum the area of circle E and that of
F we would double count the intersection. ..........................................................................202

vii
iii
Figure 7.2 shows the normal curve approximation to the binomial probability
distribution function. We have chosen the coin tosses so that p = 0.5. Here µ
and σ can be seen as the most likely value of the random variable x and the
'width' of the curve respectively. The tail end of the curve represents the region
approximated by the Poisson distribution............................................................................209

Figure 7.3 shows the mean of a function f(x) as <x>. Note this is not the same as the
most likely value of x as was the case in figure 7.2. However, in some real
sense σ is still a measure of the width of the function. The skewness is a
measure of the asymmetry of f(x) while the kurtosis represents the degree to
which the f(x) is 'flattened' with respect to a normal curve. We have also
marked the location of the values for the upper and lower quartiles, median and
mode......................................................................................................................................214

Figure 8.1 shows a comparison between the normal curve and the t-distribution
function for N = 8. The symmetric nature of the t-distribution means that the
mean, median, mode, and skewness will all be zero while the variance and
kurtosis will be slightly larger than their normal counterparts. As N → ∞, the
t-distribution approaches the normal curve with unit variance. ..........................................227

Figure 8.2 compares the χ2-distribution with the normal curve. For N=10 the curve is
quite skewed near the origin with the mean occurring past the mode (χ2 = 8).
The Normal curve has µ = 8 and σ2 = 20. For large N, the mode of the
χ2-distribution approaches half the variance and the distribution function
approaches a normal curve with the mean equal the mode. ................................................228

Figure 8.3 shows the probability density distribution function for the F-statistic with
values of N1 = 3 and N2 = 5 respectively. Also plotted are the limiting
distribution functions f(χ2/N1) and f(t2). The first of these is obtained from f(F)
in the limit of N2 → ∞. The second arises when N1 ≥ 1. One can see the tail of
the f(t2) distribution approaching that of f(F) as the value of the independent
variable increases. Finally, the normal curve which all distributions approach
for large values of N is shown with a mean equal to F
 and a variance equal to the variance for f(F).........................................................................230

Figure 8.4 shows a histogram of the sampled points xi and the cumulative probability
of obtaining those points. The Kolmogorov-Smirnov tests compare that
probability with another known cumulative probability and ascertain the odds
that the differences occurred by chance. ..............................................................................237

Figure 8.5 shows the regression lines for the two cases where the variable X2 is
regarded as the dependent variable (panel a) and the variable X1 is regarded as
the dependent variable (panel b). ........................................................................................240

viii
i
List of Tables

## Table 2.2 Sample Iterative Solution for the Relaxation Method.............................................. 45

Table 3.1 Sample Data and Results for Lagrangian Interpolation Formulae .......................... 67

## Table 3.4 Parameters for Quotient Polynomial Interpolation .................................................. 83

Table 3.5 The First Five Members of the Common Orthogonal Polynomials ........................ 90

## Table 5.2 Sample Runge-Kutta Solutions................................................................................130

Table 5.3 Solutions of a Sample Boundary Value Problem for Various Orders of
Approximation .........................................................................................................145

## Table 5.4 Solutions of a Sample Boundary Value Problem Treated as an Initial

Value Problem..........................................................................................................145

## Table 7.1 Grade Distribution for Sample Test Results............................................................216

ix
Table 7.2 Examination Statistics for the Sample Test.............................................................216

## Table 8.2 Factorial Combinations for Two-level Experiments with n=2-4............................253

x
i
Preface

• • •

The origins of this book can be found years ago when I was
a doctoral candidate working on my thesis and finding that I needed numerical tools that I should have
been taught years before. In the intervening decades, little has changed except for the worse. All fields
of science have undergone an information explosion while the computer revolution has steadily and
irrevocability been changing our lives. Although the crystal ball of the future is at best "seen through a
glass darkly", most would declare that the advent of the digital electronic computer will change
civilization to an extent not seen since the coming of the steam engine. Computers with the power that
could be offered only by large institutions a decade ago now sit on the desks of individuals. Methods of
analysis that were only dreamed of three decades ago are now used by students to do homework
exercises. Entirely new methods of analysis have appeared that take advantage of computers to perform
logical and arithmetic operations at great speed. Perhaps students of the future may regard the
multiplication of two two-digit numbers without the aid of a calculator in the same vein that we regard
the formal extraction of a square root. The whole approach to scientific analysis may change with the
advent of machines that communicate orally. However, I hope the day never arrives when the
investigator no longer understands the nature of the analysis done by the machine.

Unfortunately instruction in the uses and applicability of new methods of analysis rarely
appears in the curriculum. This is no surprise as such courses in any discipline always are the last to be
developed. In rapidly changing disciplines this means that active students must fend for themselves.
With numerical analysis this has meant that many simply take the tools developed by others and apply
them to problems with little knowledge as to the applicability or accuracy of the methods. Numerical
algorithms appear as neatly packaged computer programs that are regarded by the user as "black boxes"
into which they feed their data and from which come the publishable results. The complexity of many of
the problems dealt with in this manner makes determining the validity of the results nearly impossible.
This book is an attempt to correct some of these problems.

Some may regard this effort as a survey and to that I would plead guilty. But I do not regard the
word survey as pejorative for to survey, condense, and collate, the knowledge of man is one of the
responsibilities of the scholar. There is an implication inherent in this responsibility that the information
be made more comprehensible so that it may more readily be assimilated. The extent to which I have
succeeded in this goal I will leave to the reader. The discussion of so many topics may be regarded by
some to be an impossible task. However, the subjects I have selected have all been required of me

xi
ii
during my professional career and I suspect most research scientists would make a similar
claim. Unfortunately few of these subjects were ever covered in even the introductory level of treatment
given here during my formal education and certainly they were never placed within a coherent context
of numerical analysis.

The basic format of the first chapter is a very wide ranging view of some concepts of
mathematics based loosely on axiomatic set theory and linear algebra. The intent here is not so much to
provide the specific mathematical foundation for what follows, which is done as needed throughout the
text, but rather to establish, what I call for lack of a better term, "mathematical sophistication". There is
a general acquaintance with mathematics that a student should have before embarking on the study of
numerical methods. The student should realize that there is a subject called mathematics which is
artificially broken into sub-disciplines such a linear algebra, arithmetic, calculus, topology, set theory,
etc. All of these disciplines are related and the sooner the student realizes that and becomes aware of the
relations, the sooner mathematics will become a convenient and useful language of scientific
expression. The ability to use mathematics in such a fashion is largely what I mean by "mathematical
sophistication". However, this book is primarily intended for scientists and engineers so while there is a
certain familiarity with mathematics that is assumed, the rigor that one expects with a formal
mathematical presentation is lacking. Very little is proved in the traditional mathematical sense of the
word. Indeed, derivations are resorted to mainly to emphasize the assumptions that underlie the results.
However, when derivations are called for, I will often write several forms of the same expression on the
same line. This is done simply to guide the reader in the direction of a mathematical development. I will
often give "rules of thumb" for which there is no formal proof. However, experience has shown that
these "rules of thumb" almost always apply. This is done in the spirit of providing the researcher with
practical ways to evaluate the validity of his or her results.

The basic premise of this book is that it can serve as the basis for a wide range of courses that
discuss numerical methods used in science. It is meant to support a series of lectures, not replace them.
To reflect this, the subject matter is wide ranging and perhaps too broad for a single course. It is
expected that the instructor will neglect some sections and expand on others. For example, the social
scientist may choose to emphasize the chapters on interpolation, curve-fitting and statistics, while the
physical scientist would stress those chapters dealing with numerical quadrature and the solution of
differential and integral equations. Others might choose to spend a large amount of time on the principle
of least squares and its ramifications. All these approaches are valid and I hope all will be served by this
book. While it is customary to direct a book of this sort at a specific pedagogic audience, I find that task
somewhat difficult. Certainly advanced undergraduate science and engineering students will have no
difficulty dealing with the concepts and level of this book. However, it is not at all obvious that second
year students couldn't cope with the material. Some might suggest that they have not yet had a formal
course in differential equations at that point in their career and are therefore not adequately prepared.
However, it is far from obvious to me that a student’s first encounter with differential equations should
be in a formal mathematics course. Indeed, since most equations they are liable to encounter will require
a numerical solution, I feel the case can be made that it is more practical for them to be introduced to the
subject from a graphical and numerical point of view. Thus, if the instructor exercises some care in the
presentation of material, I see no real barrier to using this text at the second year level in some areas. In
any case I hope that the student will at least be exposed to the wide range of the material in the book lest
he feel that numerical analysis is limited only to those topics of immediate interest to his particular
specialty.

xii
iii
Nowhere is this philosophy better illustrated that in the first chapter where I deal with
a wide range of mathematical subjects. The primary objective of this chapter is to show that
mathematics is "all of a piece". Here the instructor may choose to ignore much of the material and jump
directly to the solution of linear equations and the second chapter. However, I hope that some
consideration would be given to discussing the material on matrices presented in the first chapter before
embarking on their numerical manipulation. Many will feel the material on tensors is irrelevant and will
skip it. Certainly it is not necessary to understand covariance and contravariance or the notion of tensor
and vector densities in order to numerically interpolate in a table of numbers. But those in the physical
sciences will generally recognize that they encountered tensors for the first time too late in their
educational experience and that they form the fundamental basis for understanding vector algebra and
calculus. While the notions of set and group theory are not directly required for the understanding of
cubic splines, they do form a unifying basis for much of mathematics. Thus, while I expect most
instructors will heavily select the material from the first chapter, I hope they will encourage the students
to at least read through the material so as to reduce their surprise when the see it again.

The next four chapters deal with fundamental subjects in basic numerical analysis. Here, and
throughout the book, I have avoided giving specific programs that carry out the algorithms that are
discussed. There are many useful and broadly based programs available from diverse sources. To pick
specific packages or even specific computer languages would be to unduly limit the student's range and
selection. Excellent packages are contain in the IMSL library and one should not overlook the excellent
collection provided along with the book by Press et al. (see reference 4 at the end of Chapter 2). In
general collections compiled by users should be preferred for they have at least been screened initially
for efficacy.

Chapter 6 is a lengthy treatment of the principle of least squares and associated topics. I have
found that algorithms based on least squares are among the most widely used and poorest understood of
all algorithms in the literature. Virtually all students have encountered the concept, but very few see and
understand its relationship to the rest of numerical analysis and statistics. Least squares also provides a
logical bridge to the last chapters of the book. Here the huge field of statistics is surveyed with the hope
of providing a basic understanding of the nature of statistical inference and how to begin to use
statistical analysis correctly and with confidence. The foundation laid in Chapter 7 and the tests
presented in Chapter 8 are not meant to be a substitute for a proper course of study in the subject.
However, it is hoped that the student unable to fit such a course in an already crowded curriculum will
at least be able to avoid the pitfalls that trap so many who use statistical analysis without the appropriate
care.

Throughout the book I have tried to provide examples integrated into the text of the more
difficult algorithms. In testing an earlier version of the book, I found myself spending most of my time
with students giving examples of the various techniques and algorithms. Hopefully this initial
shortcoming has been overcome. It is almost always appropriate to carry out a short numerical example
of a new method so as to test the logic being used for the more general case. The problems at the end of
each chapter are meant to be generic in nature so that the student is not left with the impression that this
algorithm or that is only used in astronomy or biology. It is a fairly simple matter for an instructor to
find examples in diverse disciplines that utilize the techniques discussed in each chapter. Indeed, the
student should be encouraged to undertake problems in disciplines other than his/her own if for no other
reason than to find out about the types of problems that concern those disciplines.

xiii
i
Here and there throughout the book, I have endeavored to convey something of the
philosophy of numerical analysis along with a little of the philosophy of science. While this is certainly
not the central theme of the book, I feel that some acquaintance with the concepts is essential to anyone
aspiring to a career in science. Thus I hope those ideas will not be ignored by the student on his/her way
to find some tool to solve an immediate problem. The philosophy of any subject is the basis of that
subject and to ignore it while utilizing the products of that subject is to invite disaster.

There are many people who knowingly and unknowingly had a hand in generating this book.
Those at the Numerical Analysis Department of the University of Wisconsin who took a young
astronomy student and showed him the beauty of this subject while remaining patient with his bumbling
understanding have my perpetual gratitude. My colleagues at The Ohio State University who years ago
also saw the need for the presentation of this material and provided the environment for the
development of a formal course in the subject. Special thanks are due Professor Philip C. Keenan who
encouraged me to include the sections on statistical methods in spite of my shortcomings in this area.
Peter Stoychoeff has earned my gratitude by turning my crude sketches into clear and instructive
drawings. Certainly the students who suffered through this book as an experimental text have my
admiration and well as my thanks.

George W. Collins, II
September 11, 1990

## A Note Added for the Internet Edition

A significant amount of time has passed since I first put this effort together. Much has changed in
Numerical Analysis. Researchers now seem often content to rely on packages prepared by others even
more than they did a decade ago. Perhaps this is the price to be paid by tackling increasingly
ambitious problems. Also the advent of very fast and cheap computers has enabled investigators to
use inefficient methods and still obtain answers in a timely fashion. However, with the avalanche of
data about to descend on more and more fields, it does not seem unreasonable to suppose that
numerical tasks will overtake computing power and there will again be a need for efficient and
accurate algorithms to solve problems. I suspect that many of the techniques described herein will be
rediscovered before the new century concludes. Perhaps efforts such as this will still find favor with
those who wish to know if numerical results can be believed.

George W. Collins, II
January 30, 2001

xiv
A Further Note for the Internet Edition

Since I put up a version of this book two years ago, I have found numerous errors which
largely resulted from the generations of word processors through which the text evolved. During the
last effort, not all the fonts used by the text were available in the word processor and PDF translator.
This led to errors that were more wide spread that I realized. Thus, the main force of this effort is to
bring some uniformity to the various software codes required to generate the version that will be
available on the internet. Having spent some time converting Fundamentals of Stellar Astrophysics
and The Virial Theorem in Stellar Astrophysics to Internet compatibility, I have learned to better
understand the problems of taking old manuscripts and setting then in the contemporary format. Thus
I hope this version of my Numerical Analysis book will be more error free and therefore useable. Will
I have found all the errors? That is most unlikely, but I can assure the reader that the number of those
errors is significantly reduced from the earlier version. In addition, I have attempted to improve the
presentation of the equations and other aspects of the book so as to make it more attractive to the
reader. All of the software coding for the index was lost during the travels through various word
processors. Therefore, the current version was prepared by means of a page comparison between an
earlier correct version and the current presentation. Such a table has an intrinsic error of at least ± 1
page and the index should be used with that in mind. However, it should be good enough to guide the
reader to general area of the desired subject.

Having re-read the earlier preface and note I wrote, I find I still share the sentiments
expressed therein. Indeed, I find the flight of the student to “black-box” computer programs to obtain
solutions to problems has proceeded even faster than I thought it would. Many of these programs such
as MATHCAD are excellent and provide quick and generally accurate ‘first looks’ at problems.
However, the researcher would be well advised to understand the methods used by the “black-boxes”
to solve their problems. This effort still provides the basis for many of the operations contained in
those commercial packages and it is hoped will provide the researcher with the knowledge of their
applicability to his/her particular problem. However, it has occurred to me that there is an additional
view provided by this book. Perhaps, in the future, a historian may wonder what sort of numerical
skills were expected of a researcher in the mid twentieth century. In my opinion, the contents of this
book represent what I feel scientists and engineers of the mid twentieth century should have known
and many did. I am confident that the knowledge-base of the mid twenty first century scientist will be
quite different. One can hope that the difference will represent an improvement.

Finally, I would like to thank John Martin and Charles Knox who helped me adapt this
version for the Internet and the Astronomy Department at the Case Western Reserve University for
making the server-space available for the PDF files. As is the case with other books I have put on the
Internet, I encourage anyone who is interested to down load the PDF files as they may be of use to
them. I would only request that they observe the courtesy of proper attribution should they find my
efforts to be of use.

George W. Collins, II
April, 2003
Case Western Reserve University

xv
1

Introduction and
Fundamental Concepts

• • •

## The numerical expression of a scientific statement has traditionally been

the manner by which scientists have verified a theoretical description of the physical world. During this
century there has been a revolution in both the nature and extent to which this numerical comparison can be
made. Indeed, it seems likely that when the history of this century is definitively written, it will be the
development of the computer, which will be regarded as its greatest technological achievement - not nuclear
power. While it is true that the origins of the digital computer can be traced through the work of Isaac
Babbitt, Hermann Hollerith, and others in the nineteenth century, the real advance came after the Second
World War when machines were developed that were able to carry out an extended sequence of instructions
at a rate that was very much greater than a human could manage. We call such machines programmable.

The electronic digital computer of the sort developed by John von Neumann and others in the 1950s really
ushered in the present computer revolution. While it is still to soon to delineate the form and consequences
of this revolution, it is already clear that it has forever changed the way in which science and engineering
will be done. The entire approach to numerical analysis has changed in the past two decades and that change
will most certainly continue rapidly into the future. Prior to the advent of the electronic digital computer, the
emphasis in computing was on short cuts and methods of verification which insured that computational
errors could be caught before they propagated through the solution. Little attention was paid to "round off
error" since the "human computer" could easily control such problems when they were encountered. Now
the reliability of electronic machines has nearly eliminated concerns of random error, but round off error can
be a persistent problem.
Numerical Methods and Data Analysis

The extreme speed of contemporary machines has tremendously expanded the scope of numerical
problems that may be considered as well as the manner in which such computational problems may even be
approached. However, this expansion of the degree and type of problem that may be numerically solved has
removed the scientist from the details of the computation. For this, most would shout "Hooray"! But this
removal of the investigator from the details of computation may permit the propagation of errors of various
types to intrude and remain undetected. Modern computers will almost always produce numbers, but
whether they represent the solution to the problem or the result of error propagation may not be obvious.
This situation is made worse by the presence of programs designed for the solution of broad classes of
problems. Almost every class of problems has its pathological example for which the standard techniques
will fail. Generally little attention is paid to the recognition of these pathological cases which have an
uncomfortable habit of turning up when they are least expected.

Thus the contemporary scientist or engineer should be skeptical of the answers presented by the
modern computer unless he or she is completely familiar with the numerical methods employed in obtaining
that solution. In addition, the solution should always be subjected to various tests for "reasonableness".
There is often a tendency to regard the computer and the programs which they run as "black boxes" from
which come infallible answers. Such an attitude can lead to catastrophic results and belies the attitude of
"healthy skepticism" that should pervade all science. It is necessary to understand, at least at some level,
what the "Black Boxes" do. That understanding is one of the primary aims of this book.

It is not my intention to teach the techniques of programming a computer. There are many excellent
texts on the multitudinous languages that exist for communicating with a computer. I will assume that the
reader has sufficient capability in this area to at least conceptualize the manner by which certain processes
could be communicated to the computer or at least recognize a computer program that does so. However, the
programming of a computer does represent a concept that is not found in most scientific or mathematical
presentations. We will call that concept an algorithm. An algorithm is simply a sequence of mathematical
operations which, when preformed in sequence, lead to the numerical answer to some specified problem.
Much time and effort is devoted to ascertaining the conditions under which a particular algorithm will work.
In general, we will omit the proof and give only the results when they are known. The use of algorithms and
the ability of computers to carry out vastly more operations in a short interval of time than the human
programmer could do in several lifetimes leads to some unsettling differences between numerical analysis
and other branches of mathematics and science.

Much as the scientist may be unwilling to admit it, some aspects of art creep into numerical analysis.
Knowing when a particular algorithm will produce correct answers to a given problem often involves a non-
trivial amount of experience as well as a broad based knowledge of machines and computational procedures.
The student will achieve some feeling for this aspect of numerical analysis by considering problems for
which a given algorithm should work, but doesn't. In addition, we shall give some "rules of thumb" which
indicate when a particular numerical method is failing. Such "rules of thumb" are not guarantees of either
success or failure of a specific procedure, but represent instances when a greater height of skepticism on the
part of the investigator may be warranted.

As already indicated, a broad base of experience is useful when trying to ascertain the validity of the
results of any computer program. In addition, when trying to understand the utility of any algorithm for

2
1 - Fundamental Concepts

## calculation, it is useful to have as broad a range of mathematical knowledge as possible. Mathematics is

indeed the language of science and the more proficient one is in the language the better. So a student should
realize as soon as possible that there is essentially one subject called mathematics, which for reasons of
convenience we break down into specific areas such as arithmetic, algebra, calculus, tensors, group theory,
etc. The more areas that the scientist is familiar with, the more he/she may see the relations between them.
The more the relations are apparent, the more useful mathematics will be. Indeed, it is all too common for
the modern scientist to flee to a computer for an answer. I cannot emphasize too strongly the need to analyze
a problem thoroughly before any numerical solution is attempted. Very often a better numerical approach
will suggest itself during the analyses and occasionally one may find that the answer has a closed form
analytic solution and a numerical solution is unnecessary.

However, it is too easy to say "I don't have the background for this subject" and thereby never
attempt to learn it. The complete study of mathematics is too vast for anyone to acquire in his or her lifetime.
Scientists simply develop a base and then continue to add to it for the rest of their professional lives. To be a
successful scientist one cannot know too much mathematics. In that spirit, we shall "review" some
mathematical concepts that are useful to understanding numerical methods and analysis. The word review
should be taken to mean a superficial summary of the area mainly done to indicate the relation to other areas.
Virtually every area mentioned has itself been a subject for many books and has occupied the study of some
investigators for a lifetime. This short treatment should not be construed in any sense as being complete.
Some of this material will indeed be viewed as elementary and if thoroughly understood may be skimmed.
However many will find some of these concepts as being far from elementary. Nevertheless they will sooner
or later be useful in understanding numerical methods and providing a basis for the knowledge that
mathematics is "all of a piece".

## 1.1 Basic Properties of Sets and Groups

Most students are introduced to the notion of a set very early in their educational experience.
However, the concept is often presented in a vacuum without showing its relation to any other area of
mathematics and thus it is promptly forgotten. Basically a set is a collection of elements. The notion of an
element is left deliberately vague so that it may represent anything from cows to the real numbers. The
number of elements in the set is also left unspecified and may or may not be finite. Just over a century ago
Georg Cantor basically founded set theory and in doing so clarified our notion of infinity by showing that
there are different types of infinite sets. He did this by generalizing what we mean when we say that two sets
have the same number of elements. Certainly if we can identify each element in one set with a unique
element in the second set and there are none left over when the identification is completed, then we would be
entitled in saying that the two sets had the same number of elements. Cantor did this formally with the
infinite set composed of the positive integers and the infinite set of the real numbers. He showed that it is not
possible to identify each real number with a integer so that there are more real numbers than integers and
thus different degrees of infinity which he called cardinality. He used the first letter of the Hebrew alphabet
to denote the cardinality of an infinite set so that the integers had cardinality ℵ0 and the set of real numbers
had cardinality of ℵ1. Some of the brightest minds of the twentieth century have been concerned with the
properties of infinite sets.

3
Numerical Methods and Data Analysis

Our main interest will center on those sets which have constraints placed on their elements for it will
be possible to make some very general statements about these restricted sets. For example, consider a set
wherein the elements are related by some "law". Let us denote the "law" by the symbol ‡. If two elements
are combined under the "law" so as to yield another element in the set, the set is said to be closed with
respect to that law. Thus if a, b, and c are elements of the set and
a‡b = c , (1.1.1)
then the set is said to be closed with respect to ‡. We generally consider ‡ to be some operation like + or ×,
but we shouldn't feel that the concept is limited to such arithmetic operations alone. Indeed, one might
consider operations such as b 'follows' a to be an example of a law operating on a and b.

If we place some additional conditions of the elements of the set, we can create a somewhat more
restricted collection of elements called a group. Let us suppose that one of the elements of the set is what we
call a unit element. Such an element is one which, when combined with any other element of the set under
the law, produces that same element. Thus
a‡i = a . (1.1.2)
This suggests another useful constraint, namely that there are elements in the set that can be designated
"inverses". An inverse of an element is one that when combined with its element under the law produces the
unit element i or
a-1‡a = i . (1.1.3)

Now with one further restriction on the law itself, we will have all the conditions required to
produce a group. The restriction is known as associativity. A law is said to be associative if the order in
which it is applied to three elements does not determine the outcome of the application. Thus

## (a‡b)‡c = a‡(b‡c) . (1.1.4)

If a set possess a unit element and inverse elements and is closed under an associative law, that set is called a
group under the law. Therefore the normal integers form a group under addition. The unit is zero and the
inverse operation is clearly subtraction and certainly the addition of any two integers produces another
integer. The law of addition is also associative. However, it is worth noting that the integers do not form a
group under multiplication as the inverse operation (reciprocal) does not produce a member of the group (an
integer). One might think that these very simple constraints would not be sufficient to tell us much that is
new about the set, but the notion of a group is so powerful that an entire area of mathematics known as group
theory has developed. It is said that Eugene Wigner once described all of the essential aspects of the
thermodynamics of heat transfer on one sheet of paper using the results of group theory.

While the restrictions that enable the elements of a set to form a group are useful, they are not the
only restrictions that frequently apply. The notion of commutivity is certainly present for the laws of
addition and scalar multiplication and, if present, may enable us to say even more about the properties of our
set. A law is said to be communitative if
a‡b = b‡a . (1.1.5)
A further restriction that may be applied involves two laws say ‡ and ∧. These laws are said to be
distributive with respect to one another if
a‡(b∧c) = (a‡b)∧(a‡c) . (1.1.6)

4
1 - Fundamental Concepts

Although the laws of addition and scalar multiplication satisfy all three restrictions, we will
encounter common laws in the next section that do not. Subsets that form a group under addition and scalar
multiplication are called fields. The notion of a field is very useful in science as most theoretical descriptions
of the physical world are made in terms of fields. One talks of gravitational, electric, and magnetic fields in
physics. Here one is describing scalars and vectors whose elements are real numbers and for which there are
laws of addition and multiplication which cause these quantities to form not just groups, but fields. Thus all
the abstract mathematical knowledge of groups and fields is available to the scientist to aid in understanding
physical fields.

## 1.2 Scalars, Vectors, and Matrices

In the last section we mentioned specific sets of elements called scalars and vectors without being
too specific about what they are. In this section we will define the elements of these sets and the various laws
that operate on them. In the sciences it is common to describe phenomena in terms of specific quantities
which may take on numerical values from time to time. For example, we may describe the atmosphere of the
planet at any point in terms of the temperature, pressure, humidity, ozone content or perhaps a pollution
index. Each of these items has a single value at any instant and location and we would call them scalars. The
common laws of arithmetic that operate on scalars are addition and multiplication. As long as one is a little
careful not to allow division by zero (often known as the cancellation law) such scalars form not only
groups, but also fields.

Although one can generally describe the condition of the atmosphere locally in terms of scalar
fields, the location itself requires more than a single scalar for its specification. Now we need two (three if
we include altitude) numbers, say the latitude and longitude, which locate that part of the atmosphere for
further description by scalar fields. A quantity that requires more than one number for its specification may
be called a vector. Indeed, some have defined a vector as an "ordered n-tuple of numbers". While many may
not find this too helpful, it is essentially a correct statement, which emphasizes the multi-component side of
the notion of a vector. The number of components that are required for the vector's specification is usually
called the dimensionality of the vector. We most commonly think of vectors in terms of spatial vectors, that
is, vectors that locate things in some coordinate system. However, as suggested in the previous section,
vectors may represent such things as an electric or magnetic field where the quantity not only has a
magnitude or scalar length associated with it at every point in space, but also has a direction. As long as such
quantities obey laws of addition and some sort of multiplication, they may indeed be said to form vector
fields. Indeed, there are various types of products that are associated with vectors. The most common of
these and the one used to establish the field nature of most physical vector fields is called the "scalar
product" or inner product, or sometimes simply the dot product from the manner in which it is usually
written. Here the result is a scalar and we can operationally
r r define what we mean by such a product by
A • B = c = ∑ A i Bi . (1.2.1)
i

One might say that as the result of the operation is a scalar not a vector, but that would be to put to restrictive
an interpretation on what we mean by a vector. Specifically, any scalar can be viewed as vector having only
one component (i.e. a 1-dimensional vector). Thus scalars become a subgroup of vectors and since the vector
scalar product degenerates to the ordinary scalar product for 1-dimensional vectors, they are actually a sub-
field of the more general notion of a vector field.

5
Numerical Methods and Data Analysis

It is possible to place additional constraints (laws) on a field without destroying the field nature of
the elements. We most certainly do this with vectors. Thus we can define an additional type of product
known as the "vector product" or simply cross product again from the way it is commonly written. Thus in
Cartesian coordinates the cross product can be written as
î ĵ k̂
r r
A × B = A i A j A k = î (A j B k − A k B j ) − ĵ(A i B k − A k B i ) + k̂ (A i B j − A j B i ) . (1.2.2)
Bi B j B k
The result of this operation is a vector, but we shall see later that it will be useful to sharpen our definition of
vectors so that this result is a special kind of vector.

Finally, there is the "tensor product"rorr vector outer product that is defined as
AB = C ⎫⎪
⎬ . (1.2.3)
C ij = A i B j ⎪⎭
Here the result of applying r the r"law" is an ordered array of (n×m) numbers where n and m are the
dimensions of the vectors A and B respectively. Again, here the result of applying the law is not a vector in
any sense of the normal definition, but is a member of a larger class of objects we will call tensors. But
before discussing tensors in general, let us consider a special class of them known as matrices.

The result of equation (1.2.3) while needing more than one component for its specification is clearly
not simply a vector with dimension (n×m). The values of n and m are separately specified and to specify
only the product would be to throw away information that was initially specified. Thus, in order to keep this
information, we can represent the result as an array of numbers having n columns and m rows. Such an array
can be called a matrix. For matrices, the products already defined have no simple interpretation. However,
there is an additional product known as a matrix product, which will allow us to at least define a matrix
group. Consider the product defined by
AB = C ⎫⎪
C ij = ∑ A ik B kj ⎬ . (1.2.4)
k ⎪⎭
With this definition of a product, the unit matrix denoted by 1 will have elements δij specified for n = m = 2
by
⎛ 1 0⎞
δ ij = ⎜⎜ ⎟⎟ . (1.2.5)
⎝ 01⎠

The quantity δij is called the Kronecker delta and may be generalized to n-dimensions.

Thus the inverse elements of the group will have to satisfy the relation
-1
AA = 1 , (1.2.6)

6
1 - Fundamental Concepts

and we shall spend some time in the next chapter discussing how these members of the group may be
calculated. Since matrix addition can simply be defined as the scalar addition of the elements of the matrix,
and the 'unit' matrix under addition is simply a matrix with zero elements, it is tempting to think that the
group of matrices also form a field. However, the matrix product as defined by equation (1.2.4), while being
distributive with respect to addition, is not communitative. Thus we shall have to be content with matrices
forming a group under both addition and matrix multiplication but not a field.

There is much more that can be said about matrices as was the case with other subjects of this
chapter, but we will limit ourselves to a few properties of matrices which will be particularly useful later. For
example, the transpose of a matrix with elements Aij is defined as
A T = A ji . (1.2.7)
We shall see that there is an important class of matrices (i.e. the orthonormal matrices) whose inverse is their
transpose. This makes the calculation of the inverse trivial.

## Another important scalar quantity is the trace of a matrix defined as

TrA = A ii . ∑ i
(1.2.8)

A matrix is said to be symmetric if Ai j= Aji. If, in addition, the elements are themselves complex numbers,
then should the elements of the transpose be the complex conjugates of the original matrix, the matrix is said
to be Hermitian or self-adjoint. The conjugate transpose of a matrix A is usually denoted by A†. If the
Hermitian conjugate of A is also A-1, then the matrix is said to be unitary. Should the matrix A commute
with it Hermitian conjugate so that
† †
AA = A A , (1.2.9)
then the matrix is said to be normal. For matrices with only real elements, Hermitian is the same as
symmetric, unitary means the same as orthonormal and both classes would be considered to be normal.

## Finally, a most important characteristic of a matrix is its determinant. It may be calculated by

expansion of the matrix by "minors" so that
a 11 a 12 a 13
det A = a 21 a 22 a 23 = a 11 (a 22 a 33 − a 23 a 32 ) − a 12 (a 21a 33 − a 23 a 31 ) + a 13 (a 21a 32 − a 22 a 13 ) . (1.2.10)
a 13 a 23 a 33
Fortunately there are more straightforward ways of calculating the determinant which we will consider in the
next chapter. There are several theorems concerning determinants that are useful for the manipulation of
determinants and which we will give without proof.

matrix is zero.

## 2. If each element in a row or column of a matrix is multiplied by a scalar q, the

determinant is multiplied by q.

3. If each element of a row or column is a sum of two terms, the determinant equals
the sum of the two corresponding determinants.

7
Numerical Methods and Data Analysis

4. If two rows or two columns are proportional, the determinant is zero. This clearly
follows from theorems 1, 2 and 3.

5. If two rows or two columns are interchanged, the determinant changes sign.

6. If rows and columns of a matrix are interchanged, the determinant of the matrix is
unchanged.

## 7. The value of a determinant of a matrix is unchanged if a multiple of one row or

column is added to another.

8. The determinant of the product of two matrices is the product of the determinants of
the two matrices.

One of the important aspects of the determinant is that it is a single parameter that can be used to
characterize the matrix. Any such single parameter (i.e. the sum of the absolute value of the elements) can be
so used and is often called a matrix norm. We shall see that various matrix norms are useful in determining
which numerical procedures will be useful in operating on the matrix. Let us now consider a broader class of
objects that include scalars, vectors, and to some extent matrices.

## 1.3 Coordinate Systems and Coordinate Transformations

There is an area of mathematics known as topology, which deals with the description of spaces. To
most students the notion of a space is intuitively obvious and is restricted to the three dimensional Euclidian
space of every day experience. A little reflection might persuade that student to include the flat plane as an
allowed space. However, a little further generalization would suggest that any time one has several
independent variables that they could be used to form a space for the description of some phenomena. In the
area of topology the notion of a space is far more general than that and many of the more exotic spaces have
no known counterpart in the physical world.

We shall restrict ourselves to spaces of independent variables, which generally have some physical
interpretation. These variables can be said to constitute a coordinate frame, which describes the space and are
fairly high up in the hierarchy of spaces catalogued by topology. To understand what is meant by a
coordinate frame, imagine a set of rigid rods or vectors all connected at a point. We shall call such a
collection of rods a reference frame. If every point in space can be projected onto the rods so that a unique
set of rod-points represent the space point, the vectors are said to span the space.

If the vectors that define the space are locally perpendicular, they are said to form an orthogonal
coordinate frame. If the vectors defining the reference frame are also unit vectors say ê i then the condition
for orthogonality can be written as
ê i • ê j = δ ij , (1.3.1)

8
1 - Fundamental Concepts

where δij is the Kronecker delta. Such a set of vectors will span a space of dimensionality equal to the
number of vectors ê j . Such a space need not be Euclidian, but if it is then the coordinate frame is said to be
a Cartesian coordinate frame. The conventional xyz-coordinate frame is Cartesian, but one could imagine
such a coordinate system drawn on a rubber sheet, and then distorted so that locally the orthogonality
conditions are still met, but the space would no longer be Euclidian or Cartesian.

Of the orthogonal coordinate systems, there are several that are particularly useful for the description
of the physical world. Certainly the most common is the rectangular or Cartesian coordinate frame where
coordinates are often denoted by x, y, z or x1, x2, x3. Other common three dimensional frames include
spherical polar coordinates (r,θ, ϕ) and cylindrical coordinates (ρ,ϑ,z). Often the most important part of
solving a numerical problem is choosing the proper coordinate system to describe the problem. For example,
there are a total of thirteen orthogonal coordinate frames in which Laplace's equation is separable (see Morse
and Feshbach1).

In order for coordinate frames to be really useful it is necessary to know how to get from one to
another. That is, if we have a problem described in one coordinate frame, how do we express that same
problem in another coordinate frame? For quantities that describe the physical world, we wish their meaning
to be independent of the coordinate frame that we happen to choose. Therefore we should expect the process
to have little to do with the problem, but rather involve relationships between the coordinate frames
themselves. These relationships are called coordinate transformations. While there are many such
transformations in mathematics, for the purposes of this summary we shall concern ourselves with linear
transformations. Such coordinate transformations relate the coordinates in rone frame to those in a second
frame by means of a system of linear algebraic equations. x
r Thus if a vector in one coordinate system has
components xj, in a primed-coordinate system a vector x ' to the same point will have components x ' j
x i = ∑ A ij x j + B i . (1.3.2)
j

## In vector notation we could write this as

r r r
x ' = Ax + B . (1.3.3)
r
This defines the general class of linear transformation where A is some matrix and B is a vector. This
general rlinear form may be divided into two constituents, the matrix A and the vector. It is clear that the
vector B may be interpreted as a shift in the origin of the coordinate system, while the elements Aij are the
cosines of the angles between the axes Xi and X ' i , and are called the directions cosines (see Figure 1.1).
r
Indeed, the vector B is nothing more than a vector from the origin of the un-primed coordinate frame to the
origin of the primed coordinate frame. Now if we consider two points that are fixed in space and a vector
connecting them, then the length and orientation of that vector will be independent of the origin of the
coordinate frame in which the measurements are made. That places an additional constraint on the types of
linear transformations that we may consider. For instance, transformations that scaled each coordinate by a
constant amount, while linear, would change the length of the vector as measured in the two coordinate
systems. Since we are only using the coordinate system as a convenient way to describe the vector, the
coordinate system can play no role in controlling the length of the vector. Thus we shall restrict our
investigations of linear transformations to those that transform orthogonal coordinate systems while
preserving the length of the vector.

9
Numerical Methods and Data Analysis

## Thus the matrix A must satisfy therfollowing

r condition
r r r r
x ' • x ' = ( Ax ) • ( Ax ) = x • x , (1.3.4)
which in component form becomes
⎛ ⎞⎛ ⎞ ⎛ ⎞
∑ ⎜⎜ ∑ A ij x j ⎟⎟⎜ ∑ A ik x k ⎟ = ∑ ∑ ⎜⎝ ∑ A ij A ik x j x k ⎟ = ∑ x i2 . (1.3.5)
i ⎝ j ⎠⎝ k ⎠ j i ⎠ i

This must be true for all vectors in the coordinate system so that
∑A
i
ij A ik = δ jk = ∑ A −ji1 A ik .
i
(1.3.6)

Now remember that the Kronecker delta δij is the unit matrix and any element of a group that multiplies
another and produces that group's unit element is defined as the inverse of that element. Therefore
Aji = [Aij]-1 . (1.3.7)
Interchanging the rows with the columns of a matrix produces a new matrix which we have called the
transpose of the matrix. Thus orthogonal transformations that preserve the length of vectors have inverses
that are simply the transpose of the original matrix so that
-1 T
A =A . (1.3.8)
This means that given the transformation A in the linear system of equations (1.3.3), we may invert the
transformation, or solve the linear equations, by multiplying those equations by the transpose of the original
matrix or r
r r
x = A T x' − A T B . (1.3.9)
Such transformations are called orthogonal unitary transformations, or orthonormal transformations, and the
result given in equation (1.3.9) greatly simplifies the process of carrying out a transformation from one
coordinate system to another and back again.

We can further divide orthonormal transformations into two categories. These are most easily
described by visualizing the relative orientation between the two coordinate systems. Consider a
transformation that carries one coordinate into the negative of its counterpart in the new coordinate system
while leaving the others unchanged. If the changed coordinate is, say, the x-coordinate, the transformation
matrix would be
⎛ −1 0 0⎞
⎜ ⎟
A=⎜ 0 1 0⎟ , (1.3.10)
⎜ 0 0 1 ⎟⎠

which is equivalent to viewing the first coordinate system in a mirror. Such transformations are known as
reflection transformations and will take a right handed coordinate system into a left handed coordinate
system.

The length of any vectors will remain unchanged. The x-component of these vectors will simply be
replaced by its negative in the new coordinate system. However, this will not be true of "vectors" that result
from the vector cross product. The values of the components of such a vector will remain unchanged
implying that a reflection transformation of such a vector will result in the orientation of that vector being
changed. If you will, this is the origin of the "right hand rule" for vector cross products. A left hand rule

10
1 - Fundamental Concepts

results in a vector pointing in the opposite direction. Thus such vectors are not invariant to reflection
transformations because their orientation changes and this is the reason for putting them in a separate class,
namely the axial (pseudo) vectors. It is worth noting that an orthonormal reflection transformation will have
a determinant of -1. The unitary magnitude of the determinant is a result of the magnitude of the vector being
unchanged by the transformation, while the sign shows that some combination of coordinates has undergone
a reflection.

Figure 1.1 shows two coordinate frames related by the transformation angles ϕij. Four
coordinates are necessary if the frames are not orthogonal

As one might expect, the elements of the second class of orthonormal transformations have
determinants of +1. These represent transformations that can be viewed as a rotation of the coordinate
system about some axis. Consider artransformation between the two coordinate systems displayed in Figure
1.1. The components of any vector C in the primed coordinate system will be given by

## ⎛ C x ' ⎞ ⎛ cos ϕ11 cos ϕ12 0 ⎞⎛ C x ⎞

⎜ ⎟ ⎜ ⎟⎜ ⎟
⎜ C y ' ⎟ = ⎜ cos ϕ 21 cos ϕ 22 0 ⎟⎜ C y ⎟ . (1.3.11)
⎜ ⎟ ⎜ ⎜ ⎟
⎝ C z' ⎠ ⎝ 0 0 1 ⎟⎠⎝ C z ⎠
If we require the transformation to be orthonormal, then the direction cosines of the transformation will
not be linearly independent since the angles between the axes must be π/2 in both coordinate systems.
Thus the angles must be related by
ϕ11 = ϕ 22 = ϕ ⎞

ϕ12 = ϕ11 + π / 2 = ϕ + π / 2 ⎟ . (1.3.12)
(2π − ϕ 21 ) = π / 2 − ϕ11 = π / 2 − ϕ ⎟

Using the addition identities for trigonometric functions, equation (1.3.11) can be given in terms of the single

11
Numerical Methods and Data Analysis

angle φ by
⎛ C x ' ⎞ ⎛ cos ϕ sin ϕ 0 ⎞⎛ C x ⎞
⎜ ⎟ ⎜ ⎟⎜ ⎟
⎜ C y ' ⎟ = ⎜ − sin ϕ cos ϕ 0 ⎟⎜ C y ⎟ . (1.3.13)
⎜ ⎟ ⎜ 0 0 1 ⎟⎠⎜⎝ C z ⎟⎠
⎝ C z' ⎠ ⎝
This transformation can be viewed as a simple rotation of the coordinate system about the Z-axis through an
angle ϕ. Thus,
cos ϕ sin ϕ 0
Det − sin ϕ cos ϕ 0 = cos 2 ϕ + sin 2 ϕ = + 1 . (1.3.14)
0 0 1

In general, the rotation of any Cartesian coordinate system about one of its principal axes can be
written in terms of a matrix whose elements can be expressed in terms of the rotation angle. Since these
transformations are about one of the coordinate axes, the components along that axis remain unchanged. The
rotation matrices for each of the three axes are
⎛1 0 0 ⎞ ⎫
⎜ ⎟ ⎪
Px (φ) = ⎜ 0 cos φ sin φ ⎟ ⎪
⎜ 0 − sin φ cos φ ⎟ ⎪
⎝ ⎠

⎛ cos φ 0 − sin φ ⎞⎪
⎜ ⎟⎪
Py (φ) = ⎜ 0 1 0 ⎟⎬ . (1.3.15)
⎜ sin φ 0 cos φ ⎟⎪
⎝ ⎠⎪

⎛ cos φ sin φ 0⎞ ⎪
⎜ ⎟
Pz (φ) = ⎜ − sin φ cos φ 0⎟ ⎪
⎜ 0 ⎪
⎝ 0 1 ⎟⎠ ⎪⎭
It is relatively easy to remember the form of these matrices for the row and column of the matrix
corresponding to the rotation axis always contains the elements of the unit matrix since that component is not
affected by the transformation. The diagonal elements always contain the cosine of the rotation angle while
the remaining off diagonal elements always contain the sine of the angle modulo a sign. For rotations about
the x- or z-axes, the sign of the upper right off diagonal element is positive and the other negative. The
situation is just reversed for rotations about the y-axis. So important are these rotation matrices that it is
worth remembering their form so that they need not be re-derived every time they are needed.

One can show that it is possible to get from any given orthogonal coordinate system to another
through a series of three successive coordinate rotations. Thus a general orthonormal transformation can
always be written as the product of three coordinate rotations about the orthogonal axes of the coordinate
systems. It is important to remember that the matrix product is not commutative so that the order of the

12
1 - Fundamental Concepts

rotations is important.

## 1.4 Tensors and Transformations

Many students find the notion of tensors to be intimidating and therefore avoid them as much as
possible. After all Einstein was once quoted as saying that there were not more than ten people in the world
that would understand what he had done when he published General Theory of Relativity. Since tensors are
the foundation of general relativity that must mean that they are so esoteric that only ten people could
manage them. Wrong! This is a beautiful example of misinterpretation of a quote taken out of context. What
Einstein meant was that the notation he used to express the General Theory of Relativity was sufficiently
obscure that there were unlikely to be more than ten people who were familiar with it and could therefore
understand what he had done. So unfortunately, tensors have generally been represented as being far more
complex than they really are. Thus, while readers of this book may not have encountered them before, it is
high time they did. Perhaps they will be somewhat less intimidated the next time, for if they have any
ambition of really understanding science, they will have to come to an understanding of them sooner or later.

In general a tensor has Nn components or elements. N is known as the dimensionality of the tensor
by analogy with vectors, while n is called the rank of the tensor. Thus scalars are tensors of rank zero and
vectors of any dimension are rank one. So scalars and vectors are subsets of tensors. We can define the law
of addition in the usual way by the addition of the tensor elements. Thus the null tensor (i.e. one whose
elements are all zero) forms the unit under addition and arithmetic subtraction is the inverse operation.
Clearly tensors form a communitative group under addition. Furthermore, the scalar or dot product can be
generalized for tensors so that the result is a tensor of rank m − n . In a similar manner the outer product
can be defined so that the result is a tensor of rank m + n . It is clear that all of these operations are closed;
that is, the results remain tensors. However, while these products are in general distributive, they are not
communitative and thus tensors will not form a field unless some additional restrictions are made.

One obvious way of representing tensors of rank 2 is as N×N square matrices Thus, the scalar
product of a tensor of rank 2 with a vector would be written as

r r ⎫
A•B = C ⎪

⎬ , (1.4.1)
C i = ∑ A ij B j ⎪⎪
j ⎭

while the tensor outer product of the same tensor and vector could be written as
t
AB = C ⎫

⎬ . (1.4.2)
C ijk = A ij B k ⎪

13
Numerical Methods and Data Analysis

It is clear from the definition and specifically from equation (1.4.2) that tensors may frequently have
a rank of more than two. However, it becomes more difficult to display all the elements in a simple
geometrical fashion so they are generally just listed or described. A particularly important tensor of rank
three is known as the Levi-Civita Tensor (or correctly the Levi-Civita Tensor Density). It plays a role that is
somewhat complimentary to that of the Kronecker delta in that when any two indices are equal the tensor
element is zero. When the indices are all different the tensor element is +1 or -1 depending on whether the
index sequence can be obtained as an even or odd permutation from the sequence 1, 2, 3 respectively. If we
try to represent the tensor εijk as a succession of 3×3 matrices we would get
⎛ 0 0 0⎞ ⎫
⎜ ⎟ ⎪
ε1 jk = ⎜ 0 0 +1 ⎟ ⎪
⎜ 0 −1 0 ⎟ ⎪
⎝ ⎠

⎛ 0 0 −1 ⎞ ⎪⎪
⎜ ⎟
ε 2 jk =⎜ 0 0 0 ⎟ ⎬ . (1.4.3)
⎜+1 0 0 ⎟ ⎪
⎝ ⎠ ⎪

⎛ 0 −1 0 ⎞ ⎪
⎜ ⎟
ε 3 jk = ⎜+1 0 0 ⎟ ⎪
⎜ 0 0 0⎟ ⎪
⎝ ⎠ ⎪⎭
This somewhat awkward looking third rank tensor allows us to write the equally awkward vector cross
product in summation notation as r r
t rr
A × B = ε : (AB) = ∑∑
ε ijk A j B k = C i .
j k
(1.4.4)

Here the symbol : denotes the double dot product which is explicitly specified by the double sum of the right
hand term. The quantity εijk is sometimes called the permutation symbol as it changes sign with every
permutation of its indices. This, and the identity
∑ i
ε ijk ε ipq = δ jp δ kq − δ jq δ kp , (1.4.5)

makes the evaluation of some complicated vector identities much simpler (see exercise 13).

In section 1.3 we added a condition to what we meant by a vector, namely we required that the
length of a vector be invariant to a coordinate transformation. Here we see the way in which additional
constraints of what we mean by vectors can be specified by the way in which they transform. We further
limited what we meant by a vector by noting that some vectors behave strangely under a reflection
transformation and calling these pseudo-vectors. Since the Levi-Civita tensor generates the vector cross
product from the elements of ordinary (polar) vectors, it must share this strange transformation property.
Tensors that share this transformation property are, in general, known as tensor densities or pseudo-tensors.
Therefore we should call εijk defined in equation (1.4.3) the Levi-Civita tensor density. Indeed, it is the
invariance of tensors, vectors, and scalars to orthonormal transformations that is most correctly used to
define the elements of the group called tensors.

14
1 - Fundamental Concepts

Figure 1.2 shows two neighboring points P and Q in two r adjacent coordinate systems X
and X'. The differential distance between the two is d x . The vectorial distance to the two
r r r r
points is X (P) or X ' (P) and X(Q) or ?X '(Q) respectively.

Since vectors are just a special case of the broader class of objects called tensors, we should expect these
transformation constraints to extend to the general class. Indeed the only fully appropriate way to define
tensors is to define the way in which they transform from one coordinate system to another. To further refine
the notion of tensor transformation, we will look more closely at the way vectors transform. We have written
a general linear transformation for vectors in equation (1.3.2). However, except for rotational and reflection
transformations, we have said little about the nature of the transformation matrix A. So let us consider how
we would express a coordinate transformation from some point P in a space to a nearby neighboring point Q.
Each point can be represented in any coordinate system we choose. Therefore, let us consider two coordinate
systems having a common origin where the coordinates are denoted by xi and x' i respectively.

Since P and Q are near each other, we can represent the coordinates of Q to those of P in either
coordinate system by
x i (Q) = x i (P) + dx i ⎫
⎬ . (1.4.6)
x 'i (Q) = x ' i (P) + dx ' = ⎭

Now the coordinates of the vector from P to Q will be dxi and dx’i, in the un-primed and primed coordinate
systems respectively. By the chain rule the two coordinates will be related by

15
Numerical Methods and Data Analysis

∂x 'i
dx 'i = ∑ dx j . (1.4.7)
j ∂x j
Note that equation (1.4.7) does not involve the specific location of point Q but rather is a general expression
of the local relationship between the two coordinate frames. Since equation (1.4.7) basically describes how
the coordinates of P or Q will change from one coordinate system to another, we can identify the elements
Aij from equation (1.3.2) with the partial derivatives in equation (1.4.6). Thus we could expect any vector
?x to transform according to
∂x 'i
x 'i = ∑ xj . (1.4.8)
j ∂x j
Vectors that transform in this fashion are called contravariant vectors. In order to distinguish them from
covariant vectors, which we shall shortly discuss, we shall denote the components of the vector with
superscripts instead of subscripts. Thus the correct form for the transformation of a contravariant vector is
∂x ' i j
x' i = ∑ x . (1.4.9)
j ∂x j
We can generalize this transformation law to contravariant tensors of rank two by
∂x ' i ∂x ' j
T' ij = ∑∑ T kl . (1.4.10)
k l ∂x l ∂x k
Higher rank contravariant tensors transform as one would expect with additional coordinate changes. One
might think that the use of superscripts to represent contravariant indices would be confused with exponents,
but such is generally not the case and the distinction between this sort of vector transformation and
covariance is sufficiently important in physical science to make the accommodation. The sorts of objects that
transform in a contravariant manner are those associated with, but not limited to, geometrical objects. For
example, the infinitesimal displacements of coordinates that makes up the tangent vector to a curve show
that it is a contravariant vector. While we have used vectors to develop the notion of contravariance, it is
clear that the concept can be extended to tensors of any rank including rank zero. The transformation rule for
such a tensor would simply be
T' = T . (1.4.11)
In other words scalars will be invariant to contravariant coordinate transformations.

Now instead of considering vector representations of geometrical objects imbedded in the space and
their transformations, let us consider a scalar function of the coordinates themselves. Let such a function be
Φ(xi). Now consider components of the gradient of Φ in the x'i-coordinate frame. Again by the chain rule

∂Φ ∂x j ∂Φ
=∑ . (1.4.12)
∂x 'i j ∂x i ∂x '
j

If we call ∂Φ / ∂x 'i a vector with components Vi , then the transformation law given by equation (1.4.12)
appears very like equation (1.4.8), but with the partial derivatives inverted. Thus we would identify the
elements Aij of the linear vector transformation represented by equation (1.3.2) as
A i j = ∂x j / ∂x 'i , (1.4.13)
and the vector transformation would have the form

16
1 - Fundamental Concepts

Vi = ∑ A i j Vj . (1.4.14)
j

Vectors that transform in this manner are called covariant vectors. In order to distinguish them from
contravariant vectors, the component indices are written as subscripts. Again, it is not difficult to see how the
concept of covariance would be extended to tensors of higher rank and specifically for a second rank
covariant tensor we would have
∂x l ∂x k
T'ij = ∑∑ Tlk . (1.4.15)
k l ∂x 'i ∂x ' j
The use of the scalar invariant Φ to define what is meant by a covariant vector is a clue as to the types of
vectors that behave as covariant vectors. Specifically the gradient of physical scalar quantities such as
temperature and pressure would behave as a covariant vector while coordinate vectors themselves are
contravariant. Basically equations (1.4.15) and (1.4.10) define what is meant by a covariant or contravariant
tensor of second rank. It is possible to have a mixed tensor where one index represents covariant
transformation while the other is contravariant so that
∂x l ∂x j
T'i = ∑∑ Tl
j k
. (1.4.16)
k l ∂x 'i ∂x k
Indeed the Kronecker delta can be regarded as a tensor as it is a two index symbol and in particular it is a
mixed tensor of rank two and when covariance and contravariance are important should be written as δ j .
i

Remember that both contravariant and covariant transformations are locally linear transformations
of the form given by equation (1.3.2). That is, they both preserve the length of vectors and leave scalars
unchanged. The introduction of the terms contravariance and covariance simply generate two subgroups of
what we earlier called tensors and defined the members of those groups by means of their detailed
transformation properties. One can generally tell the difference between the two types of transformations by
noting how the components depend on the coordinates. If the components denote 'distances' or depend
directly on the coordinates, then they will transform as contravariant tensor components. However, should
the components represent quantities that change with the coordinates such as gradients, divergences, and
curls, then dimensionally the components will depend inversely on the coordinates and the will transform
covariantly. The use of subscripts and superscripts to keep these transformation properties straight is
particularly useful in the development of tensor calculus as it allows for the development of rules for the
manipulation of tensors in accord with their specific transformation characteristics. While coordinate
systems have been used to define the tensor characteristics, those characteristics are properties of the tensors
themselves and do not depend on any specific coordinate frame. This is of considerable importance when
developing theories of the physical world as anything that is fundamental about the universe should be
independent of man made coordinate frames. This is not to say that the choice of coordinate frames is
unimportant when actually solving a problem. Quite the reverse is true. Indeed, as the properties of the
physical world represented by tensors are independent of coordinates and their explicit representation and
transformation properties from one coordinate system to another are well defined, they may be quite useful
in reformulating numerical problems in different coordinate systems.

17
Numerical Methods and Data Analysis

1.5 Operators
The notion of a mathematical operator is extremely important in mathematical physics and there are
entire books written on the subject. Most students first encounter operators in calculus when the notation
[d/dx] is introduced to denote the operations involved in finding the derivative of a function. In this instance
the operator stands for taking the limit of the difference between adjacent values of some function of x
divided by the difference between the adjacent values of x as that difference tends toward zero. This is a
fairly complicated set of instructions represented by a relatively simple set of symbols.

The designation of some symbol to represent a collection of operations is said to represent the
definition of an operator. Depending on the details of the definition, the operators can often be treated as if
they were quantities and subject to algebraic manipulations. The extent to which this is possible is
determined by how well the operators satisfy the conditions for the group on which the algebra or
mathematical system in question is defined. The operator [d/dx] is a scalar operator. That is, it provides a
single result after operating on some function defined in an appropriate coordinate space. It and the operator
∫ represent the fundamental operators of the infinitesimal calculus. Since [d/dx] and ∫ carry out inverse
operations on functions, one can define an identity operator by [d/dx]∫ so that continuous differentionable
functions will form a group under the action of these operators.

In numerical analysis there are analogous operators ∆ and Σ that perform similar functions but
without taking the limit to vanishingly small values of the independent variable. Thus we could define the
forward finite difference operator ∆ by its operation on some function f(x) so that
∆f(x) = f(x+h) - f(x) , .(1.5.1)
where the problem is usually scaled so that h = 1. In a similar manner Σ can be defined as
n

∑ f (x ) = f (x ) + f (x + h) + f (x + 2h) + f (x + ih ) ⋅ ⋅ ⋅ +f (x + nh)
i =0
i . (1.5.2)

Such operators are most useful in expressing formulae in numerical analysis. Indeed, it is possible to build
up an entire calculus of finite differences. Here the base for such a calculus is 2 instead of e=2.7182818... as
in the infinitesimal calculus . Other operators that are useful in the finite difference calculus are the shift
operator E[f(x)] and the Identity operator I[f(x)] which are defined as
E[f(x)] ≡ f(x + h)⎫

⎬ . (1.5.3)
I[f(x)] ≡ f(x) ⎪⎭
These operators are not linearly independent as we can write the forward difference operator as
∆=E-I . (1.5.4)
The finite difference and summation calculus are extremely powerful when summing series or evaluating
convergence tests for series. Before attempting to evaluate an infinite series, it is useful to know if the series
converges. If possible, the student should spend some time studying the calculus of finite differences.

In addition to scalar operators, it is possible to define vector and tensor operators. One of the most
common vector operators is the "del" operator or "nabla". It is usually denoted by the symbol ∇ and is
defined in Cartesian coordinates as

18
1 - Fundamental Concepts

∂ ∂ ∂
∇ = î + ĵ + k̂ . (1.5.5)
∂x ∂y ∂z
This single operator, when combined with the some of the products defined above, constitutes the foundation
of vector calculus. Thus the divergence, gradient, and curl are defined as
r
∇•A = b ⎫
r ⎪⎪
∇a = B ⎬ , (1.5.6)
r r ⎪
∇×A = C ⎪⎭
r
respectively. If we consider A to be a continuous vector function of the independent variables that make up
the space in which it is defined, then we may give a physical interpretation for both the divergence and curl.
The divergence of a vector field is a measure of the amount that the field spreads or contracts at some given
point in the space (see Figure 1.3).

.Figure 1.3 schematically shows the divergence of a vector field. In the region where the
arrows of the vector field converge, the divergence is positive, implying an increase in the
source of the vector field. The opposite is true for the region where the field vectors diverge.

Figure 1.4 schematically shows the curl of a vector field. The direction of the curl is determined
by the "right hand rule" while the magnitude depends on the rate of change of the x- and y-
components of the vector field with respect to y and x..

19
Numerical Methods and Data Analysis

The curl is somewhat harder to visualize. In some sense it represents the amount that the field rotates about a
given point. Some have called it a measure of the "swirliness" of the field. If in the vicinity of some point in
the field, the vectors tend to veer to the left rather than to the right, then the curl will be a vector pointing up
normal to the net rotation with a magnitude that measures the degree of rotation (see Figure 1.4). Finally, the
gradient of a scalar field is simply a measure of the direction and magnitude of the maximum rate of change
of that scalar field (see Figure 1.5).

Figure 1.5 schematically shows the gradient of the scalar dot-density in the form of a
number of vectors at randomly chosen points in the scalar field. The direction of the
gradient points in the direction of maximum increase of the dot-density while the magnitude
of the vector indicates the rate of change of that density.

With these simple pictures in mind and what we developed in section 1.4 it is possible to generalize the
notion of the Del-operator to other quantities. Consider the gradient of a vector field. This represents the
outer product of the Del-operator with a vector. While one doesn't see such a thing often in freshman
physics, it does occur in more advanced descriptions of fluid mechanics (and many other places). We now
know enough to understand that the result of this operation will be a tensor of rank two which we can
represent as a matrix. What do the components mean? Generalize from the scalar case. The nine elements
of the vector gradient can be viewed as three vectors denoting the direction of the maximum rate of
change of each of the components of the original vector.

The nine elements represent a perfectly well defined quantity and it has a useful purpose in describing many
physical situations. One can also consider the divergence of a second rank tensor, which is clearly a vector.

In hydrodynamics, the divergence of the pressure tensor may reduce to the gradient of the scalar gas
pressure if the macroscopic flow of the material is small compared to the internal speed of the particles that
make up the material. With some care in the definition of a collection of operators, their action on the
elements of a field or group will preserve the field or group nature of the original elements. These are the
operators that are of the greatest use in mathematical physics.

20
1 - Fundamental Concepts

By combining the various products defined in this chapter with the familiar notions of vector
calculus, we can formulate a much richer description of the physical world. This review of scalar and vector
mathematics along with the all-too-brief introduction to tensors and matrices will be useful in setting up
problems for their eventual numerical solution. Indeed, it is clear from the transformations described in the
last sections that a prime aspect in numerically solving problems will be dealing with linear equations and
matrices and that will be the subject of the next chapter

21
Numerical Methods and Data Analysis

Chapter 1 Exercises
1. Show that the rational numbers (not including zero) form a group under addition and multiplication.
Do they also form a scalar field?

2. Show that it is not possible to put the rational numbers into a one to one correspondence with the
real numbers.

## 4. Show that the matrix product is not communitative.

5. Is the scalar product of two second rank tensors communitative? If so show how you know.
If not, give a counter example.

## 7. Show that the Kronecker delta δij is indeed a mixed tensor.

8. Determine the nature (i.e. contravariant, covariant, or mixed) of the Levi-Civita tensor density.

9. Show that the vector cross product does indeed give rise to a pseudo-vector.

10. Use the forward finite difference operator to define a second order finite difference operator and use
it to evaluate ∆2[f(x)], where f(x) = x2 + 5x + 12.

## 11. If gn(x) = x(n) ≡ x(x-1)(x-2)(x-3) ⋅ ⋅ ⋅ (x-n+1), show that ∆[gn(x)] = ngn-1(x).

gn(x) is known as the factorial function.

12. Show that if f(x) is a polynomial of degree n, then it can be expressed as a sum of factorial functions
(see problem 11).

i

## and use the result to prove

r r r
∇ × (∇ × F) = ∇(∇ • F) − ∇ 2 F

22
1 - Fundamental Concepts

## Chapter 1 References and Additional Reading

One of the great books in theoretical physics, and the only one I know that gives a complete list of
the coordinate frames for which Laplace's equation is separable is

1. Morse, P.M., and Feshbach, H., "Methods of Theoretical Physics" (1953) McGraw-Hill Book Co.,
Inc. New York, Toronto, London, pp. 665-666.

It is a rather formidable book on theoretical physics, but any who aspire to a career in the area should be
familiar with its contents.

While many books give excellent introductions to modern set and group theory, I have found

2. Andree, R.V., "Selections from Modern Abstract Algebra" (1958) Henry Holt & Co. New York,to
be clear and concise. A fairly complete and concise discussion of determinants can be found in

3. Sokolnikoff, I.S., and Redheffer, R.M., "Mathematics of Physics and Modern Engineering" (1958)
McGraw-Hill Book Co., Inc. New York, Toronto, London, pp. 741-753.

A particularly clear and approachable book on Tensor Calculus which has been reprinted by Dover is

4. Synge, J.L., and Schild, A., "Tensor Calculus" (1949) University of Toronto Press, Toronto.

I would strongly advise any student of mathematical physics to become familiar with this book before
attempting books on relativity theory that rely heavily on tensor notation. While there are many books on
operator calculus, a venerable book on the calculus of finite differences is

5. Milne-Thomson, L.M., "The Calculus of Finite Differences" (1933) Macmillan and Co., LTD,
London.

A more elementary book on the use of finite difference equations in the social sciences is

6. Goldberg, S., "Introduction to Difference Equations", (1958) John Wiley & Sons, Inc., London.

There are many fine books on numerical analysis and I will refer to many of them in later chapters.
However, there are certain books that are virtually unique in the area and foremost is

23
Numerical Methods and Data Analysis

## 7. Abramowitz, M. and Stegun, I.A., "Handbook of Mathematical Functions" National Bureau of

Standards Applied Mathematics Series 55 (1964) U.S. Government Printing Office, Washington
D.C.

While this book has also been reprinted, it is still available from the Government Printing Office and
represents an exceptional buy. Approximation schemes and many numerical results have been collected and
are clearly presented in this book. One of the more obscure series of books are collectively known as the
Bateman manuscripts, or

8.
Bateman, H., "The Bateman Manuscript Project" (1954) Ed. A. Erde'lyi, 5 Volumns,
McGraw-Hill Book Co., Inc. New York, Toronto, London.

Harry Bateman was a mathematician of considerable skill who enjoyed collecting obscure functional
relationships. When he died, this collection was organized, catalogued, and published as the Bateman
Manuscripts. It is a truly amazing collection of relations. When all else fails in an analysis of a problem,
before fleeing to the computer for a solution, one should consult the Bateman Manuscripts to see if the
problem could not be transformed to a different more tractable problem by means of one of the remarkable
relations collected by Harry Bateman. A book of similar utility but easier to obtain and use is

9. Lebedev, N.N., "Special Functions and Their Applications" (1972), Trans. R.A. Silverman. Dover
Publications, Inc. New York.

24
2

## The Numerical Methods for

Linear Equations and Matrices

• • •

We saw in the previous chapter that linear equations play an important role
in transformation theory and that these equations could be simply expressed in terms of matrices. However,
this is only a small segment of the importance of linear equations and matrix theory to the mathematical
description of the physical world. Thus we should begin our study of numerical methods with a description
of methods for manipulating matrices and solving systems of linear equations. However, before we begin
any discussion of numerical methods, we must say something about the accuracy to which those calculations
can be made.

25
Numerical Methods and Data Analysis

## 2.1 Errors and Their Propagation

One of the most reliable aspects of numerical analysis programs for the electronic digital computer
is that they almost always produce numbers. As a result of the considerable reliability of the machines, it is
common to regard the results of their calculations with a certain air of infallibility. However, the results can
be no better than the method of analysis and implementation program utilized by the computer and these are
the works of highly fallible man. This is the origin of the aphorism "garbage in − garbage out". Because of
the large number of calculations carried out by these machines, small errors at any given stage can rapidly
propagate into large ones that destroy the validity of the result.

We can divide computational errors into two general categories: the first of these we will call round
off error, and the second truncation error. Round off error is perhaps the more insidious of the two and is
always present at some level. Indeed, its omnipresence indicates the first problem facing us. How accurate an
answer do we require? Digital computers utilize a certain number of digits in their calculations and this base
number of digits is known as the precision of the machine. Often it is possible to double or triple the number
of digits and hence the phrase "double" or "triple" precision is commonly used to describe a calculation
carried out using this expanded number of digits. It is a common practice to use more digits than are justified
by the problem simply to be sure that one has "got it right". For the scientist, there is a subtle danger in this
in that the temptation to publish all the digits presented by the computer is usually overwhelming. Thus
published articles often contain numerical results consisting of many more decimal places than are justified
by the calculation or the physics that went into the problem. This can lead to some reader unknowingly using
the results at an unjustified level of precession thereby obtaining meaningless conclusions. Certainly the full
machine precession is never justified, as after the first arithmetical calculation, there will usually be some
uncertainty in the value of the last digit. This is the result of the first kind of error we called round off error.
As an extreme example, consider a machine that keeps only one significant figure and the exponent of the
calculation so that 6+3 will yield 9×100. However, 6+4, 6+5, and 6+8 will all yield the same answer namely
1×101. Since the machine only carries one digit, all the other information will be lost. It is not immediately
obvious what the result of 6+9, or 7+9 will be. If the result is 2×101, then the machine is said to round off the
calculation to the nearest significant digit. However, if the result remains 1×101, then the machine is said to
truncate the addition to the nearest significant digit. Which is actually done by the computer will depend on
both the physical architecture (hardware) of the machine and the programs (software) which instruct it to
carry out the operation. Should a human operator be carrying out the calculation, it would usually be
possible to see when this is happening and allow for it by keeping additional significant figures, but this is
generally not the case with machines. Therefore, we must be careful about the propagation of round off error
into the final computational result. It is tempting to say that the above example is only for a 1-digit machine
and therefore unrealistic. However, consider the common 6-digit machine. It will be unable to distinguish
between 1 million dollars and 1 million and nine dollars. Subtraction of those two numbers would yield zero.
This would be significant to any accountant at a bank. Repeated operations of this sort can lead to a
completely meaningless result in the first digit.

This emphasizes the question of 'how accurate an answer do we need?'. For the accountant, we
clearly need enough digits to account for all the money at a level decided by the bank. For example, the
Internal Revenue Service allows taxpayers to round all calculations to the nearest dollar. This sets a lower

26
2 - Linear Equations and Matrices

bound for the number of significant digits. One's income usually sets the upper bound. In the physical world
very few constants of nature are known to more than four digits (the speed of light is a notable exception).
Thus the results of physical modeling are rarely important beyond four figures. Again there are exceptions
such as in null experiments, but in general, scientists should not deceive themselves into believing their
answers are better answers than they are.

How do we detect the effects of round off error? Entire studies have been devoted to this subject by
considering that round off errors occurs in basically a random fashion. Although computers are basically
deterministic (i.e. given the same initial state, the computer will always arrive at the same answer), a large
collection of arithmetic operations can be considered as producing a random collection of round-ups and
round-downs. However, the number of digits that are affected will also be variable, and this makes the
problem far more difficult to study in general. Thus in practice, when the effects of round off error are of
great concern, the problem can be run in double precession. Should both calculations yield the same result at
the acceptable level of precession, then round off error is probably not a problem. An additional "rule of
thumb" for detecting the presence of round off error is the appearance of a large number of zeros at the right-
hand side of the answers. Should the number of zeros depend on parameters of the problem that determine
the size or numerical extent of the problem, then one should be concerned about round off error. Certainly
one can think of exceptions to this rule, but in general, they are just that - exceptions.

The second form of error we called truncation error and it should not be confused with errors
introduced by the "truncation" process that happens half the time in the case of round off errors. This type of
error results from the inability of the approximation method to properly represent the solution to the
problem. The magnitude of this kind of error depends on both the nature of the problem and the type of
approximation technique. For example, consider a numerical approximation technique that will give exact
answers should the solution to the problem of interest be a polynomial (we shall show in chapter 3 that the
majority of methods of numerical analysis are indeed of this form). Since the solution is exact for
polynomials, the extent that the correct solution differs from a polynomial will yield an error. However, there
are many different kinds of polynomials and it may be that a polynomial of higher degree approximates the
solution more accurately than one of lower degree.

This provides a hint for the practical evaluation of truncation errors. If the calculation is repeated at
different levels of approximation (i.e. for approximation methods that are correct for different degree
polynomials) and the answers change by an unacceptable amount, then it is likely that the truncation error is
larger than the acceptable amount. There are formal ways of estimating the truncation error and some 'black-
box' programs do this. Indeed, there are general programs for finding the solutions to differential equations
that use estimates of the truncation error to adjust parameters of the solution process to optimize efficiency.
However, one should remember that these estimates are just that - estimates subject to all the errors of
calculation we have been discussing. It many cases the correct calculation of the truncation error is a more
formidable problem than the one of interest. In general, it is useful for the analyst to have some prior
knowledge of the behavior of the solution to the problem of interest before attempting a detailed numerical
solution. Such knowledge will generally provide a 'feeling' for the form of the truncation error and the extent
to which a particular numerical technique will manage it.

We must keep in mind that both round-off and truncation errors will be present at some level in any
calculation and be wary lest they destroy the accuracy of the solution. The acceptable level of accuracy is

27
Numerical Methods and Data Analysis

determined by the analyst and he must be careful not to aim too high and carry out grossly inefficient
calculations, or too low and obtain meaningless results.

We now turn to the solution of linear algebraic equations and problems involving matrices
associated with those solutions. In general we can divide the approaches to the solution of linear algebraic
equations into two broad areas. The first of these involve algorithms that lead directly to a solution of the
problem after a finite number of steps while the second class involves an initial "guess" which then is
improved by a succession of finite steps, each set of which we will call an iteration. If the process is
applicable and properly formulated, a finite number of iterations will lead to a solution.

## 2.2 Direct Methods for the Solution of Linear Algebraic Equations

In general, we may write a system of linear algebraic equations in the form
a 11 x 1 + a 12 x 2 + a 13 x 3 + ⋅ ⋅ ⋅ + a 1n x n = c1 ⎫
a 21 x 1 + a 22 x 2 + a 23 x 3 + ⋅ ⋅ ⋅ + a 2 n x n = c 2 ⎪⎪
a 31 x 1 + a 312 x 2 + a 33 x 3 + ⋅ ⋅ ⋅ + a 3n x n = c 3 ⎪

⋅ ⋅ ⋅ ⋅ ⋅ ⎬ , (2.2.1)
⋅ ⋅ ⋅ ⋅ ⋅ ⎪

⋅ ⋅ ⋅ ⋅ ⋅ ⎪
a n1 x 1 + a n 2 x 2 + a n 3 x 3 + ⋅ ⋅ ⋅ + a nn x n = c n ⎪⎭
which in vector notation is r r
Ax = c . (2.2.2)
Here x is an n-dimensional vector the elements of which represent the solution of the equations.
c is the constant vector of the system of equations and A is the matrix of the system's coefficients.

## We can write the solution to these equations as

r r
x = A -1 c , (2.2.3)
thereby reducing the solution of any algebraic system of linear equations to finding the inverse of the
coefficient matrix. We shall spend some time describing a number of methods for doing just that. However,
there are a number of methods that enable one to find the solution without finding the inverse of the matrix.
Probably the best known of these is Cramer's Rule

## a. Solution by Cramer's Rule

It is unfortunate that usually the only method for the solution of linear equations that
students remember from secondary education is Cramer's rule or expansion by minors. As we shall see, this
method is rather inefficient and relatively difficult to program for a computer. However, as it forms sort of a
standard by which other methods can by judged, we will review it here. In Chapter 1 [equation (1.2.10)] we
gave the form for the determinant of a 3×3 matrix. The more general definition is inductive so that the
determinant of the matrix A would be given by

28
2 - Linear Equations and Matrices
n
Det A = ∑ (−1)
i =1
i+ j
a ij M ij , ∀j . (2.2.4)

Here the summation may be taken over either i or j, or indeed, any monotonically increasing sequence of
both. The quantity Mij is the determinant of the matrix A with the ith row and jth column removed and, with
the sign carried by (-1)(i+j) is called the cofactor of the minor element aij. With all this terminology, we can
simply write the determinant as
n n
Det A = ∑ C ij a ij , ∀j , = ∑ a ij C ij , ∀i . (2.25)
i −1 j=1

By making use of theorems 2 and 7 in section 1.2, we can write the solution in terms of the determinant of
A as
a 11 a 12 ⋅ ⋅ ⋅ a 1n a 11 x 1 a 12 ⋅ ⋅ ⋅ a 1n (a 11 x 1 + a 12 x 2 + ⋅ ⋅ ⋅ + a 1n x n ) a 12 ⋅ ⋅ ⋅ a 1n
a 21 a 22 ⋅ ⋅ ⋅ a 2 n a 21 x 1 a 22 ⋅ ⋅ ⋅ a 2 n (a 21 x 1 + a 22 x 2 + ⋅ ⋅ ⋅ + a 2 n x n ) a 22 ⋅ ⋅ ⋅ a 2 n
x1 ⋅ ⋅ ⋅ = ⋅ ⋅ ⋅ = ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
a n1 a n2 a nn a n1 x 1 a n 2 a nn (a 1n x 1 + a 1n x 2 + ⋅ ⋅ ⋅ + a 1n x n ) a n 2 a nn
, (2.2.6)
c1 a 12 ⋅ ⋅ ⋅ a 1n
c2 a 22 ⋅ ⋅ ⋅ a 2 n
= ⋅ ⋅ ⋅
⋅ ⋅ ⋅
cn a n2 a nn
which means that the general solution of equation (2.2.1) is given by
a 11 ⋅ ⋅ ⋅ a 1 j−1c1 a 1 j+1 ⋅ ⋅ ⋅ a 1n
a 21 ⋅ ⋅ ⋅ a 2 j−1c 2 a 2 j+1 ⋅ ⋅ ⋅ a 2 n
xj = ⋅ ⋅ ⋅ ⋅ × [Det A] −1 . (2.2.7)
⋅ ⋅ ⋅ ⋅
a n1 ⋅ ⋅ ⋅ a n j−1c n a n j+1 ⋅ ⋅ ⋅ a nn
This requires evaluating the determinant of the matrix A as well as an augmented matrix where the jth
column has been replaced by the elements of the constant vector ci. Evaluation of the determinant of an n×n
matrix requires about 3n2 operations and this must be repeated for each unknown, thus solution by Cramer's
rule will require at least 3n3 operations. In addition, to maintain accuracy, an optimum path through the
matrix (finding the least numerically sensitive cofactors) will require a significant amount of logic. Thus,
solution by Cramer's rule is not a particularly desirable approach to the numerical solution of linear
equations either for a computer or a hand calculation. Let us consider a simpler algorithm, which forms the
basis for one of the most reliable and stable direct methods for the solution of linear equations. It also
provides a method for the inversion of matrices. Let begin by describing the method and then trying to
understand why it works.

29
Numerical Methods and Data Analysis

## Consider representing the set of linear equations given in equation (2.2.1) by

⎛ a 11 a 12 ⋅ ⋅ ⋅ a 1n ⎞⎛ c1 ⎞
⎜ ⎟⎜ ⎟
⎜ a 21 a 22 ⋅ ⋅ ⋅ a 2 n ⎟⎜ c 2 ⎟
⎜ ⋅ ⋅ ⋅ ⎟⎜ ⋅ ⎟ .
⎜ ⎟⎜ ⎟ (2.2.8)
⎜ ⋅ ⋅ ⋅ ⎟⎜ ⋅ ⎟
⎜ ⎟⎜ ⎟
⎝ a n1 a n 2 a nn ⎠⎝ c n ⎠
Here we have suppressed the presence of the elements of the solution vector xj. Now we will perform a
series of operations on the rows and columns of the coefficient matrix A and we shall carry through the row
operations to include the elements of the constant vector ci. In other words, we shall treat the rows as if they
were indeed the equations so that anything done to one element is done to all. One begins by dividing each
row including the constant element by the lead element in the row. The first row is then subtracted from all
the lower rows. Thus all rows but the first will have zero in the first column. Now repeat these operations for
all but the first equation starting with the second element of the second equation producing ones in the
second column of the remaining equations. Subtracting the resulting second line from all below will yield
zeros in the first two columns of equation three and below. This process can be repeated until one has arrived
at the last line representing the last equation. When the diagonal coefficient there is unity, the last term of the
constant vector contains the value of xn. This can be used in the (n-1)th equation represented by the second
to the last line to obtain xn-1 and so on right up to the first line which will yield the value of x1. The name of
this method simply derives from the elimination of each unknown from the equations below it producing a
triangular system of equations represented by
⎛1 a '12 ⋅ ⋅ ⋅ a '1n ⎞⎛ c'1 ⎞
⎜ ⎟⎜ ⎟
⎜ 0 1 ⋅ ⋅ ⋅ a ' 2 n ⎟⎜ c' 2 ⎟
⎜ ⋅ ⋅ ⋅ ⎟⎜ ⋅ ⎟ ,
⎜ ⎟⎜ ⎟ (2.2.9)
⎜ ⋅ ⋅ ⋅ ⎟⎜ ⋅ ⎟
⎜ ⎟⎜ ⎟
⎝ 0 0 ⋅ ⋅ ⋅ 1 ⎠⎝ c' n ⎠
which can then be easily solved by back substitution where
x n = c' n ⎫
n ⎪
⎬ . (2.2.10)
x i = c'i − ∑ a ' ij x j ⎪
j=i +1 ⎭
One of the disadvantages of this approach is that errors (principally round off errors) from the
successive subtractions build up through the process and accumulate in the last equation for xn. The errors
thus incurred are further magnified by the process of back substitution forcing the maximum effects of the
round-off error into x1. A simple modification to this process allows us to more evenly distribute the effects

30
2 - Linear Equations and Matrices

of round off error yielding a solution of more uniform accuracy. In addition, it will provide us with an
efficient mechanism for calculation of the inverse of the matrix A.

## c. Solution by Gauss Jordan Elimination

Let us begin by writing the system of linear equations as we did in equation (2.2.8), but now
include a unit matrix with elements δij on the right hand side of the expression. Thus,
⎛ a 11 a 12 ⋅ ⋅ ⋅ a 1n ⎞ ⎛ c1 ⎞ ⎛1 0 ⋅ ⋅ ⋅ 0 ⎞
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜ a 21 a 22 ⋅ ⋅ ⋅ a 2 n ⎟ ⎜c2 ⎟ ⎜ 0 1 ⋅ ⋅ ⋅ 0⎟
⎜ ⋅ ⋅ ⋅ ⎟ ⎜⋅ ⎟ ⎜⋅ ⋅ ⋅⎟ .
⎜ ⎟ ⎜ ⎟ (2.2.11)
⎜ ⎟
⎜ ⋅ ⋅ ⋅ ⎟ ⎜⋅ ⎟ ⎜⋅ ⋅ ⋅⎟
⎜ ⎟ ⎜ ⎟ ⎜0 0 ⋅ ⋅ ⋅ 1 ⎟
⎝ a n1 a n 2 a nn ⎠ ⎝cn ⎠ ⎝ ⎠
We will treat the elements of this matrix as we do the elements of the constant vector ci. Now proceed as we
did with the Gauss elimination method producing zeros in the columns below and to the left of the diagonal
element. However, in addition to subtracting the line whose diagonal element has been made unity from all
those below it, also subtract from the equations above it as well. This will require that these equations be
normalized so that the corresponding elements are made equal to one and the diagonal r element will no
longer be unity. In addition to operating on the rows of the matrix A and the elements of C , we will operate
on the elements of the additional matrix which is initially a unit matrix. Carrying out these operations row by
row until the last row is completed will leave us with a system of equations that resemble
⎛ a '11 0 ⋅ ⋅ ⋅ 0 ⎞ ⎛ c'1 ⎞ ⎛ b11 b12 ⋅ ⋅ ⋅ b1n ⎞
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜ 0 a ' 22 ⋅ ⋅ ⋅ 0 ⎟ ⎜ c' 2 ⎟ ⎜ b 21 b 22 ⋅ ⋅ ⋅ b 2 n ⎟
⎜⋅ ⋅ ⋅ ⎟ ⎜⋅ ⎟ ⎜ ⋅ ⋅ ⋅ ⎟ .
⎜ ⎟ ⎜ ⎟ ⎜ ⎟ (2.2.12)
⎜⋅ ⋅ ⋅ ⎟ ⎜⋅ ⎟ ⎜ ⋅ ⋅ ⋅ ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎝ 0 0 ⋅ ⋅ ⋅ a ' nn ⎠ ⎝ c' n ⎠ ⎝ b n1 b n 2 ⋅ ⋅ ⋅ b nn ⎠

If one examines the operations we have performed in light of theorems 2 and 7 from section 1.2, it is
clear that so far we have done nothing to change the determinant of the original matrix A so that expansion
by minors of the modified matrix represent by the elements a'ij is simply accomplished by multiplying the
diagonal elements aii together. A final step of dividing each row by a’ii will yield the unit matrix on the left
hand side and elements of the solution vector xi will be found where the C'i s were. The final elements of B
will be the elements of the inverse matrix of A. Thus we have both solved the system of equations and found
the inverse of the original matrix by performing the same steps on the constant vector as well as an
additional unit matrix. Perhaps the simplest way to see why this works is to consider the system of linear
equations and what the operations mean to them. Since all the operations are performed on entire rows
including the constant vector, it is clear that they constitute legal algebraic operations that won't change the
nature of the solution in any way. Indeed these are nothing more than the operations that one would preform
by hand if he/she were solving the system by eliminating the appropriate variables. We have simply
formalized that procedure so that it may be carried out in a systematic fashion. Such a procedure lends itself

31
Numerical Methods and Data Analysis

to computation by machine and may be relatively easily programmed. The reason for the algorithm yielding
the matrix inverse is somewhat less easy to see. However, the product of A and B will be the unit matrix 1,
and the operations that go into that matrix-multiply are the inverse of those used to generate B.

To see specifically how the Gauss-Jordan routine works, consider the following system of
equations:
x 1 + 2 x 2 + 3x 3 = 12 ⎫

3x 1 + 2 x 2 + x 3 = 24 ⎬ . (2.2.13)
2 x 1 + x 2 + 3x 3 = 36 ⎪⎭
If we put this in the form required by expression (2.2.11) we have
⎛ 1 2 3 ⎞⎛ 12 ⎞⎛ 1 0 0 ⎞
⎜ ⎟⎜ ⎟⎜ ⎟
⎜ 3 2 1 ⎟⎜ 24 ⎟⎜ 0 1 0 ⎟ . (2.2.14)
⎜ 2 1 3 ⎟⎜ 36 ⎟⎜ 0 0 1 ⎟
⎝ ⎠⎝ ⎠⎝ ⎠
Now normalize the all rows by factoring out the lead elements of the first column so that
⎛1 2 3 ⎞⎛ 12 ⎞⎛ 1 0 0 ⎞
⎜ ⎟⎜ ⎟⎜ ⎟
(1)(3)(2) ⎜ 1 2 3 1
3 ⎟⎜ 8 ⎟⎜ 0
1
3 0⎟ . (2.2.15)
⎜1 1 3 ⎟⎜ 18 ⎟⎜ 0 0 1 2 ⎟⎠
⎝ 2 2 ⎠⎝ ⎠⎝
The first row can then be subtracted from the remaining rows (i.e. rows 2 and 3) to yield
⎛1 2 3 ⎞⎛ 12 ⎞⎛ 1 0 0 ⎞
⎜ ⎟⎜ ⎟⎜ ⎟
( 6) ⎜ 0 3 ⎟⎜ − 4 ⎟⎜ − 1 1 3 0 ⎟ .
−4 −8
3 (2.2.16)
⎜ 0 −3 −3 ⎟⎜ + 6 ⎟⎜ − 1 0 1 ⎟
⎝ 2 2 ⎠⎝ ⎠⎝ 2 ⎠

Now repeat the cycle normalizing by factoring out the elements of the second column getting
⎛ 12 1 3
⎞⎛ + 6
2 ⎞⎛ 1
2 0 0⎞
⎛ − 4 ⎞⎛ − 3 ⎞ ⎜ ⎟⎜ ⎟⎜ ⎟
(6)⎜ ⎟⎜ ⎟ ( 2) ⎜ 0 1 2 ⎟⎜ + 3 ⎟⎜ 3
4
1
4 0 ⎟ . (2.2.17)
⎝ 3 ⎠⎝ 2 ⎠ ⎜
⎝ 0 1 1 ⎟⎠⎜⎝ − 4 ⎟⎜
⎠⎝
2
3 0 −1 3 ⎟⎠
Subtracting the second row from the remaining rows (i.e. rows 1 and 3) gives
⎛ 12 0 ⎞⎛ + 3
−1
2 ⎞⎛ −1
4
1
4 0⎞
⎜ ⎟⎜ ⎟⎜ ⎟
(24) ⎜ 0 1 2 ⎟⎜ + 3 ⎟⎜ 3
4
−1
4 0⎟ . (2.2.18)
⎜ 0 1 − 1 ⎟⎠⎜⎝ − 7 ⎟⎜ −1 1 −1 ⎟
⎝ ⎠⎝ 12 4 3 ⎠

Again repeat the cycle normalizing by the elements of the third column so
⎛ −1 0 1 ⎞⎛ − 6 ⎞⎛ 1
2
−1
2 0⎞
⎜ ⎟⎜ ⎟⎜ ⎟
(24)(−1 / 2)(2)(−1) ⎜ 0 1
2 1 ⎟⎜ 3 2 ⎟⎜ 3
8
−1
8 0⎟ , (2.2.19)
⎜ 0 0 1 ⎟⎠⎜⎝ + 7 ⎟⎜ 1 −1 1 ⎟
⎝ ⎠⎝ 12 4 3 ⎠

## and subtract from the remaining rows to yield

32
2 - Linear Equations and Matrices

⎛ −1 0 0 ⎞⎛ − 13 ⎞⎛ 5
12
−1
4
−1
3 ⎞
⎜ ⎟⎜ ⎟⎜ ⎟
(24) ⎜ 0 1
2 0 ⎟⎜ −11 2 ⎟⎜ 7
24
1
8
−1
3 ⎟ . (2.2.20)
⎜ 0 0 1 ⎟⎠⎜⎝ + 7 ⎟⎜ 1 −1 1 ⎟
⎝ ⎠⎝ 12 4 3 ⎠
Finally normalize by the remaining elements so as to produce the unit matrix on the left hand side so that
⎛ 1 0 0 ⎞⎛ + 13 ⎞⎛ −5
12
1
4
1
3 ⎞
⎜ ⎟⎜ ⎟⎜ ⎟
(24)(−1)(1 / 2)(+1) ⎜ 0 1 0 ⎟⎜ − 11 ⎟⎜ 7
24
1
8
−1
3 ⎟ . (2.2.21)
⎜ 0 0 1 ⎟⎠⎜⎝ + 7 ⎟⎜ 1 −1 1 ⎟
⎝ ⎠⎝ 12 4 3 ⎠
The solution to the equations is now contained in the center vector while the right hand matrix contains the
inverse of the original matrix that was on the left hand side of expression (2.2.14). The scalar quantity
accumulating at the front of the matrix is the determinant as it represents factors of individual rows of the
original matrix. Here we have repeatedly use theorem 2 and 7 given in section (1.2) in chapter 1. Theorem 2
allows us to build up the determinant by factoring out elements of the rows, while theorem 7 guarantees that
the row subtraction shown in expressions (2.2.16), (2.2.18), and (2.2.20) will not change the value of the
determinant. Since the determinant of the unit matrix on left side of expression (2.2.21) is one, the
determinant of the original matrix is just the product of the factored elements. Thus our complete solution is
r
x = [13, − 11, + 7 ] ⎫

Det A = −12 ⎪⎪
⎬ . (2.2.22)
⎛ −5
12
1
4
1
3 ⎞⎪
⎜ ⎟
A −1 =⎜ 7
12
1
4
−2
3 ⎟⎪
⎜ 1 −1 1 ⎟⎪⎪
⎝ 12 4 3 ⎠⎭

In carrying out this procedure, we have been careful to maintain full accuracy by keeping the
fractions that explicitly appear as a result of the division. In general, this will not be practical and the
perceptive student will have notice that there is the potential for great difficulty as a result of the division.
Should any of the elements of the matrix A be zero when they are to play the role of divisor, then a
numerical singularity will result. Indeed, should the diagonal elements be small, division would produce
such large row elements that subtraction of them from the remaining rows would generate significant
roundoff error. However, interchanging two rows or two columns of a system of equations doesn't alter the
solution of these equations and, by theorem 5 of chapter 1 (sec 1.2), only the sign of the determinant is
changed. Since the equations at each step represent a system of equations, which have the same solution as
the original set, we may interchange rows and columns at any step in the procedure without altering the
solution. Thus, most Gauss-Jordan programs include a search of the matrix to place the largest element on
the diagonal prior to division by that element so as to minimize the effects of round off error. Should it be
impossible to remove a zero from the division part of this algorithm, the one column of the matrix can be
made to be completely zero. Such a matrix has a determinant, which is zero and the matrix is said to be
singular. Systems of equations that are characterized by singular matrices have no unique solution.

It is clear that one could approach the singular state without actually reaching it. The result of this
would be to produce a solution of only marginal accuracy. In such circumstances the initial matrix might

33
Numerical Methods and Data Analysis

have coefficients with six significant figures and the solution have one or less. While there is no a priori way
of knowing how nearly singular the matrix may be, there are several "rules of thumb" which while not
guaranteed to resolve the situation, generally work. First consider some characteristic of the matrix that
measures the typical size of its elements. Most any reasonable criterion will do such as the absolute value of
the largest element, the sum of the absolute values of the elements, or possibly the trace. Divide this
characteristic by the absolute value of the determinant and if the result exceeds the machine precision, the
result of the solution should be regarded with suspicion. Thus if we denote this characteristic of the matrix
by M, then
N ≥ log10│M/d│ , (2.2.23)

where d is the determinant of the original matrix. This should be regarded as a necessary, but not sufficient,
condition for the solution to be accurate. Indeed a rough guess as to the number of significant figures in the
resultant solution is
Ns ~ N ─ log10│M/d│ . (2.2.24)

Since most Gauss-Jordan routines return the determinant as a byproduct of the solution, it is irresponsible to
fail to check to see if the solution passes this test.

An additional test would be the substitution of the solution back into the original equations to see
how accurately the elements of the constant vector are reproduced. For the inverse matrix, one can always
multiply the original matrix by the inverse and see to what extent the unit matrix results. This raises an
interesting question. What do we mean when we say that a solution to a system of equations is accurate. One
could mean that each element of the solution vector contains a certain number of significant figures, or one
might mean that the solution vector satisfies the equations at some acceptable level of accuracy (i.e. all
elements of the constant vector are reproduced to some predetermined number of significant figures). It is
worth noting that these two conditions are not necessarily the same. Consider the situation of a poorly
conditioned system of equations where the constant vector is only weakly specified by one of the unknowns.
Large changes in its value will make little change in the elements of the constant vector so that tight
tolerances on the constant vector will not yield values of the that particular unknown with commensurate
accuracy. This system would not pass the test given by equation (2.2.23). In general, there should always be
an a priori specification of the required accuracy of the solution and an effort must be made to ascertain if
that level of accuracy has been reached.

## Consider two triangular matrices U and V with the following properties

⎛ u ij i ≤ j ⎞⎫
U = ⎜⎜ ⎟⎪
⎝0 i > j ⎟⎠⎪
⎬ . (2.2.25)
⎛0 i < j ⎞⎪
V = ⎜⎜ ⎟
⎝ v ij i ≥ j ⎟⎠⎪⎭
Further assume that A can be written in terms of these triangular matrices so that
A = VU . (2.2.26)
Then our linear system of equations [equation (2.2.2)] could be written as

34
2 - Linear Equations and Matrices
r r r
Ax = c = V (Ux ) . (2.2.27)
-1
Multiplying by V we have that the solution will be given by a different set of equations
r r r
Ux = V −1 c = c' , (2.2.28)
where r r
c = Vc ' . (2.2.29)
r
If the vector c' can be determined, then equation (2.2.28) has the form of the result of the Gauss elimination
and would resemble expression (2.2.9) and have a solution similar to equationr (2.2.10). In addition, equation
(2.2.29) is triangular and has a similarly simple solution for the vector ' . Thus, we have replaced the
c
general system of linear equations by two triangular systems. Now the constraints on U and V only r depend
c
on the matrix A and the triangular constraints. In no way do they depend on the constant vector . Thus, if
one has a large number of equations differing only in the constant vector, the matrices U and V need only be
found once.

The matrices U and V can be found from the matrix A in a fairly simple way by
i −1

u ij = a ij − ∑ v ik u kj ⎪
k =1 ⎪
⎬ , (2.2.30)
⎛ j−1
⎞ ⎪
v ij = ⎜⎜ a ij − ∑ v ik u kj ⎟⎟ u ii
⎝ k =1 ⎠ ⎪⎭
which is justified by Hildebrandt1. The solution of the resulting triangular equations is then just
⎛ i −1
⎞ ⎫
c'i = ⎜ c i − ∑ v ik c' k ⎟ v ii ⎪
⎝ k =1 ⎠ ⎪
⎬ . (2.2.31)
⎛ n
⎞ ⎪
x i = ⎜ c' i − ∑ u ik x k ⎟ u ii
⎝ ⎠ ⎪
k =i +1 ⎭
Both equations (2.2.30) and (2.2.31) are recursive in nature in that the unknown relies on previously
determined values of the same set of unknowns. Thus round-off error will propagate systematically
throughout the solution. So it is useful if one attempts to arrange the initial equations in a manner which
minimizes the error propagation. However, the method involves a minimum of readily identifiable divisions
and so tends to be exceptionally stable. The stability will clearly be improved as long as the system of
equations contains large diagonal elements. Therefore the Crout method provides a method of similar or
greater stability to Gauss-Jordan method and considerable efficiency in dealing with systems differing only
in the constant vector. In instances where the matrix A is symmetric the equations for uij simplify to
uij = vji/uii . (2.2.32)

As we shall see the normal equations for the least squares formalism always have this form so that the Crout
method provides a good basis for their solution.

While equations (2.2.30) and (2.2.31) specifically delineate the elements of the factored matrices U
and V, it is useful to see the manner in which they are obtained. Therefore let us consider the same equations
that served as an example for the Gauss-Jordan method [i.e. equations (2.2.13)]. In order to implement the
Crout method we wish to be able to express the coefficient matrix as

35
Numerical Methods and Data Analysis

⎛ 1 2 3 ⎞ ⎛ v11 0 0 ⎞⎛ u 11 u 12 u 13 ⎞
⎜ ⎟ ⎜ ⎟⎜ ⎟
A = VU = ⎜ 3 2 1 ⎟ = ⎜ v12 v 22 0 ⎟⎜ 0 u 22 u 23 ⎟ . (2.2.33)
⎜ 2 1 3 ⎟ ⎜ v
⎝ ⎠ ⎝ 13 v 23 v 33 ⎟⎠⎜⎝ 0 0 u 33 ⎟⎠
r
The constant vector c that appears in equation (2.2.31)
r is
c = ( 12, 24, 36 ) . (2.2.34)
To factor the matrix A into the matrices U and V in accordance with equation (2.2.30), we proceed column
by column through the matrix so that the necessary elements of U and V required by equation (2.2.30) are
available when they are needed. Carrying out the factoring process specified by equations ( 2.2.30)
sequentially column by column yields

u 11 = a 11 − 0 = 1 ⎫ ⎪
(a 0) = 1 ⎪⎪
v11 = 11 −

u 11 ⎪
⎪⎪ ⎪
(a
v12 = 12 − 0 ) = 3⎪
⎬ j =1 ⎪
u 11 ⎪
⎪ ⎪
v13 =
(a 13 − 0)
= 2⎪⎪ ⎪
u 11 ⎭ ⎪

⎫ ⎪
u 12 = a 12 − 0 = 2 ⎪ ⎪
⎪ ⎪
u 22 = [a 22 − ( v 21 u 21 )] = 2 − (3 × 2) = 4 ⎪ ⎪
v 22
[
= a 22 − ( v 21 u 12 )
] = [2 − (3 × 2)] = 1 ⎪⎬ j=2

u 22 −4
⎪ ⎪
⎪ ⎪
[a 32 − ( v 31 u 13 )] [1 − ( 2 × 2 ) ] 3 ⎪⎪ ⎪

v 32 = = =
u 22 − 4 4⎭ ⎪

⎫ ⎪
u 13 = a 13 − 0 = 3 ⎪ ⎪
⎪ ⎪
u 23 = a 23 − ( v 21 u 13 ) = 1 − (3 × 3) = −8 ⎪
⎪ ⎪
u 33 = a 33 − ( v 31 u 13 + v 32 u 23 ) = 3 − [(2 × 3) + ( 3 4 × −8)] = 3 ⎬ j=3 ⎪ . (2.2.35)
⎪ ⎪
⎪ ⎪
v 33 =
[a 33 − ( v 31 u 13 + v 32 u 23 )]
= [3 − (2 × 3) − (−8 × 4 )] / 3 = 1 ⎪⎪
3 ⎪
u 33 ⎭ ⎪

Therefore we can write the original matrix A in accordance with equation (2.2.33) as

36
2 - Linear Equations and Matrices

⎛ 1 0 0 ⎞⎛ 1 2 3 ⎞ ⎛1 2 3 ⎞ ⎛1 2 3⎞
⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟
A = ⎜ 3 1 0 ⎟⎜ 0 − 4 − 8 ⎟ = ⎜ 3 (6 − 4) (9 − 8) ⎟=⎜ 3 2 1 ⎟ . (2.2.36)
⎜ 2 3 1 ⎟⎜ 0 0 3 ⎟ ⎜ 2 (4 − 3) (6 − 6 + 3) ⎟ ⎜ 2 1 3 ⎟
⎝ 4 ⎠⎝ ⎠ ⎝ ⎠ ⎝ ⎠
Here the explicit multiplication of the two factored matrices U and V demonstrates that the factoring has
been done correctly.
r
Now we need to obtain the augmented constant vector c' specified by equations (2.2.31). These
equations must be solved recursively so that the results appear in the order in which they are needed. Thus
c'1 = (c1 − 0) / v11 = 12 / 1 = 12 ⎫

c' 2 = [c 2 − ( v 21c'1 )] / v 22 = [24 − (3 × 12)] / 1 − 12 ⎬. (2.2.37)
c'3 = [c 3 − ( v 31c'1 + v 32 c' 2 )] / v 33 = [36 − (2 × 2) + (12 × 3 4 )] / 1 = 1 ⎪⎭

Finally the complete solution can be obtained by back-solving the second set of equations (2.2.31) so that

x 3 = c'3 / u 33 = 21 / 3 = 7 ⎫

x 2 = (c' 2 − u 23 x 3 ) / u 22 = [−12 + (8 × 7)] /(−4) = − 11 ⎬ . (2.2.38)
x 1 = (c'1 − u 12 x 2 − u 13 x 3 ) / u 11 = [12 − (2 × −11) − (3 × 7)] / 1 = 13 ⎪⎭

As anticipated, we have obtained the same solution as in equation (2.2.22). The strength of the Crout method
resides in the minimal number of operations required to solve a second set of equations differing only in the
constant vector. The factoring of the matrix remains the same and only the steps specified by equations
(2.2.37) and (2.2.38) need be repeated. In addition, the method is particularly stable.

## e. The Solution of Tri-diagonal Systems of Linear Equations

All the methods described so far generally require about n3 operations to obtain the solution.
However, there is one frequently occurring system of equations for which extremely efficient solution
algorithms exist. This system of equations is called tri-diagonal because there are never more than three
unknowns in any equation and they can be arranged so that the coefficient matrix is composed of non-zero
elements on the main diagonal and the diagonal immediately adjacent to either side. Thus such a system
would have the form

37
Numerical Methods and Data Analysis

a 11 x 1 + a 12 x 2 + 0 + 0 + ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ + 0 = c1 ⎫
a 21 x 1 + a 22 x 2 + a 23 x 3 + 0 + ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ + 0 = c 2 ⎪⎪
0 + a 32 x 2 + a 33 x 3 + a 34 x 4 + ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ + 0 = c3 ⎪

⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⎪
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⎬ . (2.2.39)

⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⎪

0 + 0 + 0 + 0 + ⋅ ⋅ ⋅ + a n −1 n − 2 x n − 2 + a n −1 n −1 x n −1 + a n −1 n x n = c n −1 ⎪
0 + 0 + 0 + 0 + ⋅ ⋅ ⋅ + 0 + a n −1 n x n −1 + a n n x n = c n ⎪

Equations of this type often occur as a result of using a finite difference operator to replace a differential
operator for the solution of differential equations (see chapter 5). A routine that performed straight Gauss
elimination would only be involved in one subtraction below the diagonal normalization element and so
would reach its 'triangular' form after n steps. Since the resulting equations would only contain two terms,
the back substitution would also only require two steps meaning that the entire process would require
something of the order of 3n steps for the entire solution. This is so very much more efficient than the
general solution and equations of this form occur sufficiently frequently that the student should be aware of
this specialized solution.

## 2.3 Solution of Linear Equations by Iterative Methods

So far we have dealt with methods that will provide a solution to a set of linear equations after a
finite number of steps (generally of the order of n3). The accuracy of the solution at the end of this sequence
of steps is fixed by the nature of the equations and to a lesser extent by the specific algorithm that is used.
We will now consider a series of algorithms that provide answers to a linear system of equations in
considerably fewer steps, but at a level of accuracy that will depend on the number of times the algorithm is
applied. Such methods are generally referred to as iterative methods and they usually require of the order of
n2 steps for each iteration. Clearly for very large systems of equations, these methods may prove very much
faster than direct methods providing they converge quickly to an accurate solution.

## a. Solution by the Gauss and Gauss-Seidel Iteration Methods

All iterative schemes begin by assuming that an approximate answer is known and then the
scheme proceeds to improve that answer. Thus we will have a solution vector that is constantly changing
from iteration to iteration. In general, we will denote this by a superscript in parentheses so that x(i) will
denote the value of x at the ith iteration. Therefore in order to begin, we will need an initial value of the
r (0)
solution vector x . The concept of the Gauss iteration scheme is extremely simple. Take the system of
linear equations as expressed in equations (2.2.1) and solve each one for the diagonal value of x so that

38
2 - Linear Equations and Matrices

⎡ n ⎤
⎢ i ∑ a ij x j ⎥
c −
xi = ⎣ ⎦
i≠ j
. (2.3.1)
a ii
Now use the components of the initial value of on the right hand side of equation (2.3.1) to obtain an
improved value for the elements. This procedure can be repeated until a solution of the desired accuracy is
obtained. Thus the general iteration formula would have the form
⎡ n ⎤
⎢ i ∑ a ij x j ⎥
( k −1)
c −
=⎣ ⎦
i≠ j
x i( k ) . (2.3.2)
a ii
It is clear, that should any of the diagonal elements be zero, there will be a problem with the stability of the
method. Thus the order in which the equations are arranged will make a difference to in the manner in which
this scheme proceeds. One might suppose that the value of the initial guess might influence whether or not
the method would find the correct answer, but as we shall see in section 2.4 that is not the case. However, the
choice of the initial guess will determine the number of iterations required to arrive at an acceptable answer.

The Gauss-Seidel scheme is an improvement on the basic method of Gauss. Let us rewrite equations
(2.3.1) as follows:
⎡ i −1 n ⎤
⎢ i ∑ ij j ∑
( k −1)
c − a x − a ij x (jk −1) ⎥
=⎣ ⎦
j=1 j= i +1
x i( k ) . (2.3.3)
a ii
When using this as a basis for an iteration scheme, we can note that all the values of xj in the first
summation for the kth iteration will have been determined before the value of
x(k)
i so that we could write the iteration scheme as
⎡ i −1 n ⎤
⎢c i − ∑ a ij x j − ∑ a ij x j ⎥
(k ) ( k −1)

=⎣ ⎦
j=1 j=i +1
x i( k ) . (2.3.4)
a ii
Here the improved values of xi are utilized as soon as they are obtained. As one might expect, this can
lead to a faster rate of convergence, but there can be a price for the improved speed. The Gauss-Seidel
scheme may not be as stable as the simple Gauss method. In general, there seems to be a trade off
between speed of convergence and the stability of iteration schemes.

Indeed, if we were to apply either if the Gauss iterative methods to equations (2.2.13) that served
as an example for the direct method, we would find that the iterative solutions would not converge. We
shall see later (sections 2.3d and 2.4) that those equations fail to satisfy the simple sufficient convergence
criteria given in section 2.3d and the necessary and sufficient condition of section 2.4. With that in mind,
let us consider another 3×3 system of equations which does satisfy those conditions. These equations are
much more strongly diagonal than those of equation (2.2.13) so

39
Numerical Methods and Data Analysis

3x 1 + x 2 + x 3 = 8 ⎫

x 1 + 4x 2 + 2x 3 = 15 ⎬ . (2.3.5)
2x 1 + x 2 + 5x 3 = 19 ⎪

For these equations, the solution under the Gauss-iteration scheme represented by equations (2.3.2) takes the
form
x 1( k +1) =
[8 − x (k)
2 − x 3( k ) ] ⎫
3 ⎪

x (2k +1) =
[15 − x (k )
− 2 x 3( k ) ] ⎪
4⎬ .
1
(2.3.6)

x 3( k +1) =
[19 − 2x (k)
− x (2k ) ] ⎪
5 ⎪⎭
1

However, if we were to solve equations (2.3.5) by means of the Gauss-Seidel method the iterative equations
for the solution would be
x 1( k +1) =
[8 − x (k)
2 − x 3( k ) ] ⎫
3 ⎪

x (2k +1) =
[
15 − x 1( k +1) − 2 x 3( k ) ⎪ ]
4 ⎬ . (2.3.7)

x 3( k +1) =
[ ( k +1)
19 − 2 x 1 − x 2 ( k +1) ⎪ ]
5 ⎪⎭
If we take the initial guess to be
x(0) (0) (0)
1 = x2 = x3 = 1 , (2.3.8)
then repetitive use of equations (2.3.6) and (2.3.7) yield the results given in Table 2.1.

Table 2.1

## Convergence of Gauss and Gauss-Seidel Iteration Schemes

K 0 1 2 3 4 5 10
G GS G GS G GS G GS G GS G GS G GS
x1 1.00 1.00 2.00 2.00 0.60 0.93 1.92 0.91 0.71 0.98 1.28 1.00 0.93 1.00
x2 1.00 1.00 3.00 2.75 1.65 2.29 2.64 2.03 1.66 1.99 2.32 2.00 1.92 2.00
x3 1.00 1.00 3.20 2.45 1.92 2.97 3.23 3.03 2.51 3.01 3.18 3.00 2.95 3.00

40
2 - Linear Equations and Matrices

As is clear from the results labeled "G" in table 2.1, the Gauss-iteration scheme converges very
slowly. The correct solution which would eventually be obtained is
r
x ( ∞ ) = [ 1,2,3 ] . (2.3.9)
There is a tendency for the solution to oscillate about the correct solution with the amplitude slowly damping
out toward convergence. However, the Gauss-Seidel iteration method damps this oscillation very rapidly by
employing the improved values of the solution as soon as they are obtained. As a result, the Gauss-Seidel
scheme has converged on this problem in about 5 iterations while the straight Gauss scheme still shows
significant error after 10 iterations.

## Assume that the correct solution to equation (2.2.3) can be written as

r r
x c = A −1 c , (2.3.10)
but that the actual solution that is obtained by matrix inversion is really
r r
x ( k ) = ( A −1 ) ( k ) c . (2.3.11)
Substitution of this solution into the original equations would yield a slightly different constant vector,
namely
r r
c ( k ) = Ax ( k ) . (2.3.12)
Let us define a residual vector in terms of the constant vector we started with and the one that results from
the substitution of the correct solution into the original equations so that
r r r r r r r
R ( k ) = c ( k ) − c = Ax ( k ) − Ax c = A( x ( k ) − x c ) . (2.3.13)
r
Solving this for the true solution x c we get

r r r r r r r r
x c = x ( k ) − [ A −1 ]( k ) R ( k ) = x ( k ) − [ A −1 ]( k ) c ( k ) + [ A −1 ]( k ) c = [ A −1 ]( k ) [2c − c ( k ) ] . (2.3.14)

The solution of equation (2.3.13) will involve basically the same steps as required to solve
r equation (2.3.11).
r r r
Thus the quantity ( x − x c ) will be found with the same accuracy as x providing R is not too large.
(k) (k) (k )

(k) r r
Now we can write c in terms c of by using equations (2.3.11, 12) and get
r r r
c ( k ) = Ax ( k ) = A[ A −1 ]( k ) c . (2.3.15)

(k) r r
Using this result to eliminate c from equation (2.3.14) we can write the "correct" solution x c in terms of
the approximate matrix inverse [A-1](k) as
r r
x c = [ A −1 ]( k ) {2 × 1 − A[ A −1 ]( k ) }c . (2.3.16)

Here 1 denotes the unit matrix with elements equal to the Kronecker delta δij. Round-off error and other

41
Numerical Methods and Data Analysis
r
problems that gave rise to the initially inaccurate answer will in reality keep x c from being the correct
answer, but it may be regarded as an improvement over the original solution. It is tempting to use equation
(2.3.16) as the basis for a continuous iteration scheme, but in practice very little improvement can be made
over a single application as the errors that prevent equation (2.3.16) from producing the correct answer will
prevent any further improvement over a single iteration.

If we compare equations (2.3.10) and (2.3.16), we see that this method provides us with a
mechanism for improving the inverse of a matrix since
-1 -1 (k) -1 (k)
A = [A ] {2×1 ─ A[A ] } . (2.3.17)

All of the problems of using equation (2.3.16) as an iteration formula are present in equation (2.3.17).
However, the matrix inverse as obtained from equation (2.3.17) should be an improvement over [A-1](k).

To see how this method works, consider the equations used to demonstrate the Gauss-Jordan and
Crout methods. The exact matrix inverse is given in equations (2.2.22) so we will be able to compare the
iterated matrix with the correct value to judge the improvement. For demonstration purposes, assume that the
inverse in equation (2.2.22) is known only to two significant figures so that

## ⎛ − 0.42 0.25 0.33 ⎞

−1 ( k )
⎜ ⎟
(A ) = ⎜ 0.58 0.25 − 0.67 ⎟ . (2.3.18)
⎜ 0.08 − 0.25 0.33 ⎟
⎝ ⎠

Taking the constant vector to be the same as equation (2.2.13), the solution obtained from the imperfect
matrix inverse would be
⎛ − 0.42 0.25 0.33 ⎞⎛ 12 ⎞ ⎛ 12.84 ⎞
r (k) −1 ( k ) r
⎜ ⎟⎜ ⎟ ⎜ ⎟
x = ( A ) c = ⎜ 0.58 0.25 − 0.67 ⎟⎜ 24 ⎟ = ⎜ − 11.16 ⎟ . (2.3.19)
⎜ 0.08 − 0.25 0.33 ⎟⎜ 36 ⎟ ⎜ 6.84 ⎟
⎝ ⎠⎝ ⎠ ⎝ ⎠

and substitution
r of this solution into the original equations [i.e. equation (2.2.13)] will yield the constant
vector c k with the elements
⎛ 1.0 2.0 3.0 ⎞⎛ 12.84 ⎞ ⎛ 11.04 ⎞
r (k) r (k) ⎜ ⎟⎜ ⎟ ⎜ ⎟
Ax = c = ⎜ 3.0 2.0 1.0 ⎟⎜ − 11.16 ⎟ = ⎜ 23.04 ⎟ , (2.3.20)
⎜ 2.0 1.0 3.0 ⎟⎜ 6.84 ⎟ ⎜ 35.04 ⎟
⎝ ⎠⎝ ⎠ ⎝ ⎠
that are used to obtain the residual vector in equation (2.3.13).

The method of Hotelling and Bodewig operates by basically finding an improved value for the
matrix inverse and then using that with the original constant vector to obtain an improved solution.
Therefore, using equation (2.3.17) to improve the matrix inverse we get

42
2 - Linear Equations and Matrices
-1 -1 (k) -1 (k)
A = [A ] {2×1-A[A ] },
or for example

## ⎡⎛ 2 0 0 ⎞ ⎛ 1 2 3 ⎞⎛ − 0.42 0.25 0.33 ⎞⎤

−1 ⎢⎜
−1 ( k )
⎟ ⎜ ⎟⎜ ⎟⎥
A = [A ] ⎢⎜ 0 2 0 ⎟ − ⎜ 3 2 1 ⎟⎜ 0.58 0.25 − 0.67 ⎟⎥
⎢⎣⎜⎝ 0 0 2 ⎟⎠ ⎜ 2 1 3 ⎟⎜ 0.08 0.25 0.33
⎝ ⎠⎝
⎟⎥
⎠⎦
⎡⎛ 2 0 0 ⎞ ⎛ 0.98 0.00 − 0.02 ⎞⎤
⎢ ⎜ ⎟ ⎜ ⎟⎥
= [A −1 ]( k ) ⎢⎜ 0 2 0 ⎟ − ⎜ − 0.02 1.00 − 0.02 ⎟⎥ . (2.3.21)
⎢⎣⎜⎝ 0 0 2 ⎟⎠ ⎜ 0.02 0.00 0.98

⎟⎥
⎠⎦
⎛ 1.02 0.00 0.02 ⎞

−1 ( k )

= [A ] ⎜ 0.02 1.00 0.02 ⎟
⎜ 0.02 0.00 1.02 ⎟
⎝ ⎠

## ⎛ − 0.42 0.25 0.33 ⎞⎛ 1.02 0.00 0.02 ⎞ ⎛ − 0.4168 0.2500 0.3332 ⎞

−1
⎜ ⎟⎜ ⎟ ⎜ ⎟
A = ⎜ 0.58 0.25 − 0.67 ⎟⎜ 0.02 1.00 0.02 ⎟ = ⎜ 0.5832 0.2500 − 0.6668 ⎟ . (2.3.22)
⎜ 0.08 − 0.25 0.33 ⎟⎜ 0.02 0.00 1.02 ⎟⎠ ⎜⎝ 0.0832 − 0.2500 0.3332 ⎟
⎝ ⎠⎝ ⎠

This can be compared with the six figure version of the exact inverse from equation (2.2.22) which is

## ⎛ − 0.416667 0.250000 0.333333 ⎞

−1
⎜ ⎟
A = ⎜ 0.583333 0.250000 − 0.666667 ⎟. (2.3.23)
⎜ 0.083333 − 0.250000 0.333333 ⎟
⎝ ⎠

Every element experienced a significant improvement over the two figure value [equation(2.3.18)]. It is
interesting that the elements of the original inverse for which two figures yield an exact result (i.e
−1
a 12 , a −221 , a 32
−1
) remain unchanged. This result can be traced back to the augmentation matrix [i.e. the right
hand matrix in equation (2.3.21) third line]. The second column is identical to the unit matrix so that the
second column of the initial inverse will be left unchanged.

We may now use this improved inverse to re-calculate the solution from the initial constant vector
and get
⎛ − 0.4168 0.2500 0.3332 ⎞⎛ 12 ⎞ ⎛ 12.99 ⎞
r −1 r
⎜ ⎟⎜ ⎟ ⎜ ⎟
x c = A c = ⎜ 0.5832 0.2500 − 0.6668 ⎟⎜ 24 ⎟ = ⎜ − 11.00 ⎟ . (2.3.24)
⎜ 0.0832 − 0.2500 0.3332 ⎟⎜ 36 ⎟ ⎜ 6.994 ⎟
⎝ ⎠⎝ ⎠ ⎝ ⎠

43
Numerical Methods and Data Analysis

As one would expect from the improved matrix inverse, the solution represents a significant improvement
over the initial values given by equation (2.2.19). Indeed the difference between this solution and the exact
solution given by equation (2.2.22) is in the fifth significant which is smaller than the calculation accuracy
used to obtain the improved inverse. Thus we see that the method of Hotelling and Bodewig is a powerful
algorithm for improving a matrix inverse and hence a solution to a system of linear algebraic equations.

## c. Relaxation Methods for the Solution of Linear Equations

The Method of Hotelling and Bodewig is basically a specialized relaxation technique and
such techniques can be used with virtually any iteration scheme. In general, relaxation methods tend to play
off speed of convergence for stability. Rather than deal with the general theory of relaxation techniques, we
will illustrate them by their application to linear equations.
r
As in equation (2.3.8) we can define a residual R ( k ) as
r vector r r
R ( k ) = Ax ( k ) − c . (2.3.25)
r (k) r (k)
Let us assume that each element of the solution vector x is subject to an improvement δ x so that
x (jk +1) = x (jk ) + δx j . (2.3.26)
Since each element of the solution vector may appear in each equation, a single correction to an element can
change the entire residual vector. The elements of the new residual vector will differ from the initial residual
vector by an amount
δRim = ─ aimδxm . (2.3.27)

Now search the elements of the matrix δRim over the index m for the largest value of δRim and reduce the
corresponding residual by that maximum value so that
ρ i( k ) ≡ − R i( k ) / Max
m ( −δR im ) . (2.3.28)
The parameter ρi is known as the relaxation parameter for the ith equation and may change from iteration
to iteration. The iteration formula then takes the form
x (jk +1) = x (jk ) + ρ (jk ) δx j . (2.3.29)

Clearly the smaller ρi is, the smaller the correction to xi will be and the longer the iteration will take
to converge. The advantage of this technique is that it treats each unknown in an individual manner and thus
tends to be extremely stable.

Providing a specific example of a relaxation process runs the risk of appearing to limit the concept.
Unlike the other iterative procedures we have described, relaxation schemes leave the choice of the
correction to the elements of the solution vector completely arbitrary. However, having picked the
corrections, the scheme describes how much of them to apply by calculating a relaxation parameter for each
element of the solution vector. While convergence of these methods is generally slow, their stability is often
quite good. We shall demonstrate that by applying the method to the same system of equations used to
demonstrate the other iterative processes [i.e. equations (2.3.5)].

44
2 - Linear Equations and Matrices

We begin
r by choosing the same initial solution that we used for the initial guess of the iterative
schemes [i.e. x =(1, 1, 1)]. Inserting that initial guess into equation (2.3.5), we obtain the approximate
r (k)
constant vector c , which yields a residual vector

⎛ 3 1 1 ⎞⎛ 1 ⎞ ⎛ 8⎞ ⎛5⎞ ⎛8⎞ ⎛ −3 ⎞
r ⎜ ⎟⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
R 0 = ⎜ 1 4 2 ⎟⎜ 1 ⎟ − ⎜ 15 ⎟ = ⎜ 7 ⎟ − ⎜15 ⎟ = ⎜ − 8 ⎟ . (2.3.30)
⎜ 2 1 5 ⎟⎜ 1 ⎟ ⎜ 19 ⎟ ⎜8⎟ ⎜19 ⎟ ⎜ − 11 ⎟
⎝ ⎠⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠

It should be emphasized that the initial guesses are somewhat arbitrary as they only define a place from
which to start the iteration scheme. However, we will be able to compare the results given in Table 2.2 with
the other iterative methods .

We will further arbitrarily choose to vary all the unknowns by the same amount so that

## δxm = 0.3 . (2.3.31)

Now calculate the variational residual matrix specified by equation (2.3.27) and get

⎛3 1 1⎞ ⎛ 0 .9 0 .3 0 .3 ⎞
⎜ ⎟ ⎜ ⎟
δRim = − ⎜ 1 4 2 ⎟ × 0.3 = −⎜ 0.3 1.2 0.6 ⎟ . (2.3.32)
⎜ 2 1 5⎟ ⎜ 0 .6 0 .3 1 .5 ⎟
⎝ ⎠ ⎝ ⎠

The element of the matrix with the largest magnitude is δR33 = 1.5. We may now calculate the elements of
the relaxation vector in accordance with equation (2.3.28) and modify the solution vector as in equation
(2.3.29). Repeating this process we get the results in Table 2.2

Table 2.2
Sample Iterative Solution for the Relaxation Method

k 0 1 4 7 10 ∞
i\ x i ρi xi ρi xi ρi xi ρi xi ρi xi ρi
1 1.00 2.00 1.60 -1.07 1.103 –0.02 1.036-.107 0.998 .006 1.00 0.00
2 1.00 4.44 2.33 -1.55 2.072 +0.41 2.07 -.224 2.00 .002 2.00 0.00
3 1.00 7.33 3.20 -7.02 3.011 -.119 3.01 -.119 2.99 .012 3.00 0.00

We see that the solution does indeed converge at a rate that is intermediate between that obtain for the Gauss
method and that of the Gauss-Seidel method. This application of relaxation techniques allows the relaxation
vector to change approaching zero as the solution converges. Another approach is to use the relaxation
parameter to change the correction given by another type of iteration scheme such as Gauss-Seidel. Under
these conditions, it is the relaxation parameter that is chosen and usually held constant while the corrections

45
Numerical Methods and Data Analysis

approach zero.

There are many ways to arrive at suitable values for the relaxation parameter but the result will
usually be in the range ½ρ½. For values of ρ<½, the rate of convergence is so slow that one is not sure when
the solution has been obtained. On rare occasions one may choose a relaxation parameter greater than unity.
Such a procedure is said to be over relaxed and is likely to become unstable. If ρ ≥ 2, then instability is
almost guaranteed. We have said a great deal about convergence, but little that is quantitative so let us turn to
a brief discussion of convergence within the confines of fixed-point iteration theory.

## d. Convergence and Fixed-point Iteration Theory

The problems of deciding when a correct numerical solution to a system of equations has
been reached are somewhat more complicated for the iterative methods than with the direct methods. Not
only does the practical problem of what constitutes a sufficiently accurate solution have to be dealt with, but
the problem of whether or not the iteration method is approaching that solution has to be solved. The
iteration method will most certainly produce a new solution set, but whether that set is any closer to the
correct set is not immediately obvious. However, we may look to fixed-point iteration theory for some help
with this problem.

Just as there is a large body of knowledge connected with relaxation theory, there is an equally large
body of knowledge relating to fix-point iteration theory2. Before looking at iteration methods of many
variables such as the Gauss iteration scheme, let us consider a much simpler iteration scheme of only one
variable. We could write such a scheme as
x(k) = Φ[x(k-1)] . (2.3.33)

Here Φ[x(k-1)] is any function or algorithm that produces an new value of x based on a previous value. Such a
function is said to posses a fixed-point x0 if
x0 = Φ(x0) . (2.3.34)

If Φ(x) provides a steady succession of values of x that approach the fixed-point x0, then it can be said to be
a convergent iterative function. There is a little-known theorem which states that a necessary and sufficient
condition for Φ(x) to be a convergent iterative function is

dΦ ( x )
< 1 ∀x ε x ( k ) ≤ x ≤ x 0 . (2.3.35)
dx

## x i( k +1) = Φ i ( x (jk ) ) , (2.3.36)

the theorem becomes

46
2 - Linear Equations and Matrices

n
dΦ i ( x i )

(k)
< 1 , ∀x i ε x i ≤ x i ≤ x i 0 . (2.3.37)
j=1 dx j

However, it no longer provides necessary conditions, only sufficient ones. If we apply this to the Gauss
iteration scheme as described by equation (2.3.1) we have

n a ij
∑a
j≠ i
<1, ∀ i . (2.3.38)
ii

It is clear that the convergence process is strongly influenced by the size of the diagonal elements present in
the system of equations. Thus the equations should be initially arranged so that the largest possible elements
are present on the main diagonal of the coefficient matrix. Since the equations are linear, the sufficient
condition given in equation (2.2.23) means that the convergence of a system of equations under the Gauss
iteration scheme is independent of the solution and hence the initial guess. If equation (2.2.23) is satisfied
then the Gauss iteration method is guaranteed to converge. However, the number of iterations required to
achieve that convergence will still depend on the accuracy of the initial guess.

If we apply these conditions to equations (2.2.13) which we used to demonstrate the direct methods
of solution, we find that
⎛5⎞
n a ij ⎜ ⎟
∑a = ⎜ 2⎟ . (2.3.39)
j≠ i ii ⎜1 ⎟
⎝ ⎠
Each equation fails to satisfy the sufficient convergence criteria given in equation (2.3.38). Thus it is
unlikely that these equations can be solved by most iterative techniques. The fact that the method of
Hotelling and Bodewig gave a significantly improved solution is a testament to the stability of that method.
However, it must be remembered that the method of Hotelling and Bodewig is not meant to be used in an
iterative fashion so comparison of iterative techniques with it is not completely justified.

The sufficient convergence criteria give by equation (2.3.38) essentially says that if the sum of the
absolute values of the off-diagonal elements of every row is less than the absolute value of the diagonal
element, then the iteration sequence will converge. The necessary and sufficient condition for convergence
of this and the Gauss Seidel Scheme is that the eigenvalues of the matrix all be positive and less than one.
Thus it is appropriate that we spend a little time to define what eigenvalues are, their importance to science,
and how they may be obtained.

## 2.4 The Similarity Transformations and the Eigenvalues and

Vectors of a Matrix
In Chapter 1 (section 1.3) we saw that it is often possible to represent one vector in terms of another
by means of a system of linear algebraic equations which we called a coordinate transformation. If this

47
Numerical Methods and Data Analysis

transformation preserved the length of the vector, it was called an orthonormal transformation and the matrix
of the transformation coefficients had some special properties. Many problems in science can be represented
in terms of linear equations of the form r r
y = Ax . (2.4.1)
In general, these problems could be made much simpler by finding a coordinate frame so that each element
of the transformed vector is proportional to the corresponding element of the original vector. In other words,
does there exist a space wherein the basis vectors are arranged so that the transformation is a diagonal matrix
of the form r r
y ' = Sx ' , (2.4.2)
r r r r
where x ' y ' x y
and represent the vectors and in this new space where the transformation matrix becomes
r
diagonal. Such a transformation is called a similarity transformation as each element of y' would be similar
r
(proportional) to the corresponding element of x 'r. Now rthe space in which we express and is defined by a
) )
set of basis vectors e i and the space in which x ' and y' are expressed is spanned by e'i . If we let the
transformation that relates the unprimed and primed coordinate frames be D, then the basis vectors are
related by
ê i ' = ∑ d ij ê j ⎫

r
j
r ⎬ . (2.4.3)
e ' = De ⎪⎭
Any linear transformation that relates the basis vectors of two coordinate frames will transform any vector
from one frame to the other. Therefore
r r
x = D −1 x '⎫
r r ⎬ . (2.4.4)
e ' = De ⎭
r r r r
If we use the results of equation (2.4.4) to eliminate x and y from equation (2.4.1) in favor of x ' and y' we
get
r r r
y' = [DAD −1 ]x ' = Sx ' . (2.4.5)
Comparing this result with equation (2.4.2) we see that the conditions for S to be diagonal are
-1
DAD = S , (2.4.6)
which we can rewrite as
T T
AD = D S . (2.4.7)
Here we have made use of an implicit assumption that the transformations are orthonormal and so preserve
the length of vectors. Thus the conditions that lead to equation (1.3.8) are met and D-1 = DT. We can write
these equations in component form as
n n

∑ a ik d jk = d ji s jj = ∑ d jk δ ki s jj
k =1 k =1
, i = 1L n, j = 1L n .(2.4.8)

## These are n systems of linear homogeneous equations of the form

n

∑ (a
k =1
ik − δ ki s jj )d jk = 0, i = 1L n, j = 1L n , (2.4.9)

## which have a solution if and only if

48
2 - Linear Equations and Matrices

Det a ik − δ ki s jj = 0 , ∀j . (2.4.10)

r r
Now the nature of D and S depend only on the matrix A and in no way on the values of x or y .
Thus they may be regarded as properties of the matrix A. The elements sjj are known as the eigenvalues (also
as the proper values or characteristic values) of A, while the columns that make up D are called the
eigenvectors (or proper vectors or characteristic vectors) of A. In addition, equation (2.4.10) is known as the
eigen (or characteristic) equation of the matrix A. It is not obvious that a similarity transformation exists for
all matrices and indeed, in general they do not. However, should the matrix be symmetric, then such a
transformation is guaranteed to exist. Equation (2.4.10) suggests the manner by which we can find the
eigenvalues of a matrix. The expansion of equation (2.4.10) by minors as in equation (1.2.10), or more
generally in equation (2.2.5), makes it clear that the resulting expression will be a polynomial of degree n in
sjj which will have n roots which are the eigenvalues. Thus one approach to finding the eigenvalues of a
matrix is equivalent to finding the roots of the eigen-equation (2.4.9). We shall say more about finding the
roots of a polynomial in the next chapter so for the moment we will restrict ourselves to some special
techniques for finding the eigenvalues and eigenvectors of a matrix.

We saw in section (2.2c) that diagonalization of a matrix will not change the value of its
determinant. Since the application of the transformation matrix D and its inverse effectively accomplishes a
diagonalization of A to the matrix S we should expect the determinant to remain unchanged. Since the
determinant of S will just be the product of the diagonal elements we can write

i

## Tr│A│ = Σsii . (2.4.12)

i

These two constraints are always enough to enable one to find the eigenvalues of a 2 2 matrix and may be
used to reduce the eigen-equation by two in its degree. However, for the more interesting case where n is
large, we shall have to find a more general method. Since any such method will be equivalent to finding the
roots of a polynomial, we may expect such methods to be fairly complicated as finding the roots of
polynomials is one of the trickiest problems in numerical analysis. So it is with finding the eigenvalues of a
matrix.

While we noted that the transformation that gives rise to S is a similarity transformation [equation
(2.4.6)], not all similarity transformations need diagonalize a matrix, but simply have the form
-1
B AB = Q . (2.4.13)
The invariance of the eigenvalues to similarity transformations provide the basis for the general strategy
employed by most "canned" eigenvalue programs. The basic idea is to force the matrix A toward diagonal
form by employing a series of similarity transformations. The details of such procedures are well beyond the
scope of this book but can be found in the references suggested at the end of this chapter3, 4. However,
whatever approach is selected, the prudent investigator will see how well the constraints given by equations
(2.4.11, 12) are met before being satisfied that the "canned" package has actually found the correct

49
Numerical Methods and Data Analysis

## eigenvalues of the matrix.

Having found the eigenvalues, the corresponding eigenvectors can be found by appealing to
equation (2.4.9). However, these equations are still homogeneous, implying that the elements of the
eigenvectors are not uniquely determined. Indeed, it is the magnitude of the eigenvector that is usually
considered to be unspecified so that all that is missing is a scale factor to be applied to each eigenvector. A
common approach is to simply define one of the elements of the eigenvector to be unity thereby making the
system of equations (2.4.9) nonhomogeneous and of the form
n

∑ (a
k =2
ik − δ ik s jj )d jk / d j1 = −a i1 . (2.4.14)

In this form the elements of the eigenvector will be found relative to the element d1j.

Let us conclude our discussion of eigenvalues and eigen-vectors by again considering the matrix of
the equations (2.2.13) used to illustrate the direct solution schemes. We have already seen from equation
(2.3.39) that these equations failed the sufficient conditions for the existence of Gauss-Seidel iterative
solution. By evaluating the eigenvalues for the matrix we can evaluate the necessary and sufficient
conditions for convergence, namely that the eigenvalues all be positive and less than unity.

## The matrix for equations (2.2.13) is

⎛1 2 3 ⎞
⎜ ⎟
A= ⎜ 3 2 1 ⎟ , (2.4.14)
⎜ 2 1 3 ⎟
⎝ ⎠
so that the eigen-equation delineated by equation (2.4.10) becomes
(1-s) 2 3
Det A = Det 3 ( 2 − s) 1 = − s 3 + 6s 2 + 2s − 12 = 0 . (2.4.15)
2 1 (3 − s)
The cubic polynomial that results has three roots which are the eigenvalues of the matrix. However before
solving for the eigenvalues we can evaluate the constraints given by equations (2.4.11) and (2.4.12) and get

Det A = ∏ s ii = −12 ⎫
i ⎪

Tr A = ∑ s ii = +6
.
(2.4.16)

i ⎭
The determinant tells us that the eigenvalues cannot all be positive so that the necessary and sufficient
conditions for the convergence of Gauss-Seidel are not fulfilled confirming the result of sufficient condition
given by equation (2.3.39). The constraints given by equation (2.4.26) can also aid us in finding roots for the
eigen-equation (2.4.15). The fact that the product of the roots is the negative of twice their sum suggests that
two of the roots occur as a pair with opposite sign. This conjecture is supported by Descarte's "rule of signs"
discussed in the next chapter (section 3.1a). With that knowledge coupled with the values for the trace and
determinant we find that the roots are

50
2 - Linear Equations and Matrices

⎛ 6 ⎞
⎜ ⎟
si = ⎜ + 2 ⎟ . (2.4.17)
⎜ ⎟
⎜− 2 ⎟
⎝ ⎠
Thus, not only does one of the eigenvalues violate the necessary and sufficient convergence criteria by being
negative, they all do as they all have a magnitude greater than unity.

We may complete the study of this matrix by finding the eigen-vectors with the aid of equation
(2.4.9) so that
⎛ (1-s) 2 3 ⎞⎛ d i1 ⎞
⎜ ⎟⎜ ⎟
⎜ 3 ( 2 − s) 1 ⎟⎜ d i 2 ⎟=0 . (2.4.18)
⎜ 2 1 (3 − s) ⎟⎜ d ⎟
⎝ ⎠⎝ i 3 ⎠

As we noted earlier, these equations are homogeneous so that they have no unique solution. This means that
the length of the eigen-vectors is indeterminant. Many authors normalize them so that they are of unit length
thereby constituting a set of unit basis vectors for further analysis. However, we shall simply take one
component d11 to be unity thereby reducing the 3 3 system of homogeneous equations (2.4.18) to a 2 2
system of inhomogeneous equations,

( 2 − s i ) d i 2 + d i 3 = −3 ⎫
⎬, (2.4.19)
d i 2 + (3 − s i )d 13 = −2 ⎭

which have a unique solution for the remaining elements of the eigen-vectors. For our example the solution
is r
s1 = +6 : D1 = [ 1.0, 1.0, 1.0 ] ⎫
r ⎪⎪
s2 = + 2 : D 2 = [ 1.0, − (7 + 3 2 ) /(7 − 5 2 ), + (2 2 − 1) /(7 − 5 2 ) ⎬ . (2.4.20)
r ⎪
s3 = − 2 : D 3 = [ 1.0, − (7 + 3 2 ) /(7 + 5 2 ), − (2 2 + 1) /(7 + 5 2 ) ⎪⎭

Should one wish to re-normalize these vectors to be unit vectors, one need only divide each element by the
magnitude of the vectors. Each eigenvalue has its own associated eigen-vector so that equation (2.4.20)
completes the analysis of the matrix A.

We introduced the notion of an eigenvalue initially to provide a necessary and sufficient condition
for the convergence of the Gauss-Seidel iteration method for a system of linear equations. Clearly, this is an
excellent example of the case where the error or convergence criteria pose a more difficult problem than the
original problem. There is far more to the detailed determination of the eigenvalues of a matrix than merely
the inversion of a matrix. All the different classes of matrices described in section 1.2 pose special problems
even in the case where distinct eigenvalues exist. The solution of the eigen-equation (2.4.10) involves
finding the roots of polynomials. We shall see in the next chapter that this is a tricky problem indeed.

51
Numerical Methods and Data Analysis

Chapter 2 Exercises
1. Find the inverse, eigenvalues, and eigenvectors for

## Describe the accuracy of your answer and how you know.

2. Solve the following set of equations both by means of a direct method and iterative method.
Describe the methods used and why you chose them.

## X2 + 5X3 - 7X4 + 23X5 - X6 + 7X7 + 8X8 + X9 - 5X10 = 10

17X1 - 24X3 - 75X4 +100X5 - 18X6 + 10X7 - 8X8 + 9X9 - 50X10 = -40
3X1 - 2X2 + 15X3 - 78X5 - 90X6 - 70X7 +18X8 -75X9 + X10 = -17
5X1 + 5X2 - 10X3 - 72X5 - X6 + 80X7 - 3X8+10X9 -18X10 = 43
100X1 - 4X2 - 75X3 - 8X4 + 83X6 - 10X7 - 75X8+ 3X9 - 8X10 = -53
70X1 + 85X2 - 4X3 - 9X4 + 2X5 + 3X7 - 17X8 - X9- 21X10 = 12
X1 + 15X2+100X3 - 4X4 - 23X5 + 13X6 + 7X8 - 3X9+17X10 = -60
16X1 + 2X2 - 7X3 + 89X4 - 17X5 + 11X6 - 73X7 - 8X9- 23X10 = 100
51X1 + 47X2 - 3X3 + 5X4 - 10X5 + 18X6 - 99X7 - 18X8+12X10 = 0
X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 = 100
r r
3. Solve the equations Ax = c where aij = (i+j-1)-1, and ci = i for i5, and j5. Use both Gauss-Jordan
and Gauss-Seidel methods and comment on which gives the better answer.

4. Solve the following system of equations by Gauss-Jordan and Gauss-Seidel iteration starting with an
initial guess of X=Y=Z=1.

8X + 3Y + 2Z = 20.00
16X + 6Y + 4.001Z = 40.02
4X + 1.501Y + Z = 10.01 .

Comment on the accuracy of your solution and the relative efficiency of the two methods.

## 5. Show that if A is an orthonormal matrix, the A-1 = AT.

r r
6. If x = Ax ' where

⎛ cos φ − sin φ 0 ⎞
⎜ ⎟
A = ⎜ sin φ cos φ 0 ⎟ ,
⎜ 0 1 1 ⎟⎠

r r
find the components of x ’ in terms of the components of x for φ = π/6.

52
2 - Linear Equations and Matrices

## Chapter 2 References and Supplemental Reading

A reasonable complete description of the Crout factorization method is given by

## 1. Hildebrand, F.B., "Introduction to Numerical Analysis" (1956) McGraw-Hill Book Co.,

Inc., New York, Toronto, London.

## A very nice introduction to fixed-point iteration theory is given by

2. Moursund, D.G., and Duris, C.S., "Elementary Theory and Applications of Numerical
Analysis" (1988) Dover Publications, Inc. New York.

The next two references provide an excellent introduction to the determination of eigenvalues and
eigenvectors. Householder's discussion is highly theoretical, but provides the underpinnings for
contemporary methods. The work titled "Numerical Recipes" is just that with some description on how the
recipes work. It represents probably the most complete and useful compilation of contemporary numerical
algorithms currently available.

## 3. Householder, A.S., "Principles of Numerical Analysis" (1953) McGraw-Hill Book Co.,

Inc., New York, Toronto, London, pp.143-184.

4. Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T., "Numerical Recipes The
Art of Scientific Computing" (1986), Cambridge University Press, Cambridge, New York,
Melbourne, pp. 335-380.

Richard Hamming's most recent numerical analysis provides a good introduction to the methods for
handling error analysis, while reference 6 is an excellent example of the type of effort one may find in the
Russian literature on numerical methods. Their approach tends to be fundamentally different than the typical
western approach and is often superior as they rely on analysis to a far greater degree than is common in the
west.

## 5. Hamming, R.W., "Introduction to Applied Numerical Analysis" (1971) McGraw-Hill Book

Co., Inc., New York, San Francisco, Toronto, London.

## 6. Faddeeva, V.N., "Computational Methods of Linear Algebra",(1959), Trans. C.D. Benster,

Dover Publications, Inc. New York.

53
Numerical Methods and Data Analysis

54
3

Polynomial Approximation,
Interpolation, and
Orthogonal Polynomials

• • •

In the last chapter we saw that the eigen-equation for a matrix was
a polynomial whose roots were the eigenvalues of the matrix. However, polynomials play a much larger role
in numerical analysis than providing just eigenvalues. Indeed, the foundation of most numerical analysis
methods rests on the understanding of polynomials. As we shall see, numerical methods are usually tailored
to produce exact answers for polynomials. Thus, if the solution to a problem is a polynomial, it is often
possible to find a method of analysis, which has zero formal truncation error. So the extent to which a
problem's solution resembles a polynomial will generally determine the accuracy of the solution. Therefore
we shall spend some time understanding polynomials themselves so that we may better understand the
methods that rely on them.

55
Numerical Methods and Data Analysis

## 3.1 Polynomials and Their Roots

When the term polynomial is mentioned, one generally thinks of a function made up of a sum of
terms of the form ai xi. However, it is possible to have a much broader definition where instead of the simple
function xi we may use any general function φi(x) so that a general definition of a polynomial would have
the form
n
P( x ) = ∑ a i φ i ( x ) . (3.1.1)
i =0

Here the quantity n is known as the degree of the polynomial and is usually one less than the number of
terms in the polynomial. While most of what we develop in this chapter will be correct for general
polynomials such as those in equation (3.1.1), we will use the more common representation of the
polynomial so that
i
φi(x) = x . (3.1.2)

## P(x) = a0 + a1x + a2x2 + … + anxn . (3.1.3)

Familiar as this form may be, it is not the most convenient form for evaluating the polynomial. Consider the
last term in equation (3.1.3). It will take n+1 multiplications to evaluate that term alone and n multiplications
for the next lowest order term. If one sums the series, it is clear that it will take (n+1)n/2 multiplications and
n additions to evaluate P(x). However, if we write equation (3.1.3) as

P( x ) = a 0 + (a 1 + L (a n −1 + a n x ) x ) x L) x , (3.1.4)

then, while there are still n additions required for the evaluation of P(x), the number of multiplications has
been reduced to n. Since the time required for a computer to carry out a multiplication is usually an order of
magnitude greater than that required for addition, equation (3.1.4) is a considerably more efficient way to
evaluate P(x) than the standard form given by equation (3.1.3). Equation (3.1.4) is sometimes called the
"factored form" of the polynomial and can be immediately written down for any polynomial. However, there
is another way of representing the polynomial in terms of factors, namely

P( x ) = a n ( x − x 1 )( x − x 2 )( x − x 3 ) L ( x − x n ) . (3.1.5)

Here the last n coefficients of the polynomial have been replaced by n quantities known as the roots of the
polynomial. It is important to note that, in general, there are (n+1) parameters specifying a polynomial of
degree n. These parameters can be either the (n+1) coefficients or the n roots and a multiplicative scale factor
an. In order to fully specify a polynomial this many parameters must be specified. We shall see that this
requirement sets constraints for interpolation.

The n quantities known as the roots are not related to the coefficients in a simple way. Indeed, it is
not obvious that the polynomial should be able to be written in the form of equation (3.1.5). The fact that a

56
3 - Polynomial Approximation

polynomial of degree n has exactly n such roots is known as the fundamental theorem of algebra and its
proof is not simple. As we shall see, simply finding the roots is not simple and constitutes one of the more
difficult problems in numerical analysis. Since the roots may be either real or complex, the most general
approach will have to utilize complex arithmetic. Some polynomials may have multiple roots (i.e. more than
one root with the same numerical value). This causes trouble for some root finding methods. In general, it is
useful to remove a root (or a pair if they are complex) once it is found thereby reducing the polynomial to a
lower degree. Once it has been reduced to a quadratic or even a cubic, the analytic formulae for these roots
maybe used. There is an analytic form for the general solution of a quartic (i.e. polynomial of 4th degree),
but it is so cumbersome that it is rarely used. Since it has been shown that there is no general form for the
roots of polynomials of degree 5 or higher, one will usually have to resort to numerical methods in order to
find the roots of such polynomials. The absence of a general scheme for finding the roots in terms of the
coefficients means that we shall have to learn as much about the polynomial as possible before looking for
the roots.

## a. Some Constraints on the Roots of Polynomials

This subject has been studied by some of the greatest mathematical minds of the last several
centuries and there are numerous theorems that can be helpful in describing the roots. For example, if we re-
multiply equation (3.1.5) the coefficient of xn-1 is just an times the negative summation of the roots so that
n
an-1 = an Σ xi . (3.1.6)
i=1

## In a similar manner we find that

a n − 2 = a n ∑∑ x i x j . (3.1.7)
i≠ j j

We will see that it is possible to use these relations to obtain estimates of the magnitude of the roots. In
addition, the magnitude of the roots is bounded by
(a max + 1) ≤ x j ≤ ( a max + 1) .
−1
(3.1.8)

Finally there is Descarte's rule of signs which we all learned at one time but usually forgot. If we
reverse the order of equation (3.1.3) so that the terms appear in descending powers of x as
P( x ) = a n x n + a n −1 x n −1 + a n − 2 x n − 2 + L + a 0 , (3.1.9)
then any change of sign between two successive terms is called a variation in sign. Coefficients that are zero
are ignored. With that definition of a sign variation we can state Descarte's rule of signs as

The number of positive roots of P(x)=0 cannot exceed the number of variations of sign in
P(x) and, in any case, differs from the number of variations by an even integer.

## A useful and easily proved corollary to this is

The number of negative roots of P(x)=0 cannot exceed the number of variations in sign in
P(-x) and, in any case, differs from the number of variations by an even integer.

57
Numerical Methods and Data Analysis

The phrasing concerning the "even integer" results from the possibility of the existence of complex roots,
which occur, in pairs (providing the coefficients are real) where one is the complex conjugate of the other.
With these tools, it is often possible to say a good deal about the properties of the roots of the polynomial in
question. Since most of the methods for finding roots are sequential and require the removal of the roots
leading to a new polynomial of lower degree, we should say something about how this is accomplished.

b. Synthetic Division

If we wish to remove a factor from a polynomial we may proceed as if we were doing long
division with the added proviso that we keep track of the appropriate powers of x. Thus if (x-r) is to be
factored out of P(x) we could proceed in exactly the same fashion as long division. Consider the specific
case where r = 2 and
P(x) = x4 + 3x3 ─ 17x2 + 6x ─ 18 . (3.1.10)

## The long division would then look like

x 3 + 5x 2 − 7 x − 11 ⎫
( x − 2) x 4 + 3x 3 − 17 x 2 + 3x − 18 ⎪

x 4 − 2x 3 ⎪

5x 3 − 17 x 2 ⎪

5x 3 − 10 x 2 ⎪

− 7 x 2 + 3x ⎬ . (3.1.11)
− 7x 2 + 14 x ⎪

− 11x − 18 ⎪

− 11x + 22 ⎪
− 40 ⎪

Thus we can write P(x) as

or in general as

## P(x) = (x-r)Q(x) + R . (3.1.13)

So if we evaluate the polynomial for x = r we get

P(r) = R . (3.1.14)

58
3 - Polynomial Approximation

Now if R(r) is zero, then r is a root by definition. Indeed, one method for improving roots is to carry out
repeated division, varying r until the remainder R is acceptably close to zero. A cursory inspection of the
long division expression (3.1.11) shows that much more is being written down than is necessary. In order for
the division to proceed in an orderly fashion, there is no freedom in what is to be done with the lead
coefficients of the largest powers of x. Indeed, the coefficients of the resultant polynomial Q(x) are repeated
below. Also, when searching for a root, the lead coefficient of x in the divisor is always one and therefore
need not be written down. Thus if we write down only the coefficients and r-value for the division process,
we can compress the notation so that
r = 2 + 1 + 3 − 17 + 3 − 18 = P( x ) ⎫
⎪⎪
+ 2 + 10 − 14 − 22 ⎬ . (3.1.15)
Q( x ) = +1 + 5 − 7 − 11 − 40 = R ⎪
⎪⎭

This shorthand form of keeping track of the division is known as synthetic division. Even this notation can
be formulated in terms of a recursive procedure. If we let the coefficients of the quotient polynomial Q(x) be
bi so that
Q(x) = b0 + b1x + b2x2 + … + bn-1xn-1 , (3.1.16)

then the process of finding the bi's in terms of the coefficients ai of the original polynomial P(x) can be
written as
b n −1 = a n ⎫

b i −1 = rb i + a i i = 0L n − 1 ⎬ . (3.1.17)
R = b −1 ⎪

Here the remainder R is given by b-1 and should it be zero, then r is a root. Therefore, once a root has been
found, it can be removed by synthetic division leading to an new polynomial Q(x). One can then begin again
to find the roots of Q(x) until the original polynomial has been reduced to a cubic or less. Because of the
complexity of the general cubic, one usually uses the quadratic formula. However, even here Press et al1
suggest caution and recommend the use of both forms of the formula, namely
− b ± b 2 − 4ac ⎫
x= ⎪
2a ⎪
⎬ . (3.1.18)
2c ⎪
x=
− b ± b 2 − 4ac ⎪⎭
Should a or c be small the discriminate will be nearly the same as b and the resultant solution will suffer
from round-off error. They suggest the following simple solution to this problem. Define

## Then the two roots will be given by

59
Numerical Methods and Data Analysis

x = q/a ⎫
⎬ . (3.1.20)
x = c/q ⎭

Let us see how one might analyze our specific polynomial in equation (3.1.10). Descartes’ rule of
signs for P(x) tells us that we will have no more than three real positive roots while for P(-x) it states that we
will have no more than one real negative root. The degree of the polynomial itself indicates that there will be
four roots in all. When the coefficients of a polynomial are integer, it is tempting to look for integer roots. A
little exploring with synthetic division shows that we can find two roots so that

## P(x) = (x-3)(x+6)(x2+1) , (3.1.21)

and clearly the last two roots are complex. For polynomials with real coefficients, one can even use synthetic
division to remove complex roots. Since the roots will appear in conjugate pairs, simply form the quadratic
polynomial
(x-r)(x-r*) = x2 ─ (r+r*)x + r r* , (3.1.22)

which will have real coefficients as the imaginary part of r cancels out of (r+r*) and rr* is real by definition.
One then uses synthetic division to divide out the quadratic form of equation (3.1.22). A general recurrence
relation similar to equation (3.1.17) can be developed for the purposes of machine computation.

Normally the coefficients of interesting polynomials are not integers and the roots are not simple
numbers. Therefore the synthetic division will have a certain round off error so that R(r) will not be zero.
This points out one of the greatest difficulties to be encountered in finding the roots of a polynomial. The
round off error in R(r) accumulates from root to root and will generally depend on the order in which the
roots are found. Thus the final quadratic polynomial that yields the last two roots may be significantly
different than the correct polynomial that would result in the absence of round off error. One may get a
feeling for the extent of this problem by redoing the calculation but finding the roots in a different order. If
the values are independent of the order in which they are found, then they are probably accurate. If not, then
they are not.

## c. The Graffe Root-Squaring Process

We discuss this process not so much for its practical utility as to show the efficacy of the
constraints given in equations (3.1.6,7). Consider evaluating a polynomial for values of x = xi where xi are
the roots so that

P( x i ) = a i x ij =
j

k
a 2 k x i2 k + a 2 k +1 x i2 k +1 . (3.1.23)

We may separate the terms of the polynomial into even and odd powers of x and since P(xi)=0, we may
arrange the odd powers so that they are all on one side of the equation as
2 2
⎡ 2k ⎤ ⎡ 2 k +1 ⎤
⎢∑ a 2 k x i ⎥ = ⎢∑ a 2 k +1 x i ⎥ . (3.1.24)
⎣k ⎦ ⎣k ⎦

60
3 - Polynomial Approximation

Squaring both sides produces exponents with even powers and a polynomial with new coefficients ai
(p)
and having the form
S( x ) = a (np ) x 2 pn + a (np ) x 2 pn − 2 + L + a (0p ) . (3.1.25)

## These new coefficients can be generated by the recurrence relation from

n −1

a i( p +1) = 2a (np ) a (2pl) − 2 ∑ a (kp ) a (2pl)− k + (−1) l (a 1( p ) ) 2 ⎪
k = l −1 ⎬. (3.1.26)
a i = 0, i > n
( p) ⎪

If we continue to repeat this process it is clear that the largest root will dominate the sum in equation (3.1.6)
so that
n
⎡ ( p) ⎤
x 2max
p
= Lim ∑ x i2 p = Lim ⎢a n −1 ( p ) ⎥ . (3.1.27)
p →∞
i =1
p→∞
⎣ a n ⎦

Since the product of the largest two roots will dominate the sums of equation (3.1.7), we may generalize the
result of eq (3.1.27) so that each root will be given by

⎡ (p) ⎤
x i2 p ≅ Lim ⎢a i −1 ( p ) ⎥ . (3.1.28)
p →∞
⎣ an ⎦

While this method will in principle yield all the roots of the polynomial, the coefficients grow so fast that
roundoff error quickly begins to dominate the polynomial. However, in some instance it may yield
approximate roots that will suffice for initial guesses required by more sophisticated methods. Impressive as
this method is theoretically, it is rarely used. While the algorithm is reasonably simple, the large number of
digits required by even a few steps makes the programming of the method exceedingly difficult.

d. Iterative Methods

Most of the standard algorithms used to find the roots of polynomials scan the polynomial
in an orderly fashion searching for the root. Any such scheme requires an initial guess, a method for
predicting a better guess, and a system for deciding when a root has been found. It is possible to cast any
such method in the form of a fixed-point iterative function such as was discussed in section 2.3d. Methods
having this form are legion so we will discuss only the simplest and most widely used. Putting aside the
problem of establishing the initial guess, we will turn to the central problem of predicting an improved value
for the root. Consider the simple case of a polynomial with real roots and having a value P(x k) for some
value of the independent variable x k (see Figure 3.1).

61
Numerical Methods and Data Analysis

Figure 3.1 depicts a typical polynomial with real roots. Construct the tangent to the curve
at the point xk and extend this tangent to the x-axis. The crossing point xk+1 represents an
improved value for the root in the Newton-Raphson algorithm. The point xk-1 can be used to
construct a secant providing a second method for finding an improved value of x.

Many iterative techniques use a straight line extension of the function P(x) to the x-axis as a means
of determining an improved value for x. In the case where the straight-line approximation to the function is
obtained from the local tangent to the curve at the point xk, we call the method the Newton-Raphson method.
We can cast this in the form of a fixed-point iterative function since we are looking for the place where P(x)
= 0. In order to find the iterative function that will accomplish this let us assume that an improved value of
the root x(k) will be given by

## x(k+1) = x(k) + [x(k+1)-x(k)] ≡ x(k) + ∆x(k) . (3.1.29)

Now since we are approximating the function locally by a straight line, we may write
P[ x ( k ) ] = αx ( k ) + β ⎫⎪
⎬. (3.1.30)
P[ x ( k +1) ] = αx ( k +1) + β ⎪⎭
Subtracting these two equations we get

## P[x(k)] = α[x(k) ─ x(k+1)] = ─ α∆x(k) . (3.1.31)

62
3 - Polynomial Approximation

However the slope of the tangent line α is given by the derivative so that

α = dP[x(k)]/dx . (3.1.32)

## Thus the Newton-Raphson iteration scheme can be written as

x(k+1) = x(k) ─ P[x(k)]/P'[x(k)] . (3.1.33)

By comparing equation (3.1.33) to equation (2.3.18) it is clear that the fixed-point iterative function for
Newton-Raphson iteration is
Φ(x) = x ─ P(x)/P'(x) . (3.1.34)

We can also apply the convergence criterion given by equation (2.3.20) and find that the necessary
and sufficient condition for the convergence of the Newton-Raphson iteration scheme is
P( x )P" ( x )
2
< 1 , ∀x ε x ( k ) ≤ x ≤ x 0 . (3.1.35)
[P' ( x )]
Since this involves only one more derivative than is required for the implementation of the scheme, it
provides a quite reasonable convergence criterion and it should be used in conjunction with the iteration
scheme.

The Newton-Raphson iteration scheme is far more general than is implied by its use in polynomial
root finding. Indeed, many non-linear equations can be dealt with by means of equations (3.1.34, 35). From
equation (3.1.33), it is clear that the scheme will yield 'exact' answers for first degree polynomials or straight
lines. Thus we can expect that the error at any step will depend on [∆x(k)]2. Such schemes are said to be
second order schemes and converge quite rapidly. In general, if the error at any step can be written as

## E(x) = K×(∆x)n , (3.1.36)

where K is approximately constant throughout the range of approximation, the approximation scheme is said
to be of (order) O(∆x)n. It is also clear that problems can occur for this method in the event that the root of
interest is a multiple root. Any multiple root of P(x) will also be a root of P'(x). Geometrically this implies
that the root will occur at a point where the polynomial becomes tangent to the x-axis. Since the denominator
of equation (3.1.35) will approach zero at least quadratically while the numerator may approach zero linearly
in the vicinity of the root(s), it is unlikely that the convergence criterion will be met. In practice, the shallow
slope of the tangent will cause a large correction to x(k) moving the iteration scheme far from the root.

A modest variation of this approach yields a rather more stable iteration scheme. If instead of using
the local value of the derivative to obtain the slope of our approximating line, we use a prior point from the
iteration sequence, we can construct a secant through the prior point and the present point instead of the local
tangent. The straight line approximation through these two points will have the form
P[ x ( k ) ] = αx ( k ) + β ⎫⎪
⎬, (3.1.37)
P[ x ( k −1) ] = αx ( k −1) + β ⎪⎭
which, in the same manner as was done with equation (3.1.30) yields a value for the slope of the line of

63
Numerical Methods and Data Analysis

P[ x ( k ) ] − P[ x ( k −1) ]
α= . (3.1.38)
x ( k ) − x ( k −1)
So the iterative form of the secant iteration scheme is

x ( k +1) = x ( k ) −
[
P[ x ( k ) ] x ( k ) − x ( k −1) ] . (3.1.39)
P[ x ( k ) ] − P[ x ( k −1) ]

Useful as these methods are for finding real roots, as presented, they will be ineffective in locating
complex roots. There are numerous methods that are more sophisticated and amount to searching the
complex plane for roots. For example Bairstow's method synthetically divides the polynomial of interest by
an initial quadratic factor which yields a remainder of the form

R = αx + β , (3.1.40)

where α and β depend on the coefficients of the trial quadratic form. For that form to contain two roots of the
polynomial both α and β must be zero. These two constraints allow for a two-dimensional search in the
complex plane to be made usually using a scheme such as Newton-Raphson or versions of the secant
method. Press et al strongly suggest the use of the Jenkins-Taub method or the Lehmer-Schur method. These
rather sophisticated schemes are well beyond the scope of this book, but may be studied in Acton2.

Before leaving this subject, we should say something about the determination of the initial guess.
The limits set by equation (3.1.8) are useful in choosing an initial value of the root. They also allow for us to
devise an orderly progression of finding the roots - say from large to small. While most general root finding
programs will do this automatically, it is worth spending a little time to see if the procedure actually follows
an orderly scheme. Following this line, it is worth repeating the cautions raised earlier concerning the
difficulties of finding the roots of polynomials. The blind application of general programs is almost certain to
lead to disaster. At the very least, one should check to see how well any given root satisfies the original
polynomial. That is, to what extent is P(xi) = 0. While even this doesn't guarantee the accuracy of the root, it
is often sufficient to justify its use in some other problem.

## 3.2 Curve Fitting and Interpolation

The very processes of interpolation and curve fitting are basically attempts to get "something for
nothing". In general, one has a function defined at a discrete set of points and desires information about the
function at some other point. Well that information simply doesn't exist. One must make some assumptions
about the behavior of the function. This is where some of the "art of computing" enters the picture. One
needs some knowledge of what the discrete entries of the table represent. In picking an interpolation scheme
to generate the missing information, one makes some assumptions concerning the functional nature of the
tabular entries. That assumption is that they behave as polynomials. All interpolation theory is based on
polynomial approximation. To be sure the polynomials need not be of the simple form of equation (3.1.3),
but nevertheless they will be polynomials of some form such as equation (3.1.1).

Having identified that missing information will be generated on the basis that the tabular function is

64
3 - Polynomial Approximation

represented by a polynomial, the problem is reduced to determining the coefficients of that polynomial.
Actually some thought should be given to the form of the functions φi(x) which determines the basic form of
the polynomial. Unfortunately, more often than not, the functions are taken to be xi and any difficulties in
representing the function are offset by increasing the order of the polynomial. As we shall see, this is a
dangerous procedure at best and can lead to absurd results. It is far better to see if the basic data is - say
exponential or periodic in form and use basis functions of the form eix, sin(i π x), or some other appropriate
functional form. One will be able to use interpolative functions of lower order which are subject to fewer
large and unexpected fluctuations between the tabular points thereby producing a more reasonable result.

Having picked the basis functions of the polynomial, one then proceeds to determine the
coefficients. We have already observed that an nth degree polynomial has (n+1) coefficients which may be
regarded as (n+1) degrees of freedom, or n+1 free parameters to adjust so as to provide the best fit to the
tabular entry points. However, one still has the choice of how much of the table to fit at any given time. For
interpolation or curve-fitting, one assumes that the tabular data are known with absolute precision. Thus we
expect the approximating polynomial to reproduce the data points exactly, but the number of data points for
which we will make this demand at any particular part of the table remains at the discretion of the
investigator. We shall develop our interpolation formulae initially without regard to the degree of the
polynomial that will be used. In addition, although there is a great deal of literature developed around
interpolating equally spaced data, we will allow the spacing to be arbitrary. While we will forgo the elegance
of the finite difference operator in our derivations, we will be more than compensated by the generality of
the results. These more general formulae can always be used for equally spaced data. However, we shall
limit our generality to the extent that, for examples, we shall confine ourselves to basis functions of the form
xi. The generalization to more exotic basis functions is usually straightforward. Finally, some authors make a
distinction between interpolation and curve fitting with the latter being extended to a single functional
relation, which fits an entire tabular range. However, the approaches are basically the same so we shall treat
the two subjects as one. Let us then begin by developing Lagrange Interpolation formulae.

a. Lagrange Interpolation

Let us assume that we have a set of data points Y(xi) and that we wish to approximate the
behavior of the function between the data points by a polynomial of the form
n
Φ(x) = Σ ajxj . (3.2.1)
j=0

Now we require exact conformity between the interpolative function Φ(xi) and the data points Y(xi) so that
n
Y( x i ) = Φ ( x i ) = ∑ a j x ij , i = 0 L n . (3.2.2)
j= 0

Equation (3.2.2) represents n+1 inhomogeneous equations in the n+1 coefficients aj which we could solve
using the techniques in chapter 2. However, we would then have a single interpolation formula that would
have to be changed every time we changed the values of the dependent variable Y(xi). Instead, let us
combine equations (3.2.1) and (3.2.2) to form n+2 homogeneous equations of the form

65
Numerical Methods and Data Analysis

n

∑a x j − Y( x i )⎪
j
i
j= 0 ⎪
n ⎬ = 0 . (3.2.3)
∑ a j x − Φ(x) ⎪
j

j= 0
⎪⎭
These equations will have a solution if and only if
1 x 0 x 02 x 30 L x 0n − Y0
1 x 1 x 12 x 13 L x 1n − Y1
Det 1 x 2 x 22 x 32 L x n2 − Y2 =0 . (3.2.4)
M M M M M M
1 x x x L x − Φ( x )
2 3 n

Now let x = xi and subtract the last row of the determinant from the ith row so that expansion by minors
along that row will yield
[Φ(xi) ─ Yi]│xjk│i = 0 . (3.2.5)
Since x k i ≠ 0 , the value of Φ(xi) must be Y(xi) satisfying the requirements given by equation (3.2.2). Now
j

## expand equation (3.2.4) by minors about the last column so that

1 x 0 x 02 x 30 L x 0n − Y0
1 x 1 x 12 x 13 L x 1n − Y1
n
Φ ( x ) x kj = 1 x 2 x 22 x 32 L x n2 − Y2 = ∑ Y( x i )A i ( x ) . (3.2.3)
i =0
M M M M M M
1 x x2 x3 L xn 0
Here the Ai(x) are the minors that arise from the expansion down the last column and they are independent
of the Yi's. They are simply linear combinations of the xj' s and the coefficients of the linear combination
depend only on the xi's. Thus it is possible to calculate them once for any set of independent variables xi and
j
use the results for any set of Yi's. The determinant x k depends only on the spacing of the tabular values of
the independent variable and is called the Vandermode determinant and is given by

n
Vd = x kj = ∏ (x
i > j= 0
i − x j) . (3.2.7)

Therefore dividing Ai(x) in equation (3.2.6) by the Vandermode determinant we can write the interpolation
formula given by equation (3.2.6) as
n
Φ ( x ) = ∑ Y( x i )L i ( x ) , (3.2.8)
i =0

## where Li(x) is known as the Lagrange Interpolative Polynomial and is given by

66
3 - Polynomial Approximation
n
(x − x j )
L i (x) = ∏ (3.2.9)
j≠ i
(x i − x j ) .
j= 0

This is a polynomial of degree n with roots xj for j ≠ i since one term is skipped (i.e. when i = j) in a product
of n+1 terms. It has some interesting properties. For example
n
(x k − x j )
L i (x k ) = ∏ = δ ki , (3.2.10)
j≠ i
(x i − x j )
j= 0

where δik is Kronecker's delta. It is clear that for values of the independent variable equally separated by an
amount h the Lagrange polynomials become
(−1) n n
L i (x) =
(n − i)!i!h n
∏ (x − x
j≠ i
j ) . (3.2.11)
j= 0

The use of the Lagrangian interpolation polynomials as described by equations (3.2.8) and (3.2.9)
suggest that entire range of tabular entries be used for the interpolation. This is not generally the case. One
picks a subset of tabular points and uses them for the interpolation. The use of all available tabular data will
generally result in a polynomial of a very high degree possessing rapid variations between the data points
that are unlikely to represent the behavior of the tabular data.

Here we confront specifically one of the "artistic" aspects of numerical analysis. We know only the
values of the tabular data. The scheme we choose for representing the tabular data at other values of the
independent variable must only satisfy some aesthetic sense that we have concerning that behavior. That
sense cannot be quantified for the objective information on which to evaluate it simply does not exist. To
illustrate this and quantify the use of the Lagrangian polynomials, consider the functional values for xi and
Yi given in Table 3.1. We wish to obtain a value for the dependent variable Y when the independent variable
x = 4. As shown in figure 3.2, the variation of the tabular values Yi is rapid, particularly in the vicinity of x =
4. We must pick some set of points to determine the interpolative polynomials.

Table 3.1
Sample Data and Results for Lagrangian Interpolation Formulae

I X 1
2 L i (4) 2
1 L i (4) 2
2 L i (4) 3
1 L i (4) YI 1
1 Φ i (4) 2
1 Φ i (4) 2
2 Φ i (4) 3
1 Φ i (4)
0 1 1
1 2 -1/3 -2/9 3
3 3 +1/2 +1 +2/5 +4/5 8
4 6 25/3 86/15 112/15
4 5 +1/2 +1/3 +2/3 +4/9 4
5 8 -1/15 -1/45 2
6 10 1

67
Numerical Methods and Data Analysis

The number of points will determine the order and we must decide which points will be used. The points are
usually picked for their proximity to the desired value of the independent variable. Let us pick them
consecutively beginning with tabular entry xk. Then the nth degree Lagrangian polynomials will be
n+k
(x − x j )
n
L i (x ) = ∏ . (3.2.12)
k
j≠ i
(x i − x j )
j= k

Should we choose to approximate the tabular entries by a straight line passing through points bracketing the
desired value of x = 4, we would get

(x − x 3 ) ⎫
1
2 L1 ( x ) = = 1
2 for x = 4 ⎪
(x 2 − x 3 ) ⎪
⎬ . (3.2.13)
(x − x 2 )
1
L 2 (x) = = 1
for x = 4⎪
2
(x 3 − x 2 ) 2
⎪⎭

Thus the interpolative value 2 Φ (4) given in table 3.1 is simply the average of the adjacent values of Yi. As
1

can be seen in figure 3.2, this instance of linear interpolation yields a reasonably pleasing result. However,
should we wish to be somewhat more sophisticated and approximate the behavior of the tabular function
with a parabola, we are faced with the problem of which three points to pick. If we bracket the desired point
with two points on the left and one on the right we get Lagrangian polynomials of the form

( x − x 2 )( x − x 3 ) 1 ⎫
2
1 L1 ( x ) = =− , x=4⎪
( x 1 − x 2 )( x 1 − x 3 ) 3 ⎪
( x − x 1 )( x − x 3 ) ⎪⎪
1 L 2 (x) = = −1 , x = 4 ⎬
2
.
( x 2 − x 1 )( x 2 − x 3 ) (3.2.14)

( x − x 1 )( x − x 2 ) 1 ⎪
1 L 3 (x) = =+ , x=4 ⎪
2

( x 3 − x 1 )( x 3 − x 2 ) 3 ⎪⎭

68
3 - Polynomial Approximation

Figure 3.2 shows the behavior of the data from Table 3.1. The results of
various forms of interpolation are shown. The approximating polynomials for
the linear and parabolic Lagrangian interpolation are specifically displayed.
The specific results for cubic Lagrangian interpolation, weighted Lagrangian
interpolation and interpolation by rational first degree polynomials are also
indicated.

Substituting these polynomials into equation (3.2.8) and using the values for Yi from Table 3.1, we get an
interpolative polynomial of the form

## P1(x) = 3 12L1(x) + 8 12L2(x) + 4 12L3(x) = ─(7x2-50x+63)/3 . (3.2.15)

Had we chosen the bracketing points to include two on the left and only one on the right the polynomial
would have the form
P2(x) = 8 22L1(x) + 4 22L2(x) + 2 22L3(x) = 2(2x2-31x+135)/15 . (3.2.16)

However, it is not necessary to functionally evaluate these polynomials to obtain the interpolated value. Only
the numerical value of the Lagrangian polynomials for the specific value of the independent variable given

69
Numerical Methods and Data Analysis

on the right hand side of equations (3.2.14) need be substituted directly into equation (3.2.8) along with the
appropriate values of Yi. This leads to the values for 1 Φ (4) and 2 Φ (4) given in Table 3.1. The values are
2 2

## quite different, but bracket the result of the linear interpolation.

While equations (3.13) - (3.16) provide an acceptable method of carrying out the interpolation, there
are more efficiently and readily programmed methods. One of the most direct of these is a recursive
procedure where values of an interpolative polynomial of degree k are fit to successive sets of the data
points. In this method the polynomial's behavior with x is not found, just its value for a specific choice of x.
This value is given by
( x − x i + k )Pi ,i +1,L,i + k −1 + ( x i − x )Pi +1,i + 2,L,i + k ( x ) ⎫
Pi ,i +1,L,i + k ( x ) = ⎪
(x i − x i+k ) ⎬ . (3.2.17)
Pi ,i ( x ) = Yi , for k = 0 ⎪

For our test data given in table 3.1 the recursive formula given by equation (3.2.17) yields
(4 − x 2 )Y1 + ( x 1 − 4)Y2 (4 − 3) × 3 + (2 − 4) × 8 ⎫
P1, 2 (4) = = = +13 ⎪
(x 1 − x 2 ) (2 − 3) ⎪
(4 − x 3 )Y2 + ( x 2 − 4)Y3 (4 − 5) × 8 + (3 − 4) × 4 ⎪⎪
P2,3 (4) = = = +6 ⎬ . (3.2.18)
(x 2 − x 3 ) (3 − 5) ⎪
(4 − x 4 )Y3 + ( x 3 − 4)Y4 (4 − 5) × 4 + (5 − 4) × 2 14 ⎪
P3, 4 (4) = = =+ ⎪
(x 3 − x 4 ) (5 − 8) 3 ⎪⎭
for k = 1. Here we see that P2,3(4) corresponds to the linear interpolative value obtained using points x2 and
x3 given in table 3.1 as 12 Φ (4) . In general, the values of Pi,i+1(x) correspond to the value of the straight line
passing through points xi and xi+1 evaluated at x. The next generation of recursive polynomial-values will
correspond to parabolas passing through the points xi, xi+1, and xi+2 evaluated at x.

## ( 4 − x 2 )P1, 2 (4) + ( x 1 − 4)P2,3 (4) (4 − 3) × 3 + ( 2 − 4) × 8 25 ⎫

P1, 2,3 (4) = = =+ ⎪
(x 1 − x 3 ) (2 − 5) 3 ⎪
⎬ , (3.2.20)
(4 − x 4)P2,3 (4) + ( x 2 − 4)P3, 4 (4) (4 − 8) × 6 + (3 − 4) × (14 3 ) 86 ⎪
P2,3, 4 (4) = = =+
(x 2 − x 4 ) (3 − 8) 15 ⎪

2 2
which correspond to the values for 1 Φ (4) and 2 Φ (4) in table 3.1 respectively. The cubic which passes
through points x1, x2, x3, and x4 is the last generation of the polynomials calculated here by this recursive
procedure and is
(4 − x 3 )P1, 2,3 (4) + ( x 1 − 4)P2,3, 4 ( 4) (4 − 8) × ( 25 3 ) + (2 − 4) × (8615) 112
P1, 2,3, 4 (4) = = =+ . (3.2.21)
(x 1 − x 4 ) (2 − 8) 15
The procedure described by equation (3.2.17) is known as Neville's algorithm and can nicely be summarized
by a Table 3.2.

70
3 - Polynomial Approximation

The fact that these results exactly replicate those of table 3.1 is no surprise as the polynomial of a particular
degree k that passes through a set of k+1 points is unique. Thus this algorithm describes a particularly
efficient method for carrying out Lagrangian interpolation and, like most recursive proceedures, is easily
implemented on a computer.

How are we to decide which of the parabolas is "better". In some real sense, both are equally likely.
2
The large value of 1 Φ (4) results because of the rapid variation of the tabular function through the three
chosen points (see figure 3.1) and most would reject the result as being too high. However, we must
remember that this is a purely subjective judgment. Perhaps one would be well advised to always have the
same number of points on either side so as to insure the tabular variation on either side is equally weighted.
This would lead to interpolation by polynomials of an odd degree. If we chose two points either3
side of the
Φ(4)
desired value of the independent variable, we fit a cubic through the local points and obtain 1 which is
2
rather close to 1 Φ (4) . It is clear that the rapid tabular variation of the points preceding x = 4 dominate the
interpolative polynomials. So which one is correct? We must emphasis that there is no objectively "correct"
answer to this question. Generally one prefers an interpolative function that varies no more rapidly that the
tabular values themselves, but when those values are sparse this criterion is difficult to impose. We shall
consider additional interpolative forms that tend to meet this subjective notion over a wide range of
conditions. Let us now turn to methods of their construction.

Table 3.2

## Parameters for the Polynomials Generated by Neville's Algorithm

I X YI PI, I PI, I+1 PI, I+1, I+2 PI, I+1, I+2, I+3
0 1 1 0
0
1 2 3 3 0
+13
2 3 8 8 +25/3 0

4 +6 112/15

3 5 4 4 +86/15 0
+14/3
4 8 2 2 0
0
5 10 1 0

It is possible to place additional constraints on the interpolative function which will make the
appropriate interpolative polynomials somewhat more difficult to obtain, but it will always be possible to
obtain them through consideration of the determinantal equation similar to equation (3.2.6). For example, let

71
Numerical Methods and Data Analysis

us consider the case where constraints are placed on the derivative of the function at a given number of
values for the independent variable.

b. Hermite Interpolation

While we will use the Hermite interpolation formula to obtain some highly efficient
quadrature formulae later, the primary reason for discussing this form of interpolation is to show a powerful
approach to generating interpolation formulae from the properties of the Lagrange polynomials. In addition
to the functional constraints of Lagrange interpolation given by equation (3.2.2), let us assume that the
functional values of the derivative Y'(xi) are also specified at the points xi. This represents an additional
(n+1) constraints. However, since we have assumed that the interpolative function will be a polynomial, the
relationship between a polynomial and its derivative means we shall have to be careful in order that these
2n+2 constraints remain linearly independent. While a polynomial of degree n has (n+1) coefficients, its
derivative will have only n coefficients. Thus the specification of the derivative at the various values of the
independent variable allow for a polynomial with 2n+2 coefficients to be used which is a polynomial of
degree 2n+1.

Rather than obtain the determinantal equation for the 2n+2 constraints and the functional form of the
interpolative function, let us derive the interpolative function from what we know of the properties of Li(x).
For the interpolative function to be independent of the values of the dependent variable and its derivative, it
must have a form similar to equation (3.2.8) so that
n
Φ( x ) = ∑ Y( x )h
j= 0
j j (x ) + Y' ( x j )H j ( x ) . (3.2.21)

As before we shall require that the interpolative function yield the exact values of the function at the tabular
values of the independent variable. Thus,
n
Φ( x i ) = Y( x i ) = ∑ Y(x
j= 0
j )h j ( x i ) + Y' (x j )H j ( x i ) . (3.2.22)

Now the beauty of an interpolation formula is that it is independent of the values of the dependent variable
and, in this case, its derivative. Thus equation (3.2.22) must hold for any set of data points Yi and their
derivatives Y’i . So lets consider a very specific set of data points given by
Y( x i ) = 1 ⎫

Y( x j ) = 0, j ≠ i ⎬ . (3.2.23)

Y' ( x j ) = 0, ∀j ⎭

This certainly implies that hi(xi) must be one. A different set of data points that have the properties that
Y( x i ) = 0, j ≠ k ⎫

Y( x k ) = 0, ⎬ , (3.2.24)
Y' ( x j ) = 0, ∀j ⎭⎪

will require that hk(xj) be zero. However, the conditions on hi(xj) must be independent of the values of the
independent variable so that both conditions must hold. Therefore

72
3 - Polynomial Approximation

## hj(xi) = δij . (3.2.25)

where δij is Kronecker's delta. Finally one can consider a data set where
Y( x j ) = 0 ⎫⎪
⎬ ∀j . 3.2.26)
Y' ( x j ) = 1, ⎪⎭
Substitution of this set of data into equation (3.2.22) clearly requires that

Hj(xi) = 0 . (3.2.27)

Now let us differentiate equation (3.2.21) with respect to x and evaluate at the tabular values of the
independent variable xi. This yields
n
Φ' (x i ) = Y(x i ) = ∑ Y(x
j= 0
j )h ' j ( x i ) + Y' ( x j )H' j ( x i ) . (3.2.28)

By choosing our data sets to have the same properties as in equations (3.2.23,24) and (3.2.26), but with the
roles of the function and its derivative reversed, we can show that

h' j (x i ) = 0 ⎫⎪
⎬ . (3.2.29)
H' j ( x i ) = δ ij ⎪⎭

We have now place constraints on the interpolative functions hj(x), Hj(x) and their derivatives at each of the
n+1 values of the independent variable. Since we know that both hj(x) and Hj(x) are polynomials, we need
only express them in terms of polynomials of the correct degree which have the same properties at the points
xi to uniquely determine their form.

We have already shown that the interpolative polynomials will have a degree of (2n+1). Thus we
need only find a polynomial that has the form specified by equations (3.2.25) and (3.2.29). From equation
(3.2.10) we can construct such a polynomial to have the form

## hj(x) = vj(x)L2j (x) , (3.2.30)

where vj(x) is a linear polynomial in x which will have only two arbitrary constants. We can use the
constraint on the amplitude and derivative of hj(xi) to determine those two constants. Making use of the
constraints in equations (3.2.25) and (3.2.29) we can write that

h i ( x i ) = v i ( x i )L2i ( x i ) = 1 ⎫⎪
⎬ . (3.2.31)
h ' j ( x i ) = v' j ( x i )L2j ( x i ) + 2v j ( x i )L' j ( x i ) L j ( x i ) = 0 ⎪⎭

## vi(x) = aix + bi . (3.2.32)

73
Numerical Methods and Data Analysis

Specifically putting the linear form for vi(x) into equation (3.2.31) we get
v i (x i ) = a i x i + b i = 1 ⎫
⎬ , (3.2.33)
v' i ( x i ) = a i = −2(a i x i + b i )L' i ( x ) ⎭
which can be solved for ai and bi to get
a i = −2L' i ( x i )

⎬ . (3.2.34)
b i = 1 + 2 x i L' i ( x i ) ⎭
Therefore the linear polynomial vi(x) will have the particular form
vi(x) = 1 ─ 2(x-xi)L' i(xi) . (3.2.35)

We must follow the same procedure to specify Hj(x). Like hj(x), it will also be a polynomial of
degree 2n+1 so let us try the same form for it as we did for hj(x). So
Hj(x) = uj(x)L2j(x) , (3.2.36)

where uj(x) is also a linear polynomial whose coefficients must be determined from
H i ( x i ) = u i ( x i ) L2i ( x i ) = 0 ⎫⎪
⎬ . (3.2.37)
H' i ( x i ) = u ' i ( x i )L2i ( x i ) + 2u i ( x i )L' i ( x i )L i ( x i ) = 1 ⎪⎭
2
Since L i ( x i ) is unity, these constraints clearly limit the values of ui(x) and its derivative at the tabular
points to be
u i (x i ) = 0 ⎫
⎬ . (3.2.38)
u'i (x i ) = 1 ⎭
Since ui(x) is linear and must have the form
ui(x) = αix + βi , (3.2.39)

## we can use equation (3.2.38) to fine the constants αi and βi as

αi = 1 ⎫

βi = −x i ⎬ , (3.2.40)
u i ( x ) = ( x − x i ) ⎪⎭
thereby completely specifying ui(x). Therefore, the two functions hj(x) and Hj(x) will have the specific
form
h j ( x ) = [1 − 2( x − x j )L' j ( x j )]L2j ( x ) ⎫⎪
⎬ . (3.2.41)
H j ( x ) = ( x − x j )L2j ( x ) ⎪⎭
All that remains is to find L'
j(xj). By differentiating equation (3.2.9) with respect to x and setting x to
xj, we get
L' j ( x j ) = ∑
( x j − x k ) −1 ,
k≠ j
(3.2.42)

## which means that vj(x) will simplify to

74
3 - Polynomial Approximation

(x − x j )
v j (x) = 1 − 2 ∑ (x
k≠ j j − xk )
. (3.2.43)

## Therefore the Hermite interpolative function will take the form

2
⎡ ⎤
i
∑ ∏
Φ ( x ) = [Yi v i ( x ) + Y ' i u i ( x )]⎢ ( x − x j ) ( x i − x j ) ⎥ .
⎢⎣ j≠ i ⎥⎦
(3.2.44)

This function will match the original function Yi and its derivative at each of the tabular points.
This function is a polynomial of degree 2n-1 with 2n coefficients. These 2n coefficients are specified by
the 2n constraints on the function and its derivative. Therefore this polynomial is unique and whether it is
obtained in the above manner, or by expansion of the determinantal equation is irrelevant to the result.
While such a specification is rarely needed, this procedure does indicate how the form of the Lagrange
polynomials can be used to specify interpolative functions that meet more complicated constraints. We
will now consider the imposition of a different set of constraints that lead to a class of interpolative
functions that have found wide application.

c. Splines

Splines are interpolative polynomials that involve information concerning the derivative
of the function at certain points. Unlike Hermite interpolation that explicitly invokes knowledge of the
derivative, splines utilize that information implicitly so that specific knowledge of the derivative in not
required. Unlike general interpolation formulae of the Lagrangian type, which maybe used in a small
section of a table, splines are constructed to fit an entire run of tabular entries of the independent variable.
While one can construct splines of any order, the most common ones are cubic splines as they generate
tri-diagonal equations for the coefficients of the polynomials. As we saw in chapter 2, tri-diagonal
equations lend themselves to rapid solution involving about N steps. In this case N would be the number
of tabular entries of the independent variable. Thus for relatively few arithmetic operations, one can
construct a set of cubic polynomials which will represent the function over its entire tabular range. If one
were to make a distinction between interpolation and curve fitting, that would be it. That is, one may
obtain a local value of a function by interpolation, but if one desires to describe the entire range of a
tabular function, one would call that curve fitting. Because of the common occurrence of cubic splines,
we shall use them as the basis for our discussion. Generalization to higher orders is not difficult, but will
generate systems of equations for their coefficients that are larger than tri-diagonal. That removes much
of the attractiveness of the splines for interpolation.

To understand how splines can be constructed, consider a function with n tabular points whose
independent variable we will denote as xi and dependent values as Yi. We will approximate the functional
values between any two adjacent points xi and xi+1 by a cubic polynomial denoted by Ψi(x). Also let the
interval between xi+1 and xi be called
∆xi ≡ xi+1 ─ xi . (3.2.45)

Since the cubic interpolative polynomials Ψi(x) cover each of the n-1 intervals between the n tabular

75
Numerical Methods and Data Analysis

points, there will be 4(n-1) constants to be determined to specify the interpolative functions. As with
Lagrange interpolation theory we will require that the interpolative function reproduce the tabular entries
so that

Ψi ( x i ) = Yi ⎫
⎬ i = 1L n − 1 . (3.2.46)
Ψi ( x i +1 ) = Yi +1 ⎭
Requiring that a single polynomial match two successive points means that two adjacent polynomials will
have the same value where they meet, or
Ψi ( x i +1 ) = Ψi +1 ( x i +1 ) i = 1L n − 2 . (3.2.47)
The requirement to match n tabular points represents n linearly independent constraints on the 4n-4
coefficients of the polynomials. The remaining constraints come from conditions placed on the functional
derivatives. Specifically we shall require that
Ψ ' i −1 ( x i ) = Ψ ' i ( x i ) ⎫
⎬ i = 2L n − 1 . (3.2.48)
Ψ"i −1 ( x i ) = Ψ"i ( x i ) ⎭
Unlike Hermite interpolation, we have not specified the magnitude of the derivatives at the tabular points,
but only that they are the same for two adjacent functions Ψi-1(xi) and Ψi(xi) at the points xi all across the
tabular range. Only at the end points have we made no restrictions. Requiring the first two derivatives of
adjacent polynomials to be equal where they overlap will guarantee that the overall effect of the splines
will be to generate a smoothly varying function over the entire tabular range. Since all the interpolative
polynomials are cubics, their third derivatives are constants throughout the interval ∆xi so that
Ψi''' ( x i ) = Ψi''' ( x i +1 ) = const. , i = 1L n − 1 . (3.2.49)
Thus the specification of the functional value and equality of the first two derivatives of adjacent
functions essentially forces the value of the third derivative on each of the functions Ψi(x). This represents
n-1 constraints. However, the particular value of that constant for all polynomials is not specified so that
this really represents only n-2 constraints. In a similar manner, the specification of the equality of the
derivative of two adjacent polynomials for n-2 points represents another n-2 constraints. Since two
derivatives are involved we have an additional 2n-4 constraints bringing the total to 4n-6. However, there
were 4n-4 constants to be determined in order that all the cubic splines be specified. Thus the system as
specified so far is under-determined. Since we have said nothing about the end points it seems clear that
that is where the added constraints must be made. Indeed, we shall see that additional constraints must be
placed either on the first or second derivative of the function at the end points in order that the problem
have a unique solution. However, we shall leave the discussion of the specification of the final two
constraints until we have explored the consequences of the 4n-6 constraints we have already developed.

Since the value of the third derivative of any cubic is a constant, the constraints on the equality of
the second derivatives of adjacent splines require that the constant be the same for all splines. Thus the
second derivative for all splines will have the form
Ψ"(x)
i = ax + b . (3.2.50)
If we apply this form to two successive tabular points, we can write
Ψ"i ( x i ) = ax i + b = Y"i ⎫
⎬ . (3.2.51)
Ψ"i +1 ( x i +1 ) = ax i +1 + b = Y"i +1 ⎭
Here we have introduced the notation that Ψ"i(xi)=Y"i. The fact of the matter is that Y"i doesn't exist. We

76
3 - Polynomial Approximation

have no knowledge of the real values of the derivatives of the tabular function anywhere. All our
constraints are applied to the interpolative polynomials Ψi(x) otherwise known as the cubic splines.
However, the notation is clear, and as long as we keep the philosophical distinction clear, there should be
no confusion about what Y"i means. In any event they are unknown and must eventually be found. Let us
press on and solve equations (3.2.51) for a and b getting
a = ( Y"i + Y"i +1 ) /( x i − x i +1 ) = (Y"i + Y"i +1 ) / ∆x i ⎫
⎬ . (3.2.52)
b = Y"i − x i ( Y"i +1 −Y"i ) / ∆x i ⎭
Substituting these values into equation (3.2.50) we obtain the form of the second derivative of the cubic
spline as
Ψ"i(x) = [Y"
i+1(x-xi) ─ Y"
i(x-xi+1)]/∆xi . (3.2.53)

Now we may integrate this expression twice making use of the requirement that the function and its first
derivative are continuous across a tabular entry point, and evaluate the constants of integration to get
2 2 2 2
Ψi(x) = {Yi ─ Y"
i[(∆xi) -(xi+1-x) ]/6}[(xi+1-x)/∆xi] ─ {Yi+1 ─ Y"
i+1[(∆xi) -(xi-x) ]/6}[(xi-x)/∆xi] . (3.2.54)

This fairly formidable expression for the cubic spline has no quadratic term and depends on those
unknown constants Y"i.

To get equation (3.2.54) we did not explicitly use the constraints on Ψ'i(x) so we can use them
now to get a set of equations that the constants Y”i must satisfy. If we differentiate equation (3.2.54) and
make use of the condition on the first derivative that
Ψ'
i-1(xi) = Ψ'
i(xi) , (3.2.55)
we get after some algebra that

i-1∆xi-1+2Y"
Y" i(∆xi-1+∆xi)+Y"
i+1∆xi = 6[(Yi+1-Yi)/∆xi + (Yi-Yi-1)/∆xi-1] i=2 L n-1 . (3.2.56)

Everything on the right hand side is known from the tabular entries while the left hand side contains three
of the unknown constants Y"i. Thus we see that the equations have the form of a tri-diagonal system of
equations amenable to fast solution algorithms. Equation (3.2.56) represents n-2 equations in n unknowns
clearly pointing out that the problem is still under determined by two constants. If we arbitrarily take
Y"1=Y"n=0, then the splines that arise from the solution of equation (3.2.56) are called natural splines.
Keeping the second derivative zero will reduce the variation in the function near the end points and this is
usually considered desirable. While this arbitrary choice may introduce some error near the end points,
the effect of that error will diminish as one moves toward the middle of the tabular range. If one is given
nothing more than the tabular entries Yi and xi, then there is little more that one can do and the natural
splines are as good as any other assumption. However, should anything be known about the first or
second derivatives at the end points one can make a more rational choice for the last two constants of the
problem? For example, if the values of the first derivatives are known at the end points then
differentiating equation (3.2.56) and evaluating it at the end points yields two more equations of condition
which depend on the end point first derivatives as
Y1" + 12 Y2" = 3[(Y2 − Y1 ) / ∆x 1 − Y1' ] / ∆x 1 ⎫⎪
⎬ . (3.2.57)
Yn" − Yn" −1 / 6 = 2[(Yn −1 − Yn ) / ∆x n −1 − Yn' ] / ∆x n −1 ⎪⎭

77
Numerical Methods and Data Analysis

These two added conditions complete the system of equations without destroying their tri-diagonal form
and pose a significant alternative to natural splines should some knowledge of the endpoint derivatives
exist. It is clear that any such information at any point in the tabular range could be used to further
constrain the system so that a unique solution exists. In the absence of such information one has little
choice but to use the aesthetically pleasing natural splines. One may be somewhat disenchanted that it is
necessary to appeal to esthetics to justify a solution to a problem, but again remember that we are trying
to get "something for nothing" in any interpolation or curve fitting problem. The "true" nature of the
solution between the tabular points simply does not exist. Thus we have another example of where the
"art of computing" is utilized in numerical analysis.

In order to see the efficacy of splines, consider the same tabular data given in Table 3.1 and
investigate how splines would yield a value for the table at x = 4. Unlike Lagrangian interpolation, the
constraints that determine the values for the splines will involve the entire table. Thus we shall have to set
up the equations specified by equation (3.2.56). We shall assume that natural splines will be appropriate
for the example so that
Y"0 = Y"5 = 0 . (3.2.58)

For i = 1, equation (3.2.56) and the tabular values from table 3.1 yield

## and the entire system of linear equations for the Y"'s

i can be written as

⎛ 4 1 0 0 ⎞⎛⎜ Y1 ⎞⎟ ⎛ 42 ⎞
"

⎜ ⎟ ⎜ ⎟
⎜ 1 6 2 0 ⎟⎜ Y2 ⎟ ⎜ 18 ⎟
"

⎜ 0 2 10 3 ⎟⎜ " ⎟ = ⎜ − 16 ⎟ . (3.2.60)
⎜ ⎟⎜ Y3 ⎟ ⎜ ⎟
⎜ 0 0 3 10 ⎟⎜⎜ " ⎟⎟ ⎜ − 7 ⎟
⎝ ⎠⎝ Y4 ⎠ ⎝ ⎠
The solution for this tri-diagonal system can be found by any of the methods described in Chapter 2, but it is
worth noting the increase in efficiency afforded by the tri-diagonal nature of the system. The solution is
given in Table 3.3.

The first item to note is that the assumption of natural splines may not be the best, for the value of
Y1" × 10 is significantly different from the zero assumed for Y0". The value of Y" then proceeds to drop
smoothly toward the other boundary implying that the assumption of Y"5 = 0 is pretty good. Substituting the
solution for Y"i into equation (3.2.54) we get that

## Ψ2(4) = {8 - 1.9876[4-(5-4)2]/6}(4-3)/2 ─ {4 - (-1.9643)[4-(3-4)2]/6}(3-4)/2 = 5.9942 . (3.2.61)

As can be seen from Table 3.3, the results for the natural cubic splines are nearly identical to the
linear interpolation, and are similar to that of the second parabolic Lagrangian interpolation. However, the
most appropriate comparison would be with the cubic Lagrangian interpolation 31Φ(4) as both approximating
functions are cubic polynomials. Here the results are quite different illustrating the importance of the
constraints on the derivatives of the cubic splines. The Lagrangian cubic interpolation utilizes tabular
information for 2x8 in order to specify the interpolating polynomial. The splines rely on the more local

78
3 - Polynomial Approximation

information involving the function and its derivatives specified in the range 3x5. This minimizes the large
tabular variations elsewhere in the table that affect the Lagrangian cubic polynomial and make for a
smoother functional variation. The negative aspect of the spline approach is that it requires a solution
throughout the table. If the number of tabular entries is large and the required number of interpolated values
is small, the additional numerical effort maybe difficult to justify. In the next section we shall find esthetics
and efficiency playing an even larger role in choosing the appropriate approximating method.

Table 3.3

## A Comparison of Different Types of Interpolation Formulae

I X 1
2 Yi 1
1 Φ (4) 2
1 Φ (4) 2
2 Φ (4) 3
1 Φ (4) ∆x i Yi" Ψ2 (4) R1, 2, 3, 4 1,22Φw(4)
0 1 1 1 0.0000
1 2 3 1 10.003
2 3 8 2 1.988
4 6.000 8.333 5.733 7.467 5.994 5.242 6.000
3 5 4 3 -1.965
4 8 2 2 -0.111
5 10 1 -- -0.000

## d. Extrapolation and Interpolation Criteria

So far we have obtained several different types of interpolation schemes, but said little
about choosing the degree of the polynomial to be used, or the conditions under which one uses Lagrange
interpolation or splines to obtain the information missing from the table. The reason for this was alluded to in
the previous paragraph - there is no correct answer to the question. One can dodge the philosophical question
of the "correctness" of the interpolated number by appealing to the foundations of polynomial approximation
- namely that to the extent that the function represented by the tabular points can be represented by a
polynomial, the answer is correct. But this is indeed a dodge. For if it were true that the tabular function was
indeed a polynomial, one would simply use the interpolation scheme to determine the polynomial that fit the
entire table and use it. In science, one generally does know something about the nature of a tabular function.
For example, many such tables result from a complicated computation of such length that it is not practical
to repeat the calculation to obtain additional tabular values. One can usually guarantee that the results of
such calculations are at least continuous differentiable functions. Or if there are discontinuities, their location
is known and can be avoided. This may not seem like much knowledge, but it guarantees that one can locally
approximate the table by a polynomial. The next issue is what sort of polynomial should be used and over
what part of the tabular range.

79
Numerical Methods and Data Analysis

In section 3.1 we pointed out that a polynomial can have a very general form [see equation (3.1.1)].
While we have chosen our basis functions φi(x) to be xi for most of the discussion, this need not have been
the case. Interpolation formulae of the type developed here for xi can be developed for any set of basis
functions φi(x). For example, should the table exhibit exponential growth with the independent variable, it
might be advisable to choose
φi(x) = eiαx = [eαx]i  zi . (3.2.62)

The simple transformation of z = eαx allows all previously generated formulae to be immediately carried over
to the exponential polynomials. The choice of α will be made to suit the particular table. In general, it is far
better to use basis functions φi(x) that characterize the table than to use some set of functions such as the
convenient xi and a larger degree for interpolation. One must always make the choice between fitting the
tabular form and using the lowest degree polynomial possible. The choice of basis functions that have the
proper form will allow the use of a lower degree polynomial.

Why is it so desirable to choose the lowest degree polynomial for interpolation? There is the
obvious reason that the lower the degree the faster the computation and there are some cases where this may
be an overriding concern. However, plausibility of the result is usually the determining factor. When one fits
a polynomial to a finite set of points, the value of the polynomial tends to oscillate between the points of
constraint. The higher the degree of the polynomial, the larger is the amplitude and frequency of these
oscillations. These considerations become even more important when one considers the use of the
interpolative polynomial outside the range specified by the tabular entries. We call such use extrapolation
and it should be done with only the greatest care and circumspection. It is a fairly general characteristic of
polynomials to vary quite rapidly outside the tabular range to which they have been constrained. The
variation is usually characterized by the largest exponent of the polynomial. Thus if one is using polynomials
of the forth degree, he/she is likely to find the interpolative polynomial varying as x4 immediately outside the
tabular range. This is likely to be unacceptable. Indeed, there are some who regard any extrapolation beyond
the tabular range that varies more than linearly to be unjustified. There are, of course, exceptions to such a
hard and fast rule. Occasionally asymptotic properties of the function that yield the tabular entries are
known, then extrapolative functions that mimic the asymptotic behavior maybe justifiable.

There is one form of extrapolation that reduces the instabilities associated with polynomials. It is a
form of approximation that abandons the classical basis for polynomial approximation and that is
approximation by rational functions or more specifically quotient polynomials. Let us fit such a function
through the (k − i +1) points i → k. Then we can define a quotient polynomial as
∑P( x ) j=0
a0x j
R i ,i +1,L,i + k ( x ) = = . (3.2.63)
∑Q( x )
j= 0
b0x j

This function would appear to have (m+n+2) free parameters, but we can factor a0 from the numerator
and b0 from the denominator so that only their ratio is a free parameter. Therefore there are only (m+n+1)
free parameters so we must have
k+1 = m+n+1 , (3.2.64)

functional points to determine them. However, the values of n and m must also be specified separately.

80
3 - Polynomial Approximation

Normally the determination of the coefficients of such a function is rather difficult, but Stoer and Bulirsch3
have obtained a recurrence formula for the value of the function itself, which is

R i +1,L,i + k ( x ) − R i ,L,i + k −1 ( x ) ⎫
R i ,i +1,L,i + k ( x ) = R i +1,L,i + k ( x ) ⎪
⎡ ( x − x i ) ⎤ ⎡ R i +1,L,i + k ( x ) − R i ,i +1,L,i + k −1 ( x ) ⎤ ⎪
⎢ ⎥ ⎢1 − ⎥ −1
⎣ ( x − x i + k ) ⎦ ⎣⎢ R i +1,L,i + k ( x ) − R i +1,L,i + k −1 ( x ) ⎦⎥ ⎪⎪
⎬ . (3.2.65)
R i ,i = f ( x i ) ⎪

R i,k = 0 , k < 1 ⎪⎭
This recurrence relation produces a function where n = m if the number of points used is odd, but where m
= n+1 should the number of points be even. However, its use eliminates the need for actually knowing the
values of the coefficients as the relationship gives the value of the approximated function itself. That is

f ( x ) ≅ R i ,i +1,L,i + k . (3.2.66)

Equation (3.2.65) conceals most of the difficulties of using rational functions or quotient
polynomials. While the great utility of such approximating functions are their stability for extrapolation, we
shall demonstrate their use for interpolation so as to compare the results with the other forms of interpolation
we have investigated. Since the bulk of the other methods have four parameters available for the
specification of the interpolating polynomial (i.e. they are cubic polynomials), we shall consider a quotient
polynomial with four free parameters. This will require that we use four tabular entries which we shall
choose to bracket the point x = 4 symmetrically. Such an approximating function would have the form

## R1,2,3,4(x) = (ax+b)/(αx+β) . (3.2.67)

However, the recursive form of equation (3.2.65) means that we will never determine the values of a, b, α,
and β. The subscript notation used in equations (3.2.63) − (3.2.66) is designed to explicitly convey the
recursive nature of the determination of the interpolative function. Each additional subscript denotes a
successive "generation" in the development of the final result. One begins with the tabular data and the
second of equations (3.2.65). Taking the data from table 3.3 so that f(xi) = Yi, we get for the second
generation that represents the interpolative value at x = 4

81
Numerical Methods and Data Analysis

⎡ ⎤ ⎫
R 2, 2 ( x ) − R 1,1 ( x ) ⎢ ⎥ ⎪
8−3 ⎪
R 1, 2 ( x ) = R 2, 2 ( x ) + =8+ ⎢ ⎥ = −12
⎡ ( x − x 1 ) ⎤ ⎡ R 2, 2 ( x ) − R 1,1 ( x ) ⎤ ⎢ ⎡ 4 − 2 ⎤⎡ 8 − 3⎤ ⎥ ⎪
⎢ ⎥ ⎢1 − ⎥ −1 ⎢ ⎢ 4 − 3 ⎥ ⎢1 − 8 ⎥ − 1 ⎥ ⎪
⎣ ( x − x 2 ) ⎦ ⎢⎣ R 2, 2 ( x ) ⎥⎦ ⎣⎣ ⎦⎣ ⎦ ⎦ ⎪

⎡ ⎤ ⎪
R 3,3 ( x ) − R 2, 2 ( x ) ⎢ 4−8 ⎥ ⎪
R 2 , 3 ( x ) = R 3, 3 ( x ) + =4+⎢ ⎥ = + 16 ⎪
⎡ ( x − x 2 ) ⎤ ⎡ R 3, 3 ( x ) − R 2 , 2 ( x ) ⎤ ⎢ ⎡ 4 − 3⎤⎡ 4 − 8⎤ ⎥ 3 ⎪⎬ . (3.2.68)
⎥ ⎢1 − ⎥ −1 ⎢ ⎢ 4 − 5 ⎥⎢ 1 − − 1

⎣⎣ ⎦⎣ 4 ⎥⎦ ⎥⎦ ⎪
⎣ ( x − x 3 ) ⎦ ⎣⎢ R 3,3 ( x ) ⎦⎥ ⎪
⎡ ⎤ ⎪
⎢ ⎥ ⎪
R 4 , 4 ( x ) − R 3, 3 ( x ) 2−4 26 ⎪
R 3, 4 ( x ) = R 4, 4 ( x ) + =2+ ⎢ ⎥ =+ ⎪
⎡ ( x − x 3 ) ⎤ ⎡ R 4, 4 ( x ) − R 3,3 ( x ) ⎤ ⎢ ⎡ 4 − 2 ⎤⎡ 2 − 4 ⎤ ⎥ 5
⎢ ⎥ ⎢1 − ⎥ −1 ⎢ ⎢ 4 − 5 ⎥ ⎢1 − 4 ⎥ − 1 ⎥ ⎪
⎣ ( x − x 4 ) ⎦ ⎢⎣ R 4, 4 ( x ) ⎥⎦ ⎣⎣ ⎦⎣ ⎦ ⎦ ⎪

⎪⎭
The third generation will contain only two terms so that

R 2, 2 ( x ) − R 1,1 ( x ) ⎫
R 1, 2,3 ( x ) = R 2,3 ( x ) + ⎪
⎡ (x − x 1 ) ⎤ ⎡ R 2, 2 ( x ) − R 1,1 ( x ) ⎤ ⎪
⎢ ⎥ ⎢1 − ⎥ −1⎪
⎣ ( x − x )
2 ⎦⎣ ⎢ R 2,3 ( x ) − R 2, 2 ( x ) ⎥⎦ ⎪
⎬ . (3.2.69)
R 3, 4 ( x ) − R 2 , 3 ( x ) ⎪
R 2,3, 4 ( x ) = R 3, 4 ( x ) +
⎡ (x − x 2 ) ⎤ ⎡ R 3, 4 ( x ) − R 2,3 ( x ) ⎤ ⎪
⎢ ⎥ ⎢1 − ⎥ − 1⎪
⎣ ( x − x 4 ) ⎦ ⎣⎢ R 3, 4 ( x ) − R 3,3 ( x ) ⎦⎥ ⎪⎭

## Finally the last generation will have the single result.

R 2,3, 4 ( x ) − R 1, 2,3 ( x ) ⎪
R 1, 2,3, 4 ( x ) = R 2,3, 4 ( x ) + ⎬ . (3.2.70)
⎡ ( x − x 1 ) ⎤ ⎡ R 2,3, 4 ( x ) − R 1, 2,3 ( x ) ⎤ ⎪
⎢ ⎥ ⎢1 − ⎥ − 1⎪
⎣ ( x − x 4 ) ⎦ ⎢⎣ R 2,3, 4 ( x ) − R 2,3,3 ( x ) ⎥⎦ ⎭

We can summarize this process neatly in the form of a "difference" Table (similar to Table 3.2 and Table
4.2) below.

Note how the recursion process drives the successive 'generations' of R toward the final result. This
is a clear demonstration of the stability of this sort of scheme. It is this type of stability that makes the
method desirable for extrapolation. In addition, such recursive procedures are very easy to program and quite
fast in execution. The final result is given in equation (3.2.70), tabulated for comparison with other methods
in Table 3.3, and displayed in Figure 3.2. This result is the smallest of the six results listed indicating that the

82
3 - Polynomial Approximation

rapid tabular variation of the middle four points has been minimized. However, it still compares favorably
with the second parabolic Lagrangian interpolation. While there is not a great deal of differentiation between
these methods for interpolation, there is for extrapolation. The use of quotient polynomials for extrapolation
is vastly superior to the use of polynomials, but one should always remember that one is basically after "free-
lunch" and that more sophisticated is not necessarily better. Generally, it is risky to extrapolate any function
far beyond one typical tabular spacing.

We have seen that the degree of the polynomial that is used for interpolation should be as low as
possible to avoid unrealistic rapid variation of the interpolative function. This notion of providing a general
"smoothness" to the function was also implicit in the choice of constraints for cubic splines. The constraints
at the interior tabular points guarantee continuity up through the second derivative of the interpolative
function throughout the full tabular range. The choice of Y"1 = Y"n = 0 that produces "natural" splines means
that the interpolative function will vary no faster than linearly near the endpoints. In general, when one has
to make an assumption concerning unspecified constants in an interpolation scheme, one chooses them so as
to provide a slowly varying function. The extension of this concept to more complicated interpolation
schemes is illustrated in the following highly successful interpolation algorithm.

Table 3.4
Parameters for Quotient Polynomial Interpolation

## I X YI RI, I RI, RI, RI, I+1, I+2, I+3

I+1, I+1, I+2

0 1 1 0
0
1 2 3 3 0
-12
2 3 8 8 6.5714 0

4 +16/3 5.2147

3 5 4 4 5.3043 0
+26/5
4 8 2 2 0
0
5 10 1 0

One of the most commonly chosen polynomials to be used for interpolation is the parabola. It tends
not to vary rapidly and yet is slightly more sophisticated than linear interpolation. It will clearly require three
tabular points to specify the three arbitrary constants of the parabola. One is then confronted with the
problem of which of the two intervals between the tabular points should the point to be interpolated be
placed. A scheme that removes this problem while far more importantly providing a gently varying function

83
Numerical Methods and Data Analysis

proceeds as follows: Use four points symmetrically placed about the point to be interpolated. But instead of
fitting a cubic to these four points, fit two parabolas, one utilizing the first three points and one utilizing the
last three points. At this point one exercises an artistic judgment. One may choose to use the parabola with
that exhibits the least curvature (i.e. the smallest value of the quadratic coefficient).

However, one may combine both polynomials to form a single quadratic polynomial where the
contribution of each is weighted inversely by its curvature. Specifically, one could write this as
2
kΦw(x) = {ak+1 [k2Φ(x)] + ak [k+12Φ(x)]}/(ak+ak+1) , (3.2.71)

where aks are the inverse of the coefficient of the x2 term of the two polynomials and are given by
k+2

∑ Y(x ) i
ak = i=k
, (3.2.72)
∏ (x
j≠ i
i − x j)

and are just twice the inverse of the curvature of that polynomial. The kΦ(x) are the Lagrange polynomials
of second degree and are k+2
2
kΦ(x) = Σ Y(xi)Li(x) . (3.2.73)
i=k

Since each of the kΦ(x)s will produce the value of Y(xi) when x = xi, it is clear that equation (3.2.71)
will produce the values of Y(x2) and Y(x3) at the points x2 and x3 adjacent to the interpolative point. The
functional behavior between these two points will reflect the curvature of both polynomials giving higher
weight to the flatter, or more slowly varying polynomial. This scheme was developed in the 1960s by
researchers at Harvard University who needed a fast and reliable interpolation scheme for the construction of
model stellar atmospheres. While the justification of this algorithm is strictly aesthetic, it has been found to
function well in a wide variety of situations. We may compare it to the other interpolation formulae by
applying it to the same data from tables 3.1 and 3.3 that we have used throughout this section. In developing
the parabolic Lagrangian formulae in section 3.1, we obtained the actual interpolative polynomials in
equations (3.2.15) and (3.2.16). By differentiating these expressions twice, we obtain the aks required by
equation (3.2.71) so that
−1
a 1 = 2 P1" (4) = 3 / 7 ⎫

−1 ⎬ . (3.2.74)
a 2 = 2 P2" (4) = 15 / 4 ⎪

Substitution of these values into equation (3.2.71) yields a weighted Lagrangian interpolated value of
2
1,2Φw = {[3P1(4)/7] + [15P2(4)/4]}/[(3/7)+(15/4)] = 6.000 (3.2.75)

We have evaluated equation (3.2.75) by using the rational fraction values for P1(4) and P2(4) which are
identical to the interpolative values given in table 3.1. The values for the relative weights given in equation
(3.2.74) show that the first parabola will only contribute about 15% to the final answer do to its rapid
variation. The more gently changing second parabola contributes the overwhelming majority of the final
result reflecting our aesthetic judgment that slowly varying functions are more plausible for interpolating

84
3 - Polynomial Approximation

functions. The fact that the result is identical to the result for linear interpolation is a numerical accident.
Indeed, had round-off error not been a factor, it is likely that the result for the cubic splines would have also
been exactly 6. However, this coincidence points up a common truth: "more sophisticated is not necessarily
better".

Although slightly more complicated than quadratic Lagrangian interpolation, this scheme is rather
more stable against rapid variation and is certainly more sophisticated than linear interpolation. In my
opinion, its only real competition is the use of cubic splines and then only when the entire range of the table
is to be used as in curve fitting. Even here there is no clear distinction as to which produces the more
appropriate interpolative values, but an edge might be given to cubic splines on the basis of speed depending
on the table size and number of required interpolative values.

It is worth taking a last look at the results in Table 3.3. We used the accuracy implied by the tables
to provide a basis for the comparison of different interpolative methods. Indeed, some of the calculations
were carried out as rational fractions to eliminate round-off error as the possible source of the difference
between methods. The plausible values range from about 5.2 to 6.00. However, based on the tabular data,
there is no real reason to prefer one value over another. The appropriate choice should revolve around the
extent that one should expect an answer of a particular accuracy. None of the tabular data contain more than
two significant figures. There would have to be some compelling reason to include more in the final result.
Given the data spacing and the tabular variability, even two significant figures are difficult to justify. With
that in mind, one could argue persuasively that linear interpolation is really all that is justified by this
problem. This is an important lesson to be learned for it lies at the root of all numerical analysis. There is no
need to use numerical methods that are vastly superior to the basic data of the problem.

## 3.3 Orthogonal Polynomials

Before leaving this chapter on polynomials, it is appropriate that we discuss a special, but very important
class of polynomials known as the orthogonal polynomials. Orthogonal polynomials are defined in terms of
their behavior with respect to each other and throughout some predetermined range of the independent
variable. Therefore the orthogonality of a specific polynomial is not an important notion. Indeed, by itself
that statement does not make any sense. The notion of orthogonality implies the existence of something to
which the object in question is orthogonal. In the case of polynomials, that something happens to be other
polynomials. In section 1.3 we discussed the notion of orthogonality for vectors and found that for a set of
vectors to be orthogonal, no element of the set could be expressed in term of the other members of the set.
This will also be true for orthogonal polynomials. In the case of vectors, if the set was complete it was said
to span a vector space and any vector in that space could be expressed as a linear combination of the
orthogonal basis vectors. Since the notion of orthogonality seems to hinge on two things being perpendicular
to each other, it seems reasonable to say that two functions f1(x) andr f2(x) are rorthogonal if they are
everywhere perpendicular to each other. If we imagine tangent vectors t1 (x) and t 2 (x) defined at every
point of each function, then if r r
t1 ( x ) • t 2 ( x ) = 0 ∀x , (3.3.1)

85
Numerical Methods and Data Analysis

one could conclude from equation (3.3.1) that f1(x) and f2(x) were mutually perpendicular at each value of x.
If one considers the range of x to represent r an infinite dimension vector space with each value of x
representing a dimension so that the vectors t1 (x) represented basis vectors in that space, then orthogonality
could be expressed as
b

∫ t (x )t
a
1 2 ( x )dx =0 . (3.3.2)

## Thus, it is not unreasonable to generalize orthogonality of the functions themselves by

b

∫ f (x )f (x )dx = 0
a
i j , i ≠ j. (3.3.3)

Again, by analogy to the case of vectors and linear transformations discussed in chapter 1 we can define two
functions as being orthonormal if
b

∫ w (x) f (x )f (x ) dx = δ
a
i j ij . (3.3.4)

Here we have included an additional function w(x) which is called a weight function. Thus the proper
statement is that two functions are said to be orthonormal in the interval a x b, relative to a weight function
w(x), if they satisfy equation (3.3.4). In this section we shall consider the subset of functions known as
polynomials.

It is clear from equation (3.3.4) that orthonormal polynomials come in sets defined by the weight
function and range of x. These parameters provide for an infinite number of such sets, but we will discuss
only a few of the more important ones. While we will find it relatively easy to characterize the range of the
independent variable by three distinct categories, the conditions for the weight function are far less stringent.
Indeed the only constraint on w(x) is
w(x) > 0 ∀x ∈ a ≤  x ≤  b . (3.3.5)

While one can find orthogonal functions for non-positive weight functions, it turns out that they are not
unique and therefore not well defined. Simply limiting the weight function to positive definite functions in
the interval a-b, still allows for an infinite number of such weight functions and hence an infinite number of
sets of orthogonal polynomials.

Let us begin our search for orthogonal polynomials by using the orthogonality conditions to see how
such polynomials can be generated. For simplicity, let us consider a finite interval from a to b. Now an
orthogonal polynomial φi(x) will be orthogonal to every member of the set of polynomials other than itself.
In addition, we will assume (it can be proven) that the polynomials will form a complete set so that any
polynomial can be generated from a linear combination of the orthogonal polynomials of the same degree or
less. Thus, if qi(x) is an arbitrary polynomial of degree i, we can write
b

∫ w ( x )φ ( x )q
a
i i −1 ( x ) dx =0. (3.3.6)

Now let
d i U i (x)
w ( x )φ i ( x ) = i
≡ U i(i ) ( x ) . (3.3.7)
dx

86
3 - Polynomial Approximation

The function Ui(x) is called the generating function of the polynomials φi(x) and is itself a polynomial of
(i )
degree 2i so that the ith derivative U i is an ith degree polynomial. Now integrate equation (3.3.7) by parts
i-times to get

∫a
b
[
U i(i ) ( x )q i −1 ( x ) dx = 0 = U i(i −1) ( x )q i −1 ( x ) − U i(i − 2 ) ( x )q i' −1 ( x ) + L + (−1) i −1 U i ( x )q i(−i −11) ( x ) ] a

b
. (3.3.8)
Since qi(x) is an arbitrary polynomial each term in equation (3.3.8) must hold separately so that
U i (a ) = U i' (a ) = L = U i(i −1) (a ) = 0 ⎪⎫
⎬ . (3.3.9)
U i (b) = U i' (b) = L = U i(i −1) (b) = 0⎪⎭
Since φi(x) is a polynomial of degree i we may differentiate it i+1 times to get
d i +1 ⎡ 1 d i U i ( x ) ⎤
⎢ ⎥=0 . (3.3.10)
dx i =1 ⎣ w ( x ) dx i ⎦
This constitutes a differential equation of order 2i+1 subject to the 2i boundary conditions given by equation
(3.3.9). The remaining condition required to uniquely specify the solution comes from the normalization
constant required to make the integral of φi2(x) unity. So at this point we can leave Ui(x) uncertain by a scale
factor. Let us now turn to the solution of equation (3.3.10) subject to the boundary conditions given by
equation (3.3.9) for some specific weight functions w(x).

## a. The Legendre Polynomials

Let us begin by restricting the range to -1 ≤  x ≤  1 and taking the simplest possible weight
function, namely
w(x) = 1 , (3.3.11)
so that equation (3.3.9) becomes
d 2i +1
[U i ( x )] = 0 . (3.3.12)
dx 2i =1

Since Ui(x) is a polynomial of degree 2i, an obvious solution which satisfies the boundary conditions is

## Ui(x) = Ci(x2-1)i . (3.3.13)

Therefore the polynomials that satisfy the orthogonality conditions will be given by
d i ( x 2 − 1) i
φi (x) = C i . (3.3.14)
dx i
If we apply the normalization criterion we get
+1 +1⎡ d i ( x 2 − 1) ⎤

−1 ∫
φ i2 ( x ) dx = 1 = C i ⎢
−1
⎣ dx
i ⎥ dx ,

∫ (3.3.15)

so that
Ci = [2ii!]-1 . (3.3.16)

87
Numerical Methods and Data Analysis

We call the orthonormal polynomials with that normalization constant and satisfying equation (3.3.14) the
Legendre polynomials and denote them by
Pi(x) = [2ii!]-1di(x2-1)i/dxi . (3.3.17)

One can use equation (3.3.17) to verify that these polynomials will satisfy the recurrence relation
⎡ 2i + 1 ⎤ ⎡ i ⎤ ⎫
Pi +1 ( x ) = ⎢ ⎥ xPi ( x ) − ⎢ ⎥ Pi −1 ( x ) ⎪
⎣ i +1 ⎦ ⎣ i + 1⎦ ⎪⎪
P0 ( x ) = 1 ⎬ , (3.3.18)

P1 ( x ) = x ⎪⎭
The set of orthogonal polynomials that covers the finite interval from -1 to +1 and whose members are
orthogonal relative to the weight function w(x) = 1 are clearly the simplest of the orthogonal polynomials.
One might be tempted to say that we have been unduly restrictive to limit ourselves to such a specific
interval, but such is not the case. We may transform equation (3.3.15) to any finite interval by means of a
linear transformation of the form
y(x) = x[(b-a)/2] +(a+b)/2 , (3.3.19)
so that we obtain an integral
⎡ 2 ⎤ b

⎢ b − a ⎥ aφ i ( y)φ j ( y) dy = δ ij ,

(3.3.20)

that resembles equation (3.3.4). Thus the Legendre polynomials form an orthonormal set that spans any
finite interval relative to the unit weight function.

## b. The Laguerre Polynomials

While we noted that the Legendre polynomials could be defined over any finite interval
since the linear transformation required to reach such as interval didn't affect the polynomials, we had earlier
mentioned that there are three distinct intervals that would have to be treated differently. Here we move to
the second of these - the semi-infinite interval where 0 ≤ x ≤ ∞. Clearly the limits of this interval cannot be
reached from any finite interval by a linear transformation. A non-linear transformation that would
accomplish that result would destroy the polynomic nature of any polynomials obtained in the finite interval.
In addition, we shall have to consider a weight function that asymptotically approaches zero as x as any
polynomials in x will diverge making it impossible to satisfy the normalization condition. Perhaps the
simplest weight function that will force a diverging polynomial to zero as x → ∞ is e-α x. Therefore our
orthogonal polynomials will take the form
d i U i (x)
φ I ( x ) = e αx , (3.3.21)
dx i
where the generating function will satisfy the differential equation
d i +1 ⎡ αx d i U i ( x ) ⎤
⎢e ⎥=0 , (3.3.22)
dx i =1 ⎣ dx i ⎦
and be subject to the boundary conditions

88
3 - Polynomial Approximation

⎫⎪
U i (0) = U i' (0) = L = U i(i −1) (0) = 0
⎬ . (3.3.23)
U i (∞) = U i' (∞) = L = U i(i −1) (∞) = 0 ⎪⎭
When subjected to those boundary conditions, the general solution to equation (3.3.22) will be

## Ui(x) = Cixie-αx , (3.3.24)

so that the polynomials can be obtained from
e α x d i ( x i e − αx )
φi (x) = , (3.3.25)
i! dx i
If we set α = 1, then the resultant polynomials are called the Laguerre polynomials and when normalized
have the form
e x d i (x i e − x )
Li = , (3,3,26)
i! dx i
and will satisfy the recurrence relation
⎡ 2i + 1 − x ⎤ ⎡ i ⎤ ⎫
L i +1 ( x ) = ⎢ ⎥ L i (x) − ⎢ ⎥ L i −1 ( x ) ⎪
⎣ i +1 ⎦ ⎣ i + 1⎦ ⎪⎪
L 0 (x ) = 1 ⎬ . (3.3.27)

L1 ( x ) = 1 − x ⎭⎪
These polynomials form an orthonormal set in the semi-infinite interval relative to the weight function e-x.

## c. The Hermite Polynomials

Clearly the remaining interval that cannot be reached from either a finite interval or semi-
infinite interval by means of a linear transformation is the full infinite interval -∞ ≤ x ≤ +∞. Again we will
need a weight function that will drive the polynomial to zero at both end points so that it must be symmetric
in x. Thus the weight function for the semi-infinite interval will not do. Instead, we pick the simplest
2 2
α x
symmetric exponential e , which leads to polynomials of the form
i
2 2 d U i (x )
φi (x) = e α x , (3.3.28)
dx i
that satisfy the differential equation
d i +1 ⎡ α 2 x 2 d i U i ( x ) ⎤
⎢e ⎥=0 , (3.3.29)
dx i ⎣ dx i ⎦
subject to the boundary conditions
U i (±∞) = U i' (±∞) = L = U i(i −1) (±∞) = 0 . (3.3.30)

This has a general solution satisfying the boundary conditions that look like

89
Numerical Methods and Data Analysis

2x2
Ui(x) = Cie-α , (3.3.31)

which when normalized and with α = 1, leads to the Hermite polynomials that satisfy
2
x2 d i e −x
H i ( x ) = (−1) ei
. (3.3.32)
dx i

Table 3.5
The First Five Members of the Common Orthogonal Polynomials
I PI(X) LI(X) HI(X)
0 1 1 1
1 x 1-x 2x
2 (3x2-1)/2 (2-4x+x2)/2 2(2x2-1)
3 x(5x2-3)/2 (6-18x+9x2-x3)/6 4x(2x2-3)
4 (35x4-30x2+3)/8 (24-96x+72x2-6x3+x4)/24 4(4x4-16x2+3)

Like the other polynomials, the Hermite polynomials can be obtained from a recurrence relation. For the
Hermite polynomials that relation is
H i +1 ( x ) = 2 xH i ( x ) − 2iH i −1 ( x ) ⎫

H 0 (x) = 1 ⎬ . (3.3.31)
H1 ( x ) = 2x ⎪

.We have now developed sets of orthonormal polynomials that span the three fundamental ranges of the real
variable. Many other polynomials can be developed which are orthogonal relative to other weight functions,
but these polynomials are the most important and they appear frequently in all aspects of science.

## d. Additional Orthogonal Polynomials

There are as many additional orthogonal polynomials as there are positive definite weight
functions. Below we list some of those that are considered to be classical orthogonal polynomials as they
turn up frequently in mathematical physics. A little inspection of Table 3.6 shows that the Chebyschev
polynomials are special cases of the more general Gegenbauer or Jacobi polynomials. However, they turn up
sufficiently frequently that it is worth saying more about them. They can be derived from the generating
function in the same manner that the other orthogonal polynomials were, so we will only quote the results.
The Chebyschev polynomials of the first kind can be obtained from the reasonably simple trigonometric
formula
Ti(x) = cos[i cos-1(x)] . (3.3.34)

90
3 - Polynomial Approximation

Table 3.6
Classical Orthogonal Polynomials of the Finite Interval

## Name Weight Function w(x)

Legendre 1
Gegenbauer or Ultraspherical (1 − x 2 ) λ −
1
2

Jacobi or Hypergeometric (1 − x ) α (1 − x ) β
Chebyschev of the first kind (1 − x 2 ) −
1
2

## Chebyschev of the second kind (1 − x 2 ) +

1
2

However, in practice they are usually obtained from a recurrence formula similar to those for the other
polynomials. Specifically
Ti +1 ( x ) = 2 xTi ( x ) − Ti −1 ( x ) ⎫

T0 ( x ) = 1 ⎬. (3.3.35)
T1 ( x ) = x ⎪

The Chebyschev polynomials of the second kind are represented by the somewhat more complicated
trigonometric formula
Vi(x) = sin[(i+1)cos-1(x)]/sin[cos-1(x)] , (3.3.36)

and obey the same recurrence formula as Chebyschev polynomials of the first kind so
Vi +1 ( x ) = 2 xVi ( x ) − Vi −1 ( x ) ⎫

V0 ( x ) = 1 ⎬ . (3.3.37)
V1 ( x ) = 2 x ⎪

Only the starting values are slightly different. Since they may be obtained from a more general class of
polynomials, we should not be surprised if there are relations between them. There are, and they take the
form
Ti ( x ) = Vi ( x ) − xVi −1 ( x ) ⎫
⎬ . (3.3.38)
(1 − x )Vi −1 ( x ) = xTi ( x ) − Ti +1 ( x )
2

Since the orthogonal polynomials form a complete set enabling one to express an arbitrary
polynomial in terms of a linear combination of the elements of the set, they make excellent basis functions
for interpolation formulae. We shall see in later chapters that they provide a basis for curve fitting that
provides great numerical stability and ease of solution. In the next chapter, they will enable us to generate
formulae to evaluate integrals that yield great precision for a minimum of effort. The utility of these
functions is of central importance to numerical analysis. However, all of the polynomials that we have

91
Numerical Methods and Data Analysis

discussed so far form orthogonal sets over a continuous range of x. Before we leave the subject of
orthogonality, let us consider a set of functions, which form a complete orthogonal set with respect to a
discrete set of points in the finite interval.

## e. The Orthogonality of the Trigonometric Functions

At the beginning of the chapter where we defined polynomials, we represented the most
general polynomial in terms of basis functions φi(x). Consider for a moment the case where

## φi(x) = sin(iπx) . (3.3.39)

Now integration by parts twice, recovering the initial integral but with a sign change, or perusal of any good
table of integrals4 will convince one that
+1 +1

−1
sin( kπx ) sin( jπx ) dx = ∫ cos(kπx ) cos( jπx ) dx = δ kj .
−1
(3.3.40)
Thus sines and cosines form orthogonal sets of functions of the real variable in the finite interval. This will
come as no surprise to the student with some familiarity with Fourier transforms and we will make much of
it in chapters to come. But what is less well known is that

1 2 N −1 1 2 N −1

N x =0
sin( kπx / N) sin( jπx / N) = ∑ cos(kπx / N) cos( jπx / N) = δ kj , 0 < (k + j) < 2 N , (3.3.41)
N x =0

which implies that these functions also form an orthogonal set on the finite interval for a discrete set of
points. The proof of this result can be obtained in much the same way as the integral, but it requires some
knowledge of the finite difference calculus (see Hamming5 page 44, 45). We shall see that it is this discrete
orthogonality that allows for the development of Fourier series and the numerical methods for the calculation
of Power Spectra and "Fast Fourier Transforms". Thus the concept of orthogonal functions and polynomials
will play a role in much of what follows in this book.

92
3 - Polynomial Approximation

Chapter 3 Exercises
1. Find the roots of the following polynomial

## a. by the Graffe Root-squaring method,

b. any interative method,
c. then compare the accuracy of the two methods.

## d. P(x) = +0.0021(x3+x) + 1.000000011x2 + 0.000000011.

Comment of the accuracy of your solution.

3. Find Lagrangian interpolation formulae for the cases where the basis functions are

a. φi(x) = eix

b. φi(x) = sin(iπx/h) ,

## where h is a constant interval spacing between the points xi.

4. Use the results from problem 3 to obtain values for f(x) at x=0.5, 0.9 and 10.3 in the
following table:

xi f(xi)
0.0 1.0
0.4 2.0
0.8 3.0
1.2 5.0
2.0 3.0
5.0 1.0
8.0 8.0 .

Compare with ordinary Lagrangian interpolation for the same degree polynomials and cubic
splines. Comment on the result.

93
Numerical Methods and Data Analysis

## 5. Given the following table, approximate f(x) by

n
f(x) = Σ aisin(ix).
i=1

Determine the "best" value of n for fitting the table. Discuss your reasoning for making the
choice you made.
xi f(xi)
1.0 +.4546
2.0 -.3784
3.0 -.1397
4.0 +.4947
5.0 -.2720
6.0 -.2683
7.0 +.4953
8.0 -.1439

## 6. Find the normalization constants for

a. Hermite polynomials
b. Laguerre polynomials
c. Legendre polynomials that are defined in the interval -1→ +1.

7. Use the rules for the manipulation of determinants given in chapter 1 (page 8) to show how
the Vandermode determinant takes the form given by equation (3.3.7)

8. In a manner similiar to problem 7, show how the Lagrangian polynomials take the form
given by equation (3.2.9).

9. Explicitly show how equation (3.2.29) is obtained from equations (3.2.23), (3.2.24), and
(3.2.26).

10. Integrate equation (3.2.53) to obtain the tri-diagonal equations (3.2.54). Show explicitly
how the constraints of the derivatives of Yi enter into the problem.

11. By obtaining equation (3.3.18) from equation (3.3.17) show that one can obtain the
recurrence relations for orthogonal polynomials from the defining differential equation.

12. Find the generating function for Gegenbauer polynomials and obtain the recurrence relation
for them.

## 13. Show that equation (3.3.41) is indeed correct.

94
3 - Polynomial Approximation

## Chapter 3 References and Supplemental Reading

1. Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T., "Numerical Recipies the Art of
Scientific Computing" (1986), Cambridge University Press Cambridge, New York, New Rochelle,
Melbourne, Sydney.

2. Acton, Forman S., "Numerical Methods That Work", (1970) Harper and Row, New York.

3. Stoer, J. and Bulirsch, R., "Introduction to Numerical Analysis" (1980), Springer-Verlag, New
York, §2.2.

4. Gradshteyn, I.S. and Ryzhik,I.M., "Table of Integrals, Series, and Products : corrected and enlarged
edition" (1980), (ed. A. Jeffrey), Academic Press, New York, London, Toronto, Sydney, San
Francisco, pp 139-140.

5. Hamming, R.W., "Numerical Methods for Scientists and Engineers" (1962) McGraw-Hill Book Co.,
Inc., New York, San Francisco, Toronto, London.

## For an excellent general discussion of polynomials one should read

6. Moursund, D.G., and Duris, C.S., "Elementary Theory and Applications of Numerical Analysis"
(1988) Dover Publications, Inc. New York, pp 108-140.

## A very complete discussion of classical orthogonal polynomials can be found in

7. Bateman, H., The Bateman Manuscript Project, "Higher Transcendental Functions" (1954) Ed. A.
Erde'lyi, Vol. 3, McGraw-Hill Book Co., Inc. New York, Toronto, London, pp 153-228.

95
Numerical Methods and Data Analysis

96
4

Numerical Evaluation of
Derivatives and Integrals

• • •

## The mathematics of the Greeks was insufficient to handle the

concept of time. Perhaps the clearest demonstration of this is Zeno's Paradox regarding the flight of arrows.
Zeno reasoned that since an arrow must cover half the distance between the bow and the target before
traveling all the distance and half of that distance (i.e. a quarter of the whole) before that, etc., that the total
number of steps the arrow must cover was infinite. Clearly the arrow could not accomplish that in a finite
amount of time so that its flight to the target was impossible. This notion of a limiting process of an
infinitesimal distance being crossed in an infinitesimal time producing a constant velocity seems obvious to
us now, but it was a fundamental barrier to the development of Greek science. The calculus developed in the
17th century by Newton and Leibnitz has permitted, not only a proper handling of time and the limiting
process, but the mathematical representation of the world of phenomena which science seeks to describe.
While the analytic representation of the calculus is essential in this description, ultimately we must
numerically evaluate the analytic expressions that we may develop in order to compare them with the real
world.

97
Numerical Methods and Data Analysis

Again we confront a series of subjects about which books have been written and entire courses of
study developed. We cannot hope to provide an exhaustive survey of these areas of numerical analysis, but
only develop the basis for the approach to each. The differential and integral operators reviewed in chapter 1
appear in nearly all aspects of the scientific literature. They represent mathematical processes or operations
to be carried out on continuous functions and therefore can only be approximated by a series of discrete
numerical operations. So, as with any numerical method, we must establish criteria for which the discrete
operations will accurately represent the continuous operations of differentiation and integration. As in the
case of interpolation, we shall find the criteria in the realm of polynomial approximation.

## 4.1 Numerical Differentiation

Compared with other subjects to be covered in the study of numerical methods, little is usually taught about
numerical differentiation. Perhaps that is because the processes should be avoided whenever possible. The
reason for this can be seen in the nature of polynomials. As was pointed out in the last chapter on
interpolation, high degree polynomials tend to oscillate between the points of constraint. Since the derivative
of a polynomial is itself a polynomial, it too will oscillate between the points of constraint, but perhaps not
quite so wildly. To minimize this oscillation, one must use low degree polynomials which then tend to
reduce the accuracy of the approximation. Another way to see the dangers of numerical differentiation is to
consider the nature of the operator itself. Remember that

df ( x ) f ( x + ∆x ) − f ( x )
= Lim . (4.1.1)
dx ∆x →0 ∆x

Since there are always computational errors associated with the calculation of f(x), they will tend to be
present as ∆x → 0, while similar errors will not be present in the calculation of ∆x itself. Thus the ratio will
end up being largely determined by the computational error in f(x). Therefore numerical differentiation
should only be done if no other method for the solution of the problem can be found, and then only with
considerable circumspection.

## a. Classical Difference Formulae

With these caveats clearly in mind, let us develop the formalisms for numerically
differentiating a function f(x). We have to approximate the continuous operator with a finite operator and the
finite difference operators described in chapter 1 are the obvious choice. Specifically, let us take the finite
difference operator to be defined as it was in equation (1.5.1). Then we may approximate the derivative of a
function f(x) by
df ( x ) ∆f ( x )
= . (4.1.2)
dx ∆x
The finite difference operators are linear so that repeated operations with the operator lead to

## ∆nf(x) = ∆[∆n-1f(x)] . (4.1.3)

98
4 - Derivatives and Integrals

This leads to the Fundamental Theorem of the Finite Difference Calculus which is

## The nth difference of a polynomial of degree n is a constant ( an n! hn ), and the (n+1) st

difference is zero.

Clearly the extent to which equation (4.1.3) is satisfied will depend partly on the value of h. Also the ability
to repeat the finite difference operation will depend on the amount of information available. To find a non-
trivial nth order finite difference will require that the function be approximated by an nth degree polynomial
which has n+1 linearly independent coefficients. Thus one will have to have knowledge of the function for at
least n+1 points. For example, if one were to calculate finite differences for the function x2 at a finite set of
points xi, then one could construct a finite difference table of the form:

Table 4.1
2
A Typical Finite Difference Table for f(x) = x

## xi f(xi) ∆f(x) ∆2f(x) ∆3f(x)

2 f(2)=4
∆f(2)=5
3 f(3)=9 ∆2f(2)=2
∆f(3)=7 ∆3f(2)=0
4 f(4)=16 ∆2f(3)=2
∆f(4)=9 ∆3f(3)=0
5 f(5)=25 ∆2f(4)=2
∆f(5)=11
6 f(6)=36

This table nicely demonstrates the fundamental theorem of the finite difference calculus while pointing out
an additional problem with repeated differences. While we have chosen f(x) to be a polynomial so that the
differences are exact and the fundamental theorem of the finite difference calculus is satisfied exactly, one
can imagine the situation that would prevail should f(x) only approximately be a polynomial. The truncation
error that arises from the approximation would be quite significant for ∆f(xi) and compounded for ∆2f(xi).
The propagation of the truncation error gets progressively worse as one proceeds to higher and higher
differences. The table illustrates an additional problem with finite differences. Consider the values of ∆f(xi).
They are not equal to the values of the derivative at xi implied by the definition of the forward difference
operator at which they are meant to apply. For example ∆f(3)=7 and with h=1 for this table would suggest
that f '(3)=7, but simple differentiation of the polynomial will show that f '(3)=6. One might think that this
could be corrected by averaging ∆f (2) and ∆f (3), or by re-defining the difference operator so that it didn't
always refer backward. Such an operator is known as the central difference operator which is defined as

## δf(x) ≡ f(x+½h) ─ f(x-½h) . (4.1.4)

99
Numerical Methods and Data Analysis

However, this does not remove the problem that the value of the nth difference, being derived from
information spanning a large range in the independent variable, may not refer to the nth derivative at the
point specified by the difference operator.

In Chapter 1 we mentioned other finite difference operators, specifically the shift operator E and the
identity operator I (see equation 1.5.3). We may use these operators and the relation between them given by
equation (1.5.4), and the binomial theorem to see that
k
⎛k⎞ k
⎛k⎞
∆k [f ( x )] = [E − I] k [f ( x )] = ∑ (−1) k ⎜ ⎟E i [f ( x )] = ∑ (−1) k −1 ⎜ ⎟f ( x + i) , (4.1.5)
i =0 ⎝i⎠ i =0 ⎝i⎠
where ( ki) is the binomial coefficient which can be written as
⎛k⎞ k!
⎜ ⎟= . (4.1.6)
⎝ i ⎠ (k − i)!i!
One can use equation (4.1.5) to find the kth difference for equally spaced data without constructing the entire
difference table for the function. If a specific value of f(xj) is missing from the table, and one assumes that
the function can be represented by a polynomial of degree k-1, then, since ∆kf (xi) = 0, equation (4.1.5) can
be solved for the missing value of f(xj).

While equation (4.1.5) can be used to find the differences of any equally spaced function f(xi) and
hence is an estimate of the kth derivative, the procedure is equivalent to finding the value of a polynomial of
degree n-k at a specific value of xi. Therefore, we may use any interpolation formula to obtain an expression
for the derivative at some specific point by differentiation of the appropriate formula. If we do this for
Lagrangian interpolation, we obtain
n
Φ ' ( x ) = ∑ f ( x i )L'i ( x ) , (4.1.7)
i =1

where
n n
(x − x j )
L i ' ( x ) = ∑∏ . (4.1.8)
k =1 j≠ i
(x i − x j )
j≠ k

Higher order formulae can be derived by successive differentiation, but one must always use numerical
differentiation with great care.

## b. Richardson Extrapolation for Derivatives

We will now consider a "clever trick" that enables the improvement of nearly all formulae
that we have discussed so far in this book and a number yet to come. It is known as Richardson
extrapolation, but differs from what is usually meant by extrapolation. In chapter 3 we described
extrapolation in terms of extending some approximation formula beyond the range of the data which
constrained that formula. Here we use it to describe a process that attempts to approximate the results of any
difference or difference based formula to limit where the spacing h approaches zero. Since h is usually a
small number, the extension, or extrapolation, to zero doesn't seem so unreasonable. Indeed, it may not seem
very important, but remember the limit of the accuracy on nearly all approximation formulae is set by the
influence of round-off error in the case where an approximating interval becomes small. This will be

100
4 - Derivatives and Integrals

particularly true for problems of the numerical solution of differential equations discussed in the next
chapter. However, we can develop and use it here to obtain expressions for derivatives that have greater
accuracy and are obtained with greater efficiency than the classical difference formulae. Let us consider the
special case where a function f(x) can be represented by a Taylor series so that if
x = x0 + kh , (4.1.9)
then
(kh ) 2 f " ( x 0 ) (kh ) 3 f ( 3) ( x 0 ) (kh ) n f ( n ) ( x 0 )
f ( x 0 + kh ) = f ( x 0 ) + khf ' ( x 0 ) + + +L+ . (4.1.10)
2! 3! n!
Now let us make use of the fact that h appears to an odd power in even terms of equation (4.1.10). Thus if
we subtract the a Taylor series for -k from one for +k, the even terms will vanish leaving

## 2(kh ) 3 f ( 3) ( x 0 ) (kh ) 2 n +1 f ( 2 n +1) ( x 0 )

f ( x 0 + kh ) − f ( x 0 − kh ) = 2khf ' ( x 0 ) + +L+ . (4.1.11)
3! (2n + 1)!
The functional relationship on the left hand side of equation (4.1.11) is considered to be some mathematical
function whose value is precisely known, while the right hand side is the approximate relationship for that
function. That relationship now only involves odd powers of h so that it converges much faster than the
original Taylor series. Now evaluate equation (4.1.11) for k = 1 and 2 explicitly keeping just the first two
terms on the right hand side so that
f ( x 0 + h ) − f ( x 0 − h ) = 2hf ' ( x 0 ) + 2h 3 f ( 3) ( x 0 ) / 6 + L + R (h 5 ) ⎫⎪
~ ⎬ . (4.1.12)
f ( x 0 + 2h ) − f ( x 0 − 2h ) = 4hf ' ( x 0 ) + 16h 3 f (3) ( x 0 ) / 6 + L + R (h 5 ) ⎪⎭
We now have two equations from which the term involving the third derivative may be eliminated yielding
~
f(x0-2h)-8f(x0-h)+8f(x0+h)-f(x0+2h) = ─12hf'(x0)+R(h5)-R(h5) , (4.1.13)
and solving for f'(x0) we get.
f'(x0) = [f(x0-2h) ─ 8f(x0-h) + 8f(x0+h) ─ f(x0+2h)]/(12h) + O(h4). (4.1.14)

It is not hard to show that the error term in equation (4.1.13) divided by h is O(h4). Thus we have an
expression for the derivative of the function f(x) evaluated at some value of x = x0 which requires four values
of the function and is exact for cubic polynomials. This is not too surprising as we have four free parameters
with which to fit a Taylor series or alternately a cubic polynomial and such polynomials will be unique.
What is surprising is the rapid rate of convergence with decreasing interval h. But what is even more
amazing is that this method can be generalized to any approximation formulae that can be written as
f ( x ) = Φ ( x , αh ) + Ch n + O(h m ) ⎫
⎬ . (4.1.15)
m > n , α > 0, α ≠ 1 ⎭
so that
α n Φ ( x , h ) − Φ ( x , αh )
f (x) = + O( h m ) . (4.1.16)
α −1
n

Indeed, it could be used to obtain an even higher order approximation for the derivative utilizing more
tabular points. We shall revisit this method when we consider the solution to differential equations in
Chapter 5.

101
Numerical Methods and Data Analysis

## 4.2 Numerical Evaluation of Integrals: Quadrature

While the term quadrature is an old one, it is the correct term to use for describing the numerical evaluation
of integrals. The term numerical integration should be reserved for describing the numerical solution of
differential equations (see chapter 5). There is a genuine necessity for the distinction because the very nature
of the two problems is quite different. Numerically evaluating an integral is a rather common and usually
stable task. One is basically assembling a single number from a series of independent evaluations of a
function. Unlike numerical differentiation, numerical quadrature tends to average out random computational
errors.

Because of the inherent stability of numerical quadrature, students are generally taught only the
simplest of techniques and thereby fail to learn the more sophisticated, highly efficient techniques that can be
so important when the integrand of the integral is extremely complicated or occasionally the result of a
separate lengthy study. Virtually all numerical quadrature schemes are based on the notion of polynomial
approximation. Specifically, the quadrature scheme will give the exact value of the integral if the integrand is
a polynomial of some degree n. The scheme is then said to have a degree of precision equal to n. In general,
since a nth degree polynomial has n+1 linearly independent coefficients, a quadrature scheme will have to
have n+1 adjustable parameters in order to accurately represent the polynomial and its integral.
Occasionally, one comes across a quadrature scheme that has a degree of precision that is greater than the
number of adjustable parameters. Such a scheme is said to be hyper-efficient and there are a number of such
schemes known for multiple integrals. For single, or one dimensional, integrals, there is only one which we
will discuss later.

## a. The Trapezoid Rule

The notion of evaluating an integral is basically the notion of evaluating a sum. After all the
integral sign ∫ is a stylized S that stands for a continuous "sum". The symbol Σ as introduced in equation
(1.5.2) stands for a discrete or finite sum, which, if the interval is taken small enough, will approximate the
value for the integral. Such is the motivation for the Trapezoid rule which can be stated as
b n −1
f ( x i +1 ) + f ( x i )
∫ f ( x ) dx = ∑ ∆x i . (4.2.1)
a
i =1 2
The formula takes the form of the sum of a discrete set of average values of the function each of which is
multiplied by some sort of weight Wi. Here the weights play the role of the adjustable parameters of the
quadrature formula and in the case of the trapezoid rule the weights are simply the intervals between
functional evaluations. A graphical representation of this can be seen below in Figure 4.1

The meaning of the rule expressed by equation (4.2.1) is that the integral is approximated by a series
of trapezoids whose upper boundaries in the interval ∆xi are straight lines. In each interval this formula
would have a degree of precision equal to 1 (i.e. equal to the number of free parameters in the interval minus
one). The other "adjustable" parameter is the 2 used in obtaining the average of f(xi) in the interval. If we
divide the interval a → b equally then the ∆xi's have the particularly simple form

## ∆xi = (b-a)/(n-1) . (4.2.2)

102
4 - Derivatives and Integrals

In Chapter 3, we showed that the polynomic form of the integrand of an integral was unaffected by a linear
transformation [see equations (3.3.19) and (3.3.20)]. Therefore, we can rewrite equation (4.2.1) as
b (b − a ) +1 (b − a ) n f [ x ( y i +1 )] + f [ x ( y i )]

a
f ( x ) dx =
2 ∫−1
f ( y ) dy = ∑
2 i =1 2
W 'i , (4.2.3)

## where the weights for an equally spaced interval are

W'
i = 2/(n-1) . (4.2.4)
If we absorb the factor of (b-a)/2 into the weights we see that for both representations of the integral [i.e.
equation (4.2.1) and equation (4.2.3)] we get
n

∑W
i =1
i = b−a . (4.2.5)

Notice that the function f(x) plays absolutely no role in determining the weights so that once they are
determined; they can be used for the quadrature of any function. Since any quadrature formula that is exact
for polynomials of some degree greater than zero must be exact for f(x) = x0, the sum of the weights of any
quadrature scheme must be equal to the total interval for which the formula holds.

Figure 4.1 shows a function whose integral from a to b is being evaluated by the trapezoid
rule. In each interval a straight line approximates the function ∆xi .

b. Simpson's Rule

The trapezoid rule has a degree of precision of 1 as it fits straight lines to the function in the
interval. It would seem that we should be able to do better than this by fitting a higher order polynomial to
the function. So instead of using the functional values at the endpoints of the interval to represent the
function by a straight line, let us try three equally spaced points. That should allow us to fit a polynomial

103
Numerical Methods and Data Analysis

with three adjustable parameters (i.e. a parabola) and obtain a quadrature formula with a degree of precision
of 2. However, we shall see that this quadrature formula actually has a degree of precision of 3 making it a
hyper-efficient quadrature formula and the only one known for integrals in one dimension.

## In general, we can construct a quadrature formula from an interpolation formula by direct

integration. In chapter 3 we developed interpolation formulae that were exact for polynomials of an arbitrary
degree n. One of the more general forms of these interpolation formulae was the Lagrange interpolation
formula given by equation (3.2.8). In that equation Φ(x) was a polynomial of degree n and was made up of a
linear combination of the Lagrange polynomials Li(x). Since we are interested in using three equally spaced
points, n will be 2. Also, we have seen that any finite interval is equivalent to any other for the purposes of
fitting polynomials, so let us take the interval to be 2h so that our formula will take the form
2h 2 2 2h
∫ f ( x ) dx = ∑ f ( x i ) Wi =∑ f ( x i ) ∫ L i ( x ) dx . (4.2.6)
0 0
i =0 i =0

## Here we see that the quadrature weights Wi are given by

2h 2h 2
(x − x i )
Wi = ∫ L i ( x ) dx = ∫ ∏ (x dx . (4.2.7)
i − x j)
0 0
j≠ i
j= 0

Now the three equally spaced points in the interval 2h will have x = 0, h, and 2h. For equal intervals we can
use equation (3.2.11) to evaluate the Lagrange polynomials to get
( x − h )( x − 2h ) ( x 2 − 3xh + 2h 2 ) ⎫
L 0 (x) = = ⎪
2h 2 2h 2 ⎪
( x − 0)( x − 2h ) ( x 2 − 2xh ) ⎪
L1 ( x ) = = ⎬ . (4.2.8)
h2 h2 ⎪
( x − 0)( x − h ) ( x 2 − xh ) ⎪
L 2 (x) = = ⎪
2h 2 2h 2 ⎭
Therefore the weights for Simpson's rule become
2h (8h 3 / 3 − 12h 3 / 2 + 4h 3 ) h ⎫
W0 = ∫ L 0 ( x ) dx = = ⎪
0 2h 2 3 ⎪
2h (8h / 3 − 8h / 2) 4h
3 3

W1 = ∫ L1 ( x ) dx = = ⎬ . (4.2.9)
0 h2 3 ⎪
2h (8h / 3 − 4h / 2) h
3 3

W2 = ∫ L 2 ( x ) dx = = ⎪
0 h2 3 ⎭

Actually we need only to have calculated two of the weights since we know that the sum of the weights had
to be 2h. Now since h is only half the interval we can write

h = ∆x/2 , (4.2.10)

104
4 - Derivatives and Integrals

## so that the approximation formula for Simpson's quadrature becomes

2
∆x
[f ( x 0 ) + 4f ( x 1 ) + f ( x 2 )] .
∆x
∫ f ( x ) dx = ∑ f ( x i ) Wi = (4.2.11)
0
i =0 6

Now let us confirm the assertion that Simpson's rule is hyper-efficient. We know that the quadrature
formula will yield exact answers for quadratic polynomials, so consider the evaluation of a quartic. We pick
the extra power of x in anticipation of the result. Thus we can write
α∆x 4 β ∆x 5 ∆x ⎛⎜ ⎡ ∆x ⎤ ⎞
3 4
∆x ⎡ ∆x ⎤
∫ (αx 3 + βx 4 ) dx = + = 4α ⎢ ⎥ + α(∆x ) 3 + 4β ⎢ ⎥ + β(∆x ) 4 ⎟ + R (∆x )
0 4 5 6 ⎜⎝ ⎣ 2 ⎦ ⎣ 2 ⎦ ⎟

α(∆x ) 4
5β(∆x ) 5
= + + R (∆x ) . (4.2.12)
4 24
Here R(∆x) is the error term for the quadrature formula. Completing the algebra in equation (4.2.12) we get
R(∆x) = β(∆x)5/120 . (4.2.13)
Clearly the error in the integral goes as the interval to the fifth power and not the fourth power. So the
quadrature formula will have no error for cubic terms in the integrand and the formula is indeed hyper-
efficient. Therefore Simpson's rule is a surprisingly good quadrature scheme having a degree of precision of
3 over the interval ∆x. Should one wish to span a larger interval (or reduce the spacing for a given interval),
one could write
n
∆x
[f ( x 1 ) + 4f ( x 2 ) + 2f ( x 3 ) + 4f ( x 4 ) + L + 4f ( x n −1 ) + f ( x n )] .
h∆x i∆x
∫ f ( x ) dx = ∑ ∫ f ( x i ) dx =
0
i =1
( i −1) ∆x 6
(4.2.14)

By breaking the integral up into sub-intervals, the function need only be well approximated locally
by a cubic. Indeed, the function need not even be continuous across the separate boundaries of the sub-
intervals. This form of Simpson's rule is sometimes called a running Simpson's rule and is quite easy to
implement on a computer. The hyper-efficiency of this quadrature scheme makes this a good "all purpose"
equal interval quadrature algorithm.

## As we saw above, it is possible to obtain a quadrature formula from an interpolation

formula and maintain the same degree of precision as the interpolation formula. This provides the basis for
obtaining quadrature formula for functions that are specified at arbitrarily spaced values of the independent
variable xi. For example, simply evaluating equation (4.2.6) for an arbitrary interval yields
b n b
∫ f ( x ) dx = ∑ f ( x i ) ∫ L i ( x ) dx , (4.2.15)
a a
i =0

which means that the weights associated with the arbitrarily spaced points xi are
b
Wi = ∫ L i ( x ) dx . (4.2.16)
a

However, the analytic integration of Li(x) can become tedious when n becomes large so we give an

105
Numerical Methods and Data Analysis

alternative strategy for obtaining the weights for such a quadrature scheme. Remember that the scheme is to
have a degree of precision of n so that it must give the exact answers for any polynomial of degree n. But
there can only be one set of weights, so we specify the conditions that must be met for a set of polynomials
for which we know the answer - namely xi. Therefore we can write

b i +1 − a i +1 n
= ∑ x ij W j
b
∫a
x i dx =
i +1 j= 0
, i = 0L n . (4.2.17)

The integral on the left is easily evaluated to yield the center term which must be equal to the sum on the
right if the formula is to have the required degree of precision n. Equations (4.2.17) represent n+1 linear
equations in the n+1 weights Wi. Since we have already discussed the solution of linear equations in some
detail in chapter 2, we can consider the problem of finding the weights to be solved.

While the spacing of the points given in equations (4.2.17) is completely arbitrary, we can use these
equations to determine the weights for Simpson's rule as an example. Assume that we are to evaluate an
integral in the interval 0 → 2h. Then the equations (4.2.17) for the weights would be

(2h ) i +1 n
= ∑ x ij W j
2h
∫ x dx = , i = 0L n .
i
(4.2.18)
0 i +1 j= 0

## For xj = [0,h,2h], the equations specifically take the form

2h = W1 + W2 + W3 ⎪

( 2h ) 2 ⎪
= 2h 2 = h 2 W2 + h 2 W3 ⎬ . (4.2.19)
2 ⎪
(2h ) 3 8h 3 ⎪
= = h 2 W2 + 4h 2 W3 ⎪
3 3 ⎭
which upon removal of the common powers of h are

2h = W1 + W2 + W3 ⎪

2h = W2 + W3 ⎬ . (4.2.20)
8h ⎪
= W2 + 4W3 ⎪
3 ⎭
These have the solution
Wi = [1/3, 4/3, 1/3]h . (4.2.21)

The weights given in equation (4.2.21) are identical to those found for Simpson's rule in equation (4.2.9)
which lead to the approximation formula given by equation (4.2.11). The details of finding the weights by
this method are sufficiently simple that it is generally preferred over the method discussed in the previous
section (section 4.2b).

106
4 - Derivatives and Integrals

There are still other alternatives for determining the weights. For example, the integral in equation
(4.2.16) is itself the integral of a polynomial of degree n and as such can be evaluated exactly by any
quadrature scheme with that degree of precision. It need not have the spacing of the desired scheme at all.
Indeed, the integral could be evaluated at a sufficient level of accuracy by using a running Simpson's rule
with a sufficient total number of points. Or the weights could be obtained using the highly efficient Gaussian
type quadrature schemes described below. In any event, a quadrature scheme can be tailored to fit nearly any
problem by writing down the equations of condition that the weights must satisfy in order to have the desired
degree of precision. There are, of course, some potential pitfalls with this approach. If very high degrees of
precision formulae are sought, the equations (4.2.17) may become nearly singular and be quite difficult to
solve with the accuracy required for reliable quadrature schemes. If such high degrees of precision formulae
are really required, then one should consider Gaussian quadrature schemes.

## d. Gaussian Quadrature Schemes

We turn now to a class of quadrature schemes first suggested by that brilliant 19th century
mathematician Karl Friedrich Gauss. Gauss noted that one could obtain a much higher degree of precision
for a quadrature scheme designed for a function specified at a given number of points, if the location of those
points were regarded as additional free parameters. So, if in addition to the N weights one also had N
locations to specify, one could obtain a formula with a degree of precision of 2N-1 for a function specified at
only N points. However, they would have to be the proper N points. That is, their location would no longer
be arbitrary so that the function would have to be known at a particular set of values of the independent
variable xi. Such a formula would not be considered a hyper-efficient formula since the degree of precision
does not exceed the number of adjustable parameters. One has simply enlarged the number of such
parameters available in a given problem.

The question then becomes how to locate the proper places for the evaluation of the function given
the fact that one wishes to obtain a quadrature formula with this high degree of precision. Once more we
may appeal to the notion of obtaining a quadrature formula from an interpolation formula. In section (3.2b)
we developed Hermite interpolation which had a degree of precision of 2N-1. (Note: in that discussion the
actual numbering if the points began with zero so that N=n+1 where n is the limit of the sums in the
discussion.) Since equation (3.2.12) has the required degree of precision, we know that its integral will
provide a quadrature formula of the appropriate degree. Specifically

b n b n b
∫ Φ ( x ) dx = ∑ f ( x j ) ∫ h j ( x ) dx + ∑ f ' ( x j ) ∫ H j ( x ) dx . (4.2.22)
a a a
j= 0 j= 0

Now equation (4.2.22) would resemble the desired quadrature formula if the second sum on the right hand
side could be made to vanish. While the weight functions Hj(x) themselves will not always be zero, we can
ask under what conditions their integral will be zero so that

b
∫a
H j ( x ) dx = 0 . (4.2.23)

107
Numerical Methods and Data Analysis

Here the secret is to remember that those weight functions are polynomials [see equation (3.2.32)] of degree
2n+1 (i.e. 2N-1) and in particular Hj(x) can be written as

H i (x) =
∏ ( x )L ( x ) i
, (4.2.24)
n

∏ (x − x )
j≠ i
i j

where
n

∏ (x) ≡ ∏ (x − x j ) .
j= 0
(4.2.25)

Here the additional multiplicative linear polynomial uj(x) that appears in equation has been included in one
of the Lagrange polynomials Lj(x) to produce the n+1 degree polynomial Π(x). Therefore the condition for
the weights of f'(xi) to vanish [equation(4.2.23)] becomes
b
∫ ∏ (x)L (x) dx = 0
a
i
. (4.2.26)
n

∏ (x − x )
j≠ i
i j

The product in the denominator is simply a constant which is not zero so it may be eliminated from the
equation. The remaining integral looks remarkably like the integral for the definition of orthogonal
polynomials [equation (3.3.6)]. Indeed, since Li(x) is a polynomial of degree n [or (N-1)] and Π(x) is a
polynomial of degree n+1 (also N), the conditions required for equation (4.2.26) to hold will be met if Π(x)
is a member of the set of polynomials which are orthogonal in the interval a → b. But we have not
completely specified Π(x) for we have not chosen the values xj where the function f(x) and hence Π(x) are to
be evaluated. Now it is clear from the definition of Π(x) [equation (4.2.25)] that the values of xj are the roots
of a polynomial of degree n+1 (or N) that Π(x) represents. Thus, we now know how to choose the xj's so that
the weights of f'(x) will vanish. Simply choose them to be the roots of the (n+1)th degree polynomial which
is a member on an orthogonal set on the interval a → b. This will insure that the second sum in equation
(4.2.22) will always vanish and the condition becomes
b n b
∫ Φ ( x ) dx = ∑ f ( x j ) ∫ h j ( x ) dx . (4.2.27)
a a
j= 0

This expression is exact as long as Φ(x) is a polynomial of degree 2n+1 (or 2N-1) or less. Thus, Gaussian
quadrature schemes have the form
b n

∫ f ( x ) dx = ∑ f ( x j ) W j , (4.2.28)
a
j= 0

where the xi's are the roots of the Nth degree orthogonal polynomial which is orthogonal in the interval
a → b, and the weights Wi can be written with the aid of equation (3.2.32) as

b b
Wi = ∫ h i ( x ) dx = ∫ [1 − 2( x − x i )L' i ( x )L2i ( x )] dx . (4.2.29)
a a

108
4 - Derivatives and Integrals

Now these weights can be evaluated analytically should one have the determination, or they can be evaluated
from the equations of condition [equation (4.2.17)] which any quadrature weights must satisfy. Since the
extent of the finite interval can always be transformed into the interval −1 → +1 where the appropriate
orthonormal polynomials are the Legendre polynomials, and the weights are independent of the function
f(x), they will be specified by the value of N alone and may be tabulated once and for all. Probably the most
complete tables of the roots and weights for Gaussian quadrature can be found in Abramowitz and Stegun1
and unless a particularly unusual quadrature scheme is needed these tables will suffice.

Before continuing with our discussion of Gaussian quadrature, it is perhaps worth considering a
specific example of such a formula. Since the Gaussian formulae make use of orthogonal polynomials, we
should first express the integral in the interval over which the polynomials form an orthogonal set. To that
end, let us examine an integral with a finite range so that
b ⎛ b − a ⎞ +1
∫a
f ( x ) dx = ⎜
⎝ 2 ⎠ −1
⎟ ∫ f {[(b − a ) y + (a + b)] / 2} dy . (4.2.30)

Here we have transformed the integral into the interval −1 → +1. The appropriate transformation can be
obtained by evaluating a linear function at the respective end points of the two integrals. This will specify the
slope and intercept of the straight line in terms of the limits and yields
y = [2 x − (a + b)] /(b − a ) ⎫
⎬ . (4.2.31)
dy = [2 /(b − a )]dx ⎭
We have no complicating weight function in the integrand so that the appropriate polynomials are the
Legendre polynomials. For simplicity, let us take n=2. We gave the first few Legendre polynomials in Table
3.4 and for n = 2 we have
P2(y) = (3y2-1)/2 . (4.2.32)
The points at which the integrand is to be evaluated are simply the roots of that polynomial which we can
fine from the quadratic formula to be
(3y 2 − 1) / 2 = 0 ⎫⎪
⎬ . (4.2.33)
yi = ± 3 ⎪⎭

Quadrature formulae of larger n will require the roots of much larger degree polynomials which
have been tabulated by Abramowitz and Stegun1. The weights of the quadrature formula are yet to be
determined, but having already specified where the function is to be evaluated, we may use equations
(4.2.17) to find them. Alternatively, for this simple case we need only remember that the weights sum to the
interval so that
W 1 + W2 = 2 . (4.2.34)

Since the weights must be symmetric in the interval, they must both be unity. Substituting the values for yi
and Wi into equation (4.2.28), we get

b

( b −a )
f ( x ) dx ≅ 2 {f [(( b −a ) 2 3 ) + 12 (a + b)] + f [(( a − b ) 2 3 ) + 12 (a + b)]} . (4.2.35)
a

While equation (4.2.35) contains only two terms, it has a degree of precision of three (2n-1) or the same as

109
Numerical Methods and Data Analysis

the three term hyper-efficient Simpson's rule. This nicely illustrates the efficiency of the Gaussian schemes.
They rapidly pass the fixed abscissa formulae in their degree of precision as [(2n-1)/n].

So far we have restricted our discussion of Gaussian quadrature to the finite interval. However, there
is nothing in the entire discussion that would affect general integrals of the form
β
I = ∫ w ( x )f ( x ) dx . (4.2.36)
α

Here w(x) is a weight function which may not be polynomic and should not be confused with the quadrature
weights Wi. Such integrals can be evaluated exactly as long as f(x) is a polynomial of degree 2N-1. One
simply uses a Gaussian scheme where the points are chosen so that the values of xi are the roots of the Nth
degree polynomial that is orthogonal in the interval α → β relative to the weight function w(x). We have
already studied such polynomials in section 3.3 so that we may use Gaussian schemes to evaluate integrals in
the semi-infinite interval [0 → +∞] and full infinite interval [−∞ → +∞] as well as the finite interval [−1 →
+1] as long as the appropriate weight function is used. Below is a table of the intervals and weight functions
that can be used for some common types of Gaussian quadrature.

Table 4.2
Types of Polynomials for Gaussian Quadrature
INTERVAL WEIGHT fUNCTION TYPE OF POLYNOMIAL
W(X)
-1 → +1 (1-x2)-½ Chebyschev: 1st kind
2 +½
-1 → +1 (1-x ) Chebyschev: 2nd kind
0 → +∞ e-x Laguerre
2
-∞ → +∞ e-x Hermite

It is worth noting from the entries in Table 4.2 that there are considerable opportunities for creativity
available for the evaluation of integrals by a clever choice of the weight function. Remember that it is only
f(x) of the product w(x)f(x) making up the integrand that need be well approximated by a polynomial in
order for the quadrature formula to yield accurate answers. Indeed the weight function for Gaussian-
Chebyschev quadrature of the first kind has singularities at the end points of the interval. Thus if one's
integral has similar singularities, it would be a good idea to use Gauss-Chebyschev quadrature instead of
Gauss-Legendre quadrature for evaluating the integral. Proper choice of the weight function may simply be
used to improve the polynomic behavior of the remaining part of the integrand. This will certainly improve
the accuracy of the solution.

In any event, the quadrature formulae can always be written to have the form

β n

∫α w (x )f (x ) dx = ∑ f (x j )Wj ,
j= 0
(4.2.37)

110
4 - Derivatives and Integrals

where the weights, which may include the weight function w(x) can be found from
β
w i = ∫ w ( x )h i ( x ) dx . (4.2.38)
α

Here hi(x) is the appropriate orthogonal polynomial for the weight function and interval.

## e. Romberg Quadrature and Richardson Extrapolation

So far we have given explicit formulae for the numerical evaluation of a definite integral. In
reality, we wish the result of the application of such formulae to specific problems. Romberg quadrature
produces this result without obtaining the actual form for the quadrature formula. The basic approach is to
use the general properties of the equal-interval formulae such as the Trapezoid rule and Simpson's rule to
generate the results for formulae successively applied with smaller and smaller step size. The results can be
further improved by means of Richardson's extrapolation to yield results for formulae of greater accuracy
[i.e. higher order O(hm)]. Since the Romberg algorithm generates these results recursively, the application is
extremely efficient, readily programmable, and allows an on-going estimate of the error. Let us define a step
size that will always yield equal intervals throughout the interval a → b as

hj = (b-a)/2j . (4.2.39)

The general Trapezoid rule for an integral over this range can written as
hj ⎡ j−1

f (a ) + f (b) + 2∑ f (a + ih j )⎥ .
b
F(b − a ) = ∫ f ( x ) dx = ⎢ (4.2.40)
a 2 ⎣ i =1 ⎦
The Romberg recursive quadrature algorithm states that the results of applying this formula for successive
values of j (i.e. smaller and smaller step sizes hj) can be obtained from

Fj0 = 1
2 (Fj0−1 + Q j−1 ) ⎫

2( j−1) ⎪
Q j−1 = h j−1 ∑ f [b + (i − 1 2 )h j−1 ] ⎬ . (4.2.41)
i =1 ⎪
F0 = (b − a )[f (a ) + f (b)] / 2
0 ⎪

Each estimate of the integral will require 2(j-1) evaluations of the function and should yield a value for the
integral, but can have a degree of precession no greater than 2(j-1). Since a sequence of j steps must be execute
to reach this level, the efficiency of the method is poor compared to Gaussian quadrature. However the
difference (F0j─Fj-1
0
) does provide an continuous estimate of the error in the integral.

We can significantly improve the efficiency of the scheme by using Romberg extrapolation to
improve the nature of the quadrature formulae that the iteration scheme is using. Remember that successive
values of h differ by a factor of two. This is exactly the form that we used to develop the Richardson formula
for the derivative of a function [equation (4.1.15)]. Thus we can use the generalization of the Richardson
algorithm given by equation (4.1.15) and utilizing two successive values of F0j to "extrapolate" to the result

111
Numerical Methods and Data Analysis

for a higher order formula. Each value of integral corresponding to the higher order quadrature formula can,
in turn, serve as the basis for an additional extrapolation. This procedure also can be cast as a recurrence
formula where
2 2 k Fjk+−11 − Fjk −1
Fjk = . (4.2.42)
2 2k − 1
There is a trade off between the results generated by equation (4.2.42) and equation (4.2.41). Larger values
of j produce values for Fkj which correspond to decreasing values of h (see table 4.3). However, increasing
values of k yield values for Fkj which correspond to quadrature formulae smaller error terms, but with larger
values of h. Thus it is not obvious which sequence, equation (4.2.41) or equation (4.2.42) will yield the
better value for the integral.

In order to see how this method works, consider applying it to the analytic integral
+1 e5 − 1
∫0
e 5 x dx =
5
= 29.48263182 . (4.2.43)

Table 4.3
Sample Results for Romberg Quadrature

## I Fj0 Fj1 Fj2 Fj3 Fj4

0 74.7066 33.0238 29.6049 29.4837 29.4827
1 43.4445 29.8186 29.4856 29.4826
2 33.2251 29.5064 29.4827
3 30.4361 29.4824
4 29.722113

Here it is clear that improving the order of the quadrature formula rapidly leads to a converged solution. The
convergence of the non-extrapolated quadrature is not impressive considering the number of evaluations
required to reach, say, F04. Table 4.4 gives the results of applying some of the other quadrature methods we
have developed to the integral in equation (4.2.43).

We obtain the results for the Trapezoid rule by applying equation (4.2.1) to the integral given by
equation (4.2.43). The results for Simpson's rule and the two-point Gaussian quadrature come from
equations (4.2.11) and (4.2.35) respectively. In the last two columns of Table 4.4 we have given the
percentage error of the method and the number of evaluations of the function required for the determination
of the integral. While the Romberg extrapolated integral is five times more accurate that it nearest
competitor, it takes twice the number of evaluations. This situation gets rapidly worse so that the Gaussian
quadrature becomes the most efficient and accurate scheme when n exceeds about five. The trapezoid rule
and Romberg F00 yield identical results as they are the same approximation. Similarly Romberg F10 yields
the same results as Simpson's rule. This is to be expected as the Richardson extrapolation of the Romberg
quadrature equivalent to the Trapezoid rule should lead to the next higher order quadrature formula which is
Simpson's rule.

112
4 - Derivatives and Integrals

Table 4.4
Test Results for Various Quadrature Formulae

## Type F(x) |∆F(%)| N[f(x)]

Analytic Result 29.48263182 0.0 1
Trapezoid Rule 74.70658 153.39 2
Simpson's Rule 33.02386 12.01 3
2-point Gauss Quad. 27.23454 7.63 2
Romberg Quadrature F00 74.70658 153.39 2
Romberg Quadrature F11 29.8186 1.14 4

f. Multiple Integrals

Most of the work on the numerical evaluation of multiple integrals has been done in the
middle of this century at the University of Wisconsin by Preston C. Hammer and his students. A reasonably
complete summary of much of this work can be found in the book by Stroud2. Unfortunately the work is not
widely known since problems associated with multiple integrals occur frequently in the sciences particularly
in the area of the modeling of physical systems. From what we have already developed for quadrature
schemes one can see some of the problems. For example, should it take N points to accurately represent an
integral in one dimension, then it will take Nm points to calculate an m-dimensional integral. Should the
integrand be difficult to calculate, the computation involved in evaluating it at Nm points can be prohibitive.
Thus we shall consider only those quadrature formulae that are the most efficient - the Gaussian formulae.
The first problem in numerically evaluating multiple integrals is to decide what will constitute an
approximation criterion. Like integrals of one dimension, we shall appeal to polynomial approximation. That
is, in some sense, we shall look for schemes that are exact for polynomials of the multiple variables that
describe the multiple dimensions. However, there are many distinct types of such polynomials so we shall
choose a subset. Following Stroud2 let us look for quadrature schemes that will be exact for polynomials that
can be written as simple products of polynomials of a single variable. Thus the approximating polynomial
will be a product polynomial in m-dimensions. Now we will not attempt to derive the general theory for
multiple Gaussian quadrature, but rather pick a specific space. Let the space be m-dimensional and of the full
infinite interval. This allows us, for the moment, to avoid the problem of boundaries. Thus we can represent
our integral by

+∞ +∞ +∞ − ( x12 + x 22 +L+ x 2m )
V=∫ ∫ L∫ e f ( x 1 , x 2 , L, x m ) dx 1dx 2 L dx m . (4.2.44)
−∞ −∞ −∞

Now we have seen that we lose no generality by assuming that our nth order polynomial is a monomial of

113
Numerical Methods and Data Analysis

the form xα so let us continue with this assumption that f(x1, x2, L xm) has the form
n
f ( x ) = ∏ x iαi . (4.2.45)
i =1

## We can then write equation (4.2.44) as

m
+∞ +∞ +∞
− ∑ x i2 m m +∞ x2
∏ x j j dx j = ∏ ∫
α α
V=∫ ∫ L∫ e i =1
e j x j j dx j . (4.2.46)
−∞ −∞ −∞ −∞
j=1 j=1

The right hand side has the relatively simple form due to the linearity of the integral operator. Now make a
coordinate transformation to general spherical coordinates by means of
x 1 = r cos θ m −1 cos θ m − 2 L cos θ 2 cos θ1 ⎫
x 2 = r cos θ m −1 cos θ m − 2 L cos θ 2 sin θ1 ⎪
⎪⎪
M M M ⎬, (4.2.47)
x m −1 = r cos θ m −1 sin θ m − 2 ⎪

x m = r sin θ m −1 ⎭⎪
which has a Jacobian of the transformation equal to

## This allows the expression of the integral to take the form

⎛ m
( ∑ αi )
⎞ m
⎜ + ∞ 2 m −1 ⎟ m −1 + π / 2 ( − ∑ αi )
=1
V = ⎜ ∫ e − r r r j=1 dr ⎟∏ ∫ (cos θ i ) i −1 (cos θ i )
j
(sin θ i ) αi +1 dθ i . (4.2.49)
⎜⎜ −∞
⎟⎟ i =1 −π / 2

⎝ ⎠
Consider how we could represent a quadrature scheme for any single integral in the running product. For
example
+π / 2 N

## ∫−π / 2 (cos θ i ) (cos θ i ) i (sin θ i ) i+1 dθ i = ∑ Bij (cos θ i ) i .

i −1 α α α
(4.2.50)
j=1

Here we have chosen the quadrature points for θi to be at θij and we have let

α = Σαi . (4.2.51)

## Now make one last transformation of the form yi = cosθi , (4.2.52)

which leads to
+1 N +1
∫ (1 − y i2 ) ( i − 2) / 2 y iα dy = ∑ B ij y j = ∫ w ( y i ) y iα dy i , i = 1L (m − 1) . (4.2.53)
−1 −1
j=1

The integral on the right hand side can be evaluated exactly if we take the yi's to be the roots of a polynomial
of degree (α+1)/2 which is a member of an orthogonal set in the interval −1 → +1, relative to the weight
function w(yi) which is
w ( y i ) = (1 − y i2 ) ( i − 2) / 4 (1 + y i2 ) ( i − 2) / 4 . (4.2.54)

114
4 - Derivatives and Integrals

By considering Table 3.1 we see that the appropriate polynomials will be members of the Jacobi
polynomials for α = β = ( i-2 )/4. The remaining integral over the radial coordinate has the form
+∞ 2
∫−∞
e − r r α ' dr , (4.2.55)
which can be evaluated using Gauss-Hermite quadrature. Thus we see that multiple dimensional quadratures
can be carried out with a Gaussian degree of precision for product polynomials by considering each integral
separately and using the appropriate Gaussian scheme for that dimension. For example, if one desires to
integrate over the solid sphere, one would choose Gauss-Hermite quadrature for the radial quadrature,
Gauss-Legendre quadrature for the polar angle θ, and Gauss-Chebyschev quadrature for the azimuthal angle
φ. Such a scheme can be used for integrating over the surface of spheres or surfaces that can be distorted
from a sphere by a polynomial in the angular variables with good accuracy. The use of Gaussian quadrature
schemes can save on the order of Nm/2 evaluations of the functions which is usually significant.

For multi-dimensional integrals, there are a number of hyper-efficient quadrature formulae that are
known. However, they depend on the boundaries of the integration and are generally of rather low order.
Nevertheless such schemes should be considered when the boundaries are simple and the function well
behaved. When the boundaries are not simple, one may have to resort to a modeling scheme such a Monte
Carlo method.

It is clear that the number of points required to evaluate an integral in m-dimensions will increase as
Nm. It does not take many dimensions for this to require an enormous number of points and hence,
evaluations of the integrand. Thus for multiple integrals, efficiency may dictate another approach.

## 4.3 Monte Carlo Integration Schemes and Other Tricks

The Monte Carlo approach to quadrature is a philosophy as much as it is an algorithm. It is an
application of a much more widely used method due to John von Neumann. The method was developed
during the Second World War to facilitate the solution to some problems concerning the design of the atomic
bomb. The basic philosophy is to describe the problem as a sequence of causally related physical
phenomena. Then by determining the probability that each separate phenomenon can occur, the joint
probability that all can occur is a simple product. The procedure can be fashioned sequentially so that even
probabilities that depend on prior events can be handled. One can conceptualize the entire process by
following a series of randomly chosen initial states each of which initiates a causal sequence of events
leading to the desired final state. The probability distribution of the final state contains the answer to the
problem. While the method derives it name from the casino at Monte Carlo in order to emphasize the
probabilistic nature of the method, it is most easily understood by example. One of the simplest examples of
Monte Carlo modeling techniques involves the numerical evaluation of integrals.

## a. Monte Carlo Evaluation of Integrals

Let us consider a one dimensional integral defined over a finite interval. The graph of the integrand
might look like that in Figure 4.2. Now the area under the curve is related to the integral of the function.

115
Numerical Methods and Data Analysis

Therefore we can replace the problem of finding the integral of the function to that of finding the area under
the curve. However, we must place some units on the integral and we do that by finding the relative area
under the curve. For example, consider the integral
b
∫ a
f max dx = (b − a )f max . (4.3.1)
The graphical representation of this integral is just the area of the rectangle bounded by y = 0, x = a, x = b,
and y = fmax. Now if we were to randomly select values of xi and yi, one could ask if
yi ≤  f (xi) . (4.3.2)
If we let ratio of the number of successful trials to the total number of trials be R, then
b

a
f ( x )dx = R (b − a )f max . (4 .3.3)
Clearly the accuracy of the integral will depend on the accuracy of R and this will improve with the number
N of trials. In general, the value of R will approach its actual value as N. This emphasizes the major
difference between Monte Carlo quadrature and the other types of quadrature. In the case of the quadrature
formulae that depend on a direct calculation of the integral, the error of the result is determined by the extent
to which the integrand can be approximated by a polynomial (neglecting round-off error). If one is
sufficiently determined he/she can determine the magnitude of the error term and thereby place an absolute
limit on the magnitude of the error. However, Monte Carlo schemes are not based on polynomial
approximation so such an absolute error estimate cannot be made even in principle. The best we can hope for
is that there is a certain probability that the value of the integral lies within ε of the correct answer. Very
often this is sufficient, but it should always remembered that the certainty of the calculation rests on a
statistical basis and that the approximation criterion is different from that used in most areas of numerical
analysis.

If the calculation of f(x) is involved, the time required to evaluate the integral may be very great
indeed. This is one of the major drawbacks to the use of Monte Carlo methods in general. Another lesser
problem concerns the choice of the random variables xi and yi. This can become a problem when very large
numbers of random numbers are required. Most random number generators are subject to periodicities and
other non-random behavior after a certain number of selections have been made. Any non-random behavior
will destroy the probabilistic nature of the Monte Carlo scheme and thereby limit the accuracy of the answer.
Thus, one may be deceived into believing the answer is better than it is. One should use Monte Carlo
methods with great care. It should usually be the method of last choice. However, there are problems that can
be solved by Monte Carlo methods that defy solution by any other method. This modern method of
modeling the integral is reminiscent of a method used before the advent of modern computers. One simply
graphed the integrand on a piece of graph paper and then cut out the area that represented the integral. By
comparing the carefully measured weight of the cutout with that of a known area of graph paper, one
obtained a crude estimate of the integral.

While we have discussed Monte Carlo schemes for one-dimensional integrals only, the technique
can easily be generalized to multiple dimensions. Here the accuracy is basically governed by the number of
points required to sample the "volume" represented by the integrand and limits. This sampling can generally
be done more efficiently than the Nm points required by the direct multiple dimension quadrature schemes.
Thus, the Monte-Carlo scheme is likely to efficiently compete with those schemes as the number of
dimensions increases. Indeed, should m > 2, this is likely to be the case.

116
4 - Derivatives and Integrals

Figure 4.2 shows the variation of a particularly complicated integrand. Clearly it is not a
polynomial and so could not be evaluated easily using standard quadrature formulae. However, we
may use Monte Carlomethods to determine the ratio area under the curve compared to the area of
the rectangle.

One should not be left with the impression that other quadrature formulae are without their
problems. We cannot leave this subject without describing some methods that can be employed to improve
the accuracy of the numerical evaluation of integrals.

## b. The General Application of Quadrature Formulae to Integrals

Additional tricks that can be employed to produce more accurate answers involve the proper
choice of the interval. Occasionally the integrand will display pathological behavior at some point in the
interval. It is generally a good idea to break the interval at that point and represent the integral by two (or
more) separate integrals each of which may separately be well represented by a polynomial. This is
particularly useful in dealing with integrals on the semi-infinite interval, which have pathological integrands
in the vicinity of zero. One can separate such an integral into two parts so that
+∞ a +∞

0
f ( x ) dx = ∫ f ( x ) dx + ∫
0 a
f ( x ) dx . (4.3.4)
The first of these can be transformed into the interval -1→ +1 and evaluated by means of any combination of
the finite interval quadrature schemes shown in table 4.2. The second of these integrals can be transformed

117
Numerical Methods and Data Analysis

## back into the semi-infinite interval by means of the linear transformation

y=x─a , (4.3.5)
so that
+∞ +∞
∫a
f ( x ) dx = ∫
0
e − y [e + y f ( y + a ) dy . (4.3.6)
Gauss-Laguerre quadrature can be used to determine the value of the second integral. By judiciously
choosing places to break an integral that correspond to locations where the integrand is not well
approximated by a polynomial, one can significantly increase the accuracy and ease with which integrals
may be evaluated.

Having decided on the range over which to evaluate the integral, one has to pick the order of the
quadrature formula to be used. Unlike the case for numerical differentiation, the higher the degree of
precision of the quadrature formula, the better. However, there does come a point where the round-off error
involved in the computation of the integrand exceeds the incremental improvement from the increased
degree of precision. This point is usually difficult to determine. However, if one evaluates an integral with
formulae of increasing degree of precision, the value of the integral will steadily change, reach a plateau, and
then change slowly reflecting the influence of round-off error. As a rule of thumb 8 to 10 point Gauss-
Legendre quadrature is sufficient to evaluate any integral over a finite range. If this is not the case, then the
integral is somewhat pathological and other approaches should be considered. In some instances, one may
use very high order quadrature (roots and weights for Legendre polynomials can be found up to N = 212),
but these instances are rare. There are many other quadrature formulae that have utility in specific circum-
stances. However, should the quadrature present special problems, or require highly efficient evaluation,
these formulae should be considered.

118
4 - Derivatives and Integrals

Chapter 4 Exercises
1. Numerically differentiate the function

f(x) = e-x ,

at the points x = 0, .5, 1, 5, 10. Describe the numerical method you used and why you chose it.
Discuss the accuracy by comparing your results with the analytic closed form derivatives.

2. Numerically evaluate

1
f= ∫ 0
e-x dx .

## Carry out this evaluation using

a. 5-point Gaussian quadrature
b. a 5-point equal interval formula that you choose
c. 5 point trapezoid rule
d. analytically.
Compare and discuss your results.

+1
∫−1
│x│dx .

## 4. What method would you use to evaluate

+∞

1
(x-4 + 3x-2) Tanh(x) dx ?

## Explain your choice.

5. Use the techniques described in section (4.2e) to find the volume of a sphere. Discuss all the choices
you make regarding the type of quadrature use and the accuracy of the result.

119
Numerical Methods and Data Analysis

## Chapter 4 References and Supplemental Reading

1. Abramowitz, M. and Stegun, I.A., "Handbook of Mathematical Functions" National Bureau of
Standards Applied Mathematics Series 55 (1964) U.S. Government Printing Office, Washington
D.C.

## 2. Stroud, A.H., "Approximate Calculation of Multiple Integrals", (1971), Prentice-Hall Inc.

Englewood Cliffs.

## Because to the numerical instabilities encountered with most approaches to numerical

differentiation, there is not a great deal of accessible literature beyond the introductory level that is
available. For example

## 3. Abramowitz, M. and Stegun, I.A., "Handbook of Mathematical Functions" National Bureau of

Standards Applied Mathematics Series 55 (1964) U.S. Government Printing Office, Washington
D.C., p. 877, devote less than a page to the subject quoting a variety of difference formulae.

The situation with regard to quadrature is not much better. Most of the results are in technical papers
in various journals related to computation. However, there are three books in English on the subject:

## 4. Davis, P.J., and Rabinowitz,P., "Numerical Integration", Blaisdell,

5. Krylov, V.I., "Approximate Calculation of Integrals" (1962) (trans. A.H.Stroud), The Macmillian
Company

6. Stroud, A.H., and Secrest, D. "Gaussian Quadrature Formulas", (1966), Prentice-Hall Inc.,
Englewood Cliffs.

Unfortunately they are all out of print and are to be found only in the better libraries. A very good
summary of various quadrature schemes can be found in

## 7. Abramowitz, M. and Stegun, I.A., "Handbook of Mathematical Functions" National Bureau of

Standards Applied Mathematics Series 55 (1964) U.S. Government Printing Office, Washington
D.C., pp. 885-899.

This is also probably the reference for the most complete set of Gaussian quadrature tables for the roots and
weights with the possible exception of the reference by Stroud and Secrest (i.e. ref 4). They also give some
hyper-efficient formulae for multiple integrals with regular boundaries. The book by Art Stroud on the
evaluation of multiple integrals

## 6. Stroud, A.H., "Approximate Calculation of Multiple Integrals", (1971), Prentice-Hall Inc.,

Englewood Cliffs.
represents largely the present state of work on multiple integrals , but it is also difficult to find.

120
5

Numerical Solution of
Differential and Integral
Equations

• • •

The aspect of the calculus of Newton and Leibnitz that allowed the
mathematical description of the physical world is the ability to incorporate derivatives and integrals into
equations that relate various properties of the world to one another. Thus, much of the theory that describes
the world in which we live is contained in what are known as differential and integral equations. Such
equations appear not only in the physical sciences, but in biology, sociology, and all scientific disciplines
that attempt to understand the world in which we live. Innumerable books and entire courses of study are
devoted to the study of the solution of such equations and most college majors in science and engineering
require at least one such course of their students. These courses generally cover the analytic closed form
solution of such equations. But many of the equations that govern the physical world have no solution in
closed form. Therefore, to find the answer to questions about the world in which we live, we must resort to
solving these equations numerically. Again, the literature on this subject is voluminous, so we can only hope
to provide a brief introduction to some of the basic methods widely employed in finding these solutions.
Also, the subject is by no means closed so the student should be on the lookout for new techniques that
prove increasingly efficient and accurate.

121
Numerical Methods and Data Analysis

## 5.1 The Numerical Integration of Differential Equations

When we speak of a differential equation, we simply mean any equation where the dependent
variable appears as well as one or more of its derivatives. The highest derivative that is present determines
the order of the differential equation while the highest power of the dependent variable or its derivative
appearing in the equation sets its degree. Theories which employ differential equations usually will not be
limited to single equations, but may include sets of simultaneous equations representing the phenomena they
describe. Thus, we must say something about the solutions of sets of such equations. Indeed, changing a high
order differential equation into a system of first order differential equations is a standard approach to finding
the solution to such equations. Basically, one simply replaces the higher order terms with new variables and
includes the equations that define the new variables to form a set of first order simultaneous differential
equations that replace the original equation. Thus a third order differential equation that had the form
f '''(x) + αf"(x) + βf'(x) + γf(x) = g(x) , (5.1.1)

could be replaced with a system of first order differential equations that looked like
y' ( x ) + αz' ( x ) + βf ' ( x ) + γf ( x ) = g ( x ) ⎫

z ' ( x ) = y( x ) ⎬ . (5.1.2)
f ' ( x ) = z( x ) ⎪

This simplification means that we can limit our discussion to the solution of sets of first order differential
equations with no loss of generality.

One remembers from beginning calculus that the derivative of a constant is zero. This means that it
is always possible to add a constant to the general solution of a first order differential equation unless some
additional constraint is imposed on the problem. These are generally called the constants of integration.
These constants will be present even if the equations are inhomogeneous and in this respect differential
equations differ significantly from functional algebraic equations. Thus, for a problem involving differential
equations to be fully specified, the constants corresponding to the derivative present must be given in
advance. The nature of the constants (i.e. the fact that their derivatives are zero) implies that there is some
value of the independent variable for which the dependent variable has the value of the constant. Thus,
constants of integration not only have a value, but they have a "place" where the solution has that value. If
all the constants of integration are specified at the same place, they are called initial values and the problem
of finding a solution is called an initial value problem. In addition, to find a numerical solution, the range of
the independent variable for which the solution is desired must also be specified. This range must contain the
initial value of the independent variable (i.e. that value of the independent variable corresponding to the
location where the constants of integration are specified). On occasion, the constants of integration are
specified at different locations. Such problems are known as boundary value problems and, as we shall see,
these require a special approach. So let us begin our discussion of the numerical solution of ordinary
differential equations by considering the solution of first order initial value differential equations.

The general approach to finding a solution to a differential equation (or a set of differential
equations) is to begin the solution at the value of the independent variable for which the solution is equal to
the initial values. One then proceeds in a step by step manner to change the independent variable and move

122
5 - Differential and Integral Equations

across the required range. Most methods for doing this rely on the local polynomial approximation of the
solution and all the stability problems that were a concern for interpolation will be a concern for the
numerical solution of differential equations. However, unlike interpolation, we are not limited in our choice
of the values of the independent variable to where we can evaluate the dependent variable and its derivatives.
Thus, the spacing between solution points will be a free parameter. We shall use this variable to control the
process of finding the solution and estimating this error.

Since the solution is to be locally approximated by a polynomial, we will have constrained the
solution and the values of the coefficients of the approximating polynomial. This would seem to imply that
before we can take a new step in finding the solution, we must have prior information about the solution in
order to provide those constraints. This "chicken or egg" aspect to solving differential equations would be
removed if we could find a method that only depended on the solution at the previous step. Then we could
start with the initial value(s) and generate the solution at as many additional values of the independent
variable as we needed. Therefore let us begin by considering one-step methods.

Equations

## Probably the most conceptually simple method of numerically integrating differential

equations is Picard's method. Consider the first order differential equation
y' ( x ) = g ( x , y) . (5.1.3)
Let us directly integrate this over the small but finite range h so that
y x 0+ h
∫y0
dy = ∫
x0
g( x, y) dx , (5.1.4)

which becomes
x0 +h
y( x ) = y 0 + ∫ g( x , y) dx , (5.1.5)
x0
Now to evaluate the integral and obtain the solution, one must know the answer to evaluate g(x,y). This can
be done iteratively by turning eq (5.1.5) into a fixed-point iteration formula so that
g[ x , y ( k −1) ( x )] dx ⎫⎪
x0 +h
y (k ) (x 0 + h) = y 0 + ∫
x0
⎬ . (5.1.6)
y ( k −1)
(x) = y ( k −1)
(x 0 + h) ⎪⎭
A more inspired choice of the iterative value for y( k-1)(x) might be

## y ( k −1) ( x ) = 1 2 [ y 0 + y ( k −1) ( x 0 + h )] . (5.1.7)

However, an even better approach would be to admit that the best polynomial fit to the solution that can be
achieved for two points is a straight line, which can be written as
y( x ) = y 0 + a ( x − x 0 ) = {[ y ( k −1) ( x 0 + h )]( x − x 0 ) + [ y 0 ( x 0 )]( x 0 + h − x )]} / h . (5.1.8)

While the right hand side of equation (5.1.8) can be used as the basis for a fixed point iteration scheme, the
iteration process can be completely avoided by taking advantage of the functional form of g(x,y). The linear

123
Numerical Methods and Data Analysis

form of y can be substituted directly into g(x,y) to find the best value of a. The equation that constrains a is
then simply
x0 +h
ah = ∫ g[ x, (ax + y 0 )] dx . (5.1.9)
x0

This value of a may then be substituted directly into the center term of equation (5.1.8) which in turn is
evaluated at x = x0+h. Even should it be impossible to evaluate the right hand side of equation (5.1.9) in
closed form any of the quadrature formulae of chapter 4 can be used to directly obtain a value for a.
However, one should use a formula with a degree of precision consistent with the linear approximation of
y.

To see how these various forms of Picard's method actually work, consider the differential equation
y' ( x ) = xy , (5.1.10)
subject to the initial conditions
y ( 0) = 1 . (5.1.11)
Direct integration yields the closed form solution
2
y = ex / 2 . (5.1.12)
The rapidly varying nature of this solution will provide a formidable test of any integration scheme
particularly if the step size is large. But this is exactly what we want if we are to test the relative accuracy of
different methods.

## In general, we can cast Picard's method as

z
y( x ) = 1 + ∫ zy(z) dz , (5.1.13)
0

where equations (5.1.6) - (5.1.8) represent various methods of specifying the behavior of y(z) for purposes of
evaluating the integrand. For purposes of demonstration, let us choose h = 1 which we know is unreasonably
large. However, such a large choice will serve to demonstrate the relative accuracy of our various choices
quite clearly. Further, let us obtain the solution at x = 1, and 2. The naive choice of equation (5.1.6) yields an
iteration formula of the form

x0 +h
y( x 0 + h ) = 1 + ∫ zy ( k −1) ( x 0 + h ) dz + 1 + [h ( x 0 + h ) / 2] y ( k −1) ( x 0 + h ) . (5.1.14)
x0

This may be iterated directly to yield the results in column (a) of table 5.1, but the fixed point can be found
directly by simply solving equation (5.1.14) for y( ∞ )(x0+h) to get
y (∞ ) ( x 0 + h ) = (1 − hx 0 − h 2 / 2) −1 . (5.1.15)

For the first step when x0 = 0, the limiting value for the solution is 2. However, as the solution proceeds, the
iteration scheme clearly becomes unstable.

124
5 - Differential and Integral Equations

Table 5.1

## (A) (B) (C) (D)

i y(1) y(1) y(1) yc(1)
0 1.0 1.0
1 1.5 1.5
2 1.75 1.625
3 1.875 1.6563
4 1.938 1.6641
5 1.969 1.6660

0 4.0 1.6666
1 7.0 3.0000
2 11.5 4.5000
3 18.25 5.6250
4 28.375 6.4688
5 43.56 7.1015

## ∞ ∞ 9.0000 17.5 7.3891

Estimating the appropriate value of y(x) by averaging the values at the limits of the integral as
indicated by equation (5.1.7) tends to stabilize the procedure yielding the iteration formula
x0 +h
y ( k ) ( x 0 + h ) = 1 + 12 ∫ z[ y( x 0 ) + y ( k −1) ( x 0 + h ) dz = 1 + [h ( x 0 + h ) / 2][ y( x 0 ) + y ( k −1) ( x 0 + h )] / 2 ,
x0

(5.1.16)
the application of which is contained in column (b) of Table 5.1. The limiting value of this iteration formula
can also be found analytically to be
1 + [h(x0+h/2)y(x0)]/2
(∞) ______________________
y (x0+h) = (5.1.17)
[1 ─ h(x0+h/2)/2] ,
which clearly demonstrates the stabilizing influence of the averaging process for this rapidly increasing
solution.

Finally, we can investigate the impact of a linear approximation for y(x) as given by equation
(5.1.8). Let us assume that the solution behaves linearly as suggested by the center term of equation (5.1.8).

125
Numerical Methods and Data Analysis

This can be substituted directly into the explicit form for the solution given by equation (5.1.13) and the
value for the slope, a, obtained as in equation (5.1.9). This process yields

a = y(x0)(x0+h/2)/[1-(x0h/2)-(h2/3)] , (5.1.18)

which with the linear form for the solution gives the solution without iteration. The results are listed in table
5.1 in column (c). It is tempting to think that a combination of the right hand side of equation (5.1.7)
integrated in closed form in equation (5.1.13) would give a more exact answer than that obtained with the
help of equation (5.1.18), but such is not the case. An iteration formula developed in such a manner can be
iterated analytically as was done with equations (5.1.15) and (5.1.17) to yield exactly the results in column
(c) of table 5.1. Thus the best one can hope for with a linear Picard's method is given by equation (5.1.8)
with the slope, a, specified by equation (5.1.9).

However, there is another approach to finding one-step methods. The differential equation (5.1.3)
has a full family of solutions depending on the initial value (i.e. the solution at the beginning of the step).
That family of solutions is restricted by the nature of g(x,y). The behavior of that family in the neighborhood
of x = x0+h can shed some light on the nature of the solution at x = x0+h. This is the fundamental basis for
one of the more successful and widely used one-step methods known as the Runge-Kutta method. The
Runge-Kutta method is also one of the few methods in numerical analysis that does not rely directly on
polynomial approximation for, while it is certainly correct for polynomials, the basic method assumes that
the solution can be represented by a Taylor series.

So let us begin our discussion of Runge-Kutta formulae by assuming that the solution can be
represented by a finite taylor series of the form
y n +1 = y n + hy' n +(h 2 / 2!) y"n + L + (h k / k!) y (nk ) . (5.1.19)
Now assume that the solution can also be represented by a function of the form
yn+1 = yn + h{α0g(xn,yn)+α1g[(xn+µ1h),(yn+b1h)] +α2g[(xn+µ2h),(yn+b2h)]+ L +αkg[(xn+µkh),(yn+bkh)]} .
(5.1.20)
This rather convoluted expression, while appearing to depend only on the value of y at the initial step
(i.e. yn ) involves evaluating the function g(x,y) all about the solution point xn, yn (see Figure 5.1).

By setting equations (5.1.19) and (5.1.20) equal to each other, we see that we can write the solution
in the from yn+1 = yn + α0t0 + α1t1 + L + αktk , (5.1.21)

## where the tis can be expressed recursively by

t 0 = hg ( x n , y n ) ⎫

t 1 = hg[( x n + µ1 h ), ( y n + λ 1, 0 t 0 )] ⎪

t 2 = hg[( x n + µ 2 h ), ( y n + λ 2,0 t 0 + λ 2,1 t 1 )] ⎬ . (5.1.22)
M M ⎪

t k = hg[( x n + µ k h ), ( y n + λ k ,0 t 0 + λ k ,1 t 1 + L + λ k ,k −1 t k −1 )] ⎪⎭

Now we must determine k+1 values of α, k values of µ and k×(k+1)/2 values of λi,j. But we only have k+1

126
5 - Differential and Integral Equations

terms of the Taylor series to act as constraints. Thus, the problem is hopelessly under-determined. Thus
indeterminency will give rise to entire families of Runge-Kutta formulae for any order k. In addition, the
algebra to eliminate as many of the unknowns as possible is quite formidable and not unique due to the
undetermined nature of the problem. Thus we will content ourselves with dealing only with low order
formulae which demonstrate the basic approach and nature of the problem. Let us consider the lowest order
that provides some insight into the general aspects of the Runge-Kutta method. That is k=1. With k=1
equations (5.1.21) and (5.1.22) become
y n +1 = y n + α 0 t 0 + α 1 t 1 ⎫

t 0 = hg ( x n y n ) ⎬ . (5.1.23)
t 1 = hg[( x n + µh ), ( y n + λt 0 )] ⎪⎭
Here we have dropped the subscript on λ as there will only be one of them. However, there are still four free
parameters and we really only have three equations of constraint.

Figure 5.1 show the solution space for the differential equation
y' = g(x,y). Since the initial value is different for different solutions, the
space surrounding the solution of choice can be viewed as being full of
alternate solutions. The two dimensional Taylor expansion of the Runge-
Kutta method explores this solution space to obtain a higher order value
for the specific solution in just one step.

127
Numerical Methods and Data Analysis

If we expand g(x,y) about xn, yn, in a two dimensional taylor series, we can write
∂g( x n , y n ) ∂g( x n , y n ) 1 2 2 ∂g( x n , y n )
g[( x n + µh ), ( y n + λt 0 )] = g ( x n , y n ) + µh + λt 0 + 2µ h
∂x ∂y ∂x 2
∂ 2 g(x n , y n ) ∂ 2 g(x n , y n )
+ 12 λ2 t 02 + µλ t 0 +L+ . (5.1.24)
∂y 2 ∂x∂y
Making use of the third of equations (5.1.23), we can explicitly write t1 as
⎡ ∂g ( x n , y n ) ∂g( x n , y n ) ⎤
t 1 = hg( x n , y n ) + h 2 ⎢µ + λg ( x n , y n ) ⎥
⎣ ∂x ∂y ⎦
. (5.1.25)
⎡ 2 ∂ g( x n , y n )
2
∂ g(x n , y n )
2
∂ 2 g(x n , y n ) ⎤
2 h ⎢µ + λ g (x n , y n ) + 2µλg( x n , y n )
1 3 2 2

⎣ ∂x 2 ∂y 2 ∂x∂y ⎦
Direct substitution into the first of equations (5.1.23) gives

⎡ ∂g( x n , y n ) ∂g( x n , y n ) ⎤
y n +1 = y n + h (α 0 + α1 )g ( x n , y n ) + h 2 ⎢µ + λg ( x n , y n ) ⎥
⎣ ∂x ∂y ⎦
. (5.1.26)
⎡ 2 ∂ g(x n , y n )
2
∂ g( x n , y n )
2
∂ g( x n , y n ) ⎤
2

2 h α 1 ⎢µ + λ g (x n , y n ) + 2µλg( x n , y n )
1 3 2 2

⎣ ∂x 2 ∂y 2 ∂x∂y ⎦
We can also expand y' in a two dimensional taylor series making use of the original differential equation
(5.1.3) to get
y ' = g ( x , y) ⎫
∂g( x , y) ∂g( x , y) ∂g ( x, y) ∂g ( x, y) ⎪
y" = + y' = + g ( x , y) ⎪
∂x ∂y ∂x ∂y ⎪

∂y" ∂y" ∂ 2 g( x , y) ∂g ( x, y) ∂g( x, y) ∂ 2 g ( x , y) ⎬ . (5.1.27)
y' ' ' = + y' = + • + g ( x , y )
∂x ∂y ∂x 2 ∂x ∂y ∂x∂y ⎪
2

∂ 2 g ( x , y) ⎡ ∂g( x, y) ⎤ ∂ 2 g ( x , y) ⎪
+ g ( x , y) + g ( x , y) ⎢ ⎥ + g ( x , y ) ⎪
∂y∂x ⎣ ∂y ⎦ ∂y 2 ⎭
Substituting this into the standard form of the Taylor series as given by equation (5.1.19) yields
⎡ ∂g ( x , y) ∂g ( x , y) ⎤ h 3 ⎛ ∂ 2 g ( x , y) ∂ 2 g ( x , y) ⎫
y n +1 = y n + hg ( x , y) + h 2 ⎢ + λg ( x , y) + ⎜ + g 2
( x , y ) ⎪
⎣ ∂x ∂y ⎥⎦ 6 ⎜⎝ ∂x 2 ∂y 2 ⎪
⎬ .
∂ 2 g ( x , y) ∂g ( x , y) ⎡ ∂g ( x , y) ∂g ( x , y) ⎤ ⎞ ⎪
+ 2g ( x , y ) + + g ( x , y) ⎟
∂x∂y ∂y ⎢⎣ ∂x ∂y ⎥⎦ ⎟⎠ ⎪

(5.1.28)
Now by comparing this term by term with the expansion shown in equation (5.1.26) we can conclude that
the free parameters α0, α1, µ, and λ must be constrained by

128
5 - Differential and Integral Equations

(α 0 + α 1 ) = 1 ⎫

α1µ = 12 ⎬ . (5.1.29)
α1λ = 21 ⎪

As we suggested earlier, the formula is under-determined by one constraint. However, we may use the
constraint equations as represented by equation (5.1.29) to express the free parameters in terms of a single
constant c. Thus the parameters are
α0 = 1− c ⎫

α1 = c ⎬ . (5.1.30)
µ = λ = 2 c ⎪⎭
1

## and the approximation formula becomes

⎡ ∂g ( x, y) ∂g( x , y) ⎤ h 3 ⎡ ∂ 2 g( x , y) ∂ 2 g ( x , y) ⎫
y n +1 = y n + hg( x , y) + h 2 ⎢ + λg ( x , y) + ⎢ + g 2
( x , y ) ⎪
⎣ ∂x ∂y ⎥⎦ 8c ⎣ ∂x 2 ∂y 2 ⎪
⎬.
∂ 2 g ( x , y) ⎤ ⎪
+ 2g( x, y) ⎥ ⎪
∂x∂y ⎦ ⎭
(5.1.31)
We can match the first two terms of the Taylor series with any choice of c. The error term will than be of
order O(h3) and specifically has the form
h3 ⎛ ∂g( x n , y n ) " ⎞
R n +1 =− ⎜⎜ [3 − 4c]y 'n'' − 3 yn ⎟⎟ . (5.1.32)
24c ⎝ ∂y ⎠
Clearly the most effective choice of c will depend on the solution so that there is no general "best" choice.
However, a number of authors recommend c = ½ as a general purpose value.

If we increase the number of terms in the series, the under-determination of the constants gets
rapidly worse. More and more parameters must be chosen arbitrarily. When these formulae are given, the
arbitrariness has often been removed by fiat. Thus one may find various Runge-Kutta formulae of the same
order. For example, a common such fourth order formula is
y n +1 = y n + ( t 0 + 2t 1 + 2t 2 + t 3 ) / 6 ⎫

t 0 = hg ( x n , y n ) ⎪

t 1 = hg[( x n + 12 h ), ( y n + 12 t 0 )] ⎬ . (5.1.33)

t 2 = hg[( x n + 12 h ), ( y n + 12 t 1 )] ⎪

t 3 = hg[( x n + h ), ( y n + t 2 )] ⎭
Here the "best" choice for the under-determined parameters has already been made largely on the basis of
experience.
If we apply these formulae to our test differential equation (5.1.10), we need first specify which
Runge-Kutta formula we plan to use. Let us try the second order (i.e. exact for quadratic polynomials)
formula given by equation (5.1.23) with the choice of constants given by equation (5.1.29) when c = ½. The
formula then becomes

129
Numerical Methods and Data Analysis

y n +1 = y n + 12 t 0 + 12 t 1 ⎫

t 0 = hg ( x n , y n ) ⎬ . (5.1.34)
t 1 = hg[( x n + h ), ( y n + t 0 )]⎪⎭
So that we may readily compare to the first order Picard formula, we will take h = 1 and y(0) = 1. Then
taking g(x,y) from equation (5.1.10) we get for the first step that
t 0 = hx 0 y 0 = (1)(0)(1) = 0 ⎫

t 1 = h ( x 0 + h )( y 0 + t 0 ) = (1)(0 + 1)(1 + 0) = 1 ⎬ . (5.1.35)
y( x 0 + h ) = y1 = (1) + ( 1 2 )(0) + ( 1 2 )(1) = 32 ⎪

The second step yields
t 0 = hx 1 y1 = (1)(1)( 3 2 ) = 3
2 ⎫

t 1 = h ( x 1 + h )( y1 + t 0 ) = (1)(1 + 1)(1 + 3 2 ) = 5 ⎬ . (5.1.36)
y( x 1 + h ) = y 2 = ( 3 2 ) + ( 1 2 )( 3 2 ) + ( 1 2 )(5) = 194 ⎪⎭

Table 5.2
Sample Runge-Kutta Solutions
Second Order Solution Fourth Order Solution
Step 1
h=1 h=1/2 yc h=1
i ti ti ti
0 0.0 [0 , 9/32] 0.00000
1 1.0 [1/4 , 45/64] 0.50000
2 ----------- ----------- 0.62500
3 ----------- ----------- 1.62500

## y1 1.5 1.6172 1.64587 1.65583

δy1 0.1172
h '1 0.8532*
Step 2
i ti ti ti
0 1.5 [0.8086 , 2.1984] 1.64583
1 5.0 [1.8193 , 5.1296] 3.70313
2 ----------- ----------- 5.24609
3 ----------- ----------- 13.78384

## y2 4.75 6.5951 7.38906 7.20051

δy2 1.8451
h'2 0.0635
* This value assumes that δy0 = 0.1

130
5 - Differential and Integral Equations

The Runge-Kutta formula tends to under-estimate the solution in a systematic fashion. If we reduce the step
size to h = ½ the agreement is much better as the error term in this formula is of O(h3). The results for h
= ½ are given in table 5.2 along with the results for h = 1. In addition we have tabulated the results for the
fourth order formula given by equation (5.1.33). For our example, the first step would require that equation
(5.1.33) take the form

t 0 = hx 0 y 0 = (1)(0)(1) = 0 ⎫

t 1 = h ( x 0 + h )( y 0 + t 0 ) = (1)(0 + 2 )(1 + 0) =
1
2
1
2
1 1
2 ⎪

t 2 = h ( x 0 + 12 h )( y 0 + 12 t 1 ) = (1)(0 + 1 2 )[1 + ( 1 2 )( 1 2 )] = 5
8 ⎬ . (5.1.37)
t 3 = h ( x 0 + 12 h )( y 0 + 12 t 2 ) = (1)(0 + 1)[1 + ( 1 2 )( 5 8 )] = 138 ⎪

y( x 0 + h ) = y1 = (1) + [(0) + 2( 1 2 ) + 2( 5 8 ) + (13 8 )] / 6) = 79
48
⎪⎭

The error term for this formula is of O(h5) so we would expect it to be superior to the second order formula
for h = ½ and indeed it is. These results demonstrate that usually it is preferable to increase the accuracy of a
solution by increasing the accuracy of the integration formula rather than decreasing the step size. The
calculations leading to Table 5.2 were largely carried out using fractional arithmetic so as to eliminate the
round-off error. The effects of round-off error are usually such that they are more serious for a diminished
step size than for an integration formula yielding suitably increased accuracy to match the decreased step
size. This simply accentuates the necessity to improve solution accuracy by improving the approximation
accuracy of the integration formula.

The Runge-Kutta type schemes enjoy great popularity as their application is quite straight forward
and they tend to be quite stable. Their greatest appeal comes from the fact that they are one-step methods.
Only the information about the function at the previous step is necessary to predict the solution at the next
step. Thus they are extremely useful in initiating a solution starting with the initial value at the boundary of
the range. The greatest drawback of the methods is their relative efficiency. For example, the forth order
scheme requires four evaluations of the function at each step. We shall see that there are other methods that
require far fewer evaluations of the function at each step and yet have a higher order.

## A numerical solution to a differential equation is of little use if there is no estimate of its

accuracy. However, as is clear from equation (5.1.32), the formal estimate of the truncation error is often
more difficult than finding the solution. Unfortunately, the truncation error for most problems involving
differential equations tends to mimic the solution. That is, should the solution be monotonically increasing,
then the absolute truncation error will also increase. Even monotonically decreasing solutions will tend to
have truncation errors that keep the same sign and accumulate as the solution progresses. The common effect
of truncation errors on oscillatory solutions is to introduce a "phase shift" in the solution. Since the effect of
truncation error tends to be systematic, there must be some method for estimating its magnitude.

Although the formal expression of the truncation error [say equation (5.1.32)] is usually rather
formidable, such expressions always depend on the step size. Thus we may use the step size h itself to

131
Numerical Methods and Data Analysis

estimate the magnitude of the error. We can then use this estimate and an a priori value of the largest
acceptable error to adjust the step size. Virtually all general algorithms for the solution of differential
equations contain a section for the estimate of the truncation error and the subsequent adjustment of the step
size h so that predetermined tolerances can be met. Unfortunately, these methods of error estimate will rely
on the variation of the step size at each step. This will generally triple the amount of time required to effect
the solution. However, the increase in time spent making a single step may be offset by being able to use
much larger steps resulting in an over all savings in time. The general accuracy cannot be arbitrarily
increased by decreasing the step size. While this will reduce the truncation error, it will increase the effects
of round-off error due to the increased amount of calculation required to cover the same range. Thus one
does not want to set the a priori error tolerance to low or the round-off error may destroy the validity of the
solution. Ideally, then, we would like our solution to proceed with rather large step sizes (i.e. values of h)
when the solution is slowly varying and automatically decrease the step size when the solution begins to
change rapidly. With this in mind, let us see how we may control the step size from tolerances set on the
truncation error.

Given either the one step methods discussed above or the multi-step methods that follow, assume
that we have determined the solution yn at some point xn. We are about to take the next step in the solution to
xn+1 by an amount h and wish to estimate the truncation error in yn+1. Calculate this value of the solution two
ways. First, arriving at xn+1 by taking a single step h, then repeat the calculation taking two steps of (h/2). Let
us call the first solution y1,n+1 and the second y2,n+1. Now the exact solution (neglecting earlier accumulated
error) at xn+1 could be written in each case as

y e = y 1, n +1 + αh k +1 + L + ⎫

⎬ , (5.1.38)
y e = y 2, n +1 + 2α( 1 2 h k +1 ) + L + ⎪

where k is the order of the approximation scheme. Now α can be regarded as a constant throughout the
interval h since it is just the coefficient of the Taylor series fit for the (k+1)th term. Now let us define δ as a
measure of the error so that
δ( y n +1 ) y 2,n +1 − y 1,n +1 = αh k +1 /(1 − 2 k ) δ) . (5.1.39)
Clearly,
δ( y n +1 ) ≈ h k +1 , (5.1.40)

so that the step size h can be adjusted at each step in order that the truncation error remains uniform by
k +1
h n +1 = h n δ( y n ) / δ( y n +1 ) . (5.1.41)
Initially, one must set the tolerance at some pre-assigned level ε so that

δy 0 ≤ ε . (5.1.42)

If we use this procedure to investigate the step sizes used in our test of the Runge-Kutta method, we
see that we certainly chose the step size to be too large. We can verify this with the second order solution for
we carried out the calculation for step sizes of h=1 and h=½. Following the prescription of equation (5.1.39)
and (5.1.41) we have, that for the results specified in Table 5.2,

132
5 - Differential and Integral Equations

## δy1 = y 2,2 − y1,1 = 1.6172 − 1.500 = 0.1172 ⎫

δy 0 ⎬ . (5.1.43)
h1 = h 0 = (1)(0.1 / 0.1172) = 0.8532 ⎪
δy1 ⎭
Here we have tacitly assumed an initial tolerance of δy0 = 0.1. While this is arbitrary and rather large for a
tolerance on a solution, it is illustrative and consistent with the spirit of the solution. We see that to maintain
the accuracy of the solution within │0.1│ we should decrease the step size slightly for the initial step. The
error at the end of the first step is 0.16 for h = 1, while it is only about 0.04 for h = ½. By comparing the
numerical answers with the analytic answer, yc, we see that factor of two change in the step size reduces the
error by about a factor of four. Our stated tolerance of 0.1 requires only a reduction in the error of about 33%
which implies a reduction of about 16% in the step size or a new step size h1' = 0.84h1. This is amazingly
close to the recommended change, which was determined without knowledge of the analytic solution.

The amount of the step size adjustment at the second step is made to maintain the accuracy that
exists at the end of the first step. Thus,
δy 2 = y 2, 2 − y1, 2 = 6.5951 − 4.7500 = 1.8451⎫

δy ⎬ . (5.1.44)
h 2 = h1 1 = (1)(0.1172 / 1.8451) = 0.0635 ⎪
δy 2 ⎭
Normally these adjustments would be made cumulatively in order to maintain the initial tolerance. However,
the convenient values for the step sizes were useful for the earlier comparisons of integration methods. The
rapid increase of the solution after x = 1 causes the Runge-Kutta method to have an increasingly difficult
time maintaining accuracy. This is abundantly clear in the drastic reduction in the step size suggested at the
end of the second step. At the end of the first step, the relative errors where 9% and 2% for the h=1 and h=½
step size solutions respectively. At the end of the second step those errors, resulting from comparison with
the analytic solution, had jumped to 55% and 12% respectively (see table 5.2). While a factor of two-change
in the step size still produces about a factor of four change in the solution, to arrive at a relative error of 9%,
we will need more like a factor of 6 change in the solution. This would suggest a change in the step size of a
about a factor of three, but the recommended change is more like a factor of 16. This difference can be
understood by noticing that equation (5.1.42) attempts to maintain the absolute error less than δyn. For our
problem this is about 0.11 at the end of step one. To keep the error within those tolerances, the accuracy at
step two would have to be within about 1.5% of the correct answer. To get there from 55% means a
reduction in the error of a factor of 36, which corresponds to a reduction in the step size of a factor of about
18, is close to that given by the estimate.

Thus we see that the equation (5.1.42) is designed to maintain an absolute accuracy in the solution
by adjusting the step size. Should one wish to adjust the step size so as to maintain a relative or percentage
accuracy, then one could adjust the step size according to
( k +1)
h n +1 = h n {[δ( y n ) y n +1 ] [δ( y n +1 ) y n ] . (5.1.45)

While these procedures vary the step size so as to maintain constant truncation error, a significant price in
the amount of computing must be paid at each step. However, the amount of extra effort need not be used
only to estimate the error and thereby control it. One can solve equations (5.1.38) (neglecting terms of order
greater than k) to provide an improved estimate of yn+1. Specifically

133
Numerical Methods and Data Analysis

y e ≅ y 2, n +1 + δ( y n +1 ) (2 k − 1) . (5.1.46)

However, since one cannot simultaneously include this improvement directly in the error estimate, it is
advisable that it be regarded as a "safety factor" and proceeds with the error estimate as if the improvement
had not been made. While this may seem unduly conservative, in the numerical solution of differential
equations conservatism is a virtue.

## c. Multi-Step and Predictor-Corrector Methods

The high order one step methods achieve their accuracy by exploring the solution space in
the neighborhood of the specific solution. In principle, we could use prior information about the solution to
constrain our extrapolation to the next step. Since this information is the direct result of prior calculation, far
greater levels of efficiency can be achieved than by methods such as Runge-Kutta that explore the solution
space in the vicinity of the required solution. By using the solution at n points we could, in principle, fit an
(n-1) degree polynomial to the solution at those points and use it to obtain the solution at the (n+1)st point.
Such methods are called multi-step methods. However, one should remember the caveats at the end of
chapter 3 where we pointed out that polynomial extrapolation is extremely unstable. Thus such a procedure
by itself will generally not provide a suitable method for the solution of differential equations. But when
combined with algorithms that compensate for the instability such schemes can provide very stable solution
algorithms. Algorithms of this type are called predictor-corrector methods and there are numerous forms of
them. So rather than attempt to cover them all, we shall say a few things about the general theory of such
schemes and give some examples.

A predictor-corrector algorithm, as the name implies, consists of basically two parts. The predictor
extrapolates the solution over some finite range h based on the information at prior points and is inherently
unstable. The corrector allows for this local instability and makes a correction to the solution at the end of
the interval also based on prior information as well as the extrapolated solution. Conceptually, the notion of a
predictor is quite simple. In its simplest form, such a scheme is the one-step predictor where
y n +1 = y n + hy' n . (5.1.47)

By using the value of the derivative at xn the scheme will systematically under estimate the proper
value required for extrapolation of any monotonically increasing solution (see figure 5.2). The error will
build up cumulatively and hence it is unstable. A better strategy would be to use the value of the derivative
midway between the two solution points, or alternatively to use the information from the prior two points to
predict yn+1. Thus a two point predictor could take the form
y n +1 = y n + 2hy' n . (5.1.48)

Although this is a two-point scheme, the extrapolating polynomial is still a straight line. We could
have used the value of yn directly to fit a parabola through the two points, but we didn't due to the
instabilities to be associated with a higher degree polynomial extrapolation. This deliberate rejection of the
some of the informational constraints in favor of increased stability is what makes predictor-corrector
schemes non-trivial and effective. In the general case, we have great freedom to use the information we have
regarding yi and y' i. If we were to include all the available information, a general predictor would have the

134
5 - Differential and Integral Equations

form n n
y n +1 = ∑ a y + h ∑ b i y' i + R ,
i =0
i i i =0
(5.1.49)

where the ai s and bi s are chosen by imposing the appropriate constraints at the points xi and R is an error
term.

When we have decided on the form of the predictor, we must implement some sort of corrector
scheme to reduce the truncation error introduced by the predictor. As with the predictor, let us take a simple
case of a corrector as an example. Having produced a solution at xn+1 we can calculate the value of the
derivative y'n+1 at xn+1. This represents new information and can be used to modify the results of the
prediction. For example, we could write a corrector as
y (nk+)1 = y n + 12 h[ y' (nk+−11) + y' n ] . (5.1.50)

Therefore, if we were to write a general expression for a corrector based on the available information we

would get
Figure 5.2 shows the instability of a simple predictor scheme that systematically
underestimates the solution leading to a cumulative build up of truncation error.

n n
y (nk+)1 = ∑ α i y i + h ∑ β i y'i + hβ n +1 y' n(+k1+1) . (5.1.51)
i =0 i =0

Equations (5.1.50) and (5.1.51) both are written in the form of iteration formulae, but it is not at all clear that

135
Numerical Methods and Data Analysis

the fixed-point for these formulae is any better representation of the solution than single iteration. So in order
to minimize the computational demands of the method, correctors are generally applied only once. Let us
now consider certain specific types of predictor corrector schemes that have been found to be successful.

Hamming1 gives a number of popular predictor-corrector schemes, the best known of which is the
Adams-Bashforth-Moulton Predictor-Corrector. Predictor schemes of the Adams-Bashforth type emphasize
the information contained in prior values of the derivative as opposed to the function itself. This is
presumably because the derivative is usually a more slowly varying function than the solution and so can be
more accurately extrapolated. This philosophy is carried over to the Adams-Moulton Corrector. A classical
fourth-order formula of this type is
y (n1+)1 = y n + h (55 y 'n − 59 y 'n −1 + 37 y 'n − 2 − 9 y 'n −3 ) / 24 + O( h 5 ) ⎫⎪
⎬ . (5.1.52)
y n +1 = y n + h (9 y '
n +1 + 19 y − 5y
'
n
'
n −1 ) / 24 + O( h )
5
⎪⎭
Lengthy study of predictor-corrector schemes has evolved some special forms such as this one
z n +1 = (2 y n −1 + y n − 2 ) / 3 + h (191y 'n − 107 y 'n −1 + 109 y 'n − 2 − 25y 'n −3 ) / 75 ⎫

u n +1 = z n +1 − 707(z n − c n ) / 750 ⎪
⎬ . (5.1.53)
c n +1 = (2 y n −1 + y n − 2 ) / 3 + h (25u ' n +1 +91y' n +43y' n −1 +9 y' n − 2 ) / 72 ⎪
y n +1 = c n +1 + 43(z n +1 − c n +1 ) / 750 + O(h ) 6 ⎪

where the extrapolation formula has been expressed in terms of some recursive parameters ui and ci. The
derivative of these intermediate parameters are obtained by using the original differential equation so that

u ' = g( x, u ) . (5.1.54)

By good chance, this formula [equation (5.1.53)] has an error term that varies as O(h6) and so is a fifth-order
formula. Finally a classical predictor-corrector scheme which combines Adams-Bashforth and Milne
predictors and is quite stable is parametrically ( i.e. Hamming p206)
z n +1 = 12 ( y n + y n −1 ) + h (119 y 'n − 99 y 'n −1 + 69 y 'n − 2 − 17 y 'n −3 ) / 48 ⎫

u n +1 = z n +1 − 161(z n − c n ) / 170 ⎪
⎬. (5.1.55)
c n +1 = 12 ( y n + y n −1 ) + h (17 u ' n +1 +51y' n +3y' n −1 + y' n − 2 ) / 48 ⎪
y n +1 = c n +1 + 9(z n +1 − c n +1 ) / 170 + O(h 6 ) ⎪

Press et al2 are of the opinion that predictor-corrector schemes have seen their day and are made
obsolete by the Bulirsch-Stoer method which they discuss at some length3. They quite properly point out that
the predictor-corrector schemes are somewhat inflexible when it comes to varying the step size. The step size
can be reduced by interpolating the necessary missing information from earlier steps and it can be expanded
in integral multiples by skipping earlier points and taking the required information from even earlier in the
solution. However, the Bulirsch-Stoer method, as described by Press et. al. utilizes a predictor scheme with
some special properties. It may be parameterized as

136
5 - Differential and Integral Equations

z 0 = y( x 0 ) ⎫

z 1 = z 0 + hz' 0 ⎪

z k +1 = z k −1 + hz' k k = 1,2,3, L, n − 1 ⎬ . (5.1.56)

y (n1) = 1 2 (z n + z n −1 + hz' n ) + O(h 5 ) ⎪
z ' = g ( z, x ) ⎪⎭
It is an odd characteristic of the third of equations (5.1.56) that the error term only contains even
powers of the step size. Thus, we may use the same trick that was used in equation (5.1.46) of utilizing the
information generated in estimating the error term to improve the approximation order. But since only even
powers of h appear in the error term, this single step will gain us two powers of h resulting in a predictor of
order seven.
y nh = {4 y (n1) ( x + nh ) − y (n1/) 2 [ x + (n / 2)(2h )]} / 3 + O(h 7 ) . (5.1.57)

This yields a predictor that requires something on the order of 1½ evaluations of the function per step
compared to four for a Runge-Kutta formula of inferior order.

Now we come to the aspect of the Bulirsch-Stoer method that begins to differentiate it from classical
predictor-correctors. A predictor that operates over some finite interval can use a successively increasing
number of steps in order to make its prediction. Presumably the prediction will get better and better as the
step size decreases so that the number of steps to make the one prediction increases. Of course practical
aspects of the problem such as roundoff error and finite computing resources prevent us from using
arbitrarily small step sizes, but we can approximate what would happen in an ideal world without round-off
error and utilizing unlimited computers. Simply consider the prediction at the end of the finite interval H
where
H = αh . (5.1.58)

## yα(x+H) = y(x+αh) = f(h) . (5.1.59)

Now we can phrase our problem to estimate the value of that function in the limit
Lim f (h ) = Y∞ ( x + H ) . (5.1.60)
h →0
α →∞

We can accomplish this by carrying out the calculation for successively smaller and smaller values of h and,
on the basis of these values, extrapolating the result to h=0. In spite of the admonitions raised in chapter 3
regarding extrapolation, the range here is small. But to produce a truly powerful numerical integration
algorithm, Bulirsch and Stoer carry out the extrapolation using rational functions in the manner described in
section 3.2 [equation (3.2.65)]. The superiority of rational functions to polynomials in representing most
analytic functions means that the step size can be quite large indeed and the conventional meaning of the
'order' of the approximation is irrelevant in describing the accuracy of the method.

137
Numerical Methods and Data Analysis

In any case, remember that accuracy and order are not synonymous! Should the solution be
described by a slowly varying function and the numerical integration scheme operate by fitting high order
polynomials to prior information for the purposes of extrapolation, the high-order formula can give very
inaccurate results. This simply says that the integration scheme can be unstable even for well behaved
solutions.

Press et. al.4 suggest that all one needs to solve ordinary differential equations is either a Runge-
Kutta or Bulirsch-Stoer method and it would seem that for most problems that may well be the case.
However, there are a large number of commercial differential equation solving algorithms and the majority
of them utilize predictor-corrector schemes. These schemes are generally very fast and the more
sophisticated ones carry out very involved error checking algorithms. They are generally quite stable and can
involve a very high order when required. In any event, the user should know how they work and be wary of
the results. It is far too easy to simply take the results of such programs at face value without ever
questioning the accuracy of the results. Certainly one should always ask the question "Are these results
reasonable?" at the end of a numerical integration. If one is genuinely skeptical, it is not a bad idea to take
the final value of the calculation as an initial value and integrate back over the range. Should one recover the
original initial value within the acceptable tolerances, one can be reasonably confident that the results are
accurate. If not, the difference between the beginning initial value and what is calculated by the reverse
integration over the range can be used to place limits on the accuracy of the initial integration.

## d. Systems of Differential Equations and Boundary Value Problems

All the methods we have developed for the solution of single first order differential
equations may be applied to the case where we have a coupled system of differential equations. We saw
earlier that such systems arose whenever we dealt with ordinary differential equations of order greater than
one. However, there are many scientific problems which are intrinsically described by coupled systems of
differential equations and so we should say something about their solution. The simplest way to see the
applicability of the single equation algorithms to a system of differential equations is to write a system like

y'1 = g 1 ( x , y1 , y 2 ,L y n ) ⎫
y' 2 = g 2 ( x , y1 , y 2 , L y n ) ⎪⎪
⎬ , (5.1.61)
M M ⎪
y' n = g n ( x , y1 , y 2 , L y n )⎪⎭

as a vector where each element represents one of the dependent variables or unknowns of the system. Then
the system becomes r r r
y ' = g ( x , y) , (5.1.62)

which looks just like equation (5.1.3) so that everything applicable to that equation will apply to the system
of equations. Of course some care must be taken with the terminology. For example, equation (5.1.4) would
have to be understood as standing for an entire system of equations involving far more complicated integrals,
but in principle, the ideas carry over. Some care must also be extended to the error analysis in that the error

138
5 - Differential and Integral Equations
r
term is also a vector R (x). In general, one should worry about the magnitude of the error vector, but in
practice, it is usually the largest element that is taken as characterizing the accuracy of the solution.

To generate a numerical integration method for a specific algorithm, one simply applies it to each of
the equations that make up the system. By way of a specific example, let's consider a forth order Runge-
Kutta algorithm as given by equation (5.1.33) and apply it to a system of two equations. We get

y1,n +1 = y1,n + ( t 0 + 2 t 1 + 2 t 2 + t 3 ) / 6 ⎫

y 2,n +1 = y 2,n + (u 0 + 2u 1 + 2u 2 + u 3 ) / 6 ⎪
t 0 = hg 1 ( x n , y1,n , y 2,n ) ⎪

t 1 = hg 1 [( x n + 1 2 h ), ( y1,n + 1 2 t 0 ), ( y 2,n + 1 2 u 0 )] ⎪

t 2 = hg 1 [( x n + 1 2 h ), ( y1,n + 1 2 t 1 ), ( y 2,n + 1 2 u 1 )] ⎪
⎬ . (5.1.63)
t 3 = hg 1 [( x n + h ), ( y1,n + t t ), ( y 2,n + u 2 )] ⎪
u 0 = hg 2 ( x n , y1,n , y 2,n ) ⎪

u 1 = hg 2 [( x n + 1 2 h ), ( y1,n + 1 2 t 0 ), ( y 2,n + 1 2 u 0 )] ⎪

u 2 = hg 2 [( x n + 1 2 h ), ( y1,n + 1 2 t 1 ), ( y 2,n + 1 2 u 1 )] ⎪
u 1 = hg 2 [( x n + h ), ( y1,n + t t ), ( y 2,n + u 2 )] ⎪

## We can generalize equation (5.1.63) to an arbitrary system of equations

r by writing it in vector form as
r r
y n +1 = A( y n ) . (5.1.64)
r r
The vector A( y n ) consists of elements which are functions of dependent variables yi,n and xn, but which all
r
have the same general form varying only with gi(x, y ). Since an nth order differential equation can always
be reduced to a system of n first order differential equations, an expression of the form of equation (5.1.63)
could be used to solve a second order differential equation.

The existence of coupled systems of differential equations admits the interesting possibility that the
constants of integration required to uniquely specify a solution are not all given at the same location. Thus
we do not have a full compliment of yi,0's with which to begin the integration. Such problems are called
boundary value problems. A comprehensive discussion of boundary value problems is well beyond the
scope of this book, but we will examine the simpler problem of linear two point boundary value problems.
This subclass of boundary value problems is quite common in science and extremely well studied. It consists
of a system of linear differential equations (i.e. differential equations of the first degree only) where part of
the integration constants are specified at one location x0 and the remainder are specified at some other value
of the independent variable xn. These points are known as the boundaries of the problem and we seek a
solution to the problem within these boundaries. Clearly the solution can be extended beyond the boundaries
as the solution at the boundaries can serve as initial values for a standard numerical integration.

The general approach to such problems is to take advantage of the linearity of the equations, which

139
Numerical Methods and Data Analysis

guarantees that any solution to the system can be expressed as a linear combination of a set of basis
solutions. A set of basis solutions is simply a set of solutions, which are linearly independent. Let us consider
a set of m linear first order differential equations where k values of the dependent variables are specified at
x0 and (m-k) values corresponding to the remaining dependent variables are specified at xn. We could solve
(m-k) initial value problems starting at x0 and specifying (m-k) independent, sets of missing initial values so
that the initial value problems are uniquely determined. Let us denote the missing set of initial values at x0 by
r
y ( 0) ( x 0 ) which we know can be determined from initial sets of linearly independent trial initial values
j r
y ( t ) ( x 0 ) by
r
y ( 0 ) ( x 0 ) = Ay ( t ) ( x 0 ) , (5.1.65)
j r ( t )
The columns of y(t)(x0) are just the individual vectors y ( x 0 ) . Clearly the matrix A will have to be
r ( 0)
diagonal to always produce y ( x 0 ) . Since the trial initial values are arbitrary, we will choose the elements
of the (m-k) sets to be
j
y i ( x 0 ) = δ ij , (5.1.66)
so that the missing initial values will be
r
y ( 0) ( x 0 ) 1 = 1A = A . (5.1.67)
r (t) j
Integrating across the interval with these initial values will yield (m-k) solution y ( x n ) at the
other boundary. Since the equations are linear each trial solution will be related to the known boundary
j r (t)
values y ( x n ) by
r r
y ( 0) ( x n ) = A[ j y ( t ) ( x n )] , (5.1.68)

## so that for the complete set of trial solutions we may write

r
y ( 0) ( x n ) 1 = Ay(t)(xn) , (5.1.69)

j r (t)
where by analogy to equation (5.1.65), the column vectors of y(t)(xn) are y ( x n ) . We may solve these
equations for the unknown transformation matrix A so that the missing initial values are
r r
y (0) ( x n ) 1 = A = y-1 y ( 0 ) ( x n ) . (5.1.70)

If one employs a one step method such as Runge-Kutta, it is possible to collapse this entire operation to the
point where one can represent the complete boundary conditions at one boundary in terms of the values at
r
the other boundary y n a system of linear algebraic equations such as
r r
y ( x 0 ) = By ( x n ) . 5.1.71)

The matrix B will depend only on the details of the integration scheme and the functional form of the
equations themselves, not on the boundary values. Therefore it may be calculated for any set of boundary
values and used repeatedly for problems differing only in the values at the boundary (see Day and Collins5).

140
5 - Differential and Integral Equations

## To demonstrate methods of solution for systems of differential equations or boundary value

problems, we shall need more than the first order equation (5.1.10) that we used for earlier examples.
However, that equation was quite illustrative as it had a rapidly increasing solution that emphasized the
shortcomings of the various numerical methods. Thus we shall keep the solution, but change the equation.
Simply differentiate equation (5.1.10) so that
2
Y" = 2(1 + 2 x 2 )e x = 2(1 + x 2 ) y . (5.1.72)

Let us keep the same initial condition given by equation (5.1.11) and add a condition of the derivative at
x = 1 so that
y(0) = 1 ⎫
⎬ . (5.1.73)
y' (1) = 2e = 5.43656 ⎭
This insures that the closed form solution is the same as equation (5.1.12) so that we will be able to compare
the results of solving this problem with earlier methods. We should not expect the solution to be as accurate
for we have made the problem more difficult by increasing the order of the differential equation in addition
to separating the location of the constants of integration. This is no longer an initial value problem since the
solution value is given at x = 0, while the other constraint on the derivative is specified at x = 1. This is
typical of the classical two-point boundary value problem.

We may also use this example to indicate the method for solving higher order differential equations
given at the start of this chapter by equations (5.1.1) and (5.1.2). With those equations in mind, let us replace
equation (5.1.72) by system of first order equations
y'1 ( x ) = y 2 ( x ) ⎫
⎬ , (5.1.74)
y' 2 ( x ) = 2(1 + 2 x ) y1 ( x )
2

which we can write in vector form as r r
y' = A ( x ) y , (5.1.75)
where
⎛ 0 1 ⎞
A( x ) = ⎜⎜ ⎟ .
⎟ (5.1.76)
⎝ 2(1 + x ) 0
2

r
The components of the solution vector y are just the solution we seek (i.e.) and its derivative. However, the
form of equation (5.1.75) emphasizes its linear form and were it a scalar equation, we should know how to
proceed.

For purposes of illustration, let us apply the fourth order Runge-Kutta scheme given by equation
(5.1.63). Here we can take specific advantage of the linear nature of our problem and the fact that the
dependence on the independent variable factors out of the right hand side. To illustrate the utility of this fact,
let
g( x , y) = [f ( x )]y , (5.1.77)

in equation (5.1.63).

141
Numerical Methods and Data Analysis

## Then we can write the fourth order Runge-Kutta parameters as

t 0 = hf 0 y n ⎫

t 1 = hf 1 ( y n + 1 2 t 0 ) = hf 1 ( y n + 1 2 hf 0 y n = (hf 1 + 1 2 h 2 f 1f 0 ) y n ⎪
⎬ . (5.1.78)
t 2 = hf1 ( y n + 1 2 t 1 ) = (hf 1 + 1 2 h 2 f 12 + 1 4 h 3 f 12 f 0 ) y n ⎪
t 3 = hf 2 ( y n + t 2 ) = (hf 2 + h 2 f 2 f 1 + 1 2 h 3 f 12 f 2 + 1 4 h 4 f 2 f 12 f 0 ) y n ⎪

where
f 0 = f (x n ) ⎫

f1 = f ( x n + 1 2 h ) ⎬ , (5.1.79)
f 2 = f (x n + h) ⎪

so that the formula becomes
y n +1 = y n + ( t 0 + 2 t 1 2 t 2 + t 3 )
⎡ h h2 h3 2 h4 ⎤ . (5.1.80)
= ⎢1 + (f 0 + 4f 1 + f 2 ) + (f 1f 0 + f 12 + f 2 f 1 ) + (f 1 f 0 + f 2 f 12 ) + f 2 f 12 f 0 )⎥ y n
⎣ 6 6 12 24 ⎦
Here we see that the linearity of the differential equation allows the solution at step n to be factored out of
the formula so that the solution at step n appears explicitly in the formula. Indeed, equation (5.1.80)
represents a power series in h for the solution at step (n+1) in terms of the solution at step n. Since we have
been careful about the order in which the functions fi multiplied each other, we may apply equation (5.1.80)
directly to equation (5.1.75) and obtain a similar formula for systems of linear first order differential
equations that has the form
r ⎡ h h2 h3 h4 ⎤r
y n +1 = ⎢1 + (A 0 + 4A1 + A 2 ) + (A 0 A1 + 4A12 + A 2 A1 ) + (A12 A 0 + A 2 A12 ) + A 2 A12 A 0 ⎥ y n .
⎣ 6 6 12 24 ⎦
(5.1.81)
Here the meaning of Ai is the same as fi in that the subscript indicates the value of the independent variable x
for which the matrix is to be evaluated. If we take h = 1, the matrices for our specific problem become
⎛ 0 1 ⎞⎫
A 0 = ⎜⎜ ⎟⎪
⎝ 2 0 ⎟⎠ ⎪
⎛ 0 1 ⎞ ⎪⎪
A 1 = ⎜⎜ ⎟⎬ . (5.1.82)
⎝3 0 ⎟⎠ ⎪
⎛ 0 1 ⎞⎪
A 2 = ⎜⎜ ⎟⎟⎪
⎝ 4 0 ⎠⎭⎪
Keeping in mind that the order of matrix multiplication is important, the products appearing in the second
order term are

142
5 - Differential and Integral Equations

⎛ 2 0 ⎞ ⎫
A 1 A 0 = ⎜⎜ ⎟⎟ ⎪
⎝ 0 3 ⎠ ⎪
⎛3 0⎞ ⎪⎪
A 12 = ⎜⎜ ⎟ ⎬ . (5.1.83)
⎝ 0 3 ⎟⎠ ⎪
⎛ 3 0⎞ ⎪
A 2 A 1 = ⎜⎜ ⎟⎟ ⎪
⎝ 0 6⎠ ⎪⎭

The two products appearing in the third order term can be easily generated from equations (5.1.82) and
(5.1.83) and are
⎛ 0 3⎞ ⎫
A 12 A 0 = ⎜⎜ ⎟⎟ ⎪
⎝9 0⎠ ⎪
⎬ . (5.1.84)
⎛ 0 3⎞ ⎪
A 2 A 12 = ⎜⎜ ⎟⎟
⎝ 18 0 ⎠ ⎪

Finally the single matrix of the first order term can be obtain by successive multiplication using
equations(5.1.82) and (5.1.84) yielding
⎛9 0⎞ ⎫
A 2 A 12 A 0 = ⎜⎜ ⎟⎟ ⎬ . (5.1.85)
⎝ 0 18 ⎠ ⎭
Like equation (5.1.80), we can regard equation (5.1.81) as a series solution in h that yields a system of linear
equations for the solution at step n+1 in terms of the solution at step n. It is worth noting that the coefficients
of the various terms of order hk are similar to those developed for equal interval quadrature formulae in
chapter 4. For example the lead term being the unit matrix generates the coefficients of the trapezoid rule
while the h(+1, +4, +1)/6 coefficients of the second term are the familiar progression characteristic of
Simpson's rule. The higher order terms in the formula are less recognizable since they depend on the
parameters chosen in the under-determined Runge-Kutta formula.

## If we define a matrix P(hk) so that

r r
y n +1 = P(h k )≡ k P y n , (5.1.86)

the series nature of equation (5.1.81) can be explicitly represented in terms of the various values of kP.

143
Numerical Methods and Data Analysis

## For our problem they are:

⎛1 0⎞ ⎪
0
P = ⎜⎜ ⎟⎟ ⎪
⎝ 0 1 ⎠ ⎪
⎛1 1⎞ ⎪
⎜ ⎟ ⎪
1
P = ⎜ 11
⎜ 0 ⎟⎟ ⎪
⎝ 6 ⎠ ⎪

⎛ 7 ⎞ ⎪
⎜ 1⎟
3 ⎪
2
P=⎜ ⎟ ⎬
⎜ 11 ⎟ ⎪
⎜ 0⎟
⎝ 6 ⎠ ⎪

⎛ 7 3 ⎞ ⎪
⎜ ⎟
⎜ 3 2 ⎟ ⎪
3
P= ⎪
⎜ 49 ⎟
⎜ 3 ⎟ ⎪
⎝ 12 ⎠ ⎪

⎛ 65 3 ⎞ ⎪
⎜ ⎟
⎪ .
4
P = ⎜ 24 2 ⎟

(5.1.87)
⎜ 49 15 ⎟
⎜ ⎟ ⎪
⎝ 12 4 ⎠ ⎭
The boundary value problem now is reduced to solving the linear system of equations specified by equation
(5.1.86) where the known values at the respective boundaries are specified. Using the values given in
equation (5.1.73) the linear equations for the missing boundary values become
1= k P11 y1 (0)+ k P12 (5.43656) ⎫⎪
⎬ . (5.1.88)
y 2 (0)= P21 y1 (1)+ P22 (5.43656)
k k
⎪⎭
The first of these yields the missing solution value at x = 0 [i.e. y2(0)]. With that value the remaining value
can be obtained from the second equation. The results of these solutions including additional terms of order
hk are given in table 5.3. We have taken h to be unity, which is unreasonably large, but it serves to
demonstrate the relative accuracy of including higher order terms and simplifies the arithmetic. The results
for the missing values y2(0) and y1(1) (i.e. the center two rows) converge slowly, and not uniformly, toward
their analytic values given in the column labeled k = ∞.

Had we chosen the step size h to be smaller so that a number of steps were required to cross the
interval, then each step would have produced a matrix kiP and the solution at each step would have been
related to the solution at the following step by equation (5.1.86). Repeated application of that equation would
yield the solution at one boundary in terms of the solution at the other so that

144
5 - Differential and Integral Equations
r r r
y n =( n −k1 P k
n −2 P k
n −3 P L00 P ) y 0 = k Q y 0 . (5.1.89)

Table 5.3
Solutions of a Sample Boundary Value Problem
for Various Orders of Approximation

\K 0 1 2 3 4 ∞
y1(0) 1.0 1.0 1.0 1.0 1.0 1.0
y2(0) 5.437 3.60 1.200 0.4510 0.3609 0.0
y1(1) 1.0 4.60 3.53 3.01 3.25 2.71828
y2(1) 5.437 5.437 5.437 5.437 5.437 2e

Thus one arrives at a similar set of linear equations to those implied by equation (5.1.86) and explicitly given
in equation (5.1.88) relating the solution at one boundary in terms of the solution at the other boundary.
These can be solved for the missing boundary values in the same manner as our example. Clearly the
decrease in the step size will improve the accuracy as dramatically as increasing the order k of the
approximation formula. Indeed the step size can be variable at each step allowing for the use of the error
correcting procedures described in section 5.1b.

Table 5.4
Solutions of a Sample Boundary Value Problem

\K 0 1 2 3 4 ∞
y1(0) 1.0 1.0 1.0 1.0 1.0 1.0
y2(0) 0.0 0.0 0.0 0.0 0.0 0.0
y1(1) 1.0 1.0 2.33 2.33 2.708 2.718
y2(1) 0.0 1.83 1.83 4.08 4.08 5.437

Any set of boundary values could have been used with equations (5.1.81) to yield the solution
elsewhere. Thus, we could treat our sample problem as an initial value problem for comparison. If we take
the analytic values for y1(0) and y2(0) and solve the resulting linear equations, we get the results given in
Table 5.4. Here the final solution is more accurate and exhibits a convergence sequence more like we would
expect from Runge-Kutta. Namely, the solution systematically lies below the rapidly increasing analytic
solution. For the boundary value problem, the reverse was true and the final result less accurate. This is not
an uncommon result for two-point boundary value problems since the error of the approximation scheme is
directly reflected in the determination of the missing boundary values. In an initial value problem, there is
assumed to be no error in the initial values.

145
Numerical Methods and Data Analysis

This simple example is not meant to provide a definitive discussion of even the restricted subset of
linear two-point boundary value problems, but simply to indicate a way to proceed with their solution.
Anyone wishing to pursue the subject of two-point boundary value problems further should begin with the
venerable text by Fox6.

## e. Partial Differential Equations

The subject of partial differential equations has a literature at least as large as that for
ordinary differential equations. It is beyond the scope of this book to provide a discussion of partial
differential equations even at the level chosen for ordinary differential equations. Indeed, many introductory
books on numerical analysis do not treat them at all. Thus we will only sketch a general approach to
problems involving such equations.

Partial differential equations form the basis for so many problems in science, that to limit the choice
of examples. Most of the fundamental laws of physical science are written in terms of partial differential
equations. Thus one finds them present in computer modeling from the hydrodynamic calculations needed
for airplane design, weather forecasting, and the flow of fluids in the human body to the dynamical
interactions of the elements that make up a model economy.

A partial derivative simply refers to the rate of change of a function of many variables, with respect
to just one of those variables. In terms of the familiar limiting process for defining differentials we would
write
∂F( x 1 , x 2 ,L, x n ) ⎡ F( x 1 , x 2 , L, x j ,L x n ) − F( x 1 , x 2 , L, x j + ∆x j , L x n ) ⎤
= Lim ⎢ ⎥ . (5.1.90)
∂x j ∆x j →0
⎢⎣ ∆x j ⎥⎦

Partial differential equations usually relate derivatives of some function with respect to one variable to
derivatives of the same function with respect to another. The notion of order and degree are the same as with
ordinary differential equations.

Although a partial differential equation may be expressed in multiple dimensions, the smallest
number for illustration is two, one of which may be time. Many of these equations, which describe so many
aspects of the physical world, have the form
∂ 2 z ( x , y) ∂ 2 z ( x , y) ∂ 2 z ( x , y) ⎡ ∂z ∂z ⎤
a ( x , y) + 2 b ( x , y ) + c ( x , y ) = F⎢ x , y, z , ⎥ . (5.1.91)
∂x 2
∂x∂y ∂y 2
⎣ ∂x ∂y ⎦
and as such can be classified into three distinct groups by the discriminate so that
[b 2 ( x , y) − a ( x , y)c( x , y)] < 0 Elliptic ⎫

[b 2 ( x , y) − a ( x , y)c( x , y)] = 0 Parabolic ⎬ . (5.1.92)

[b ( x , y) − a ( x , y)c( x , y)] > 0
2
Hyperbolic ⎭

146
5 - Differential and Integral Equations

Should the equation of interest fall into one of these three categories, one should search for solution
algorithms designed to be effective for that class. Some methods that will be effective at solving equations of
one class will fail miserably for another.

While there are many different techniques for dealing with partial differential equations, the most
wide-spread method is to replace the differential operator by a finite difference operator thereby turning the
differential equation into a finite difference equation in at least two variables. Just as a numerical integration
scheme finds the solution to a differential equation at discrete points xi along the real line, so a two
dimensional integration scheme will specify the solution at a set of discrete points xi, yj. These points can be
viewed as the intersections on a grid. Thus the solution in the x-y space is represented by the solution on a
finite grid. The location of the grid points will be specified by the finite difference operators for the two
coordinates. Unlike problems involving ordinary differential equations, the initial values for partial
differential equations are not simply constants. Specifying the partial derivative of a function at some
particular value of one of the independent variables still allows it to be a function of the remaining
independent variables of the problem. Thus the functional behavior of the solution is often specified at some
boundary and the solution proceeds from there. Usually the finite difference scheme will take advantage of
any symmetry that may result for the choice of the boundary. For example, as was pointed out in section 1.3
there are thirteen orthogonal coordinate systems in which Laplace's equation is separable. Should the
boundaries of a problem match one of those coordinate systems, then the finite difference scheme would be
totally separable in the independent variables greatly simplifying the numerical solution. In general, one
picks a coordinate system that will match the local boundaries and that will determine the geometry of the
grid. The solution can then proceed from the initial values at a particular boundary and move across the grid
until the entire space has been covered. Of course the solution should be independent of the path taken in
filling the grid and that can be used to estimate the accuracy of the finite difference scheme that is being
used. The details of setting up various types of schemes are beyond the scope of this book and could serve as
the subject of a book by themselves. For a further introduction to the solution of partial differential equations
the reader is referred to Sokolnikoff and Redheffer7 and for the numerical implementation of some methods
the student should consult Press et.al.8. Let us now turn to the numerical solution of integral equations.

## 5.2 The Numerical Solution of Integral Equations

For reasons that I have never fully understood, the mathematical literature is crowded with books,
articles, and papers on the subject of differential equations. Most universities have several courses of study in
the subject, but little attention is paid to the subject of integral equations. The differential operator is linear
and so is the integral operator. Indeed, one is just the inverse of the other. Equations can be written where the
dependent variable appears under an integral as well as alone. Such equations are the analogue of the
differential equations and are called integral equations. It is often possible to turn a differential equation into
an integral equation which may make the problem easier to numerically solve. Indeed many physical
phenomena lend themselves to description by integral equations. So one would think that they might form as
large an area for analysis are do the differential equations. Such is not the case. Indeed, we will not be able to
devote as much time to the discussion of these interesting equations as we should, but we shall spend enough
time so that the student is at least familiar with some of their basic properties. Of necessity, we will restrict
our discussion to those integral equations where the unknown appears linearly. Such linear equations are
more tractable and yet describe much that is of interest in science.

147
Numerical Methods and Data Analysis

## a. Types of Linear Integral Equations

We will follow the standard classification scheme for integral equations which, while not
exhaustive, does include most of the common types. There are basically two main classes known as
Fredholm and Volterra after the mathematicians who first studied them in detail. Fredholm equations involve
definite integrals, while Volterra equations have the independent variable as one of the limits. Each of these
categories can be further subdivided as to whether or not the dependent variable appears outside the integral
sign as well as under it. Thus the two types of Fredholm equations for the unknown φ are
b
F( x ) = ∫ K ( x , t )φ( t ) dt Fredholm Type I ⎫
a ⎪
b ⎬, (5.2.1)
φ( x ) = F( x ) + λ ∫ K ( x , t )φ( t ) dt Fredholm Type II ⎪
a ⎭
while the corresponding two types of Volterra equations for φ take the form
x
F( x ) = ∫ K ( x , t )φ( t ) dt Volterra Type I ⎫
a ⎪
x ⎬ . (5.2.2)
φ( x ) = F( x ) + λ ∫ K ( x , t )φ( t ) dt Volterra Type II ⎪
a ⎭
The parameter K(x,t) appearing in the integrand is known as the kernel of the integral equation. Its form is
crucial in determining the nature of the solution. Certainly one can have homogeneous or inhomogeneous
integral equations depending on whether or not F(x) is zero. Of the two classes, the Fredholm are generally
easier to solve.

## b. The Numerical Solution of Fredholm Equations

Integral equations are often easier to solve than a corresponding differential equation. One
of the reasons is that the truncation errors of the solution tend to be averaged out by the process of
quadrature while they tend to accumulate during the process of numerical integration employed in the
solution of differential equations. The most straight-forward approach is to simply replace the integral with a
quadrature sum. In the case of Fredholm equations of type one, this results in a functional equation for the
unknown φ(x) at a discrete set of points tj used by the quadrature scheme. Specifically
n
F(x) = Σ K(x,tj)φ(tj)Wj +Rn(x) . (5.2.3)
j=0

Since equation (5.2.3) must hold for all values of x, it must hold for values of x equal to those chosen for
the quadrature points so that
xj = tj j = 0, 1, 2, L , n . (5.2.4)
By picking those particular points we can generate a linear system of equations from the functional equation
(5.2.3) and, neglecting the quadrature error term, they are
n n
F(xi) = Σ K(xi,tj)φ(tj)Wj = Σ Aijφ(xj) i = 0, 1, 2, L , n , (5.2.5)
j=0 j=0

## which can be solved by any of the methods discussed in Chapter 2 yielding

148
5 - Differential and Integral Equations
n
φ( x j ) = ∑ A −jk1 F( x k ) j = 0,1, 2, L, n . (5.2.6)
k =0

The solution will be obtained at the quadrature points xj so that one might wish to be careful in the selection
of a quadrature scheme and pick one that contained the points of interest. However, one can use the solution
set φ(xj) to interpolate for missing points and maintain the same degree of precession that generated the
solution set. For Fredholm equations of type 2, one can perform the same trick of replacing the integral with
a quadrature scheme. Thus
n
φ( x ) = F( x ) + λ ∑ K ( x , t j )φ( t j ) W j + R n (x) . (5.2.7)
j= 0

Here we must be a little careful as the unknown φ(x) appears outside the integral. Thus equation (5.2.7) is a
functional equation for φ(x) itself. However, by evaluating this functional equation as we did for Fredholm
equations of type 1 we get
n
φ( x i ) = F( x i ) + λ ∑ K ( x i , t j )φ( t j ) W j , (5.2.8)
j= 0

which, after a little algebra, can be put in the standard form for linear equations
n n
F( x i ) = ∑ [δ ij − λK ( x i , t j ) W j ]φ( t j ) = ∑ B ij φ( x j ) i = 0,1, 2, L , n , (5.2.9)
j= 0 j= 0

## that have a solution

n
φ( x j ) = ∑ B −jk1 F( x k ) j = 0,1, 2, L, n . (5.2.10)
k =0

Here the solution set φ(xj) can be substituted into equation (5.2.7) to directly obtain an interpolation formula
for φ(x) which will have the same degree of precision as the quadrature scheme and is valid for all values of
x. Such equations can be solved efficiently by using the appropriate Gaussian quadrature scheme that is
required by the limits. In addition, the form of the kernel K(x,t) may influence the choice of the quadrature
scheme and it is useful to include as much of the behavior of the kernel in the quadrature weight functions as
possible. We could also choose to break the interval a → b in several pieces depending on the nature of the
kernel and what can be guessed about the solution itself. The subsequent quadrature schemes for the sub-
intervals will not then depend on the continuity of polynomials from one sub-interval to another and may
allow for more accurate approximation in the sub-interval.

For a specific example of the solution to Fredholm equations, let us consider a simple equation of
the second type namely
1
y( x ) = 1 + x ∫ ty dt . (5.2.11)
0

Comparing this to equation (5.2.7), we see that F(x) = 1, and that the kernel is separable which leads us
immediately to an analytic solution. Since the integral is a definite integral, it may be regarded as some
constant α and the solution will be linear of the form
1
y( x ) = 1 + αx ∫ t (1 + αt ) dt = 1 + x ( 1 2 + α 3 ) . (5.2.12)
0

149
Numerical Methods and Data Analysis

## This leads to a value for α of

α = 3/4 . (5.2.13)

However, had the equation required a numerical solution, then we would have proceeded by replacing the
integral by a quadrature sum and evaluating the resulting functional equation at the points of the quadrature.
Knowing that the solution is linear, let us choose the quadrature to be Simpson's rule which has a degree of
precision high enough to provide an exact answer. The linear equations for the solution become
y(0) = 1 + (0)[(0) y(0) + 4( 1 2 ) y( 1 2 ) + y(1)] / 6 = 1 ⎫

y( 2 ) = 1 + ( 2 )[(0) y(0) + 4( 2 ) y( 2 ) + y(1)] / 6 = 1 + y( 2 ) / 6 + y(1) / 12
1 1 1 1 1
⎬ , (5.2.14)
y(1) = 1 + (1)[(0) y(0) + 4( 1 2 ) y( 1 2 ) + y(1)] / 6 = 1 + y( 1 2 ) / 3 + y(1) / 6 ⎪

which have the immediate solution

y(0) = 1 ⎪

11 ⎪
y( 1 2 ) = ⎬. (5.2.15)
8 ⎪
7 ⎪
y(1) = ⎪⎭
4
Clearly this solution is in exact agreement with the analytic form corresponding to α=3/4,
y(x) = 1 + 3x/4 . (5.2.16)

While there are variations on a theme for the solution of these type of equations, the basic approach
is nicely illustrated by this approach. Now let us turn to the generally more formidable Volterra equations.

## c. The Numerical Solution of Volterra Equations

We may approach Volterra equations in much the same way as we did Fredholm equations,
but there is the problem that the upper limit of the integral is the independent variable of the equation. Thus
we must choose a quadrature scheme that utilizes the endpoints of the interval; otherwise we will not be able
to evaluate the functional equation at the relevant quadrature points. One could adopt the view that Volterra
equations are, in general, just special cases of Fredholm equations where the kernel is

## K(x,t) = 0, t > x . (5.2.17)

but this would usually require the kernel to be non-analytic However, if we choose such a quadrature
formula then, for Volterra equations of type 1, we can write

i

F( x i ) = ∑ K ( x i , x j )φ( x j ) W j i = 0,1, 2, L , n ⎪
j= 0 ⎬ . (5.2.18)
x k = a + kh ⎪

Not only must the quadrature scheme involve the endpoints, it must be an equal interval formula so that

150
5 - Differential and Integral Equations

successive evaluations of the functional equation involve the points where the function has been previously
evaluated. However, by doing that we obtain a system of n linear equations in (n+1) unknowns. The value of
φ(a) is not clearly specified by the equation and must be obtained from the functional behavior of F(x). One
constraint that supplies the missing value of φ(x) is
dF( x )
φ(a ) − K (a , a ) = . (5.2.19)
dx x= a

The value of φ(a) reduces equations (5.2.18) to a triangular system that can be solved quickly by successive
substitution (see section 2.2). The same method can be used for Volterra equations of type 2 yielding
i

F( x i ) = φ( x i ) + ∑ K ( x i , x j )φ( x j ) W j i = 0,1, 2, L , n ⎪
j= 0 ⎬ . (5.2.20)
x k = a + kh ⎪

Here the difficulty with φ(a) is removed since in the limit as x → a
φ(a) = F(a) . (5.2.21)
Thus it would appear that type 2 equations are more well behaved that type 1 equations. To the extent that
this is true, we may replace any type 1 equation with a type 2 equation of the form
x ∂K ( x, t )
F' ( x ) = K ( x, x )φ( x ) + ∫ φ( t ) dt . (5.2.22)
a ∂x
Unfortunately we must still obtain F'(x) which may have to be accomplished numerically.

Consider how these direct solution methods can be applied in practice. Let us choose equation
(5.1.10), which served so well as a test case for differential equations. In setting that equation up for Picard's
method, we turned it into a type 2 Volterra integral equation of the form
x
y( x ) = 1 + x ∫ ty dt . (5.2.23)
0

If we put this in the form suggested by equation (5.2.17) where the kernel vanishes for t > x, we could write
x n
1 = y( x ) − x ∫ ty dt = y( x i ) − x i ∑ t j y( t j ) W j , W j = 0, j > i . (5.2.24)
0
j= 0

Here we have insured that the kernel vanishes for t>x by choosing the quadrature weights to be zero when
that condition is satisfied. The resulting linear equations for the solution become

## 1 = y(0) − [(0) y(0) + 4(0) y( 1 2 ) + (0) y(1)] / 6 = y(0), i = 1 ⎫

1 = y( 2 ) − [(0) y(0) + 4( 2 ) y( 2 ) + (0) y(1)] / 6 = 2 y( 2 ) / 3, i = 2
1 1 1 1
⎬ . (5.2.25)
1 = y(1) − [(0) y(0) + 4( 1 2 ) y( 1 2 ) + y(1)] / 6 = − y( 1 2 ) / 3 + 5 y(1) / 6, i = 3 ⎪⎭

The method of using equal interval quadrature formulae of varying degrees of precision as x increases is
expresses by equation (5.2.18), which for our example takes the form

x i
1 = y( x ) − x ∫ ty dt = y( x i ) − ∑ t j y( t j ) W j . (5.2.26)
0
j= 0

151
Numerical Methods and Data Analysis

## This results in linear equations for the solution that are

1 = y ( 0) − ( 0) ⎫

1 = y( 1 2 ) − [(0) y(0) + ( 1 2 ) y( 1 2 )] / 2 = 3y( 1 2 ) / 4, ⎬ . (5.2.27)
1 = y(1) − [(0) y(0) + 4( 1 2 ) y( 1 2 ) + y(1)] / 6 = − y( 1 2 ) / 3 + 5 y(1) / 6 ⎪

The solutions to the two sets of linear equations (5.2.25) and (5.2.27) that represent these two different
approaches are given in table 5.5

Table 5.5
Sample Solutions for a Type 2 Volterra Equation
Fredholm Soln. Triangular Soln. Analytic Soln.
y(0) 1.0 1.0 1.0
% Error 0.0% 0.0% ---------
y(½) 1.5 1.333 1.284
% Error 16.8% 3.8% ---------
y(1) 1.8 1.733 2.718
% Error -33.8% -36.2% ---------

As with the other examples, we have taken a large step size so as to emphasize the relative accuracy. With
the step size again being unity, we get a rather poor result for the rapidly increasing solution. While both
method give answers that are slightly larger than the correct answer at x = ½, they rapidly fall behind the
exponentially increasing solution by x = 1. As was suggested, the triangular solution is over all slightly
better that then Fredholm solution with the discontinuous kernel.

When applying quadrature schemes directly to Volterra equations, we generate a solution with
variable accuracy. The quadrature scheme can initially have a degree of precision no greater than one. While
this improves as one crosses the interval the truncation error incurred in the first several points accumulates
in the solution. This was not a problem with Fredholm equations as the truncation error was spread across
the interval perhaps weighted to some degree by the behavior of the kernel. In addition, there is no
opportunity to use the highly efficient Gaussian schemes directly as the points of the quadrature must be
equally spaced. Thus we will consider an indirect application of quadrature schemes to the solution of both
types of integral equations.

By using a quadrature scheme, we are tacitly assuming that the integrand is well approximated by a
polynomial. Let us instead assume that the solution itself can be approximated by a polynomial of the form
n
φ(xi) = Σ αjξj(x) . (5.2.28)
j=0

Substitution of this polynomial into the integral of either Fredholm or Volterra equations yields
n n

## ∫ K (x, t )φ(t ) dt = ∑ α j ∫ K(x, t )ξ j (t ) dt + R = ∑ α j H j (x ) + R .

j= 0 j= 0
5.2.29)

152
5 - Differential and Integral Equations

Now the entire integrand of the integral is known and may be evaluated to generate the functions Hj(x). It
should be noted that the function Hj(x) will depend on the limits for both classes of equations, but its
evaluation poses a separate problem from the solution of the integral equation. In some cases it may be
evaluated analytically and in others it will have to be computed numerically for any chosen value of x.
However, once that is done, type one equations of both classes can be written as
n
F(xi) = Σ αjHj(xi) i=0,1,2, L , n , (5.2.30)
j=0

which constitute a linear system of (n+1) algebraic equations in the (n+1) unknowns αj. These, and equation
(5.2.28) supply the desired solution φ(x). Solution for the type 2 equations is only slightly more complicated
as equation (5.2.28) must be directly inserted into the integral equation an evaluated at x=xi. However, the
resulting linear equations can still be put into standard form so that the αjs can be solved for to generate the
solution φ(x).

We have said nothing about the functions ξj(x) that appear in the approximation equation (5.2.28).
For nominal polynomial approximation these might be xj, but for large n such a choice tends to develop
instabilities. Thus the same sort of care that was used in developing interpolation formulae should be
employed here. One might even wish to employ a rational function approach to approximating φ(x) as was
done in section 3.2. Such care is justified as we have introduced an additional source of truncation error with
this approach. Not only will there be truncation error resulting from the quadrature approximation for the
entire integral, but there will be truncation error from the approximation of the solution itself [i.e. equation
(5.2.28)]. While each of these truncation errors is separately subject to control, their combined effect is less
predictable.

Finally, we should consider the feasibility of iterative approaches in conjunction with quadrature
schemes for finding solutions to these equations. The type 2 equations immediately suggest an iterative
function of the form
b
φ ( k ) ( x ) = F( x ) + λ ∫ K ( x , t )φ ( k −1) ( t ) dt . (5.2.31)
a

Remembering that it is φ(x) that we are after, we can use equation (2.3.20) and the linearity of the
integral equations with respect to φ(x) to establish that the iterative function will converge to a fixed point as
long as
b
λ ∫ K ( x, t ) dt < 1 . (5.2.32)
a

Equation (5.2.17) shows us that a Volterra equation is more likely to converge by iteration than a Fredholm
equation with a similar kernel. If λ is small, then not only is the iterative sequence likely to converge, but an
initial guess of
(0)
φ (x) = F(x) . (5.2.33)

suggests itself. In all cases integration required for the iteration can be accomplished by any desireable
quadrature scheme as the preliminary value for the solution φ(k-1)(x) is known.

153
Numerical Methods and Data Analysis

## d. The Influence of the Kernel on the Solution

Although the linearity of the integral operator and its inverse relationship to the differential
operator tends to make one think that integral equations are no more difficult than differential equations,
there are some subtle differences. For example, one would never attempt a numerical solution of a
differential equation that could be shown to have no solution, but that can happen with integral equations if
one is not careful. The presence of the kernel under the operator makes the behavior of these equations less
transparent than differential equations. Consider the apparently benign kernel

## K(x,t) = cos(x) cos(t) , (5.2.34)

and an associated Fredholm equation of the first type
+a
F( x ) = ∫ cos(x)cos(t)φ(t)dt = cos(x)Z(a) . (5.2.35)
−a

Clearly this equation has a solution if and only if F(x) has the form given by the right hand side. Indeed, any
kernel that is separable in the independent variables so as to have the form

## K(x,t) = P(x)Q(t) , (5.2.36)

places constraints on the form of F(x) for which the equation has a solution. Nevertheless, it is conceivable
that someone could try to solve equation (5.2.35) for functional forms of F(x) other than then those which
allow for a value of φ(x) to exist. Undoubtedly the numerical method would provide some sort of answer.
This probably prompted Baker9, as reported by Craig and Brown10, to remark 'without care we may well find
ourselves computing approximate solutions to problems that have no true solutions'. Clearly the form of the
kernel is crucial to nature of the solution, indeed, to its very existence. Should even the conditions imposed
on F(x) by equation (5.2.35) be met, any solution of the form

## φ(x) = φ0(x) + ζ(x) , (5.2.37)

where φ0(x) is the initial solution and ζ(x) is any anti-symmetric function will also satisfy the equation. Not
only are we not guaranteed existence, we are not even guaranteed uniqueness when existence can be shown.
Fortunately, these are often just mathematical concerns and equations that arise from scientific arguments
will generally have unique solutions if they are properly formulated. However, there is always the risk that
the formulation will insert the problem in a class with many solutions only one of which is physical. The
investigator is then faced with the added problem of finding all the solutions and deciding which ones are
physical. That may prove to be more difficult than the numerical problem of finding the solutions.

There are other ways in which the kernel can influence the solution. Craig and Brown11 devote most
of their book to investigating the solution of a class of integral equations which represent inversion problems
in astronomy. They show repeatedly that the presence of an inappropriate kernel can cause the numerical
methods for the solution to become wildly unstable. Most of their attention is directed to the effects of
random error in F(x) on the subsequent solution. However, the truncation error in equation (5.2.3) can
combine with F(x) to simulate such errors. The implications are devastating. In Fredholm equations of Type
2, if λ is large and the kernel a weak function of t, then the solution is liable to be extremely unstable. The
reason for this can be seen in the role of the kernel in determining the solution φ(x). K(x,t) behaves like a

154
5 - Differential and Integral Equations

filter on the contribution of the solution at all points to the local value of the solution. If K(x,t) is large only
for xt then the contribution of the rest of the integral is reduced and φ(x) is largely determined by the local
value of x [i.e. F(x)]. If the Kernel is broad then distant values of φ(t) play a major role in determining the
local value of φ(x). If λ is large, then the role of F(x) is reduced and the equation becomes more nearly
homogeneous. Under these conditions φ(x) will be poorly determined and the effect of the truncation error
on F(x) will be disproportionately large. Thus one should hope for non-separable Kernels that are strongly
peaked at x = t.

What happens at the other extreme when the kernel is so strongly peaked at x=t that it exhibits a
singularity. Under many conditions this can be accommodated within the quadrature approaches we have
already developed. Consider the ultimately peaked kernel

## K(x,t) = δ(x-t) , (5.2.38)

where δ(x) is the Dirac delta function. This reduces all of the integral equations discussed here to have
solutions
φ( x ) = F( x ) type 1 ⎫
−1 ⎬ . (5.2.39)
φ( x ) = F( x )(1 − λ ) type 2 ⎭
Thus, even though the Dirac delta function is "undefined" for zero argument, the integrals are well defined
and the subsequent solutions simple. For kernels that have singularities at x = t, but are defined elsewhere we
can remove the singularity by the simple expedient of adding and subtracting the answer from the integrand
so that
b b
φ ( k ) ( x ) = F( x ) + λ ∫ K ( x , t )[φ( t ) − φ( x )] dt + λφ( x ) ∫ K ( x , t ) dt . (5.2.40)
a a

We may use the standard quadrature techniques on this equation if the following conditions are met:
b

∫a
K ( x , t ) dt < ∞ , ∀x ⎪
⎬ . (5.2.41)
Lim{K ( x, t )[φ( t ) − φ( x )]} = 0 ⎪
t →x ⎭
The first of these is a reasonable constraint of the kernel. If that is not met it is unlikely that the solution can
be finite. The second condition will be met if the kernel does not approach the singularity faster than linearly
and the solution satisfies a Lipshitz condition. Since this is true of all continuous functions, it is likely to be
true for any equation that arises from modeling the real world. If this condition is met then the contribution
to the quadrature sum from the terms where (i = j) can be omitted (or assigned weight zero). With that slight
modification all the previously described schemes can be utilized to solve the resulting equation. Although
some additional algebra is required, the resulting linear algebraic equations can be put into standard form and
solved using the formalisms from Chapter 2.

In this chapter we have considered the solution of differential and integral equations that arise so
often in the world of science. What we have done is but a brief survey. One could devote his or her life to the
study of these subjects. However, these techniques will serve the student of science who wishes simply to
use them as tools to arrive at an answer. As problems become more difficult, algorithms may need to become
more sophisticated, but these fundamentals always provide a good beginning.

155
Numerical Methods and Data Analysis

Chapter 5 Exercises
1. Find the solution to the following differential equation

y' = 3y ,

y(0) = 1.

## a. a second order Runge-Kutta

b. a 3-point predictor-corrector.
c. Picard's method with 10 steps.
d. Compare your answer to the analytic solution.

## x2y" + xy' + (x2-6)y = 0 ,

in the range 0→10 with initial values of y'(0)=y(0)=0. Use any method you like,but explain why it
was chosen.

## 3. Find the numerical solution to the integral equation

1
y(x) = 2 + ∫
0
y(t)(x2t+xt3+5t5)dt , 0  x  2 .

Comment on the accuracy of your solution and the reason for using the numerical method you
chose.

## 4. Find a closed form solution to the equation in problem 3 of the form

y(x) = ax2 + bx + c ,

## and specifically obtain the values for a,b, and c.

5. How would you have numerically obtained the values for a, b, and c of problem 4 had you only
known the numerical solution to problem 3? How would the compare to the values obtained from
the closed form solution?

156
5 - Differential and Integral Equations

## 6. We wish to find an approximate solution to the following integral equation:

1
y(x) = 1 + x +2 ∫
0
t x2 y(t) dt .

a. First assume we shall use a quadrature formula with a degree of precision of two
where the points of evaluation are specified to be x1=0.25, x2=0.5, and x3=0.75. Use
Lagrange interpolation to find the weights for the quadrature formula and use the
results to find a system of linear algebraic equations that represent the solution at
the quadrature points.

b. Solve the resulting linear equations by means of Gauss-Jordan elimination and use
the results to find a interpolative solution for the integral equation. Comment on the
accuracy of the resulting solution over the range 0 → ∞.

## 7. Solve the following integral equation:

B(x) = 1/2 ∫
0
B(t)E1│t-x│dt ,

where

E1(x) = ∫
0
e-xtdt/t .

a. First solve the equation by treating the integral as a Gaussian sum. Note that
Lim E1 x = ∞ ,
x →0

b. Solve the equation by expanding B(t) in a Taylor series about x and thereby
changing the integral equation into an nth order linear differential equation.
Convert this equation into a system of n first order linear differential equations and
solve the system subject to the boundary conditions

## B(0) = B 0 , B' (∞) = B" (∞) = ⋅ ⋅ ⋅B ( n ) (∞) .

Note that the integral equation is a homogeneous equation. Discuss how that influenced your
approach to the problem.

157
Numerical Methods and Data Analysis

## Chapter 5 References and Supplemental Reading

1. Hamming, R.W., "Numerical Methods for Scientists and Engineers" (1962) McGraw-Hill Book Co.,
Inc., New York, San Francisco, Toronto, London, pp. 204-207.

2. Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T., "Numerical Recipies the Art of
Scientific Computing" (1986), Cambridge University Press, Cambridge, pp. 569.

3. Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T., "Numerical Recipies the Art of
Scientific Computing" (1986), Cambridge University Press, Cambridge, pp. 563-568.

4. Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T., "Numerical Recipies the Art of
Scientific Computing" (1986), Cambridge University Press, Cambridge, pp. 563.

5. Day, J.T., and Collins, G.W.,II, "On the Numerical Solution of Boundary Value Problems for Linear
Ordinary Differential Equations", (1964), Comm. A.C.M. 7, pp 22-23.

6. Fox, L., "The Numerical Solution of Two-point Boundary Value Problems in Ordinary Differential
Equations", (1957), Oxford University Press, Oxford.

7. Sokolnikoff, I.S., and Redheffer, R.M., "Mathematics of Physics and Modern Engineering" (1958)
McGraw-Hill Book Co., Inc. New York, Toronto, London, pp. 425-521.

8. Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T., "Numerical Recipies the art of
scientific computing" (1986), Cambridge University Press, Cambridge, pp. 615-657.

9. Baker, C.T.N., "The Numerical Treatment of Integral Equations", (1977), Oxford University Press,
Oxford.

10. Craig, I.J.D., and Brown, J.C., (1986), "Inverse Problems in Astronomy -A Guide to Inversion
Strategies for Remotely Sensed Data", Adam Hilger Ltd. Bristol and Boston, pp. 51.

11 Craig, I.J.D., and Brown, J.C., (1986), "Inverse Problems in Astronomy -A Guide to Inversion
Strategies for Remotely Sensed Data", Adam Hilger Ltd. Bristol and Boston.

158
6

## Least Squares, Fourier

Analysis, and Related
Approximation Norms

• • •

## Up to this point we have required that any function we use to

represent our 'data' points pass through those points exactly. Indeed, except for the predictor-corrector
schemes for differential equations, we have used all the information available to determine the
approximating function. In the extreme case of the Runge-Kutta method, we even made demands that
exceeded the available information. This led to approximation formulae that were under-determined. Now
we will consider approaches for determining the approximating function where some of the information is
deliberately ignored. One might wonder why such a course would ever be followed. The answer can be
found by looking in two rather different directions.

159
Numerical Methods and Data Analysis

## Remember, that in considering predictor-corrector schemes in the last chapter, we deliberately

ignored some of the functional values when determining the parameters that specified the function. That was
done to avoid the rapid fluctuations characteristic of high degree polynomials. In short, we felt that we knew
something about extrapolating our approximating function that transcended the known values of specific
points. One can imagine a number of situations where that might be true. Therefore we ask if there is a
general approach whereby some of the functional values can be deliberately ignored when determining the
parameters that represent the approximating function. Clearly, anytime the form of the function is known
this can be done. This leads directly to the second direction where such an approach will be useful. So far we
have treated the functional values that constrain the approximating function as if they were known with
absolute precision. What should we do if this is not the case? Consider the situation where the functional
values resulted from observation or experimentation and are characterized by a certain amount of error.
There would be no reason to demand exact agreement of the functional form at each of the data points.
Indeed, in such cases the functional form is generally considered to be known a priori and we wish to test
some hypothesis by seeing to what extent the imprecise data are represented by the theory. Thus the two
different cases for this approach to approximation can be summarized as:

a. the data is exact but we desire to represent it by an approximating function with fewer
parameters than the data.

b. the approximating function can be considered to be "exact" and the data which represents
that function is imprecise.

There is a third situation that occasionally arises wherein one wishes to approximate a table of
empirically determined numbers which are inherently imprecise and the form of the function must also be
assumed. The use of any method in this instance must be considered suspect as there is no way to separate
the errors of observation or experimentation from the failure of the assumed function to represent the data.

However, all three cases have one thing in common. They will generate systems that will be over-
determined since there will, in general, be more constraining data than there are free parameters in the
approximating function. We must then develop some criterion that will enable us to reduce the problem to
one that is exactly determined. Since the function is not required to match the data at every point, we must
specify by how much it should miss. That criterion is what is known as an approximation norm and we shall
consider two popular ones, but devote most of our effort to the one known as the Least Square Norm.

## 6.1 Legendre's Principle of Least Squares

Legendre suggested that an appropriate criterion for fitting data points with a function having fewer
parameters than the data would be to minimize the square of the amount by which the function misses the
data points. However, the notion of a "miss" must be quantified. For least squares, the "miss" will be
considered to result from an error in the dependent variable alone. Thus, we assume that there is no error in
the independent variable. In the event that each point is as important as any other point, we can do this by
minimizing the sum-square of those errors. The use of the square of the error is important for it eliminates
the influence of its sign. This is the lowest power dependence of the error ε between the data point and the

160
6 - Least Squares

approximating function that neglects the sign. Of course one could appeal to the absolute value function of
the error, but that function is not continuous and so may produce difficulties as one tries to develop an
algorithm for determining the adjustable free parameters of the approximating function.

Least Squares is a very broad principle and has special examples in many areas of mathematics. For
example, we shall see that if the approximating functions are sines and cosines that the Principle of Least
Squares leads to the determination of the coefficients of a Fourier series. Thus Fourier analysis is a special
case of Least Squares. The relationship between Least Squares and Fourier analysis suggests a broad
approximation algorithm involving orthogonal polynomials known as the Legendre Approximation that is
extremely stable and applicable to very large data bases. With this in mind, we shall consider the
development of the Principle of Least Squares from several different vantage points.

There are those who feel that there is something profound about mathematics that makes this the
"correct" criterion for approximation. Others feel that there is something about nature that makes this the
appropriate criterion for analyzing data. In the next two chapters we shall see that there are conditions where
the Principle of Least Squares does provide the most probable estimate of adjustable parameters of a
function. However, in general, least squares is just one of many possible approximation norms. As we shall
see, it is a particularly convenient one that leads to a straightforward determination of the adjustable free
parameters of the approximating function.

## a. The Normal Equations of Least Squares

Let us begin by considering a collection of N data points (xi,Yi) which are to be represented
by an approximating function f(aj,x) so that

f(aj, xi ) = Yi . (6.1.1)

Here the (n+1) aj's are the parameters to be determined so that the sum-square of the deviations from Yi are a
minimum. We can write the deviation as
εi = Yi ─ f(aj,xi) . (6.1.2)

## The conditions that the sum-square error be a minimum are just

N
∂ ∑ ε i2 N ∂f (a j ,x i )
i
= 2∑ [Yi − f (a j , x i )] =0, j = 0,1, 2,L, n . (6.1.3)
∂a i i =1 ∂a j
There is one of these equations for each of the adjustable parameters aj so that the resultant system is
uniquely determined as long as (n+1)  N. These equations are known as the normal equations for the
problem. The nature of the normal equations will be determined by the nature of f(aj,x). That is, should
f(aj,x) be non-linear in the adjustable parameters aj, then the normal equations will be non-linear. However, if
f(aj,x) is linear in the aj's as is the case with polynomials, then the resultant equations will be linear in the aj's.
The ease of solution of such equations and the great body of literature relating to them make this a most
important aspect of least squares and one on which we shall spend some time.

161
Numerical Methods and Data Analysis

## b. Linear Least Squares

Consider the approximating function to have the form of a general polynomial as described
in chapter 3 [equation (3.1.1)]. Namely
n
f (a j , x ) = ∑ a k φ k ( x ) φ . (6.1.4)
k =0

Here the φk(x) are the basis functions which for common polynomials are just xk. This function, while highly
non-linear in the independent variable x is linear in the adjustable free parameters ak. Thus the partial
derivative in equation (6.1.3) is just
∂f (a j , x i )
= φ j (x i ) , (6.1.5)
∂a j
and the normal equations themselves become
n N N

∑a ∑φ
k =0
k
i =1
k ( x i )φ j ( x i ) = ∑ Yi φ j ( x i ) ,
i =1
j = 0, 1, L, n . (6.1.6)

These are a set of linear algebraic equations, which we can write in component or vector form as
∑a k A kj = C j ⎫

r
k
r ⎬ . (6.1.7)
a•A =C ⎪⎭
Since the φj(x) are known, the matrix A(xi) is known and depends only on the specific values, xi, of the
independent variable. Thus the normal equations can be solved by any of the methods described in chapter 2
and the set of adjustable parameters can be determined.

There are a number of aspects of the linear normal equations that are worth noting. First, they form a
symmetric system of equations since the matrix elements are Σφkφj. Since φj(x) is presumed to be real, the
matrix will be a normal matrix (see section 1.2). This is the origin of the name normal equations for the
equations of condition for least squares. Second, if we write the approximating function f(aj,x) in vector form
as rr r
f (a , x ) = a • φ( x ) , (6.1.8)
then the normal equations can be written as
r N r r N r
a • ∑ φ( x i )φ( x i ) = ∑ Yi φ( x i ) . (6.1.9)
i =1 i =1
r
Here we have defined a vector φ( x ) whose components are the basis functions φj(x). Thus the matrix
elements of the normal equations can be generated simply by taking the outer (tensor) product of the basis
vector with itself and summing over the values of the vector for each data point. A third way to develop the
normal equations is to define a non-square matrix from the basis functions evaluated at the data points xi as

162
6 - Least Squares

⎛ φ 0 ( x 1 ) φ1 ( x 1 ) L φ n ( x 1 ) ⎞
⎜ ⎟
⎜ φ ( x ) φ1 ( x 2 ) L φ n ( x 2 ) ⎟
φ ki = ⎜ 0 2 ⎟. (6.1.10)
M M M
⎜ ⎟
⎜ φ (x ) φ (x ) L φ (x ) ⎟
⎝ 0 n 1 n n n ⎠
Now we could write an over determined system of equationsrwhich we would like to hold as
r
φa = Y . (6.1.11)
The normal equations can then be described by r
r
[φ T φ]a = φ T Y , (6.1.12)
where we take advantage of the matrix product to perform the summation over the data points. Equations
(6.1.9) and (6.1.12) are simply different mathematical ways of expressing the same formalism and are useful
in developing a detailed program for the generation of the normal equations.

So far we have regarded all of the data points to be of equal value in determining the solution for the
free parameters aj. Often this is not the case and we would like to count a specific point (xi,Yi) to be of more
or less value than the others. We could simply include it more than once in the summations that lead to the
normal equations (6.1.6) or add it to the list of observational points defining the matrix φ given by equation
(6.1.10). This simplistic approach only yields integral weights for the data points. A far more general
approach would simply assign the expression [equation (6.1.1) or equation (6.1.8)] representing the data
point a weight ωi. then equation (6.1.1) would have the formr
r r
f (a , x ) = ϖa • φ( x i ) ≈ ϖYi . (6.1.13)
However, the partial derivative of f will also
r contain the weight so that
∂f (a , x i ) r
= ϖ i ĵ • φ( x i ) = ϖ i φ j ( x i ) . (6.1.14)
∂a j
Thus the weight will appear quadratically in the normal equations as
n N N

∑ a k ∑ ϖ i2 φ k (x i )φ j (x i ) = ∑ ϖ i2 Yi φ j (x i ) ,
k =0 i =1 i =1
j = 0, 1, L , n . (6.1.15)

In order to continually express the weight as a quadratic form, many authors define
w i ≡ ϖ i2 , (6.1.16)
so that the normal equations are written as
n N N

∑a ∑ w
k =0
k
i =1
i φ k ( x i )φ j ( x i ) = ∑ w i Yi φ j ( x i ) ,
i =1
j = 0, 1, L, n . (6.1.17)

This simple substitution is often a source of considerable confusion. The weight wi is the square of the
weight assigned to the observation and is of necessity a positive number. One cannot detract from the
importance of a data point by assigning a negative weight ϖi. The generation of the normal equations would
force the square-weight wi to be positive thereby enhancing the role of that point in determining the solution.
Throughout the remainder of this chapter we shall consistently use wi as the square-weight denoted by
equation (6.1.16). However, we shall also use ϖi as the individual weight of a given observation. The reader
should be careful not to confuse the two.

163
Numerical Methods and Data Analysis

Once generated, these linear algebraic equations can be solved for the adjustable free parameters by
any of the techniques given in chapter 2. However, under some circumstances, it may be possible to produce
normal equations which are more stable than others.

## c. The Legendre Approximation

In the instance where we are approximating data, either tabular or experimental, with a
function of our choice, we can improve the numerical stability by choosing the basis functions φj(x) to be
members of orthogonal set. Now the majority of orthogonal functions we have discussed have been
polynomials (see section 3.3) so we will base our discussion on orthogonal polynomials. But it should
remain clear that this is a convenience, not a requirement. Let φj(x) be an orthogonal polynomial relative to
the weight function w(x) over the range of the independent variable x. The elements of the normal equations
(6.1.17) then take the form
N
A kj = ∑ w i φ k ( x i )φ j ( x i ) . (6.1.18)
i =1

If we weight the points in accordance with the weight function of the polynomial, then the weights are

wi = w(xi) . (6.1.19)

If the data points are truly independent and randomly selected throughout the range of x, then as the number
of them increases, the sum will approach the value of the integral so that
⎡N ⎤
A kj = Lim ⎢∑ w ( x i )φ k ( x i )φ j ( x i )⎥ = N ∫ w ( x )φ k ( x )φ j ( x ) dx = Nδ kj . (6.1.20)
N →∞
⎣ i =1 ⎦
This certainly simplifies the solution of the normal equations (6.1.17) as equation (6.1.20) states that the off
diagonal elements will tend to vanish. If the basis functions φj(x) are chosen from an orthonormal set, then
the solution becomes
1 N
aj ≅ ∑ w (x i )φ j (x i )Yi ,
N i =1
j = 0, 1,L , n . (6.1.21)

Should they be merely orthogonal, then the solution will have to be normalized by the diagonal elements
leading to a solution of the form
−1
⎡N ⎤ ⎡N ⎤
a j ≅ ⎢∑ w ( x i )φ j ( x i )Yi ⎥ × ⎢∑ w ( x i )φ 2j ( x i )⎥ , j = 0, 1, L , n . (6.1.22)
⎣ i =1 ⎦ ⎣ i =1 ⎦
The process of using an orthogonal set of functions φj(x) to describe the data so as to achieve the simple
result of equations (6.1.21) and (6.1.22) is known as the Legendre approximation. It is of considerable utility
when the amount of data is vast and the process of forming and solving the full set of normal equations
would be too time consuming. It is even possible that in some cases, the solution of a large system of normal
equations could introduce greater round-off error than is incurred in the use of the Legendre approximation.
Certainly the number of operations required for the evaluation of equations (6.1.21) or (6.1.22) are of the
order of (n+1)N where for the formation and solution of the normal equations (6.1.17) themselves something
of the order of (n+1)2(N+n+1) operations are required.

164
6 - Least Squares

One should always be wary of the time required to carry out a Least Squares solution. It has the
habit of growing rapidly and getting out of hand for even the fastest computers. There are many problems
where n may be of the order of 102 while N can easily reach 106. Even the Legendre approximation would
imply 108 operations for the completion of the solution, while for a full solution of the normal equations 1010
operations would need to be performed. For current megaflop machines the Legendre approximation would
only take several minutes, while the full solution would require several hours. There are problems that are
considerably larger than this example. Increasing either n or N by an order of magnitude could lead to
computationally prohibitive problems unless a faster approach can be used. To understand the origin of one
of the most efficient approximation algorithms, let us consider the relation of least squares to Fourier
analysis.

## 6.2 Least Squares, Fourier Series, and Fourier Transforms

In this section we shall explicitly explore the relationship between the Principle of least Squares and Fourier
series. Then we extend the notion of Fourier series to the Fourier integral and finally to the Fourier transform
of a function. Lastly, we shall describe the basis for an extremely efficient algorithm for numerically
evaluating a discrete Fourier transform.

## a. Least Squares, the Legendre Approximation, and Fourier Series

In section 3.3e we noted that the trigonometric functions sine and cosine formed
orthonormal sets in the interval 0 → +1, not only for the continuous range of x but also for a discrete set of
values as long as the values were equally spaced. Equation (3.3.41) states that
N N

∑ sin(kπx i ) sin( jπx i ) = ∑ cos(kπx i ) cos( jπx i ) = Nδ kj
i =0 i =0

⎬ . (6.2.1)
x i = (2i − N) / N , i = 0, 1,L N ⎪

Here we have transformed x into the more familiar interval -1 ≤ x ≤ +1. Now consider the normal
equations that will be generated should the basis functions be either cos(jπx) or sin(jπx) and the data points
are spaced in accord with the second of equations (6.2.1). Since the functional sets are orthonormal we may
employ the Legendre approximation and go immediately to the solution given by equation (6.1.21) so that
the coefficients of the sine and cosine series are
1 N ⎫
aj = ∑ f (x i ) cos( jπx i )
N + 1 i =1 ⎪

⎬. (6.2.2)
1 N ⎪
bj = ∑ f (x i ) sin( jπx i )
N + 1 i =1 ⎪⎭
Since these trigonometric functions are strictly orthogonal in the interval, as long as the data points are
equally spaced, the Legendre approximation is not an approximation. Therefore the equal signs in equations
(6.2.2) are strictly correct. The orthogonality of the trigonometric functions with respect to equally spaced
data and the continuous variable means that we can replace the summations in equation (6.2.2) with integral

165
Numerical Methods and Data Analysis

signs without passing to the limit given in equation (6.1.20) and write
+1
a j = ∫ f ( x ) cos( jπx ) dx ⎫
−1 ⎪
+1 ⎬, (6.2.3)
b j = ∫ f ( x ) sin( jπx ) dx ⎪
−1 ⎭
which are the coefficients of the Fourier series

f (x) = 2 a 0 + ∑ a k cos( kπx ) + b k sin( kπx ) .
1
(6.2.4)
k =1

Let us pause for a moment to reflect on the meaning of the series given by equation (6.2.4). The
function f(x) is represented in terms of a linear combination of periodic functions. The coefficients of these
functions are themselves determined by the periodically weighted behavior of the function over the interval.
The coefficients ak and bk simply measure the periodic behavior of the function itself at the period (1/πk).
Thus, a Fourier series represents a function in terms of its own periodic behavior. It is as if the function were
broken into pieces that exhibit a specific periodic behavior and then re-assembled as a linear combination of
the relative strength of each piece. The coefficients are then just the weights of their respective contribution.
This is all accomplished as a result of the orthogonality of the trigonometric functions for both the discrete
and continuous finite interval.

We have seen that Least Squares and the Legendre approximation lead directly to the coefficients of
a finite Fourier series. This result suggests an immediate solution for the series approximation when the data
is not equally spaced. Namely, do not use the Legendre approximation, but keep the off-diagonal terms of
the normal equations and solve the complete system. As long as N and n are not so large as to pose
computational limits, this is a perfectly acceptable and rigorous algorithm for dealing with the problem of
unequally spaced data. However, in the event that the amount of data (N) is large there is a further
development that can lead to efficient data analysis.

## b. The Fourier Integral

The functions that we discussed above were confined to the interval –1 → +1. However, if
the functions meet some fairly general conditions, then we can extend the series approximation beyond that
interval. Those conditions are known as the Dirichlet conditions which are that the function satisfy
Dirichlet's theorem. That theorem states:

Suppose that f(x) is well defined and bounded with a finite number of maxima, minima, and
discontinuities in the interval -π  x  +π. Let f(x) be defined beyond this region by f(x+2π) =
f(x). Then the Fourier series for f(x) converges absolutely for all x.

It should be noted that these are sufficient conditions, but not necessary conditions for the convergence of a
Fourier series. However, they are sufficiently general enough to include a very wide range of functions
which embrace virtually all the functions one would expect to arise in science. We may use these conditions
to extend the notion of a Fourier series beyond the interval –1 → +1.

166
6 - Least Squares

Let us define
z ≡ ξ/ x , (6.2.5)
where
ξ>1 . (6.2.6)

Using Dirichlet's theorem we develop a Fourier series for f(x) in terms of z so that

f (zξ) = 2 a 0 + ∑ a k cos(kπz ) + b k sin( kπz ) ,
1
(6.2.7)
k =1

## implies which will have Fourier coefficients given by

+1 1 +ξ ⎫
a k = ∫ f (z) cos(kπz) dz =
ξ ∫−ξ
f ( x ) cos(kπx / ξ) dx ⎪
−1

⎬ . (6.2.8)
+1 1 +ξ ⎪
b k = ∫ f (z) sin(kπz) dz = ∫ f ( x ) sin(kπx / ξ) dx
−1 ξ −ξ ⎪⎭
Making use of the addition formula for trigonometric functions
cos(α-β) = cosα cosβ + sinα sinβ , (6.2.9)
we can write the Fourier series as

1 +ξ 1 +ξ
f (x) = ∫
2ξ −ξ
f ( z ) dz + ∑
k =1 ξ
∫−ξ f (z) cos[kπ(z − x) / ξ] dz . (6.2.10)

Here we have done two things at once. First, we have passed from a finite Fourier series to an infinite series,
which is assumed to be convergent. (i.e. the Dirichlet conditions are satisfied). Second, we have explicitly
included the ak's and bk's in the series terms. Thus we have represented the function in terms of itself, or more
properly, in terms of its periodic behavior. Now we wish to let the infinite summation series pass to its
limiting form of an integral. But here we must be careful to remember what the terms of the series represent.
Each term in the Fourier series constitutes the contribution to the function of its periodic behavior at some
discrete period or frequency. Thus, when we pass to the integral limit for the series, the integrand will
measure the frequency dependence of the function. The integrand will itself contain an integral of the
function itself over space. Thus this process will transform the representation of the function from its
behavior in frequency to its behavior in space. Such a transformation is known as a Fourier Transformation.

## c. The Fourier Transform

Let us see explicitly how we can pass from the discrete summation of the Fourier series to
the integral limit. To do this, we will have to represent the frequency dependence in a continuous way. This
can be accomplished by allowing the range of the function (i.e. –ξ → +ξ) to be variable. Let
δα = 1/ξ , (6.2.11)
so that each term in the series becomes
1 +ξ +ξ

ξ −ξ
f (z) cos[kπ(z − x ) / ξ] dz = δα ∫ f (z) cos[(kδα)π(z − x ) / ξ] dz .
−ξ
(6.2.12)

Now as we pass to the limit of letting δα → 0, or ξ → ∞, each term in the series will be multiplied by an

167
Numerical Methods and Data Analysis

infinitesimal dα, and the limits on the term will extend to infinity. The product kδα will approach the
variable of integration α so that

Lim ∑ ⎡ ∫ f (z) cos[(kδα)π(z − x ) / ξ] dz ⎤ = ∫ ⎡ ∫ f (z) cos[(kδα)π(z − x ) / ξ] dz ⎤ dα . (6.2.13)
+ξ ∞ +ξ

δα →0 ⎢⎣ −ξ ⎥⎦ 0 ⎢⎣ −ξ ⎥⎦
ξ→∞ k =1

The right hand side of equation 6.2.13 is known as the Fourier integral which allows a function f(x) to be
expressed in terms of its frequency dependence f(z). If we use the trigonometric identity (6.2.9) to re-express
the Fourier integrals explicitly in terms of their sine and cosine dependence on z we get
f ( x ) = 2∫
+∞ +∞

0 ∫
0
f (z) sin(απz) sin(απx ) dz

+∞ +∞ ⎬ . (6.2.14)
f ( x ) = 2 ∫ ∫ f (z) cos(απz) cos(απx ) dz ⎪
0 0 ⎭
The separate forms of the integrals depend on the symmetry of f(x). Should f(x) be an odd function, then
it will cancel from all the cosine terms and produce only the first of equations (6.2.14). The second will
result when f(x) is even and the sine terms cancel.

Clearly to produce a representation of a general function f(x) we shall have to include both the
sine and cosine series. There is a notational form that will allow us to do that using complex numbers
known as Euler's formula
eix = cos(x) + i sin(x) . (6.2.15)
This yields an infinite Fourier series of the form
+∞

f (x) = ∑C k e ikx ⎪
k = −∞
⎬ , (6.2.16)
+1

− iπ k t
Ck = 1
f ( t )e dt
2 −1 ⎭
where the complex constants Ck are related to the ak's and bk's of the cosine and sine series by
C0 = a 0 / 2 ⎫

C +k = a k / 2 − ib k / 2⎬ . (6.2.17)
C −k = a k / 2 + ib k / 2⎪⎭

We can extend this representation beyond the interval –1 → +1 in the same way we did for the
Fourier Integral. Replacing the infinite summation by an integral allows us to pass to the limit and get
+∞
f ( x ) = ∫ e 2 π i xz F(z) dz , (6.2.18)
−∞

where
+∞
F(z) = ∫ f ( t )e − 2 π i z t dt ≡ T(f ) . (6.2.19)
−∞

The integral T(f) is known as the Fourier Transform of the function f(x). It is worth considering the
transform of the function f(t) to simply be a different representation of the same function since

168
6 - Least Squares
+∞
F(z) = ∫ f ( t )e − 2 π i zt dt = T(f ) ⎫
−∞ ⎪
+∞ ⎬. (6.2.20)
f ( t ) = ∫ F(z)e + 2 π i zt dt = T(F) = T −1 (f ) ⎪
−∞ ⎭
The second of equations (6.2.20) reverses the effect of the first, [i.e.T(f)×T-1(f) = 1] so the second equation
is known as the inverse Fourier transform.

The Fourier transform is only one of a large number of integrals that transform a function from one
space to another and whose repeated application regenerates the function. Any such integral is known as an
integral transform. Next to the Fourier transform, the best known and most widely used integral transform is
the Laplace transform L(f) which is defined as

L (f)= ∫ 0
f ( t )e − pt dt . (6.2.21)
For many forms of f(t) the integral transforms as defined in both equations (6.2.20) and (6.2.21) can be
expressed in closed form which greatly enhances their utility. That is, given an analytic closed-form
expression for f(t), one can find analytic closed-form expression for T(f) or L(f). Unfortunately the
expression of such integrals is usually not obvious. Perhaps the largest collection of integral transforms, not
limited to just Fourier and Laplace transforms, can be found among the Bateman Manuscripts1 where two
full volumes are devoted to the subject.

Indeed, one must be careful to show that the transform actually exists. For example, one might
believe from the extremely generous conditions for the convergence of a Fourier series, that the Fourier
transform must always exist and there are those in the sciences that take its existence as an axiom. However,
in equation (6.2.13) we passed from a finite interval to the full open infinite interval. This may result in a
failure to satisfy the Dirichlet conditions. This is the case for the basis functions of the Fourier transform
themselves, the sines and cosines. Thus sin(x) or cos(x) will not have a discrete Fourier transform and that
should give the healthy skeptic pause for thought. However, in the event that a closed form representation of
the integral transform cannot be found, one must resort to a numerical approach which will yield a discrete
Fourier transform. After establishing the existence of the transform, one may use the very efficient method
for calculating it known as the Fast Fourier Transform Algorithm.

## d. The Fast Fourier Transform Algorithm

Because of the large number of functions that satisfy Dirichlet's conditions, the Fourier
transform is one of the most powerful analytic tools in science and considerable effort has been devoted to
its evaluation. Clearly the evaluation of the Fourier transform of a function f(t) will generally be
accomplished by approximating the function by a Fourier series that covers some finite interval. Therefore,
let us consider a finite interval of range t0 so that we can write the transform as
+∞ +t0 / 2 N −1
F(z k ) = ∫ f ( t )e − 2 π i z k t dt = ∫ f ( t )e 2 π i z k t dt = ∑ f ( t j ) e 2 π i z k t W j . (6.2.22)
−∞ −t0 / 2
j= 0
In order to take advantage of the orthogonality of the sines and cosines over a discrete set of equally
spaced data the quadrature weights Wi in equation (6.2.22) will all be taken to be equal and to sum to the
range of the integral so that

169
Numerical Methods and Data Analysis

Wi = t 0 / N = t ( N) / N ≡ δ . (6.2.23)
This means that our discrete Fourier transform can be written as
N −1
F(z k ) = δ∑ f ( t j ) e 2 π i z ( jδ ) . (6.2.24)
j= 0
In order for the units to yield a dimensionless exponent in equation (6.2.24), z~t-1. Since we are determining
a discrete Fourier transform, we will choose a discrete set of point zk so that
zk = ±k/t(N) = ± k/(Nδ) , (6.2.25)
and the discrete transform becomes
N −1
F(z k ) = δFk = δ∑ f ( t j ) e 2 π i ( kj / N ) . (6.2.26)
j= 0
To determine the Fourier transform of f(x) is to find N values of Fk. If we write equation (6.2.26) in vector
notation so that r r
⎫⎪ F = E•f
⎬ . (6.2.27)
E kj = e 2 π i ( kj / N ) ⎪⎭
r
It would appear that to find the N components of the vector F ( x ) we would have to evaluate a matrix E
having N2 complex components. The resulting matrix multiplication would require N2 operations. However,
there is an approach that yields a Fourier Transform in about Nlog2N steps known as the Fast Fourier
Transform algorithm or FFT for short. This tricky algorithm relies on noticing that we can write the discrete
Fourier transform of equation (6.2.26) as the sum of two smaller discrete transform involving the even and
odd points of the summation. Thus

N −1 N / 2 −1 N / 2 −1
Fk = ∑ f ( t j )e 2 π i ( kj / N ) = ∑ f (t 2 j )e 2π i ( kj / N) + ∑ f (t 2j )e 2 π i ( kj / N )
j= 0 j= 0 j= 0
N / 2 −1 N / 2 −1
. (6.2.28)
= ∑ f (t
j= 0
2j )e 2 π i ( kj / N )
+ e 2 π i ( kj / N )
∑ f (t
j= 0
2 j+1 )e 2 π i ( kj / N )
=F (0)
k + Qk F
(1)
k

If we follow the argument of Press et. al.2, we note that each of the transforms involving half the
points can themselves be subdivided into two more. We can continue this process until we arrive at sub-
transforms containing but a single term. There is no summation for a one-point transform so that it is simply
equal to a particular value of f( tk ). One need only identify which sub-transform is to be associated with
which point. The answer, which is what makes the algorithm practical, is contained in the order in which a
sub-transform is generated. If we denote an even sub-transform at a given level of subdivision by a
superscript 0 and an odd one by a superscript of 1, the sequential generation of sub-transforms will generate
a series of binary digits unique to that sub-transform. The binary number represented by the reverse order of
those digits is the binary representation of i denoting the functional value f( ti). Now re-sort the points so that
they are ordered sequentially on this new binary subscript say p. Each f( tp) represents a one point sub-
transform which we can combine via equation (6.2.28) with its adjacent neighbor to form a two point sub-
transform. There will of course be N of these. These can be combined to form N four-point sub-transforms
and so on until the N values of the final transform are generated. Each step of combining transforms will
take on the order of N operations. The process of breaking the original transform down to one-point

170
6 - Least Squares

transforms will double the number of transforms at each division. Thus there will be m sub-divisions where
2m = N , (6.2.29)
so that
m = Log2N . (6.2.30)

Therefore the total number of operations in this algorithm will be of the order of Nlog2N. This clearly
suggests that N had better be a power of 2 even if it is necessary to interpolate some additional data. There
will be some additional computation involved in the calculation in order to obtain the Qk's, carry out the
additions implied by equation (6.1.46), and perform the sorting operation. However, it is worth noting that at
each subdivision, the values of Qk are related to their values from the previous subdivision e2kπi/N for only the
length of the sub-transform, and hence N, has changed. With modern efficient sorting algorithms these
additional tasks can be regarded as negligible additions to the entire operation. When one compares N2 to
Nlog2N for N ~ 106, then the saving is of the order of 5×104. Indeed, most of the algorithm can be regarded
as a bookkeeping exercise. There are extremely efficient packages that perform FFTs. The great speed of
FFTs has lead to their wide spread use in many areas of analysis and has focused a great deal of attention on
Fourier analysis. However, one should always remember the conditions for the validity of the discrete
Fourier analysis. The most important of these is the existence of equally space data.

The speed of the FFT algorithm is largely derived from the repetitive nature of the Fourier
Transform. The function is assumed to be represented by a Fourier Series which contains only terms that
repeat outside the interval in which the function is defined. This is the essence of the Dirichlet conditions and
can be seen by inspecting equation (6.2.28) and noticing what happens when k increases beyond N. The
quantity e2πijk/N simply revolves through another cycle yielding the periodic behavior of Fk. Thus when
values of a sub-transform Fok are needed for values of k beyond N, they need not be recalculated.

Therefore the basis for the FFT algorithm is a systematic way of keeping track if the booking
associated with the generation of the shorter sub-transforms. By way of an example, let us consider the
discrete Fourier transform of the function
f(t) = e-│t│ . (6.2.31)

We shall consider representing the function over the finite range (-½t0 → +½t0) where t0 = 4. Since the FFT
algorithm requires that the calculation be carried out over a finite number of points, let us take 23 points to
insure a sufficient number of generations to adequately demonstrate the subdivision process. With these
constraints in mind the equation (6.2.22) defining the discrete Fourier Transform becomes
+t0 / 2 + 2 + 2 π i tz − t 7
dt = ∑ e
− tj 2 π i t jz
F(z) = ∫ f ( t )e + 2 π i tz dt = ∫ e e Wj . (6.2.32)
−t0 / 2 −2
j= 0

We may compare the discrete transform with the Fourier Transform for the full infinite interval
(i.e. -∞ → +∞) as the integral in equation (6.2.32) may be expressed in closed form so that

## F[f(t)] = F(z) = 2/[1+(2π│z│)] . (6.2.33)

The results of both calculations are summarized in table 6.1. We have deliberately chosen an even function
of t as the Fourier transform will be real and even. This property is shared by both the discrete and
continuous transforms. However, there are some significant differences between the continuous transform

171
Numerical Methods and Data Analysis

for the full infinite interval and the discrete transform. While the maximum amplitude is similar, the discrete
transform oscillates while the continuous transform is monotonic. The oscillation of the discrete transform
results from the truncation of the function at ½t0. To properly describe this discontinuity in the function a
larger amplitude for the high frequency components will be required. The small number of points in the
transform exacerbates this. The absence of the higher frequency components that would be specified by a
larger number of points forces their influence into the lower order terms leading to the oscillation. In spite of
this, the magnitude of the transform is roughly in accord with the continuous transform. Figure 6.1 shows the
comparison of the discrete transform with the full interval continuous transform. We have included a dotted
line connecting the points of the discrete transform to emphasize the oscillatory nature of the transform, but
it should be remembered that the transform is only defined for the discrete set of points z k .

Table 6.1
Summary Results for a Sample Discrete Fourier Transform
i 0 1 2 3 4 5 6 7
ti -2.0000 -1.5000 -1.0000 -0.5000 0.0000 +0.5000 +1.0000 +1.5000
f(ti) 0.1353 0.2231 0.3678 0.6065 1.0000 0.6065 0.3678 0.2231
k 0 1 2 3 4 5 6 7
zk 0.0000 +0.2500 +0.5000 +0.7500 +1.0000 -0.7500 -0.5000 -0.2500
F(zk) +1.7648 -0.7010 +0.2002 -0.1613 +0.1056 -0.1613 +0.2002 -0.7010
Fc(zk) +2.0000 +0.5768 +0.1840 +0.0863 +0.0494 +0.0863 0.1840 +0.5768

While the function we have chosen is an even function of t, we have not chosen the points
representing that function symmetrically in the interval (-½ t0 → +½ t0). To do so would have included the
each end point, but since the function is regarded to be periodic over the interval, the endpoints would not be
linearly independent and we would not have an additionally distinct point. In addition, it is important to
include the point t = 0 in the calculation of the discrete transform and this would be impossible with 2m
points symmetrically spaced about zero.

Let us proceed with the detailed implementation of the FFT. First we must calculate the weights Wj
that appear in equation (6.2.22) by means of equation (6.2.23) so that

## Wj = δ = 4/23 = 1/2 . (6.2.34)

The first sub-division into sub-transforms involving the even and odd terms in the series specified
by equation (6.2.22) is
Fk = δ(F k0 + Q1k F k1) . (6.2.35)

## Fk0 = ( Fk00 + Q 2k Fk01 ) ⎫⎪

⎬ . (6.2.36)
Fk1 = ( Fk10 + Q 2k Fk11 ) ⎪⎭

172
6 - Least Squares

Figure 6.1 compares the discrete Fourier transform of the function e-│x│ with the
continuous transform for the full infinite interval. The oscillatory nature of the discrete
transform largely results from the small number of points used to represent the function and
the truncation of the function at t = ±2. The only points in the discrete transform that are
even defined are denoted by × , the dashed line is only provided to guide the reader's eye to
the next point.

## The final generation of sub-division yields

Fk00 = ( Fk000 + Q 3k Fk001 ) = f 0 + Q 3k f 4 ⎫

Fk01 = ( Fk010 + Q 3k Fk011 ) = f 2 + Q 3k f 6 ⎪
⎬ , (6.2.37)
Fk10 = ( Fk100 + Q 3k Fk101 ) = f 1 + Q 3k f 5 ⎪
Fk11 = ( Fk110 + Q 3k Fk111 ) = f 3 + Q 3k f 7 ⎪⎭
where
Q nk = (e 2 π ik / N n ) n ⎫
( n −1)

Nn = N / 2 ⎬ . (6.2.38)
f j = f (t j ) ⎪

Here we have used the "bit-reversal" of the binary superscript of the final sub-transforms to identify which of

173
Numerical Methods and Data Analysis

the data points f(tj) correspond to the respective one-point transforms. The numerical details of the
calculations specified by equations (6.2.35) - (6.2.38) are summarized in Table 6.2.

Here we have allowed k to range from 0 → 8 generating an odd number of resultant answers.
However, the values for k = 0 and k = 8 are identical due to the periodicity of the function. While the
symmetry of the initial function f(tj) demands that the resultant transform be real and symmetric, some of the
sub-transforms may be complex. This can be seen in table 6.2 in the values of F1y1,3,5,7. They subsequently
cancel, as they must, in the final transform Fk, but their presence can affect the values for the real part of the
transform. Therefore, complex arithmetic must be used throughout the calculation. As was already
mentioned, the sub-transforms become more rapidly periodic as a function of k so that fewer and fewer
terms need be explicitly kept as the subdivision process proceeds. We have indicated this by highlighting the
numbers in table 6.2 that must be calculated. While the tabular numbers represent values that would be
required to evaluate equation (6.2.22) for any specific value of k, we may use the repetitive nature of the
sub-transforms when calculating the Fourier transform for all values of k. The highlighted numbers of table
6.2 are clearly far fewer that N2 confirming the result implied by equation (6.2.30) that Nlog2N operations
will be required to calculate that discrete Fourier transform. While the saving is quite noticeable for N = 8, it
becomes monumental for large N.

The curious will have noticed that the sequence of values for zk does not correspond with the values
of tj. The reason is that the particular values of k that are used are somewhat arbitrary as the Fourier
transform can always be shifted by e2πim/N corresponding to a shift in k by +m. This simply moves on to a
different phase of the periodic function F(z). Thus, our tabular values begin with the center point z=0, and
moves to the end value of +1 before starting over at the negative end value of -0.75 (note that -1 is to be
identified with +1 due to the periodicity of Fk). While this cyclical ranging of k seems to provide an endless
set of values of Fk, there are only N distinctly different values because of the periodic behavior of Fk. Thus
our original statement about the nature of the discrete Fourier transform - that it is defined only at a discrete
set of points - remains true.

As with most subjects in this book, there is much more to Fourier analysis than we have developed
here. We have not discussed the accuracy of such analysis and its dependence on the sampling or amount of
the initial data. The only suggestion for dealing with data missing from an equally spaced set was to
interpolate the data. Another popular approach is to add in a "fake" piece of data with f(tj) = 0 on the grounds
that it makes no direct contribution to the sums in equation (6.2.28). This is a deceptively dangerous
argument as there is an implicit assumption as to the form of the function at that point. Interpolation, as long
as it is not excessive, would appear to be a better approach.

174
6 - Least Squares

Table 6.2
Calculations for a Sample Fast Fourier Transform
k fk Fk000 = f0 Fk001= f4 Fk010 = f2 Fk011= f6 Fk100 = f1 Fk101= f5 Fk110 = f3 Fk111= f7
0 0.1353 0.1353 1.0000 0.3678 0.3678 0.2231 0.6065 0.6065 0.2231
1 0.1353 0.1353 1.0000 0.3678 0.3678 0.2231 0.6065 0.6065 0.2231
2 0.1353 0.1353 1.0000 0.3678 0.3678 0.2231 0.6065 0.6065 0.2231
3 0.1353 0.1353 1.0000 0.3678 0.3678 0.2231 0.6065 0.6065 0.2231
4 0.1353 0.1353 1.0000 0.3678 0.3678 0.2231 0.6065 0.6065 0.2231
5 0.1353 0.1353 1.0000 0.3678 0.3678 0.2231 0.6065 0.6065 0.2231
6 0.1353 0.1353 1.0000 0.3678 0.3678 0.2231 0.6065 0.6065 0.2231
7 0.1353 0.1353 1.0000 0.3678 0.3678 0.2231 0.6065 0.6065 0.2231
8 0.1353 0.1353 1.0000 0.3678 0.3678 0.2231 0.6065 0.6065 0.2231

1

## 1 -1 -.8647 0.0000 -.3834 -.3834 +i -.8647 -.3834 (1 + i ) / 2 -.7010 0.25

0.0000 +.3834
i i
2 +1 1.1353 0.7350 0.8296 0.8296 -1 0.4003 0.0000 +I 0.2002 0.50

## 3 -1 -.8647 0.0000 -.3834 -.3834 -i -.8647 -.3834 (i − 1) / 2 -.1613 0.75

0.0000 -.3834i
i
4 +1 1.1353 0.7350 0.8296 0.8296 +1 1.8703 1.6592 -1 0.1056 1.00

## 5 -1 -.8647 0.0000 -.3834 -.3834 +i -.8647 -.3834 (1 + i ) / 2 -.1613 -0.75

0.0000 +.3834i
i
6 +1 1.1353 0.7350 0.8296 0.8296 -1 0.4003 0.0000 -I 0.2002 -0.50

## 7 -1 -.8647 0.0000 -.3834 -.3834 -i -.8647 -.3834 (i − 1) / 2 -.7010 -0.25

0.0000 -.3834i
i
8 +1 1.1353 0.7350 0.8296 0.8296 +1 1.8703 1.6592 +1 1.7648 0.00

175
Numerical Methods and Data Analysis

## 6.3 Error Analysis for Linear Least-Squares

While Fourier analysis can be used for basic numerical analysis, it is most often used for
observational data analysis. Indeed, the widest area of application of least squares is probably the analysis of
observational data. Such data is intrinsically flawed. All data, whether it results from direct observation of
the natural world or from the observation of a carefully controlled experiment, will contain errors of
observation. The equipment used to gather the information will have characteristics that limit the accuracy of
that information. This is not simply poor engineering, but at a very fundamental level, the observing
equipment is part of the phenomenon and will distort the experiment or observation. This, at least, is the
view of modern quantum theory. The inability to carry out precise observations is a limit imposed by the
very nature of the physical world. Since modern quantum theory is the most successful theory ever devised
by man, we should be mindful of the limits it imposes on observation. However, few experiments and
observational equipment approach the error limits set by quantum theory. They generally have their accuracy
set by more practical aspects of the research. Nevertheless observational and experimental errors are always
with us so we should understand their impact on the results of experiment and observation. Much of the
remaining chapters of the book will deal with this question in greater detail, but for now we shall estimate
the impact of observational errors on the parameters of least square analysis. We shall give this development
in some detail for it should be understood completely if the formalism of least squares is to be used at all.

## a. Errors of the Least Square Coefficients

Let us begin by assuming that the approximating function has the general linear form of
equation (6.1.4). Now we will assume that each observation Yi has an unspecified error Ei associated with it
which, if known, could be corrected for, yielding a set of least square coefficients aj0. However, these are
unknown so that our least square analysis actually yields the set of coefficients aj. If we knew both sets of
coefficients we could write
n

E i = Yi − ∑ a 0j φ j ( x i ) ⎪
j= 0 ⎪
n ⎬ . (6.3.1)
ε i = Yi − ∑ a j φ j ( x i ) ⎪
j= 0
⎪⎭
Here ε i is the normal residual error resulting from the standard least square solution.

In performing the least square analysis we weighted the data by an amount ωi so that
N

∑ (ω ε )
i =1
i i
2
= Minimum . (6.3.2)

We are interested in the error in aj resulting from the errors Ei in Yi so let us define

δaj ≡ aj ─ a j0 . (6.3.3)

We can multiply the first of equations (6.3.1) by ω2i φk(xi), sum over i, and get

176
6 - Least Squares
N N N N

∑ a ∑ ω φ ( x )φ
j= 0
0
j
i =1
2
i j i k ( x i ) = ∑ ωi2 Yi φ k ( x i ) − ∑ ωi2 φ k ( x i )E i , k = 0,1,L , n ,
i =1 i =1
(6.3.4)

## while the standard normal equations of the problem yield

N N N

∑ a ∑ ω φ ( x )φ
j= 0
j
i =1
2
i j i k ( x i ) = ∑ ωi2 Yi φ k ( x i ) , k = 0,1, L , n .
i =1
(6.3.5)

If we subtract equation (6.3.4) from equation (6.3.5) we get an expression for δaj.
N N n N

∑ δa ∑ w φ (x )φ
j= 0
j
i =1
i j i k ( x i ) = ∑ δa j A jk = ∑ w i φ k ( x i )E i , k = 0,1, L , n .
j= 0 i =1
(6.3.6)

Here we have replace ω2i with wi as in section 1 [equation (6.1.16)]. These linear equations are basically the
normal equations where the errors of the coefficients δaj have replaced the least square coefficients aj, and
the observational errors Ei have replace the dependent variable Yi. If we knew the individual observational
errors Ei, we could solve them explicitly to get
n N
δa j = ∑ [A jk ] −1 ∑ w i φ k ( x i )E i , (6.3.7)
k =0 i =1

and we would know precisely how to correct our standard answers aj to get the "true" answers
a0j. Since we do not know the errors Ei, we shall have to estimate them in terms of εi , which at least is
knowable.

Unfortunately, in relating Ei to εi it will be necessary to lose the sign information on δaj. This is a
small price to pay for determining the magnitude of the error. For simplicity let
-1
C=A . (6.3.8)
We can then square equation (6.3.7) and write
⎡n N
⎤⎡ n N ⎤
(δa j ) 2 = ⎢∑ C jk ∑ w i φ k ( x i )E i ⎥ ⎢∑ C jp ∑ w q φ p ( x q )E q ⎥
⎣ k =0 i =1 ⎦ ⎣ p =0 q =1 ⎦
. (6.3.9)
n n N N
= ∑∑ C jk C jp ∑∑ w i w q φ k ( x i )φ p ( x q )E i E q
k =0 p −0 i =1 q =1

Here we have explicitly written out the product as we will endeavor to get rid of some of the terms by
making reasonable assumptions. For example, let us specify the manner in which the weights should be
chosen so that
ωiEi = const. (6.3.10)

While we do not know the value of Ei, in practice, one usually knows something about the expected error
distribution. The value of the constant in equation (6.3.10) doesn't matter since it will drop out of the normal
equations. Only the distribution of Ei matters and the data should be weighted accordingly.

We shall further assume that the error distribution of Ei is anti-symmetric about zero. This is a less
justifiable assumption and should be carefully examined in all cases where the error analysis for least squares
is used. However, note that the distribution need only be anti-symmetric about zero, it need not be
distributed like a Gaussian or normal error curve, since both the weights and the product φ(xi) φ(xq) are

177
Numerical Methods and Data Analysis

symmetric in i and q. Thus if we chose a negative error, say, Eq to be paired with a positive error, say, Ei we
get
N N

∑∑ w w
i =1 q =1
i q φ k ( x i )φ p ( x q )E i E q = 0 , ∀ k = 0, 1, L , n , p = 0, 1, L , n . (6.3.11)
i≠q

Therefore only terms where i=q survive in equation (6.3.9) and we may write it as
n n N n ⎡n ⎤
(δa j ) 2 = (ωE) 2 ∑ C jk ∑ C jp ∑ w i φ k ( x i )φ p ( x i ) = (ωE) 2 ∑ C jk ⎢∑ C jp A pk ⎥ . (6.3.12)
k =0 p=0 i =1 k =0 ⎣ p =0 ⎦
Since C=A-1 [i.e. equation (6.3.8)], the term in large brackets on the far right-hand-side is the Kronecker
delta δjk and the expression for (δaj)2 simplifies to
n
(δa j ) 2 = (ωE) 2 ∑ C jk δ jk =(ωE) 2 C jj . (6.3.13)
k =0

The elements Cjj are just the diagonal elements of the inverse of the normal equation matrix and can be found
as a by product of solving the normal equations. Thus the square error in aj is just the mean weighted square
error of the data multiplied by the appropriate diagonal element of the inverse of the normal equation matrix.
2
To produce a useful result, we must estimate (ωE ) .

## b. The Relation of the Weighted Mean Square Observational Error to the

Weighted Mean Square Residual

## If we subtract the second of equations (6.3.1) from the first, we get

n n n N
E i − ε i = ∑ δa j φ j ( x i ) = ∑ φ j ( x i )∑ C jk ∑ w q φ k ( x q )E q . (6.3.14)
j− 0 j= 0 k =0 q =1

Now multiply by wiεi and sum over all i. Re-arranging the summations we can write
N N N n n n N
⎡N ⎤

i =1
w ε E
i i i − ∑
i =1
ε 2
i = ∑
i =1
w i i∑
ε
j− 0
δa φ
j j ( x i ) = ∑∑∑
j= 0 k = 0 q =1
C jk w φ
q k ( x q ) E q ⎢∑ w i ε i φ j ( x i ) ⎥ .
⎣ i =1 ⎦
(6.3.15)

But the last term in brackets can be obtained from the definition of least squares to be
N
∂ ∑ w i ε i2 N
∂ε i N
i =1
= 2∑ w i ε i = 2∑ φ j ( x i ) w i ε i = 0 , (6.3.16)
∂a j i =1 ∂a j i =1

so that
N N

∑w E ε = ∑w ε
i =1
i i i
i =1
i
2
i . (6.3.17)

Now multiply equation (6.3.14) by wiEi and sum over all i. Again rearranging the order of summation we get
N N N n

∑ w i E i2 − ∑ w i E i ε i = ∑ w i E i ∑ δa jφ j (x i )
i =1 i =1 i =1 j− 0
n n N N n n N
, (6.3,13)
= ∑∑∑∑ C jk w q φ j ( x i )φ k ( x q )E q E i = ∑∑ C jk ∑ w E φ j ( x i )φ k ( x i ) 2
i
2
i
j= 0 k = 0 q =1 i =1 j= 0 k = 0 i =1

178
6 - Least Squares

where we have used equation (6.3.11) to arrive at the last expression for the right hand side. Making use of
equation (6.3.10) we can further simplify equation (6.3.18) to get
N n n
N (ωE) 2 − ∑ w i E i ε i = (ωE) 2 ∑∑ C jk A jk = n (ωE) . (6.3.19)
i =1 j= 0 k = 0

## Combining this with equation (6.3.17) we can write

1 N
N(ωE) = ∑
N − n i =1
(ω i ε i ) 2 , (6.3.20)

## and finally express the error in aj [see equation (6.3.13)] as

⎡ C jj ⎤ N
(δa j ) 2 = ⎢ ⎥ ∑ (ω i ε i ) .
2
(6.3.21)
⎣ N − n ⎦ i =1
Here everything on the right hand side is known and is a product of the least square solution. However, to
obtain the εi's we would have to recalculate each residual after the solution has been found. For problems
involving large quantities of data, this would double the effort.

## c. Determining the Weighted Mean Square Residual

To express the weighted mean square residual in equation (6.3.21) in terms of parameters
generated during the initial solution, consider the following geometrical argument. The φj(x)'s are all linearly
independent so they can form the basis of a vector space in which the f(aj,xi)'s can be expressed (see figure
6.1).

The values of f(aj,xi) that result from the least square solution are a linear combination of the φj(xi)'s
where the constants of proportionality are the aj's. However, the values of the independent variable are also
independent of each other so that the length of any vector is totally uncorrelated with the length of any other
and its location in the vector space will be random [note: the space is linear in the aj's , but
r the component
lengths depend on φj(x)]. Therefore the magnitude of the square of the vector sum of the f i ’s will grow as
r r
the square of the individual vectors. Thus, if F is the vector sum of all the individual vectors f i then its
magnitude is just
r2 N
F = ∑ f 2 (a j , x i ) . (6.3.22)
i =1

The observed values for the independent variable Yi are in general not equal to the corresponding f(aj,xi) so
they cannot be embedded in the vector space formed by the φj(xi)'s. Therefore figure 6.1 depicts them lying
above (or out of) the vector space. Indeed the difference
r between them is just εi. Again, the Yi's are
r
independent so the magnitude of the vector sum of the Yi ’s and the ε i ’s is

179
Numerical Methods and Data Analysis

r2 N

Y = ∑ Yi2 ⎪
i =1 ⎪
N ⎬ . (6.3.23)
r2 ⎪
ε = ∑ ε i2
i =1
⎪⎭

Figure 6.2 shows the parameter space defined by the φj(x)'s. Each f(aj,xi) can be
represented as a linear combination of the φj(xi) where the aj are the coefficients of the basis
functions. Since the observed variables Yi cannot be expressed in terms of the φj(xi), they lie
out of the space.

r
Since
r least squares seeks to minimize Σε2i, that will be accomplished when the tip of Y lies over the tip of
r
F so that ε is perpendicular to the φj(x) vector space. Thus we may apply the theorem of Pythagoras (in
n-dimensions if necessary) to write

180
6 - Least Squares
N N N

∑w ε
i =1
i
2
i = ∑ w i Yi2 − ∑ w i f 2 (a j , x i ) .
i =1 i =1
(6.3.24)

Here we have included the square weights wi as their inclusion in no way changes the result. From the
definition of the mean square residual we have

N N N N N

## ∑ (w i ε i ) 2 = ∑ w i [Yi − f (a j , x i )]2 = ∑ w i Yi2 − 2∑ w i Yi f (a j , x i ) + ∑ w i f 2 (a j , x i ) , (6.3.25)

i =1 i =1 i =1 i =1 i =1

which if we combine with equation (6.3.24) will allow us to eliminate the quadratic term in f2 so that
equation (6.3.21) finally becomes
⎡ C jj ⎤ ⎛ ⎡ N 2⎤
n
⎡N ⎤⎞
(δa j ) 2 = ⎢ ⎥ ⎜ ⎢∑ i i ⎥ ∑ k ⎢∑ w i Yi φ k ( x i )⎥ ⎟⎟ .
⎜ w Y − a (6.3.26)
⎣ N − n ⎦ ⎝ ⎣ i =1 ⎦ k =0 ⎣ i =1 ⎦⎠

The term in the square brackets on the far right hand side is the constant vector of the normal equations.
Then the only unknown term in the expression for δaj is the scalar term [ΣwiYi2], which can easily be
generated during the formation of the normal equations. Thus it is possible to estimate the effect of errors in
the data on the solution set of least square coefficients using nothing more than the constant vector of the
normal equations, the diagonal elements of the inverse matrix of the normal equations, the solution itself,
and the weighted sum squares of the dependent variables. This amounts to a trivial calculation compared to
the solution of the initial problem and should be part of any general least square program.

## d. The Effects of Errors in the Independent Variable

Throughout the discussion in this section we have investigated the effects of errors in the
dependent variable. We have assumed that there is no error in the independent variable. Indeed the least
square norm itself makes that assumption. The "best" solution in the least square sense is that which
minimizes the sum square of the residuals. Knowledge of the independent variable is assumed to be precise.
If this is not true, then real problems emerge for the least square algorithm. The general problem of
uncorrelated and unknown errors in both x and Y has never been solved. There do exist algorithms that deal
with the problem where the ratio of the errors in Y to those in x is known to be a constant. They basically
involve a coordinate rotation through an angle α = tan(x/y) followed by the regular analysis. If the
approximating function is particularly simple (e.g. a straight line), it may be possible to invert the defining
equation and solve the problem with the role of independent and dependent variable interchanged. If the
solution is the same (allowing for the transformation of variables) within the formal errors of the solution,
then some confidence may be gained that a meaningful solution has been found. Should they differ by more
than the formal error then the analysis is inappropriate and no weight should be attached to the solution.

Unfortunately, inversion of all but the simplest problems will generally result in a non-linear system
of equations if the inversion can be found at all. So in the next section we will discuss how one can approach
a least square problem where the normal equations are non-linear.

181
Numerical Methods and Data Analysis

## 6.4 Non-linear Least Squares

In general, the problem of non-linear least squares is fraught with all the complications to be found
with any non-linear problem. One must be concerned with the uniqueness of the solution and the non-linear
propagation of errors. Both of these basic problems can cause great difficulty with any solution. The simplest
approach to the problem is to use the definition of least squares to generate the normal equations so that
N ∂f (a j , x i )
∑ w [Y
i =1
i i − f (a j , x i )]
∂a j
=0, j = 0, 1,L , n . (6.4.1)

These n+1 non-linear equations must then be solved by whatever means one can find for the solution of
non-linear systems of equations. Usually some sort of fixed-point iteration scheme, such as Newton-
Raphson, is used. However, the error analysis may become as big a problem as the initial least square
problem itself. Only when the basic equations of condition will give rise to stable equations should the direct
method be tried. Since one will probably have to resort to iterative schemes at some point in the solution, a
far more common approach is to linearize the non-linear equations of condition and solve them iteratively.
This is generally accomplished by linearizing the equations in the vicinity of the answer and then solving the
linear equations for a solution that is closer to the answer. The process is repeated until a sufficiently
accurate solution is achieved. This can be viewed as a special case of a fixed-point iteration scheme where
one is required to be relatively near the solution.

In order to find appropriate starting values it is useful to understand precisely what we are trying to
accomplish. Let us regard the sum square of the residuals as a function of the regression coefficients aj so
that
N N

∑ w i [Yi − f (a j , x i )]2 = ∑ w i ε i2 = χ 2 (a j ) .
i =1 i =1
(6.4.2)

For the moment, we shall use the short hand notation of χ2 to represent the sum square of the residuals.
While the function f(aj,x) is no longer linear in the aj's they may be still regarded as independent and
therefore can serve to define a space in which χ2 is defined. Our non-linear least square problem can be
geometrically interpreted to be finding the minimum in the χ2 hypersurface (see figure 6.2). If one has no
prior knowledge of the location of the minima of the χ2 surface, it is best to search the space with a coarse
multidimensional grid. If the number of variables aj is large, this can be a costly search, for if one picks m
values of each variable aj, one has mn functional evaluations of equation (6.4.2) to make. Such a search may
not locate all the minima and it is unlikely to definitively locate the deepest and therefore most desirable
0
minimum. However, it should identify a set(s) of parameters a k from which one of the following schemes
will find the true minimum.

We will consider two basic approaches to the problem of locating these minima. There are others,
but they are either logically equivalent to those given here or very closely related to them. Basically we shall
assume that we are near the true minimum so that first order changes to the solution set ak0 will lead us to
that minimum. The primary differences in the methods are the manner by which the equations are
formulated.

182
6 - Least Squares

## A reasonable way to approach the problem of finding a minimum in χ2-space would be to

change the values of aj so that one is moving in the direction, which yields the largest change in the value of
χ2. This will occur in the direction of the gradient of the surface so that

N
∂χ 2 ⎫
∇χ 2 = ∑ â j ⎪
i =1 ∂a j ⎪
⎬ . (6.4.3)
∂χ 2 χ (a j + ∆a j ) − χ (a j )
2 0 2 0

= ⎪
∂a j ∆a j ⎪

We can calculate this by making small changes ∆aj in the parameters and evaluating the components of the
gradient in accordance with the second of equations (6.4.3). Alternately, we can use the definition of least
squares and calculate
∂χ 2 N ∂f (a j , x i )
∇ jχ 2 = = 2∑ w i [Yi − f (a j , x i )] . (6.4.4)
∂a j i =1 ∂a j

If the function f(aj,x) is not too complicated and has closed form derivatives, this is by far the preferable
manner to obtain the components of ∇χ2. However, we must exercise some care as the components of ∇χ2
are not dimensionless. In general, one should formulate a numerical problem so that the units don't get in the
way. This means normalizing the components of the gradient in some fashion. For example we could define
[a j ∇ j χ 2 / χ 2 ] a j∇ j χ 2
ξi = n
= n , (6.4.5)
∑a ∇ χ
j= 0
j j
2
/χ 2
∑a ∇ χ
j= 0
j j
2

which is a sort of normalized gradient with unit magnitude. The next problem is how far to apply the
gradient in obtaining the next guess, A conservative possibility is to use ∆aj from equation (6.4.3) so that

## δaj = ∆aj/ξj . (6.4.6)

In order to minimize computational time, the direction of the gradient is usually maintained until χ2 begins
to increase. Then it is time to re-evaluate the gradient. One of the difficulties of the method of steepest
descent is that the values of the gradient of χ2 vanish as one approaches a minimum. Therefore the method
becomes unstable as one approaches the answer in the same manner and for the same reasons that Newton-
Raphson fixed-point iteration became unstable in the vicinity of multiple roots. Thus we shall have to find
another approach.

183
Numerical Methods and Data Analysis

Figure 6.3 shows the χ2 hypersurface defined on the aj space. The non-linear least square
seeks the minimum regions of that hypersurface. The gradient method moves the iteration in
the direction of steepest decent based on local values of the derivative, while surfacitting
tries to locally approximate the function in some simple way and determines the local
analytic minimum as the next guess for the solution.

## b. Linear approximationf f(aj,x)

Let us consider approximating the non-linear function f(aj,x) by a Taylor series in aj. To the
extent that we are near the solution, this should yield good results. A multi-variable expansion of f(aj,x)
around the present values a0j of the least square coefficients is
n
∂f (a 0k , x )
f (a j , x ) = f (a 0j , x ) + ∑ δa k . (6.4.7)
ki = 0 ∂a k
If we substitute this expression for f(aj,x) into the definition for the sum-square residual χ2, we get
2
N N ⎡ n ∂f (a 0 , x ) ⎤
χ = ∑ w i [Yi − f (a j , x i )] = ∑ w i ⎢Yi − f (a j , x i ) − ∑
j i
2 2 0
δa k ⎥ . (6.4.8)
i =1 i =1 ⎣⎢ k =0 ∂a k ⎦⎥

This expression is linear in δaj so we can use the regular methods of linear least squares to write the normal
equations as

184
6 - Least Squares

∂χ 2 N ⎡ n ∂f (a 0 , x ) ⎤ ∂f (a 0j , x i )
= 2∑ w i ⎢Yi − f (a j , x i ) − ∑
j i
0
δa k ⎥ = 0 , p = 0, 1, L, n , (6.4.9)
∂δa p i =1 ⎣⎢ k =0 ∂a k ⎦⎥ ∂a p
which can be put in the standard form of a set of linear algebraic equations for δak so that

n ⎫
∑ δa k A kp = B p , p = 0, 1,L, n
k =0

∂f (a 0j , x i ) ∂f (a 0j , x i ) ⎪
N

A kp = ∑ w i , k = 0, 1, L, n , p = 0, 1,L , n ⎬ . (6.4.10)
i =1 ∂a k ∂a p ⎪
N ∂f (a 0j , x i ) ⎪
B p = ∑ w i [Yi − f (a 0j , x i )] , p = 0, 1,L, n ⎪
i =1 ∂a p ⎪⎭

The derivative of f(aj,x) that appears in equations (6.4.9) and (6.4.10) can either be found analytically or
numerically by finite differences where
∂f (a j , x i ) f [a 0j , (a 0p + ∆a p ), x i ] − f (a 0j , a 0p , x i )
= . (6.4.11)
∂a p ∆a p
While the equations (6.4.10) are linear in δak, they can be viewed as being quadratic in ak. Consider any
expansion of ak in terms of χ2 such as
ak = q0 + q1χ2 + q2χ4 . (6.4.12)
The variation of ak will then have the form
δak = q1 + 2q2χ2 , (6.4.13)

which is clearly linear in χ2. This result therefore represents a parabolic fit to the hypersurface χ2 with the
condition that δak is zero at the minimum value of χ2. The solution of equations (6.4.10) provides the
location of the minimum of the χ2 hypersurface to the extent that the minimum can locally be well
approximated by a parabolic hypersurface. This will certainly be the case when we are near the solution
which is precisely where the method of steepest descent fails.

It is worth noting that the constant vector of the normal equations is just half of the components of
the gradient given in equation (6.4.4). Thus it seems reasonable that we could combine this approach with
the method of steepest descent. One approach to this is given by Marquardt4. Since we were somewhat
arbitrary about the distance we would follow the gradient in a single step we could modify the diagonal
elements of equations (6.4.10) so that
A' kk = A kk (1 + λ ) , k = 0, 1, L , n ⎫
⎬ . (6.4.14)
A' kp = A kp , k ≠ p ⎭

Clearly as λ increases, the solution approaches the method of steepest descent since
Lim δak = Bk/λAkk . (6.4.15)
λ→∞

185
Numerical Methods and Data Analysis

All that remains is to find an algorithm for choosing λ. For small values of λ, the method approaches the first
order method for δak. Therefore we will choose λ small (say about 10-3) so that the δak's are given by the
solution to equations (6.4.10). We can use that solution to re-compute χ2. If
r r r
χ 2 (a + δa ) > χ 2 (a ) , (6.4.16)
then increase λ by a factor of 10 and repeat the step. However, if condition (6.4.16) fails and the value of χ2
is decreasing, then decrease λ by a factor of 10, adopt the new values of ak and continue. This allows the
analytic fitting procedure to be employed where it works the best - near the solution, and utilizes the method
of steepest descent where it will give a more reliable answer - well away from the minimum. We still must
determine the accuracy of our solution.

## c. Errors of the Least Squares Coefficients

The error analysis for the non-linear case turns out to be incredibly simple. True, we will
have to make some additional assumptions to those we made in section 6.3, but they are reasonable
assumptions. First, we must assume that we have reached a minimum. Sometimes it is not clear what
constitutes a minimum. For example, if the minimum in χ2 hyperspace is described by a valley of uniform
depth, then the solution is not unique, as a wide range of one variable will minimize χ2. The error in this
variable is large and equal at least to the length of the valley. While the method we are suggesting will give
reliable answers to the formal errors for aj when the approximation accurately matches the χ2 hypersurface,
when it does not the errors will be unreliable. The error estimate relies on the linearity of the approximating
function in δaj.

## In the vicinity of the χ2 minimum

δaj = aj ─ aj0 . (6.4.17)

For the purposes of the linear least squares solution that produces δaj, the initial value aj0 is a constant devoid
of any error. Thus when we arrive at the correct solution, the error estimates for δaj will provide the estimate
for the error in aj itself since
∆(δaj) = ∆aj ─ ∆[aj 0] = ∆aj . 6.4.18)

Thus the error analysis we developed for linear least squares in section 6.3 will apply here to finding the
error estimates for δaj and hence for aj itself. This is one of the virtues of iterative approaches. All past sins
are forgotten at the end of each iteration. Any iteration scheme that converges to a fixed-point is in some real
sense a good one. To the extent that the approximating function at the last step is an accurate representation
of the χ2 hypersurface, the error analysis of the linear least squares is equivalent to doing a first order
perturbation analysis about the solution for the purposes of estimating the errors in the coefficients
representing the coordinates of the hyperspace function. As we saw in section 6.3, we can carry out that error
analysis for almost no additional computing cost.

One should keep in mind all the caveats that apply to the error estimates for non-linear least squares.
They are accurate only as long as the approximating function fits the hyperspace. The error distribution of
the independent variable is assumed to be anti-symmetric. In the event that all the conditions are met, the
errors are just what are known as the formal errors and should be taken to represent the minimum errors of
the parameters.

186
6 - Least Squares

## 6.5 Other Approximation Norms

Up to this point we have used the Legendre Principle of Least Squares to approximate or "fit" our data
points. As long as this dealt with experimental data or other forms of data which contained intrinsic errors,
one could justify the Least Square norm on statistical grounds (as long as the error distribution met certain
criteria). However, consider the situation where one desires a computer algorithm to generate, say, sin(x)
over some range of x such as 0xπ/4. If one can manage this, then from multiple angle formulae, it is possible
to generate sin(x) for any value of x. Since at a very basic level, digital computers only carry out arithmetic,
one would need to find some approximating function that can be computed arithmetically to represent the
function sin(x) accurately over that interval. A criterion that required the average error of computation to be
less than ε is not acceptable. Instead, one would like to be able to guarantee that the computational error
would always be less than εmax. An approximating norm that will accomplish this is known as the
Chebyschev norm and is sometimes called the "mini-max" norm. Let us define the maximum value of a
function h(x) over some range of x to be
hmax ≡Max│h(x)│ ∀ allowed x . (6.5.1)

Now assume that we have a function Y(x) which we wish to approximate by f(aj,x) where aj represents a set
of free parameters that may be adjusted to provide the "best" approximation in some sense. Let h(x) be the
difference between those two functions so that
h(x) = ε(x) = Y(x) ─ f(aj,x) . (6.5.2)
The least square approximation norm would say that the "best" set of aj's is found from
Min ∫ ε2(x)dx . (6.5.3)
However, an approximating function that will be the best function for computational approximation will be
better given by
Min│hmax│ = Min│ε max│ = Min│Max│Y(x)-f(aj,x)││. (6.5.4)

A set of adjustable parameters aj that are obtained by applying this norm will guarantee that
ε(x) ≤  εmax ∀x , (6.5.5)
and that εmax is the smallest possible value that can be found for the given function f(aj,x). This guarantees
the investigator that any numerical evaluation of f(x) will represent Y(x) within an amount εmax. Thus, by
minimizing the maximum error, one has obtained an approximation algorithm of known accuracy
throughout the entire range. Therefore this is the approximation norm used by those who generate high
quality functional subroutines for computers. Rational functions are usually employed for such computer
algorithms instead of ordinary polynomials. However, the detailed implementation of the norm for
determining the free parameters in approximating rational functions is well beyond the scope of this book.
Since we have emphasized polynomial approximation throughout this book, we will discuss the
implementation of this norm with polynomials.

187
Numerical Methods and Data Analysis

## a. The Chebyschev Norm and Polynomial Approximation

Let our approximating function f(aj,x) be of the form given by equation (3.1.1) so that
n
f (a j , x ) = ∑ a j φ j ( x ) . (6.5.6)
j= 0

The choice of f(aj,x) to be a polynomial means that the free parameters aj will appear linearly in any analysis.
So as to facilitate comparison with our earlier approaches to polynomial approximation and least squares, let
us choose φj to be xj and we will attempt to minimize εmax(x) over a discrete set of points xi. Thus we wish to
find a set of aj so that
n
Min (ε i ) max = Min Yi − ∑ a j x ij ∀x . (6.5.7)
j= 0 max

Since we have (n+1) free parameters, aj, we will need at least N = n+1 points in our discrete set xi. Indeed, if
n+1 = N then we can fit the data exactly so that εmax will be zero and the aj's could be found by any of the
methods in chapter 3. Consider the more interesting case where N >> n+1. For the purposes of an example
let us consider the cases where n = 0, and 1 . For n = 0 the approximating function is a constant, represented
by a horizontal line in figure 6.4

Figure 6.4 shows the Chebyschev fit to a finite set of data points. In panel a the fit is with a
constant a0 while in panel b the fit is with a straight line of the form f(x) = a1x+a0. In both
cases, the adjustment of the parameters of the function can only produce (n+2) maximum
errors for the (n+1) free parameters.

By adjusting the horizontal line up or down in figure 6.3a we will be able to get two points to have
the same largest value of │εi│ with one change in sign between them. For the straight line in Figure 6.3b, we
will be able to adjust both the slope and intercept of the line thereby making the three largest values of │εi│
the same. Among the extreme values of εi there will be at least two changes in sign. In general, as long as N
> (n+1), one can adjust the parameters aj so that there are n+2 extreme values of εi all equal to εmax and there
will be (n+1) changes of sign along the approximating function. In addition, it can be shown that the aj's will
be unique. All that remains is to find them.

188
6 - Least Squares

## b. The Chebyschev Norm, Linear Programming, and the Simplex

Method

Let us begin our search for the "best" set of free-parameters aj by considering an example.
Since we will try to show graphically the constraints of the problem, consider an approximating function of
the first degree which is to approximate three points (see figure 6.3b). We then desire
Y1 − (a 0 + a 1 x 1 ) ≤ ε max ⎫
Y2 − (a 0 + a 1 x 2 ) ≤ ε max ⎪

⎬ . (6.5.8)
Y3 − (a 0 + a 1 x 3 ) ≤ ε max ⎪
ε max = Min ε max ⎪

Figure 6.5 shows the parameter space for fitting three points with a straight line under the
Chebyschev norm. The equations of condition denote half-planes which satisfy the
constraint for one particular point.

These constraints constitute the basic minimum requirements of the problem. If they were to be plotted in
parameter space (see Figure 6.4), they would constitute semi-planes bounded by the line for ε = 0. The half
of the semi-plane that is permitted would be determined by the sign of ε. However, we have used the result
from above that there will be three extreme values for εi all equal to εmax and having opposite sign. Since the
value of εmax is unknown and the equation (in general) to which it is attached is also unknown, let us regard it
as a variable to be optimized as well. The semi-planes representing the constraints are now extended out of

189
Numerical Methods and Data Analysis

the a0-a1 plane in the direction of increasing│εmax│ with the semi-planes of the constraints forming an
inverted irregular pyramid. The variation of the sign of εmax guarantees that the planes will intersect to form a
convex solid. The solution to our problem is trivial, as the lower vertex of the pyramid represents the
minimum value of the maximum error, which will be the same for each constraint. However, it is nice that
the method will tell us that without it being included in the specification of the problem. Since the number of
extrema for this problem is 1+2, this is an expected result. The inclusion of a new point produces an
additional semi-constraint plane which will intersect the pyramid producing a triangular upper base. The
minimum value of the maximum error will be found at one of the vertices of this triangle. However since the
vertex will be defined by the intersection of three lines, there will still be three extrema as is required by the
degree of the approximating polynomial. Additional points will increase the number of sides as they will cut
the initial pyramid forming a multi-sided polygon. The vertices of the polygon that is defined in parameter-
εmax space will still hold the optimal solution. In this instance the search is simple as we simply wish to know
which εmax is the smallest in magnitude. Thus we look for the vertex nearest the plane of the parameters. An
increase in the number of unknowns ai's will produce figures in higher dimensions, but the analysis remains
essentially the same.

The area of mathematics that deals with problems that can be formulated in term of linear
constraints (including inequalities) is known as Linear Programming and it has nothing to do with computer
programming. It was the outgrowth of a group of mathematicians working in a broader area of mathematics
known as operations research. The inspiration for its development was the finding of solutions to certain
optimization problems such as the efficient allocation of scarce resources (see Bland4).

Like many of the subjects we have introduced in this book, linear programming is a large field of
study having many ramifications far beyond the scope of this book. However, a problem that is formulated
in terms of constraint inequalities will consist of a collection of semi-spaces that define a polytope (a figure
where each side is a polygon) in multidimensional parameter space. It can be shown that the optimum
solution lies at one of the vertices of the polytope. A method for sequentially testing each vertex so that the
optimal one will be found in a deterministic way is known as the simplex method. Starting at an arbitrary
vertex one investigates the adjacent vertices finding the one which best satisfies the optimal conditions. The
remaining vertices are ignored and one moves to the new "optimal" vertex and repeats the process.

When one can find no adjacent vertices that better satisfy the optimal condition that vertex is the
most optimal of the entire polytope and represents the optimal solution to the problem. In practice, the
simplex method has been found to be far more efficient than general theoretical considerations would lead
one to expect. So, while there are other approaches to linear programming problems, the one that still attracts
most attention is the simplex method.

## c. The Chebyschev Norm and Least Squares

At the beginning of this chapter, we justified the choice of the Least Square approximation
norm on the grounds that it yielded linear equations of condition and was the lowest power of the deviation ε
that was guaranteed to be positive. What about higher powers? The desire to keep the error constraints
positive should limit us to even powers of ε. Thus consider a norm of the form

## Min Σ εi2n = Min Σ [Yi-f(aj,xi)]2n , (6.5.9)

i i

190
6 - Least Squares

## which lead to the non-linear equations

2 n −1 ∂f (a j , x i )
∂ ⎛
∂a j ⎝ i
[ ]
2n ⎞
[
⎜ ∑ y i − f (a j , x i ) ⎟ = ∑ 2n y i − f (a j , x i ) ] ∂a j
=0 . (6.5.10)
⎠ i

Now one could solve these non-linear equations, but there is no reason to expect that the solution would be
"better" in any real sense than the least square solution. However, consider the limit of equation (6.5.9) as
n→∞.
Lim( Min Σ εi2n ) = Min( Lim Σ εi2n ) = Min│εmax│2n . (6.5.11)
n→∞ i n→∞ i

The solution that is found subject to the constraint that ε2nmax is a minimum will be the same solution that is
obtained when εmax is a minimum. Thus the limit of the 2nth norm as n goes to infinity is the Chebyschev
norm.

In this chapter we have made a transition from discussing numerical analysis where the basic inputs
to a problem are known with arbitrary accuracy tp those where the basic data contained errors. In earlier
chapters the only errors that occur in the calculation result from round-off of arithmetic processes or
truncation of the approximation formulae. However, in section 6.3 we allowed for the introduction of
"flawed" inputs, with inherent errors resulting from experiment or observation. Since any interaction with
the real world will involve errors of observation, we shall spend most of the remainder of the book
discussing the implication of these errors and the manner by which they can be managed.

191
Numerical Methods and Data Analysis

Chapter 6 Exercises
1. Develop normal equations for the functions:

a. f(x) = a0e a 1 x

## b. f(x) = a0 + a1sin(a2πx + a3) .

Which expressions could be replaced with a linear function with no loss of accuracy? What would
the error analysis of that function fit to observational data say about the errors of the original
coefficients aj?

2. Using least squares find the "best" straight-line fit and the error estimates for the slope and intercept
of that line for the following set of data.

xi Yi
1 1.5
2 2.0
3 2.8
4 4.1
5 4.9
6 6.3
7 5.0
8 11.5

## f(aj,x) = Σ φk(x), where φk(x) = cos(kπx)

k

xi f(aj,xi)
0.000000.00000
0.174530.17101
0.349070.32139
0.418880.37157
0.628390.47553
0.785400.49970
1.0123 0.44940
1.0821 0.41452
1.2915 0.26496
1.5010 0.06959

How many terms are required to fit the table accurately? Discuss what you mean by "accurately"
and why you have chosen that meaning.

192
6 - Least Squares

## x1,i Y1,i x2,i Y2,i

1 9.1 1 0.5
2 8.5 2 3.2
3 7.6 3 2.5
4 3.5 4 4.6
5 4.2 5 5.1
6 2.1 6 6.9
7 0.2 7 6.8

find the "best" value for the intersection of the straight lines and an estimate for the error in Y. How
would you confirm the assumption that there is no error in x?

## 5. Determine the complex Fourier transform of

-t2
a. e -∞ < t < +∞.

## b. e-tcos(t) , 0 < t < +∞ .

6. Find the FFT for the functions in problem 5 where the function is sampled every .01 in t and the
total number of points is 1024. Calculate the inverse transform of the result and compare the
accuracy of the process.

193
Numerical Methods and Data Analysis

## Chapter 6 References and Supplementary Reading

1. Bateman, H., "Tables of Integral Transforms" (1954) Ed. A.
Erde'lyi, Volumes 1,2, McGraw-Hill Book Co., Inc. New York, Toronto, London.

2. Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T., "Numerical Recipies the Art
of Scientific Computing" (1986), Cambridge University Press, Cambridge, pp. 390-394.

3. Marquardt, D.W., "An Algorithm for Least-Squares Estimation of Nonlinear Parameters", (1963),
J. Soc. Ind. Appl. Math., Vol.11, No. 2, pp.431-441.

4. Bland, R.G., "The Allocation of Resources by Linear Programming", (1981) Sci. Amer. Vol. 244,
#6, pp.126-144.

Most books on numerical analysis contain some reference to least squares. Indeed most freshmen
calculus courses deal with the subject at some level. Unfortunately no single text contains a detailed
description of the subject and its ramifications.

1. Hildebrand, F.B., "Introduction to Numerical Analysis" (1956) McGraw-Hill Book Co., Inc.,
New York, Toronto, London, pp. 258-311,

This book presents a classical discussion and much of my discussion in section 6.3 is based on his
presentation. The error analysis for non-linear least squares in section 6.4 is dealt with in considerable
detail in

2. Bevington, P.R., "Data Reduction and Error Analysis for the Physical Sciences", (1969),
McGraw-Hill Book Co. Inc., New York, San Francisco, St. Louis, Toronto, London, Sydney, pp.
204-246.

Nearly any book that discusses Fourier series and transforms contains useful information elaborating on
the uses and extended theory of the subject. An example would be

3. Sokolnikoff, I.S., and Redheffer, R.M., "Mathematics of Physics and Modern Engineering",
(1958) McGraw-Hill Book Co., Inc. New York, Toronto, London, pp. 175-211.

Two books completely devoted to Fourier analysis and the transforms particularly are:

4. Brigham, E.O., "The Fast Fourier Transform", (1974) Prentice-Hall, Inc. Englewood Cliffs, N.J.,

and

5. Bracewell, R.N., "The Fourier Transform and its Applications", 2nd Ed., (1978),
McGraw-Hill Book Company, New York N.Y.

194
6 - Least Squares

A very compressed discussion, of Linear Programming, which covers much more that we can, is to be
found in

6. Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T., "Numerical Recipies the Art
of Scientific Computing" (1986), Cambridge University Press, Cambridge. pp. 274-334,

## but a more basic discussion is given by

7. Gass, S.T., "Linear Programming" (1969), 3rd ed. McGraw-Hill, New York.

195
Numerical Methods and Data Analysis

196
7

Statistics

• • •

## In the last chapter we made the transition from discussing

information which is considered to be error free to dealing with data that contained intrinsic errors. In the
case of the former, uncertainties in the results of our analysis resulted from the failure of the approximation
formula to match the given data and from round-off error incurred during calculation. Uncertainties resulting
from these sources will always be present, but in addition, the basic data itself may also contain errors. Since
all data relating to the real world will have such errors, this is by far the more common situation. In this
chapter we will consider the implications of dealing with data from the real world in more detail.

197
Numerical Methods and Data Analysis

Philosophers divide data into at least two different categories, observational, historical, or empirical
data and experimental data. Observational or historical data is, by its very nature, non-repeatable.
Experimental data results from processes that, in principle, can be repeated. Some1 have introduced a third
type of data labeled hypothetical-observational data, which is based on a combination of observation and
information supplied by theory. An example of such data might be the distance to the Andromeda galaxy
since a direct measurement of that quantity has yet to be made and must be deduced from other aspects of
the physical world. However, in the last analysis, this is true of all observations of the world. Even the
determination of repeatable, experimental data relies on agreed conventions of measurement for its unique
interpretation. In addition, one may validly ask to what extent an experiment is precisely repeatable. Is there
a fundamental difference between an experiment, which can be repeated and successive observations of a
phenomenon that apparently doesn't change? The only difference would appear to be that the scientist has
the option in the case of the former in repeating the experiment, while in the latter case he or she is at the
mercy of nature. Does this constitute a fundamental difference between the sciences? The hard sciences such
as physics and chemistry have the luxury of being able to repeat experiments holding important variables
constant, thereby lending a certain level of certainty to the outcome. Disciplines such as Sociology,
Economics and Politics that deal with the human condition generally preclude experiment and thus must rely
upon observation and "historical experiments" not generally designed to test scientific hypotheses. Between
these two extremes are sciences such as Geology and Astronomy which rely largely upon observation but
are founded directly upon the experimental sciences. However, all sciences have in common the gathering of
data about the real world. To the analyst, there is little difference in this data. Both experimental and
observational data contain intrinsic errors whose effect on the sought for description of the world must be
understood.

However, there is a major difference between the physical sciences and many of the social sciences
and that has to do with the notion of cause and effect. Perhaps the most important concept driving the
physical sciences is the notion of causality. That is the physical biological, and to some extent the behavioral
sciences, have a clear notion that event A causes event B. Thus, in testing a hypothesis, it is always clear
which variables are to be regarded as the dependant variables and which are to be considered the
independent variables. However, there are many problems in the social sciences where this luxury is not
present. Indeed, it may often be the case that it is not clear which variables used to describe a complex
phenomenon are even related. We shall see in the final chapter that even here there are some analytical
techniques that can be useful in deciding which variables are possibly related. However, we shall also see
that these tests do not prove cause and effect, rather they simply suggest where the investigator should look
for causal relationships. In general data analysis may guide an investigator, but cannot substitute for his or
her insight and understanding of the phenomena under investigation.

During the last two centuries a steadily increasing interest has developed in the treatment of large
quantities of data all representing or relating to a much smaller set of parameters. How should these data be
combined to yield the "best" value of the smaller set of parameters? In the twentieth century our ability to
collect data has grown enormously, to the point where collating and synthesizing that data has become a
scholarly discipline in itself. Many academic institutions now have an entire department or an academic unit
devoted to this study known as statistics. The term statistics has become almost generic in the language as it
can stand for a number of rather different concepts. Occasionally the collected data itself can be referred to
as statistics. Most have heard the reference to reckless operation of a motor vehicle leading to the operator
"becoming a statistic". As we shall see, some of the quantities that we will develop to represent large

198
7 - Probability Theory and Statistics

amounts of data or characteristics of that data are also called statistics. Finally, the entire study of the
analysis of large quantities of data is referred to as the study of statistics. The discipline of statistics has
occasionally been defined as providing a basis for decision-making on the basis of incomplete or imperfect
data. The definition is not a bad one for it highlights the breadth of the discipline while emphasizing it
primary function. Nearly all scientific enterprises require the investigator to make some sort of decisions and
as any experimenter knows, the data is always less than perfect.

The subject has its origins in the late 18th and early 19th century in astronomical problems studied
by Gauss and Legendre. Now statistical analysis has spread to nearly every aspect of scholarly activity. The
developing tools of statistics are used in the experimental and observational sciences to combine and analyze
data to test theories of the physical world. The social and biological sciences have used statistics to collate
information about the inhabitants of the physical world with an eye to understanding their future behavior in
terms of their past performance. The sampling of public opinion has become a driving influence for public
policy in the country. While the market economies of the world are largely self-regulating, considerable
effort is employed to "guide" these economies based on economic theory and data concerning the
performance of the economies. The commercial world allocates resources and develops plans for growth
based on the statistical analysis of past sales and surveys of possible future demand. Modern medicine uses
statistics to ascertain the efficacy of drugs and other treatment procedures. Such methods have been used, not
without controversy, to indicate man made hazards in our environment. Even in the study of language,
statistical analysis has been used to decide the authorship of documents based on the frequency of word use
as a characteristic of different authors.

The historical development of statistics has seen the use of statistical tools in many different fields
long before the basis of the subject were codified in the axiomatic foundations to which all science aspires.
The result is that similar mathematical techniques and methods took on different designations. The multi-
discipline development of statistics has lead to an uncommonly large amount of jargon. This jargon has
actually become a major impediment to understanding. There seems to have been a predilection, certainly in
the nineteenth century, to dignify shaky concepts with grandiose labels. Thus the jargon in statistics tends to
have an excessively pretentious sound often stemming from the discipline where the particular form of
analysis was used. For example, during the latter quarter of the nineteenth century, Sir Francis Galton
analyzed the height of children in terms of the height of their parents2. He found that if the average height of
the parents departed from the general average of the population by an amount x, then the average height of
the children would depart by, say, 2x/3 from the average for the population. While the specific value of the
fraction (2/3) may be disputed all now agree that it is less than one. Thus we have the observation that
departures from the population average of any sub group will regress toward the population average in
subsequent generations. Sir Francis Galton used Legendre's Principle of Least Squares to analyze his data
and determine the coefficient of regression for his study. The use of least squares in this fashion has become
popularly known as regression analysis and the term is extended to problems where the term regression has
absolutely no applicability. However, so wide spread has the use of the term become, that failure to use it
constitutes a barrier to effective communication.

Statistics and statistical analysis are ubiquitous in the modern world and no educated person should
venture into that world without some knowledge of the subject, its strengths and limitations. Again we touch
upon a subject that transcends even additional courses of inquiry to encompass a lifetime of study. Since we
may present only a bare review of some aspects of the subject, we shall not attempt a historical development.

199
Numerical Methods and Data Analysis

Rather we will begin by giving some of the concepts upon which most of statistics rest and then developing
some of the tools which the analyst needs.

## 7.1 Basic Aspects of Probability Theory

We can find the conceptual origins of statistics in probability theory. While it is possible to place
probability theory on a secure mathematical axiomatic basis, we shall rely on the commonplace notion of
probability. Everyone has heard the phrase "the probability of snow for tomorrow 50%". While this sounds
very quantitative, it is not immediately clear what the statement means. Generally it is interpreted to mean
that on days that have conditions like those expected for tomorrow, snow will fall on half of them. Consider
the case where student A attends a particular class about three quarters of the time. On any given day the
professor could claim that the probability of student A attending the class is 75%. However, the student
knows whether or not he is going to attend class so that he would state that the probability of his attending
class on any particular day is either 0% or 100%. Clearly the probability of the event happening is dependent
on the prior knowledge of the individual making the statement. There are those who define probability as a
measure of ignorance. Thus we can define two events to be equally likely if we have no reason to expect one
event over the other. In general we can say that if we have n equally likely cases and any m of them will
generate an event E, then the probability of E occurring is
P(E) = m/n . (7.1.1)

Consider the probability of selecting a diamond card from a deck of 52 playing cards. Since there
are 13 diamonds in the deck, the probability is just 13/52 = ¼. This result did not depend on there being 4
suits in the standard deck, but only on the ratio of 'correct' selections to the total number of possible
selections. It is always assumed that the event will take place if all cases are selected so that the probability
that an event E will not happen is just
~
Q(E) = 1 ─ P(E) . (7.1.2)

In order to use equation (7.1.1) to calculate the probability of event E taking place, it is necessary that we
correctly enumerate all the possible cases that can give rise to the event. In the case of the deck of cards, this
seems fairly simple. However, consider the tossing of two coins where we wish to know the probability of
two 'heads' occurring. The different possibilities would appear to be each coin coming up 'heads', each coin
coming up 'tails', and one coin coming up 'heads' while the other is 'tails'. Thus naïvely one would think that
the probability of obtaining two 'heads' would be 1/3. However, since the coins are truly independent events,
each coin can be either 'heads' or 'tails'. Therefore there are two separate cases where one coin can be 'head'
and the other 'tails' yielding four possible cases. Thus the correct probability of obtaining two 'heads' is 1/4.
The set of all possible cases is known as the sample set, or sample space, and in statistics is sometimes
referred to as the parent population.

200
7 - Probability Theory and Statistics

## a. The Probability of Combinations of Events

It is possible to view our coin tossing even as two separate and independent events where
each coin is tossed separately. Clearly the result of tossing each coin and obtaining a specific result is 1/2.
Thus the result of tossing two coins and obtaining a specific result (two heads) will be 1/4, or (1/2)×(1/2). In
general, the probability of obtaining event E and event F, [P(EF)], will be

## P(EF) = P(E) × P(F) . (7.1.3)

Requiring of the occurrence of event E and event F constitutes the use of the logical and which always
results in a multiplicative action. We can ask what will be the total, or joint, probability of event E or event F
occurring. Should events E and F be mutually exclusive (i.e. there are no cases in the sample set that result in
both E and F), then P(EorF) is given by

## P(EorF) = P(E) + P(F) . (7.1.4)

This use of addition represents the logical 'or'. In our coin tossing exercise obtaining one 'head' and one 'tail'
could be expressed as the probability of the first coin being 'heads' and the second coin being 'tails' or the
first coin being 'tails' and the second coin being 'heads' so that

## P(HT) = P(H)P(T) + P(T)P(H) = (1/2)×(1/2) + (1/2)×(1/2) = 1/2 . (7.1.5)

We could obtain this directly from consideration of the sample set itself and equation (7.1.1) since m = 2,
and n = 4. However, in more complicated situations the laws of combining probabilities enable one to
calculate the combined probability of events in a clear and unambiguous way.

In calculating P(EorF) we required that the events E and F be mutually exclusive and in the coin
exercise, we guaranteed this by using separate coins. What can be done if that is not the case? Consider the
situation where one rolls a die with the conventional six faces numbered 1 through 6. The probability of any
particular face coming up is 1/6. However, we can ask the question what is the probability of a number less
than three appearing or an even number appearing. The cases where the result is less than three are 1 and 2,
while the cases where the result is even are 2, 4, and 6. Naïvely one might think that the correct answer 5/6.
However, these are not mutually exclusive cases for the number 2 is both an even number and it is also less
than three. Therefore we have counted 2 twice for the only distinct cases are 1, 2, 4, and 6 so that the correct
result is 4/6. In general, this result can be expressed as

## P(EorF) = P(E) + P(F) ─ P(EF) , (7.1.6)

or in the case of the die

## P(<3oreven) = [(1/6)+(1/6)] + [(1/6)+(1/6)+(1/6)] ─ [(1/3)×(1/2)] = 2/3 . (7.1.7)

We can express these laws graphically by means of a Venn diagram as in figure 7.1. The simple sum of the
dependent probabilities counts the intersection on the Venn diagram twice and therefore it must be removed
from the sum.

201
Numerical Methods and Data Analysis

Figure 7.1 shows a sample space giving rise to events E and F. In the case of the die, E is
the probability of the result being less than three and F is the probability of the result being
even. The i