Sie sind auf Seite 1von 808

Numerical Methods Fall 2010

Lecturer: conf. dr. Viorel Bostan


Oce: 6-417
Telephone: 50-99-38
E-mail address: viorel bostan@mail.utm.md
Course web page: moodle.fcim.utm.md
Oce hours: TBA. I will also be available at other
times. Just drop by my oce, talk to me after the
class or send me an e-mail to make an appointment.
Prerequisites: A basic course on mathematical anal-
ysis (single and multivariable calculus), ordinary dif-
ferential equations and some knowledge of computer
programming.
Course outline: This is a fast-paced course. This course
gives an in-depth introduction to the basic areas of nu-
merical analysis. The main objective will be to have a
clear understanding of the ideas and techniques underly-
ing the numerical methods, results, and algorithms that
will be presented, where error analysis plays an impor-
tant role. You will then be able to use this knowledge to
analyze the numerical methods and algorithms that you
will encounter, and also to program them eectively on
a computer. This knowledge will be useful in your future
not only to solve problems with a numerical component,
but also to develop numerical algorithms of your own.
Topics to be covered:
1. Computer representation of numbers. Errors: types,
sources, propagation.
2. Solution of nonlinear equations. Rootnding.
3. Interpolation by polynomials and spline functions.
4. Approximation of functions.
5. Numerical integration. Automatic dierentiation.
6. Matrix computations and systems of linear equations.
7. Numerical methods for ODE.
This course plan may be modied during the semester.
Such modications will be anounced in advance during
class period. The student is responsible for keeping abreast
of such changes.
Class procedure: The majority of each class period
will be lecture oriented. Some material will be handed
during lectures, some material will be send by e-mail.
I strongly advise to attend lectures, do your home-
work, work consistently, and ask questions. Lecture
time is at premium; you cannot be taught everything
in class. It is your responsability to learn the material;
the instructor's job is to guide you in your learning.
During the semester, 10 homeworks and 4 program-
ming projects will be assigned. As a general rule, you
will nd it necessary to spend approximately 2-3 hours
of study for each lecture/lab meeting, and additional
time will be needed for exam preparation. It is strongly
advised that you start working on this course from the
very beginning. The importance of doing the assigned
homeworks and projects cannot be over emphasized.
Programming projects: The predominant programming
language used in numerical analysis are Fortran and MAT-
LAB. We will focus on MATLAB. Programs in other lan-
guages are also sometimes acceptable, but no program-
ming assistance will be given in the use of such languages
(i.e. C,C++,Java,Pascal). For students unaquainted
with MATLAB, the following e-readings are suggested
1.Ian Cavers, An Introductory Guide to MATLAB, 2nd
Edition, Dept. of Computer Science, University of British
Columbia, December 1998,
www.cs.ubc.ca/spider/cavers/MatlabGuide/guide.html
2. Paul Fackler, A MATLAB Primer, North Carolina Sta-
teUniversity,
www4.ncsu.edu/unity/users/p/pfackler/www/MPRIMER.htm
3.MATLAB Tutorials, Dept. of Mathematics, Southern
Illinois University at Carbondale,
www.math.siu.edu/matlab/tutorials.html
4.Christian Roessler, MATLAB Basics, University of Mel-
bourne, june 2004,
www.econphd.net/downloads/matlab.pdf
5.Kermit Sigmon, MATLAB Primer, 3rd edition, Dept.
of Mathematics, University of Florida,
www.wiwi.uni-frankfurt.de/professoren/krueger/teaching/ws0506/macromh/matlabprimer.pdf.
In your project report you should include:
1. The routines you have developed;
2. The results for your test cases in forms of tables,
graphs etc.;
3. Answers to all questions contained in the assign-
ment;
4. Comments.
You should report your results in a way that is easy to
read, communicates the problem and the results ef-
fectively, and can be reproduced by someone else who
has not seen the problem before, but is technically
knowledgeable. You should also give any justication
or other reasons to believe the correctness of your
results and code. Also, give conclusions on how eec-
tive your methods and routines appear to be, repport
and comment any "unusual behavior" of your results.
Team working is allowed, but you should specify this
in your report ,as well as the tasks executed by each
member of your team.
Grading policy: The nal grade will be based on tests
and hw/projects, as follows:
1. There will be one 3-hour written exam given after
8 weeks of classes at a time arranged later (assumably
at the end of October). This midterm exam will count
25% of the course grade.
2. The nal comprehensive exam will be given dur-
ing the scheduled examination time at the end of the
semester, it will cover all material, and it will count
35% of your nal grade.
3. HW and lab projects will count 20% of the grade
each. Late homeworks and projects are not allowed!
4. You will need a scientic calculator during exams.
Sharing of calculators will not be allowed. Make sure
you have one.
The exams will be open notes, i.e. you will be allowed
to use your class notes and class slides (no other ma-
terial will be allowed).
Grading for homeworks and lab
projects
The HW will be graded on a scale from 0 to 4 with a
possibility of getting extra bonus point at each home-
work. Grades will be given according to the following
guidelines:
{ 0 { no homework turned in;
{ 1 { poor lousy job;
{ 2 { incomplete job;
{ 3 { good job;
{ 4 { very good job;
+1 for optional problems and/or excellent/outstanding
solution to one of the porblems
It is very important that you take the examinations at the
scheduled times. Alternate exams will be scheduled only
for those who have compelling and convincing enough
reasons.
Academic misconduct: Any kinds of academic mis-
conduct will not be tolerated. If a situation arises
where you and your instructor disagree on some mat-
ter and cannot resolve the issue, you should see the
Dean. However, any problems concerning the course
should be rst discussed with your instructor.
Readings:
1. Kendall Atkinson, An Introduction to Numerical Analy-
sis, 2nd edition
2. Cleve Moler, Numerical Computing with MATLAB,
http://www.mathworks.com/moler/
3. Bjoerck A., Dahlquist G , Numerical mathematics and
scientic computation.
4. Steven E. Pav, Numerical Methods Course Notes,
University of California at San diego, 2005
5. Mathews J.H., Fink D.K., Numerical methods using
MATLAB, 1999
6. Kincaid D. Cheney.W., Numerical analysis, 1991
7. Goldberg, What every computer scientist should know
about oating-point arithmetic, 1991
8. Homan J.D.,. Numerical methods for engineers and
scientists, 2001
9. Johnston.R.L., Numerical methods, a software ap-
proach, 1982
10. Carothers N.L, A short course on approximation the-
ory, Course notes, Bowling Green State University
11. George W. Collins, Fundamental Numerical Methods
and Data Analysis
12. Shampine L.F., Allen R.C., Pruess S., Fundamentals
of numerical computing, 1997
Also, you should check the university library for available
books.
Useful web-sites with on-line literature:
www.math.gatech.edu/~cain/textbooks/onlinebooks.html
www.econphd.net/notes.htm
Denition of Numerical Analysis by Kendall Atkinson,
Prof. University of Iowa
Numerical analysis is the area of mathematics and com-
puter science that creates, analyzes, and implements al-
gorithms for solving numerically the problems of contin-
uous mathematics.
Such problems originate generally from real-world appli-
cations of algebra, geometry and calculus, and they in-
volve variables which vary continuously; these problems
occur throughout the natural sciences, social sciences,
engineering, medicine, and business.
During the past half-century, the growth in power and
availability of digital computers has led to an increas-
ing use of realistic mathematical models in science and
engineering, and numerical analysis of increasing sophis-
tication has been needed to solve these more detailed
mathematical models of the world.
With the growth in importance of using computers to
carry out numerical procedures in solving mathematical
models of the world, an area known as scientic com-
puting or computational science has taken shape during
the 1980s and 1990s. This area looks at the use of nu-
merical analysis from a computer science perspective. It
is concerned with using the most powerful tools of nu-
merical analysis, computer graphics, symbolic mathemat-
ical computations, and graphical userinterfaces to make
it easier for a user to set up, solve, and interpret compli-
cated mathematical models of the real world.
Denition of Numerical Analysis by Lloyd N Trefethen,
Prof. Cornell Unviersity
Here is the wrong answer: Numerical analysis is the study
of rounding errors
Some other wrong or incomplete answers:
Websters New Collegiate Dictionary:The study of quan-
titative approximations to the solutions of mathematical
problems including consideration of the errors and bounds
to the errors involved.
Chambers 20th Century Dictionary: The study of
methods of approximation and their accuracy etc
The American Heritage Dictionary: The study of ap-
proximate solutions to mathematical problems taking into
account the extent of possible errors
Correct answer is:Numerical analysis is the study of algo-
rithms for the problems of continuous mathematics
NUMERICAL ANALYSIS: This refers to the analysis
of mathematical problems by numerical means, es-
pecially mathematical problems arising from models
based on calculus.
Eective numerical analysis requires several things:
An understanding of the computational tool being
used, be it a calculator or a computer.
An understanding of the problem to be solved.
Construction of an algorithm which will solve the
given mathematical problem to a given desired
accuracy and within the limits of the resources
(time, memory, etc) that are available.
This is a complex undertaking. Numerous people
make this their lifes work, usually working on only
a limited variety of mathematical problems.
Within this course, we attempt to show the spirit of
the subject. Most of our time will be taken up with
looking at algorithms for solving basic problems such
as rootnding and numerical integration; but we will
also look at the structure of computers and the impli-
cations of using them in numerical calculations.
We begin by looking at the relationship of numerical
analysis to the larger world of science and engineering.
SCIENCE
Traditionally, engineering and science had a two-sided
approach to understanding a subject: the theoretical
and the experimental. More recently, a third approach
has become equally important: the computational.
Traditionally we would build an understanding by build-
ing theoretical mathematical models, and we would
solve these for special cases. For example, we would
study the ow of an incompressible irrotational uid
past a sphere, obtaining some idea of the nature of
uid ow. But more practical situations could seldom
be handled by direct means, because the needed equa-
tions were too dicult to solve. Thus we also used
the experimental approach to obtain better informa-
tion about the ow of practical uids. The theory
would suggest ideas to be tried in the laboratory, and
the experiemental results would often suggest direc-
tions for a further development of theory.
1
Computational
Science
Theoretical
Science
Experimental
Science
With the rapid advance in powerful computers, we
now can augment the study of uid ow by directly
solving the theoretical models of uid ow as applied
to more practical situations; and this area is often re-
ferred to as computational uid dynamics. At the
heart of computational science is numerical analysis;
and to eectively carry out a computational science
approach to studying a physical problem, we must un-
derstand the numerical analysis being used, especially
if improvements are to be made to the computational
techniques being used.
MATHEMATICAL MODELS
A mathematical model is a mathematical description
of a physical situtation. By means of studying the
model, we hope to understand more about the physi-
cal situation. Such a model might be very simple. For
example,
A = 4R
2
e
, R
e
.
= 6, 371 km
is a formula for the surface area of the earth. How
accurate is it? First, it assumes the earth is sphere,
which is only an approximation. At the equator, the
radius is approximately 6,378 km; and at the poles,
the radius is approximately 6,357 km. Next, there is
experimental error in determining the radius; and in
addition, the earth is not perfectly smooth. Therefore,
there are limits on the accuracy of this model for the
surface area of the earth.
AN INFECTIOUS DISEASE MODEL
For rubella measles, we have the following model for
the spread of the infection in a population (subject to
certain assumptions).
ds
dt
= a s i
di
dt
= a s i b i
dr
dt
= b i
In this, s, i, and r refer, respectively, to the propor-
tions of a total population that are susceptible, infec-
tious, and removed (from the susceptible and infec-
tious pool of people). All variables are functions of
time t. The constants can be taken as
a =
6.8
11
, b =
1
11
The same model works for some other diseases (e.g.
u), with a suitable change of the constants a and b.
Again, this is an approximation of reality (and a useful
one).
But it has its limits. Solving a bad model will not give
good results, no matter how accurately it is solved;
and the person solving this model and using the results
must know enough about the formation of the model
to be able to correctly interpret the numerical results.
THE LOGISTIC EQUATION
This is the simplest model for population growth. Let
N(t) denote the number of individuals in a population
(rabbits, people, bacteria, etc). Then we model its
growth by
N
0
(t) = cN(t), t 0, N(t
0
) = N
0
The constant c is the growth constant, and it usually
must be determined empirically. Over short periods of
time, this is often an accurate model for population
growth. For example, it accurately models the growth
of US population over the period of 1790 to 1860, with
c = 0.2975.
THE PREDATOR-PREY MODEL
Let F(t) denote the number of foxes at time t; and
let R(t) denote the number of rabbits at time t. A
simple model for these populations is called the Lotka-
Volterra predator-prey model :
dR
dt
= a [1 bF(t)] R(t)
dF
dt
= c [1 + dR(t)] F(t)
with a, b, c, d positive constants. If one looks carefully
at this, then one can see how it is built from the logis-
tic equation. In some cases, this is a very useful model
and agrees with physical experiments. Of course, we
can substitute other interpretations, replacing foxes
and rabbits with other predator and prey. The model
will fail, however, when there are other populations
that aect the rst two populations in a signicant
way.
NEWTONS SECOND LAW
Newtons second law states that the force acting on
an object is directly proportional to the product of its
mass and acceleration,
F ma
With a suitable choice of physical units, we usually
write this in its scalar form as
F = ma
Newtons law of gravitation for a two-body situation,
say the earth and an object moving about the earth is
then
m
d
2
r(t)
dt
2
=
Gmm
e
|r(t)|
2

r(t)
|r(t)|
with r(t) the vector from the center of the earth to
the center of the object moving about the earth. The
constant G is the gravitational constant, not depen-
dent on the earth; and m and m
e
are the masses,
respectively of the object and the earth.
This is an accurate model for many purposes. But
what are some physical situations under which it will
fail?
When the object is very close to the surface of the
earth and does not move far from one spot, we take
|r(t)| to be the radius of the earth. We obtain the
new model
m
d
2
r(t)
dt
2
= mgk
with k the unit vector directly upward from the earths
surface at the location of the object. The gravitational
constant
g
.
= 9.8 meters/second
2
Again this is a model; it is not physical reality.
The Patriot Missile Failure
On February 25, 1991, during the Gulf War, an Amer-
ican Patriot Missile battery in Dharan, Saudi Arabia,
failed to intercept an incoming Iraqi Scud missile. The
Scud struck an American Army barracks and killed 28
soliders.
A report of the General Accounting oce, GAO/IMTEC-
92-26, entitled Patriot Missile Defense: Software Prob-
lem Led to System Failure at Dhahran, Saudi Arabia
reported on the cause of the failure.
It turns out that the cause was an inaccurate calcula-
tion of the time since boot due to computer arithmetic
errors.
Specically, the time in tenths of second as measured
by the system's internal clock was multiplied by 1=10
to produce the time in seconds. This calculation was
performed using a 24 bit xed point register. In par-
ticular, the value 1=10, which has a non-terminating
binary expansion, was chopped at 24 bits after the
radix point. The small chopping error, when multi-
plied by the large number giving the time in tenths of
a second, lead to a
signicant error. Indeed, the Patriot battery had been up
around 100 hours, and an easy calculation shows that the
resulting time error due to the magnied chopping error
was about 0.34 seconds.
The number
1
10
equals
1
10
=
1
2
4
+
1
2
5
+
1
2
8
+
1
2
9
+
1
2
12
+
1
2
13
+ : : :
= (0:0001100110011001100110011001100 : : :)
2
Now the 24 bit register in the Patriot stored instead
(0:00011001100110011001100)
2
introducing an error of
(0:0000000000000000000000011001100:::)
2
which being converted in decimal is
(0:000000095)
10
Multiplying by the number of tenths of a second in 100
hours gives:
0:000000095 100 60 60 10 = 0:34
A Scud travels at about 1676 meters per second, and
so travels more than half a kilometer in this time. This
was far enough that the incoming Scud was outside the
"range gate" that the Patriot tracked. Ironically, the fact
that the bad time calculation had been improved in some
parts of the code, but not all, contributed to the problem,
since it meant that the inaccuracies did not cancel.
The following paragraph is excerpted from the GAO re-
port.
The range gates prediction of where the Scud will next
appear is a function of the Scuds known velocity and the
time of the last radar detection. Velocity is a real number
that can be expressed as a whole number and a decimal
(e.g., 3750.2563...miles per hour). Time is kept continu-
ously by the systems internal clock in tenths of seconds
but is expressed as an integer or whole number (e.g., 32,
33, 34...). The longer the system has been running, the
larger the number representing time. To predict where
the Scud will next appear, both time and velocity must
be expressed as real numbers. Because of the way the
Patriot computer performs its calculations and the fact
that its registers are only 24 bits long, the conversion of
time from an integer to a real number cannot be any
more precise than 24 bits. This conversion results in a
loss of precision causing a less accurate time calculation.
The eect of this inaccuracy on the range gates calcu-
lation is directly proportional to the targets velocity and
the length of the the system has been running. Conse-
quently, performing the conversion after the Patriot has
been running continuously for extended periods causes
the range gate to shift away from the center of the tar-
get, making it less likely that the target, in this case a
Scud, will be successfully intercepted.
CALCULATION OF FUNCTIONS
Using hand calculations, a hand calculator, or a com-
puter, what are the basic operations of which we are
capable? In essence, they are addition, subtraction,
multiplication, and division (and even this will usually
require a truncation of the quotient at some point).
In addition, we can make logical decisions, such as
deciding which of the following are true for two real
numbers a and b:
a > b, a = b, a < b
Furthermore, we can carry out only a nite number
of such operations. If we limit ourselves to just addi-
tion, subtraction, and multiplication, then in evaluat-
ing functions f(x) we are limited to the evaluation of
polynomials:
p(x) = a
0
+ a
1
x + a
n
x
n
In this, n is the degree (provided a
n
6= 0) and {a
0
, ..., a
n
}
are the coecients of the polynomial. Later we will
discuss the ecient evaluation of polynomials; but for
now, we ask how we are to evaluate other functions
such as e
x
, cos x, log x, and others.
TAYLOR POLYNOMIAL APPROXIMATIONS
We begin with an example, that of f(x) = e
x
from
the text. Consider evaluating it for x near to 0. We
look for a polynomial p(x) whose values will be the
same as those of e
x
to within acceptable accuracy.
Begin with a linear polynomial p(x) = a
0
+a
1
x. Then
to make its graph look like that of e
x
, we ask that the
graph of y = p(x) be tangent to that of y = e
x
at
x = 0. Doing so leads to the formula
p(x) = 1 + x
Continue in this manner looking next for a quadratic
polynomial
p(x) = a
0
+ a
1
x + a
2
x
2
We again make it tangent; and to determine a
2
, we
also ask that p(x) and e
x
have the same curvature
at the origin. Combining these requirements, we have
for f(x) = e
x
that
p(0) = f(0), p
0
(0) = f
0
(0), p
00
(0) = f
00
(0)
This yields the approximation
p(x) = 1 + x +
1
2
x
2
We continue this pattern, looking for a polynomial
p(x) = a
0
+ a
1
x + a
2
x
2
+ + a
n
x
n
We now require that
p(0) = f(0), p
0
(0) = f
0
(0), , p
(n)
(0) = f
(n)
(0)
This leads to the formula
p(x) = 1 + x +
1
2
x
2
+ +
1
n!
x
n
What are the problems when evaluating points x that
are far from 0?
TAYLORS APPROXIMATION FORMULA
Let f(x) be a given function, and assume it has deriv-
atives around some point x = a (with as many deriv-
atives as we nd necessary). We seek a polynomial
p(x) of degree at most n, for some non-negative inte-
ger n, which will approximate f(x) by satisfying the
following conditions:
p(a) = f(a)
p
0
(a) = f
0
(a)
p
00
(a) = f
00
(a)
.
.
.
p
(n)
(a) = f
(n)
(a)
The general formula for this polynomial is
p
n
(x) = f(a) + (x a)f
0
(a) +
1
2!
(x a)
2
f
00
(a)
+ +
1
n!
(x a)
n
f
(n)
(a)
Then f(x) p
n
(x) for x close to a.
TAYLOR POLYNOMIALS FOR f(x) = log x
In this case, we expand about the point x = 1, making
the polynomial tangent to the graph of f(x) = log x
at the point x = 1. For a general degree n 1, this
results in the polynomial
p
n
(x) = (x 1)
1
2
(x 1)
2
+
1
3
(x 1)
3
+ + (1)
n1
1
n
(x 1)
n
Note the graphs of these polynomials for varying n.
THE TAYLOR POLYNOMIAL ERROR FORMULA
Let f(x) be a given function, and assume it has deriv-
atives around some point x = a (with as many deriva-
tives as we nd necessary). For the error in the Taylor
polynomial p
n
(x), we have the formulas
f(x) p
n
(x) =
1
(n + 1)!
(x a)
n+1
f
(n+1)
(c
x
)
=
1
n!
Z
x
a
(x t)
n
f
(n+1)
(t) dt
The point c
x
is restricted to the interval bounded by x
and a, and otherwise c
x
is unknown. We will use the
rst form of this error formula, although the second
is more precise in that you do not need to deal with
the unknown point c
x
.
Consider the special case of n = 0. Then the Taylor
polynomial is the constant function:
f(x) p
0
(x) = f(a)
The rst form of the error formula becomes
f(x) p
0
(x) = f(x) f(a) = (x a) f
0
(c
x
)
with c
x
between a and x. You have seen this in
your beginning calculus course, and it is called the
mean-value theorem. The error formula
f(x) p
n
(x) =
1
(n + 1)!
(x a)
n+1
f
(n+1)
(c
x
)
can be considered a generalization of the mean-value
theorem.
EXAMPLE: f(x) = e
x
For general n 0, and expanding e
x
about x = 0, we
have that the degree n Taylor polynomial approxima-
tion is given by
p
n
(x) = 1 + x +
1
2!
x
2
+
1
3!
x
3
+ +
1
n!
x
n
For the derivatives of f(x) = e
x
, we have
f
(k)
(x) = e
x
, f
(k)
(0) = 1, k = 0, 1, 2, ...
For the error,
e
x
p
n
(x) =
1
(n + 1)!
x
n+1
e
c
x
with c
x
located between 0 and x. Note that for x 0,
we must have c
x
0 and
e
x
p
n
(x)
1
(n + 1)!
x
n+1
This last term is also the nal term in p
n+1
(x), and
thus
e
x
p
n
(x) p
n+1
(x) p
n
(x)
Consider calculating an approximation to e. Then let
x = 1 in the earlier formulas to get
p
n
(1) = 1 + 1 +
1
2!
+
1
3!
+ +
1
n!
For the error,
e p
n
(1) =
1
(n + 1)!
e
c
x
, 0 c
x
1
To bound the error, we have
e
0
e
c
x
e
1
1
(n + 1)!
e p
n
(1)
e
(n + 1)!
To have an approximation accurate to within 10
5
,
we choose n large enough to have
e
(n + 1)!
10
5
which is true if n 8. In fact,
e p
8
(1)
e
9!
.
= 7.5 10
6
Then calculate p
8
(1)
.
= 2.71827877, and e p
8
(1)
.
=
3.06 10
6
.
FORMULAS OF STANDARD FUNCTIONS
1
1 x
= 1 + x + x
2
+ + x
n
+
x
n+1
1 x
cos x = 1
x
2
2!
+
x
4
4!
+ (1)
m
x
2m
(2m)!
+(1)
m
x
2m+2
(2m + 2)!
cos c
x
sin x = x
x
3
3!
+
x
5
5!
+ (1)
m1
x
2m1
(2m1)!
+(1)
m
x
2m+1
(2m + 1)!
cos c
x
with c
x
between 0 and x.
OBTAINING TAYLOR FORMULAS
Most Taylor polynomials have been bound by other
than using the formula
p
n
(x) = f(a) + (x a)f
0
(a) +
1
2!
(x a)
2
f
00
(a)
+ +
1
n!
(x a)
n
f
(n)
(a)
because of the diculty of obtaining the derivatives
f
(k)
(x) for larger values of k. Actually, this is now
much easier, as we can use Maple or Mathematica.
Nonetheless, most formulas have been obtained by
manipulating standard formulas; and examples of this
are given in the text.
For example, use
e
t
= 1 + t +
1
2!
t
2
+
1
3!
t
3
+ +
1
n!
t
n
+
1
(n + 1)!
t
n+1
e
c
t
in which c
t
is between 0 and t. Let t = x
2
to obtain
e
x
2
= 1 x
2
+
1
2!
x
4

1
3!
x
6
+ +
(1)
n
n!
x
2n
+
(1)
n+1
(n + 1)!
x
2n+2
e

x
Because c
t
must be between 0 and x
2
, we have it
must be negative. Thus we let c
t
=
x
in the error
term, with 0
x
x
2
.
EVALUATING A POLYNOMIAL
Consider having a polynomial
p(x) = a
0
+ a
1
x + a
2
x
2
+ + a
n
x
n
which you need to evaluate for many values of x. How
do you evaluate it? This may seem a strange question,
but the answer is not as obvious as you might think.
The standard way, written in a loose algorithmic for-
mat:
poly = a
0
for j = 1 : n
poly = poly + a
j
x
j
end
To compare the costs of dierent numerical meth-
ods, we do an operations count, and then we compare
these for the competing methods. Above, the counts
are as follows:
additions : n
multiplications : 1 + 2 + 3 + + n =
n(n + 1)
2
This assumes each term a
j
x
j
is computed indepen-
dently of the remaining terms in the polynomial.
Next, do the terms x
j
recursively:
x
j
= x x
j1
Then to compute
n
x
2
, x
3
, ..., x
n
o
will cost n1 mul-
tiplications. Our algorithm becomes
poly = a
0
+ a
1
x
power = x
for j = 2 : n
power = x power
poly = poly + a
j
power
end
The total operations cost is
additions : n
multiplications : n + n 1 = 2n 1
When n is evenly moderately large, this is much less
than for the rst method of evaluating p(x). For ex-
ample, with n = 20, the rst method has 210 multi-
plications, whereas the second has 39 multiplications.
We now considered nested multiplication. As exam-
ples of particular degrees, write
n = 2 : p(x) = a
0
+ x(a
1
+ a
2
x)
n = 3 : p(x) = a
0
+ x(a
1
+ x(a
2
+ a
3
x))
n = 4 : p(x) = a
0
+ x(a
1
+ x(a
2
+ x(a
3
+ a
4
x)))
These contain, respectively, 2, 3, and 4 multiplica-
tions. This is less than the preceding method, which
would have need 3, 5, and 7 multiplications, respec-
tively.
For the general case, write
p(x) = a
0
+x(a
1
+ x(a
2
+ + x(a
n1
+ a
n
x) ))
This requires n multiplications, which is only about
half that for the preceding method. For an algorithm,
write
poly = a
n
for j = n 1 : 1 : 0
poly = a
j
+ x poly
end
With all three methods, the number of additions is n;
but the number of multiplications can be dramatically
dierent for large values of n.
NESTED MULTIPLICATION
Imagine we are evaluating the polynomial
p(x) = a
0
+ a
1
x + a
2
x
2
+ + a
n
x
n
at a point x = z. Thus with nested multiplication
p(z) = a
0
+z (a
1
+ z (a
2
+ + z (a
n1
+ a
n
z) ))
We can write this as the following sequence of oper-
ations:
b
n
= a
n
b
n1
= a
n1
+ zb
n
b
n2
= a
n2
+ zb
n1
.
.
.
b
0
= a
0
+ zb
1
The quantities b
n1
, ..., b
0
are simply the quantities in
parentheses, starting from the inner most and working
outward.
Introduce
q(x) = b
1
+ b
2
x + b
3
x
2
+ + b
n
x
n1
Claim:
p(x) = b
0
+ (x z)q(x) ()
Proof: Simply expand
b
0
+ (x z)

b
1
+ b
2
x + b
3
x
2
+ + b
n
x
n1

and use the fact that


zb
j
= b
j1
a
j1
, j = 1, ..., n
With this result (*), we have
p(x)
x z
=
b
0
x z
+ q(x)
Thus q(x) is the quotient when dividing p(x) by xz,
and b
0
is the remainder.
If z is a zero of p(x), then b
0
= 0; and then
p(x) = (x z)q(x)
For the remaining roots of p(x), we can concentrate
on nding those of q(x). In rootnding for polynomi-
als, this process of reducing the size of the problem is
called deation.
Another consequence of (*) is the following. Form
the derivative of (*) with respect to x, obtaining
p
0
(x) = (x z)q
0
(x) + q(x)
p
0
(z) = q(z)
Thus to evaluate p(x) and p
0
(x) simultaneously at x =
z, we can use nested multiplication for p(z) and we
can use the intermediate steps of this to also evaluate
p
0
(z). This is useful when doing rootnding problems
for polynomials by means of Newtons method.
APPROXIMATING SF(x)
Dene
SF(x) =
1
x
Z
x
0
sin t
t
dt, x 6= 0
We use Taylor polynomials to approximate this func-
tion, to obtain a way to compute it with accuracy and
simplicity.
x
y
0.5
1.0
-8 -4 8 4
As an example, begin with the degree 3 Taylor ap-
proximation to sin t, expanded about t = 0:
sin t = t
1
6
t
3
+
1
120
t
5
cos c
t
with c
t
between 0 and t. Then
sin t
t
= 1
1
6
t
2
+
1
120
t
4
cos c
t
Z
x
0
sin t
t
dt =
Z
x
0

1
1
6
t
2
+
1
120
t
4
cos c
t

dt
= x
1
18
x
3
+
1
120
Z
x
0
t
4
cos c
t
dt
1
x
Z
x
0
sin t
t
dt = 1
1
18
x
2
+ R
2
(x)
R
2
(x) =
1
120
1
x
x
Z
0
t
4
cos c
t
dt
How large is the error in the approximation
SF(x) 1
1
18
x
2
on the interval [1, 1]? Since |cos c
t
| 1, we have
for x > 0 that
0 R
2
(x)
1
120
1
x
Z
x
0
t
4
dt
=
1
600
x
4
and the same result can be shown for x < 0. Then
for |x| 1, we have
0 R
2
(x)
1
600
To obtain a more accurate approximation, we can pro-
ceed exactly as above, but simply use a higher degree
approximation to sin t.
BINARY INTEGERS
A binary integer x is a nite sequence of the digits 0
and 1, which we write symbolically as
x = (a
m
a
m1
a
2
a
1
a
0
)
2
where I insert the parentheses with subscript ()
2
in
order to make clear that the number is binary. The
above has the decimal equivalent
x = a
m
2
m
+ a
m1
2
m1
+ + a
1
2
1
+ a
0
For example, the binary integer x = (110101)
2
has
the decimal value
x = 2
5
+ 2
4
+ 2
2
+ 2
0
= 53
The binary integer x = (111 1)
2
with m ones has
the decimal value
x = 2
m1
+ + 2
1
+ 1 = 2
m
1
DECIMAL TO BINARY INTEGER CONVERSION
Given a decimal integer x we write
x = (a
m
a
m1
a
2
a
1
a
0
)
2
= a
m
2
m
+ a
m1
2
m1
+ + a
1
2
1
+ a
0
Divide x by 2, calling the quotient x
1
. The remainder
is a
0
, and
x
1
= a
m
2
m1
+ a
m1
2
m2
+ + a
1
2
0
Continue the process. Divide x
1
by 2, calling the quo-
tient x
2
. The remainder is a
1
, and
x
2
= a
m
2
m2
+ a
m1
2
m3
+ + a
2
2
0
After a nite number of such steps, we will obtain all
of the coecients a
i
, and the nal quotient will be
zero.
Try this with a few decimal integers.
EXAMPLE
The following shortened form of the above method is
convenient for hand computation. Convert (11)
10
to
binary.
b2

11c = 5 = x
1
a
0
= 1
b2

5c = 2 = x
2
a
1
= 1
b2

2c = 1 = x
3
a
2
= 0
b2

1c = 0 = x
4
a
3
= 1
In this, the notation bbc denotes the largest integer
b, and the notation 2

n denotes the quotient re-


sulting from dividing 2 into n. From the above cal-
culation, (11)
10
= (1011)
2
.
BINARY FRACTIONS
A binary fraction x is a sequence (possibly innite) of
the digits 0 and 1:
x = (.a
1
a
2
a
3
a
m
)
2
= a
1
2
1
+ a
2
2
2
+ a
3
2
3
+
For example, x = (.1101)
2
has the decimal value
x = 2
1
+ 2
2
+ 2
4
= .5 + .25 + .0625 = 0.8125
Recall the formula for the geometric series
n
X
i=0
r
i
=
1 r
n+1
1 r
, r 6= 1
Letting n with |r| < 1, we obtain the formula

X
i=0
r
i
=
1
1 r
, |r| < 1
Using this,
(.0101010101010 )
2
= 2
2
+ 2
4
+ 2
6
+
= 2
2

1 + 2
2
+ 2
4
+

which sums to the fraction 1/3.


Also,
(.11001100110011 )
2
= 2
1
+ 2
2
+ 2
5
+ 2
6
+
and this sums to the decimal fraction 0.8 =
8
10
.
DECIMAL TO BINARY FRACTION CONVERSION
In
x
1
= (.a
1
a
2
a
3
a
m
)
2
= a
1
2
1
+ a
2
2
2
+ a
3
2
3
+
we multiply by 2. The integer part will be a
1
; and
after it is removed we have the binary fraction
x
2
= (.a
2
a
3
a
m
)
2
= a
2
2
1
+ a
3
2
2
+ a
4
2
3
+
Again multiply by 2, obtaining a
2
as the integer part of
2x
2
. After removing a
2
, let x
3
denote the remaining
number. Continue this process as far as needed.
For example, with x =
1
5
, we have
x
1
= .2; 2x
1
= .4; x
2
= .4 and a
1
= 0
2x
2
= .8; x
3
= .8 and a
2
= 0
2x
3
= 1.6; x
4
= .6 and a
2
= 1
Continue this to get the pattern
(.2)
10
= (.00110011001100 )
2
DECIMAL FLOATING-POINT NUMBERS
Floating point notation is akin to what is called scien-
tic notation. For a nonzero number x, we can write
it in the form
x = x 10
e
;
where = 1, e is an integer, and 1 x < 10. Num-
ber is called sign, e is exponent, and x is signicand
or mantissa.
For example,
345:78 = 3:4578 10
2
;
where = +1, e = 2, x = 3:4578.
On a decimal computer or calculator, we store x by
instead storing , e and x: We must restrict the num-
ber of digits in x and the size of the exponent e. The
number of digits in x is called precision.
For example, on an HP-15C calculator, precision is 10,
and the exponent is restricted to 99 e 99.
BINARY FLOATING-POINT NUMBERS
We now do something similar with the binary repre-
sentation of a number x. Write
x = x 2
e
;
with 1 x < (10)
2
= (2)
10
and e an integer.
For example,
(0:1)
10
= (:000110011001100. . . )
2
= +
|{z}
=+1
(1:10011001100 : : :)
2
| {z }
x
2
e
z}|{
4
;
The number x is stored in the computer by storing the
, x, and e. On all computers, there are restrictions
on the number of digits in x and the size of e.
FLOATING POINT NUMBERS
When a number x outside a computer or calculator
is converted into a machine number, we denote it by
fl(x). On an HP-calculator,
fl(:3333. . . ) = (3:333333333)
10
10
1
The decimal fraction of innite length will not t in
the registers of the calculator, but the latter 10digit
number will t. Some calculators actually carry more
digits internally than they allow to be displayed. On
a binary computer, we use a similar notation.
We will concentrate on a particular form of computer
oating point number, that is called the IEEE oating
point standard.
Example 1 Consider a binary oating point represen-
tation with precision 3; and e
min
= 2 e 2 =
e
max
: All the numbers admitted by this representation
are presented in the table
e
x 2 1 0 1 2
(1:00)
2
(:25)
10
(:5)
10
(1)
10
(2)
10
(4)
10
(1:01)
2
(:3125)
10
(:625)
10
(1:25)
10
(2:5)
10
(5)
10
(1:10)
2
(:375)
10
(:75)
10
(1:5)
10
(3)
10
(6)
10
(1:11)
2
(:4375)
10
(:875)
10
(1:75)
10
(3:5)
10
(7)
10
0 4
5 6 7
1
2 3
0.5
This representation can be extended to include smaller
numbers called denormalized numbers. These num-
bers are obtained if e = e
min
; and the rst digit of the
signicand is 0:
Example 2 Previous example plus denormalized num-
bers
(0:01)
2
2
1
=
1
16
= (0:0625)
10
(0:10)
2
2
1
=
2
16
= (0:125)
10
(0:11)
2
2
1
=
3
16
= (0:1875)
10
0 4
5 6 7
1
2 3
0.5
IEEE SINGLE PRECISION STANDARD
In IEEE single precision 32 bits are used to store num-
bers. A number is written as
x = 1:a
1
a
2
: : : a
23
2
e
:
The signicand x = (1:a
1
a
2
a
23
)
2
immediately sat-
ises 1 x < 2.
What are the limits on e? To understand the limits
on e and the number of binary digits chosen for x, we
must look roughly at how the number x will be stored
in the computer.
Basically, we store as a single bit, the signicand x
as 24 bits (only 23 need be stored), and the exponent
lls out 8 bits, including both negative and positive
integers.
Roughly speaking, we have that e must satisfy
(1111111
| {z }
7
)
2
e (1111111
| {z }
7
)
2
or in decimal
127 e 127
In actuality, the limits are
126 e 127
for reasons related to the storage of 0 and other num-
bers such as 1. In order to avoid the sign for ex-
ponent, denote E = e + 127:
Obviously, 1 E 254 with two additional values:
0 and 255:
E x
b
1
b
2
: : : b
9
b
10
: : : b
32
Number x = 0 is stored in the following way: E = 0,
= 0 and b
10
b
11
: : : b
32
= (00 : : : 0)
2
.
E = (b
2
: : : b
9
)
2
e x
(00000000)
2
= 0 127 (0:b
10
: : : b
32
)
2
2
126
(00000001)
2
= 1 126 (1:b
10
: : : b
32
)
2
2
126
(00000010)
2
= 2 125 (1:b
10
: : : b
32
)
2
2
125
.
.
.
.
.
.
.
.
.
(01111111)
2
= 127 0 (1:b
10
: : : b
32
)
2
2
0
(10000000)
2
= 128 1 (1:b
10
: : : b
32
)
2
2
1
.
.
.
.
.
.
.
.
.
(11111101)
2
= 253 126 (1:b
10
: : : b
32
)
2
2
126
(11111110)
2
= 254 127 (1:b
10
: : : b
32
)
2
2
127
(11111111)
2
= 255 128
1; if b
i
= 0
NaN; otherwise
IEEE DOUBLE PRECISSION STANDARD
x = 1:a
1
a
2
: : : a
52
2
e
:
E x
b
1
b
2
: : : b
12
b
13
: : : b
64
where E = e + 1023
E = (b
2
: : : b
12
)
2
e x
(00000000000)
2
= 0 1023 (:b
13
: : : b
64
)2
1022
(00000000001)
2
= 1 1022 (1:b
13
: : : b
64
)2
1022
(00000000010)
2
= 2 1021 (1:b
13
: : : b
64
)2
1021
.
.
.
.
.
.
.
.
.
(01111111111)
2
= 1023 0 (1:b
13
: : : b
64
)2
0
(10000000000)
2
= 1024 1 (1:b
13
: : : b
64
)2
1
.
.
.
.
.
.
.
.
.
(11111111101)
2
= 2045 1022 (1:b
13
: : : b
64
)2
1022
(11111111110)
2
= 2046 1023 (1:b
13
: : : b
64
)
2
1023
(11111111111)
2
= 2047 1024
1; b
i
= 0
NaN; otherwise
What is the connection of the 24 bits in the signicand
x to the number of decimal digits in the storage of
a number x into oating point form. One way of
answering this is to nd the integer M for which
1. 0 < x M and x an integer implies fl(x) = x;
and
2.fl(M + 1) 6= M + 1
This integer M is at least as big as
(11 : : : 1
| {z }
)
2
23 ones
= (1:11 : : : 1)
2
2
23
= 2
23
+ 2
22
+ : : : + 2
0
= 2
24
1
Also, 2
24
= (1:00 : : : 0)
2
2
24
will be stored exactly.
Next integer 2
24
+1 cannot be stored exactly since its
signicand will contain 24 + 1 binary digits:
2
24
+ 1 = (1:00 : : : 0
| {z }
23 of 0
1)
2
2
24
:
Therefore for single precision M = 2
24
. Any integer
less or equal to M will be stored exactly. So
M = 2
24
= 16777216:
For IEEE double precision standard we have
M = 2
53
9:0 10
15
:
THE MACHINE EPSILON
Let y be the smallest number representable in the ma-
chine arithmetic that is greater than 1 in the machine.
The machine epsilon is = y 1. It is a widely used
measure of the accuracy possible in representing num-
bers in the machine.
The number 1 has the simple oating point represen-
tation
1 = (1.00 0)
2
2
0
What is the smallest number that is greater than 1?
It is
1 + 2
23
= (1.0 01)
2
2
0
> 1
and the machine epsilon in IEEE single precision oat-
ing point format is = 2
23
.
= 1.19 10
7
.
THE UNIT ROUND
Consider the smallest number > 0 that is repre-
sentable in the machine and for which
1 + > 1
in the arithmetic of the machine.
For any number 0 < < , the result of 1 + is
exactly 1 in the machines arithmetic. Thus drops
o the end of the oating point representation in the
machine. The size of is another way of describing
the accuracy attainable in the oating point represen-
tation of the machine. The machine epsilon.has been
replacing it in recent years.
It is not too dicult to derive . The number 1 has
the simple oating point representation
1 = (1.00 0)
2
2
0
What is the smallest number which can be added to
this without disappearing? Certainly we can write
1 + 2
23
= (1.0 01)
2
2
0
> 1
Past this point, we need to know whether we are us-
ing chopped arithmetic or rounded arithmetic. We
will shortly look at both of these. With chopped
arithmetic, = 2
23
; and with rounded arithmetic,
= 2
24
.
ROUNDING AND CHOPPING
Let us rst consider these concepts with decimal arith-
metic. We write a computer oating point number z
as
z = 10
e
(a
1
.a
2
a
n
)
10
10
e
with a
1
6= 0, so that there are n decimal digits in the
signicand (a
1
.a
2
a
n
)
10
.
Given a general number
x = (a
1
.a
2
a
n
)
10
10
e
, a
1
6= 0
we must shorten it to t within the computer. This
is done by either chopping or rounding. The oating
point chopped version of x is given by
fl(x) = (a
1
.a
2
a
n
)
10
10
e
where we assume that e ts within the bounds re-
quired by the computer or calculator.
For the rounded version, we must decide whether to
round up or round down. A simplied formula is
fl(x) =
(
(a
1
.a
2
a
n
)
10
10
e
a
n+1
< 5
[(a
1
.a
2
a
n
)
10
+ (0.0 1)
10
] 10
e
a
n+1
5
The term (0.0 1)
10
denotes 10
n+1
, giving the or-
dinary sense of rounding with which you are familiar.
In the single case
(0.0 0a
n+1
a
n+2
)
10
= (0.0 0500 )
10
a more elaborate procedure is used so as to assure an
unbiased rounding.
CHOPPING/ROUNDING IN BINARY
Let
x = (1.a
2
a
n
)
2
2
e
with all a
i
equal to 0 or 1. Then for a chopped oating
point representation, we have
fl(x) = (1.a
2
a
n
)
2
2
e
For a rounded oating point representation, we have
fl(x) =
(
(1.a
2
a
n
)
2
10
e
a
n+1
= 0
[(1.a
2
a
n
)
2
+ (0.0 1)
2
] 10
e
a
n+1
= 1
ERRORS
The error xfl(x) = 0 when x needs no change to be
put into the computer or calculator. Of more interest
is the case when the error is nonzero. Consider rst
the case x > 0 (meaning = +1). The case with
x < 0 is the same, except for the sign being opposite.
With x 6= fl(x), and using chopping, we have
fl(x) < x
and the error x fl(x) is always positive. This later
has major consequences in extended numerical com-
putations. With x 6= fl(x) and rounding, the error
xfl(x) is negative for half the values of x, and it is
positive for the other half of possible values of x.
We often write the relative error as
x fl(x)
x
=
This can be expanded to obtain
fl(x) = (1 + )x
Thus fl(x) can be considered as a perturbed value
of x. This is used in many analyses of the eects of
chopping and rounding errors in numerical computa-
tions.
For bounds on , we have
2
n
2
n
, rounding
2
n+1
0, chopping
IEEE ARITHMETIC
We are only giving the minimal characteristics of IEEE
arithmetic. There are many options available on the
types of arithmetic and the chopping/rounding. The
default arithmetic uses rounding.
Single precision arithmetic:
n = 24, 126 e 127
This results in
M = 2
24
= 16777216
= 2
23
= 1.19 10
7
Double precision arithmetic:
n = 53, 1022 e 1023
What are M and ?
There is also an extended representation, having n =
69 digits in its signicand.
MATLAB can be used to generate the binary oating
point representation of a number.
Execute in MATLAB the command:
format hex
This will cause all subsequent numerical output to the
screen to be given in hexadecimal format (base 16).
For example, listing the number 7.125 results in an
output of
401c800000000000:
The 16 hexadecimal digits are
f0; 1; 2; 3; 4; 5; 6; 7; 8; 9; a; b; c; d; e; fg
To obtain the binary representation, convert each hex-
adecimal digit to a four digit binary number according
to the table below
Format Format Format Format
hex binary hex binary
0 0000 8 1000
1 0001 9 1001
2 0010 a 1010
3 0011 b 1011
4 0100 c 1100
5 0101 d 1101
6 0110 e 1110
7 0111 f 1111
For the above number, we obtain the binary expansion
4
z }| {
0100
0
z }| {
0000
1
z }| {
0001
c
z }| {
1100
8
z }| {
1000
0
z }| {
0000
0
z }| {
0000 : : :
0
z }| {
0000
0
|{z}

10000000001
| {z }
E
1100100000000000 : : : 0000
| {z }
1:b
13
b
14
:::b
64
=x
which provides us with the IEEE double precision rep-
resentation of 7:125.
SOME DEFINITIONS
Let x
T
denote the true value of some number, usually
unknown in practice; and let x
A
denote an approxi-
mation of x
T
.
The error in x
A
is
error(x
A
) = x
T
x
A
The relative error in x
A
is
rel(x
A
) =
error(x
A
)
x
T
=
x
T
x
A
x
T
Example:
x
T
= e; x
A
=
19
7
: Then,
error(x
A
) = e
19
7
= 0:003996
rel(xA) =
0:003996
e
= 0:00147
Relative error is more exact in representing the der-
ence between the true value and approximated one.
Example: Suppose the distance between two cities is
D
T
= 100 km and let this distance be approximated
with D
A
= 99 km. In this case,
Err (D
A
) = D
T
D
A
= 1 km,
Rel (D
A
) =
Err (D
A
)
D
T
= 0:01 1%:
Now, suppose that distance is d
T
= 2 km and esti-
mate it with d
A
= 1 km. Then
Err (d
A
) = d
T
d
A
= 1 km,
Rel (d
A
) =
Err (d
A
)
d
T
= 0:5 50%:
In both cases the error is the same. But, obviously
D
A
is a better approximation of D
T
, then d
A
of d
T
:
Numerical Analysis
conf.dr. Bostan Viorel
Fall 2010 Lecture 3
1 / 83
Sources of Error
The sources of error in the computation of the solution of a
mathematical model for some physical situation can be roughly
characterised as follows:
1. Modelling Error.
Consider the example of a projectile of mass m that is travelling
thrugh the earth's atmosphere. A simple and oftenly used
description of projectile motion is given by
m
d
2

r
dt
2
(t) = mg

k b
d

r
dt
with b _ 0. In this,

r (t) is the vector position of the projectile;
and the nal term in the equation represents friction force in air. If
there is an error in this a model of a physical situation, then the
numerical solution of this equation is not going to improve the
results.
2 / 83
Sources of Error
The sources of error in the computation of the solution of a
mathematical model for some physical situation can be roughly
characterised as follows:
1. Modelling Error.
Consider the example of a projectile of mass m that is travelling
thrugh the earth's atmosphere. A simple and oftenly used
description of projectile motion is given by
m
d
2

r
dt
2
(t) = mg

k b
d

r
dt
with b _ 0. In this,

r (t) is the vector position of the projectile;
and the nal term in the equation represents friction force in air. If
there is an error in this a model of a physical situation, then the
numerical solution of this equation is not going to improve the
results.
3 / 83
Sources of Error
The sources of error in the computation of the solution of a
mathematical model for some physical situation can be roughly
characterised as follows:
1. Modelling Error.
Consider the example of a projectile of mass m that is travelling
thrugh the earth's atmosphere. A simple and oftenly used
description of projectile motion is given by
m
d
2

r
dt
2
(t) = mg

k b
d

r
dt
with b _ 0. In this,

r (t) is the vector position of the projectile;
and the nal term in the equation represents friction force in air. If
there is an error in this a model of a physical situation, then the
numerical solution of this equation is not going to improve the
results.
4 / 83
Sources of Error
The sources of error in the computation of the solution of a
mathematical model for some physical situation can be roughly
characterised as follows:
1. Modelling Error.
Consider the example of a projectile of mass m that is travelling
thrugh the earth's atmosphere. A simple and oftenly used
description of projectile motion is given by
m
d
2

r
dt
2
(t) = mg

k b
d

r
dt
with b _ 0. In this,

r (t) is the vector position of the projectile;
and the nal term in the equation represents friction force in air. If
there is an error in this a model of a physical situation, then the
numerical solution of this equation is not going to improve the
results.
5 / 83
Sources of Error
2. Physical / Observational / Measurement Error.
The radius of an electron is given by
(2.81777 + ) 10
13
cm, [[ _ 0.00011
This error cannot be removed, and it must aect the accuracy of
any computation in which it is used.
We need to be aware of these eects and to so arrange the
computation as to minimize the eects.
6 / 83
Sources of Error
2. Physical / Observational / Measurement Error.
The radius of an electron is given by
(2.81777 + ) 10
13
cm, [[ _ 0.00011
This error cannot be removed, and it must aect the accuracy of
any computation in which it is used.
We need to be aware of these eects and to so arrange the
computation as to minimize the eects.
7 / 83
Sources of Error
3. Approximation Error.
This is also called \discretization error" and \truncation error";
and it is the main source of error with which we deal in this course.
Such errors generally occur when we replace a computationally
unsolvable problem with a nearby problem that is more tractable
computationally.
For example, the Taylor polynomial approximation
e
x
- 1 + x +
1
2
x
2
contains an \approximation error".
The numerical integration
1
_
0
f (x)dx -
1
N
N

j =1
f
_
j
N
_
contains an approximation error.
8 / 83
Sources of Error
3. Approximation Error.
This is also called \discretization error" and \truncation error";
and it is the main source of error with which we deal in this course.
Such errors generally occur when we replace a computationally
unsolvable problem with a nearby problem that is more tractable
computationally.
For example, the Taylor polynomial approximation
e
x
- 1 + x +
1
2
x
2
contains an \approximation error".
The numerical integration
1
_
0
f (x)dx -
1
N
N

j =1
f
_
j
N
_
contains an approximation error.
9 / 83
Sources of Error
3. Approximation Error.
This is also called \discretization error" and \truncation error";
and it is the main source of error with which we deal in this course.
Such errors generally occur when we replace a computationally
unsolvable problem with a nearby problem that is more tractable
computationally.
For example, the Taylor polynomial approximation
e
x
- 1 + x +
1
2
x
2
contains an \approximation error".
The numerical integration
1
_
0
f (x)dx -
1
N
N

j =1
f
_
j
N
_
contains an approximation error.
10 / 83
Sources of Error
3. Approximation Error.
This is also called \discretization error" and \truncation error";
and it is the main source of error with which we deal in this course.
Such errors generally occur when we replace a computationally
unsolvable problem with a nearby problem that is more tractable
computationally.
For example, the Taylor polynomial approximation
e
x
- 1 + x +
1
2
x
2
contains an \approximation error".
The numerical integration
1
_
0
f (x)dx -
1
N
N

j =1
f
_
j
N
_
contains an approximation error.
11 / 83
Sources of Error
4. Finiteness of Algorithm Error
This is an error due to stopping an algorithm after a nite number
of iterations.
Even if theoretically an algorithm can run for indenite time, after
a nite (usually specied) number of iterations the algorithm will
be stopped.
12 / 83
Sources of Error
4. Finiteness of Algorithm Error
This is an error due to stopping an algorithm after a nite number
of iterations.
Even if theoretically an algorithm can run for indenite time, after
a nite (usually specied) number of iterations the algorithm will
be stopped.
13 / 83
Sources of Error
5. Blunders.
In the pre-computer era, blunders were mostly arithmetic errors.
In
the earlier years of the computer era, the typical blunder was a
programming bugs. Present day \blunders" are still often
programming errors. But now they are often much more dicult to
nd, as they are often embedded in very large codes which may
mask their eect.
Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on the
output, checking whether the output is reasonable or not.
14 / 83
Sources of Error
5. Blunders.
In the pre-computer era, blunders were mostly arithmetic errors. In
the earlier years of the computer era, the typical blunder was a
programming bugs.
Present day \blunders" are still often
programming errors. But now they are often much more dicult to
nd, as they are often embedded in very large codes which may
mask their eect.
Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on the
output, checking whether the output is reasonable or not.
15 / 83
Sources of Error
5. Blunders.
In the pre-computer era, blunders were mostly arithmetic errors. In
the earlier years of the computer era, the typical blunder was a
programming bugs. Present day \blunders" are still often
programming errors. But now they are often much more dicult to
nd, as they are often embedded in very large codes which may
mask their eect.
Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on the
output, checking whether the output is reasonable or not.
16 / 83
Sources of Error
5. Blunders.
In the pre-computer era, blunders were mostly arithmetic errors. In
the earlier years of the computer era, the typical blunder was a
programming bugs. Present day \blunders" are still often
programming errors. But now they are often much more dicult to
nd, as they are often embedded in very large codes which may
mask their eect.
Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on the
output, checking whether the output is reasonable or not.
17 / 83
Sources of Error
6. Rounding/chopping Error.
This is the main source of many problems, especially problems in
solving systems of linear equations. We later look at the eects of
such errors.
18 / 83
Sources of Error
7. Finitness of precision errors
All the numbers stored in computer memory are subject to the
niteness of allocated space for storage.
19 / 83
Pendulum Example
Original problem in engineering or in science to be solved:

T
mg
Model this physical problem mathematically.
Second Newton law provides us with:
..
=
g
l
sin
_
.
=
.
=
g
l
sin
20 / 83
Pendulum Example
Original problem in engineering or in science to be solved:

T
mg
Model this physical problem mathematically.
Second Newton law provides us with:
..
=
g
l
sin
_
.
=
.
=
g
l
sin
21 / 83
Pendulum Example
Original problem in engineering or in science to be solved:

T
mg
Model this physical problem mathematically.
Second Newton law provides us with:
..
=
g
l
sin
_
.
=
.
=
g
l
sin
22 / 83
Pendulum Example
Original problem in engineering or in science to be solved:

T
mg
Model this physical problem mathematically.
Second Newton law provides us with:
..
=
g
l
sin
_
.
=
.
=
g
l
sin
23 / 83
Pendulum Example
Original problem in engineering or in science to be solved:

T
mg
Model this physical problem mathematically.
Second Newton law provides us with:
..
=
g
l
sin
_
.
=
.
=
g
l
sin
24 / 83
Pendulum Example
Problem of continuous mathematics:

T
mg
_
.
=
.
=
g
l
sin
Modeling Errors
Physical Errors
25 / 83
Pendulum Example
Problem of continuous mathematics:

T
mg
_
.
=
.
=
g
l
sin
Modeling Errors
Physical Errors
26 / 83
Pendulum Example
Problem of continuous mathematics:

T
mg
_
.
=
.
=
g
l
sin
Modeling Errors
Physical Errors
27 / 83
Pendulum Example
Mathematical Algorithms:

T
mg
_

n+1
=
n
+ h
n+1

n+1
=
n
h
g
l
sin (
n
)
Discretisation Errors
Finiteness of Algorithm Errors
28 / 83
Pendulum Example
Mathematical Algorithms:

T
mg
_

n+1
=
n
+ h
n+1

n+1
=
n
h
g
l
sin (
n
)
Discretisation Errors
Finiteness of Algorithm Errors
29 / 83
Pendulum Example
Mathematical Algorithms:

T
mg
_

n+1
=
n
+ h
n+1

n+1
=
n
h
g
l
sin (
n
)
Discretisation Errors
Finiteness of Algorithm Errors
30 / 83
Pendulum Example
Computer Implementation:

T
mg
for i=1:Nmax
Omega = Omega - H*g/L*sin(Theta);
Theta = Theta + H*Omega
end
Rounding / Chopping Errors
Bugs in the Code
Finite Precision Errors
31 / 83
Pendulum Example
Computer Implementation:

T
mg
for i=1:Nmax
Omega = Omega - H*g/L*sin(Theta);
Theta = Theta + H*Omega
end
Rounding / Chopping Errors
Bugs in the Code
Finite Precision Errors
32 / 83
Pendulum Example
Computer Implementation:

T
mg
for i=1:Nmax
Omega = Omega - H*g/L*sin(Theta);
Theta = Theta + H*Omega
end
Rounding / Chopping Errors
Bugs in the Code
Finite Precision Errors
33 / 83
Loss of signicance errors
This can be considered a source of error or a consequence of the
niteness of calculator and computer arithmetic.
Example. Dene
f (x) = x
_
_
x + 1
_
x
_
and consider evaluating it on a 6-digit decimal calculator which
uses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e 006
10 1.54340 1.54347 7.0000e 005
100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 0.0074
10000 50.0000 49.9988 0.0012
100000 100.000 158.113 58.1130
34 / 83
Loss of signicance errors
This can be considered a source of error or a consequence of the
niteness of calculator and computer arithmetic.
Example. Dene
f (x) = x
_
_
x + 1
_
x
_
and consider evaluating it on a 6-digit decimal calculator which
uses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e 006
10 1.54340 1.54347 7.0000e 005
100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 0.0074
10000 50.0000 49.9988 0.0012
100000 100.000 158.113 58.1130
35 / 83
Loss of signicance errors
This can be considered a source of error or a consequence of the
niteness of calculator and computer arithmetic.
Example. Dene
f (x) = x
_
_
x + 1
_
x
_
and consider evaluating it on a 6-digit decimal calculator which
uses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e 006
10 1.54340 1.54347 7.0000e 005
100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 0.0074
10000 50.0000 49.9988 0.0012
100000 100.000 158.113 58.1130
36 / 83
Loss of signicance errors
Example. Dene
g(x) =
1 cos x
x
2
and consider evaluating it on a 10-digit decimal calculator which
uses rounded arithmetic.
x Computed f (x) True f (x) Error
0.1 0.4995834700 0.4995834722 2.2000e 009
0.01 0.4999960000 0.4999958333 1.6670e 007
0.001 0.5000000000 0.4999999583 4.1700e 008
0.0001 0.5000000000 0.4999999996 4.0000e 010
0.00001 0.0 0.5000000000 0.5
37 / 83
Loss of signicance errors
Example. Dene
g(x) =
1 cos x
x
2
and consider evaluating it on a 10-digit decimal calculator which
uses rounded arithmetic.
x Computed f (x) True f (x) Error
0.1 0.4995834700 0.4995834722 2.2000e 009
0.01 0.4999960000 0.4999958333 1.6670e 007
0.001 0.5000000000 0.4999999583 4.1700e 008
0.0001 0.5000000000 0.4999999996 4.0000e 010
0.00001 0.0 0.5000000000 0.5
38 / 83
Loss of signicance errors
Consider one case, that of x = 0.001.
Then on the calculator:
cos(0.001) = 0.9999994999
1 cos(0.001) = 5.001 10
7
1 cos(0.001)
(0.001)
2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583 0.5001 =
0.0001000417
0.4999999583
= 0.0002
There are 3 signicant digits in the answer. How can such a
straightforward and short calculation lead to such a large error
(relative to the accuracy of the calculator)?
39 / 83
Loss of signicance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1 cos(0.001) = 5.001 10
7
1 cos(0.001)
(0.001)
2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583 0.5001 =
0.0001000417
0.4999999583
= 0.0002
There are 3 signicant digits in the answer. How can such a
straightforward and short calculation lead to such a large error
(relative to the accuracy of the calculator)?
40 / 83
Loss of signicance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1 cos(0.001) = 5.001 10
7
1 cos(0.001)
(0.001)
2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583 0.5001 =
0.0001000417
0.4999999583
= 0.0002
There are 3 signicant digits in the answer. How can such a
straightforward and short calculation lead to such a large error
(relative to the accuracy of the calculator)?
41 / 83
Loss of signicance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1 cos(0.001) = 5.001 10
7
1 cos(0.001)
(0.001)
2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583 0.5001 =
0.0001000417
0.4999999583
= 0.0002
There are 3 signicant digits in the answer. How can such a
straightforward and short calculation lead to such a large error
(relative to the accuracy of the calculator)?
42 / 83
Loss of signicance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1 cos(0.001) = 5.001 10
7
1 cos(0.001)
(0.001)
2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583 0.5001 =
0.0001000417
0.4999999583
= 0.0002
There are 3 signicant digits in the answer. How can such a
straightforward and short calculation lead to such a large error
(relative to the accuracy of the calculator)?
43 / 83
Loss of signicance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1 cos(0.001) = 5.001 10
7
1 cos(0.001)
(0.001)
2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583 0.5001 =
0.0001000417
0.4999999583
= 0.0002
There are 3 signicant digits in the answer. How can such a
straightforward and short calculation lead to such a large error
(relative to the accuracy of the calculator)?
44 / 83
Loss of signicance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1 cos(0.001) = 5.001 10
7
1 cos(0.001)
(0.001)
2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583 0.5001 =
0.0001000417
0.4999999583
= 0.0002
There are 3 signicant digits in the answer. How can such a
straightforward and short calculation lead to such a large error
(relative to the accuracy of the calculator)?
45 / 83
Loss of signicance errors
When two numbers are nearly equal and we subtract them, then
we suer a \loss of signicance error" in the calculation.
In some
cases, these can be quite subtle and dicult to detect. And even
after they are detected, they may be dicult to x.
The last example, fortunately, can be xed in a number of ways.
Easiest is to use a trigonometric identity:
cos(2) = 2 cos
2
() 1 = 1 2 sin
2
()
Let x = 2. Then
f (x) =
1 cos x
x
2
=
2 sin
2
(x/2)
x
2
=
1
2
_
sin(x/2)
x/2
_
2
This latter formula, with x = 0.001, yields a computed value of
0.4999999584, nearly the true answer. We could also have used a
Taylor polynomial for cos(x) around x = 0 to obtain a better
approximation to f (x) for small values of x.
46 / 83
Loss of signicance errors
When two numbers are nearly equal and we subtract them, then
we suer a \loss of signicance error" in the calculation. In some
cases, these can be quite subtle and dicult to detect.
And even
after they are detected, they may be dicult to x.
The last example, fortunately, can be xed in a number of ways.
Easiest is to use a trigonometric identity:
cos(2) = 2 cos
2
() 1 = 1 2 sin
2
()
Let x = 2. Then
f (x) =
1 cos x
x
2
=
2 sin
2
(x/2)
x
2
=
1
2
_
sin(x/2)
x/2
_
2
This latter formula, with x = 0.001, yields a computed value of
0.4999999584, nearly the true answer. We could also have used a
Taylor polynomial for cos(x) around x = 0 to obtain a better
approximation to f (x) for small values of x.
47 / 83
Loss of signicance errors
When two numbers are nearly equal and we subtract them, then
we suer a \loss of signicance error" in the calculation. In some
cases, these can be quite subtle and dicult to detect. And even
after they are detected, they may be dicult to x.
The last example, fortunately, can be xed in a number of ways.
Easiest is to use a trigonometric identity:
cos(2) = 2 cos
2
() 1 = 1 2 sin
2
()
Let x = 2. Then
f (x) =
1 cos x
x
2
=
2 sin
2
(x/2)
x
2
=
1
2
_
sin(x/2)
x/2
_
2
This latter formula, with x = 0.001, yields a computed value of
0.4999999584, nearly the true answer. We could also have used a
Taylor polynomial for cos(x) around x = 0 to obtain a better
approximation to f (x) for small values of x.
48 / 83
Loss of signicance errors
When two numbers are nearly equal and we subtract them, then
we suer a \loss of signicance error" in the calculation. In some
cases, these can be quite subtle and dicult to detect. And even
after they are detected, they may be dicult to x.
The last example, fortunately, can be xed in a number of ways.
Easiest is to use a trigonometric identity:
cos(2) = 2 cos
2
() 1 = 1 2 sin
2
()
Let x = 2. Then
f (x) =
1 cos x
x
2
=
2 sin
2
(x/2)
x
2
=
1
2
_
sin(x/2)
x/2
_
2
This latter formula, with x = 0.001, yields a computed value of
0.4999999584, nearly the true answer. We could also have used a
Taylor polynomial for cos(x) around x = 0 to obtain a better
approximation to f (x) for small values of x.
49 / 83
Loss of signicance errors
When two numbers are nearly equal and we subtract them, then
we suer a \loss of signicance error" in the calculation. In some
cases, these can be quite subtle and dicult to detect. And even
after they are detected, they may be dicult to x.
The last example, fortunately, can be xed in a number of ways.
Easiest is to use a trigonometric identity:
cos(2) = 2 cos
2
() 1 = 1 2 sin
2
()
Let x = 2. Then
f (x) =
1 cos x
x
2
=
2 sin
2
(x/2)
x
2
=
1
2
_
sin(x/2)
x/2
_
2
This latter formula, with x = 0.001, yields a computed value of
0.4999999584, nearly the true answer. We could also have used a
Taylor polynomial for cos(x) around x = 0 to obtain a better
approximation to f (x) for small values of x.
50 / 83
Loss of signicance errors
When two numbers are nearly equal and we subtract them, then
we suer a \loss of signicance error" in the calculation. In some
cases, these can be quite subtle and dicult to detect. And even
after they are detected, they may be dicult to x.
The last example, fortunately, can be xed in a number of ways.
Easiest is to use a trigonometric identity:
cos(2) = 2 cos
2
() 1 = 1 2 sin
2
()
Let x = 2. Then
f (x) =
1 cos x
x
2
=
2 sin
2
(x/2)
x
2
=
1
2
_
sin(x/2)
x/2
_
2
This latter formula, with x = 0.001, yields a computed value of
0.4999999584, nearly the true answer. We could also have used a
Taylor polynomial for cos(x) around x = 0 to obtain a better
approximation to f (x) for small values of x.
51 / 83
Another example
Evaluate e
5
using a Taylor polynomial approximation:
e
5
= 1 +
(5)
1!
+
(5)
2
2!
+
(5)
3
3!
+
(5)
4
4!
+ +
(5)
5
5!
+
(5)
6
6!
. . .
With n = 25, the error is

(5)
26
26!
e
c

_ 10
8
Imagine calculating this polynomial using a computer with 4 digit
decimal arithmetic and rounding. To make the point about
cancellation more strongly, imagine that each of the terms in the
above polynomial is calculated exactly and then rounded to the
arithmetic of the computer. We add the terms exactly and then we
round to four digits.
52 / 83
Another example
Evaluate e
5
using a Taylor polynomial approximation:
e
5
= 1 +
(5)
1!
+
(5)
2
2!
+
(5)
3
3!
+
(5)
4
4!
+ +
(5)
5
5!
+
(5)
6
6!
. . .
With n = 25, the error is

(5)
26
26!
e
c

_ 10
8
Imagine calculating this polynomial using a computer with 4 digit
decimal arithmetic and rounding. To make the point about
cancellation more strongly, imagine that each of the terms in the
above polynomial is calculated exactly and then rounded to the
arithmetic of the computer. We add the terms exactly and then we
round to four digits.
53 / 83
Another example
Evaluate e
5
using a Taylor polynomial approximation:
e
5
= 1 +
(5)
1!
+
(5)
2
2!
+
(5)
3
3!
+
(5)
4
4!
+ +
(5)
5
5!
+
(5)
6
6!
. . .
With n = 25, the error is

(5)
26
26!
e
c

_ 10
8
Imagine calculating this polynomial using a computer with 4 digit
decimal arithmetic and rounding.
To make the point about
cancellation more strongly, imagine that each of the terms in the
above polynomial is calculated exactly and then rounded to the
arithmetic of the computer. We add the terms exactly and then we
round to four digits.
54 / 83
Another example
Evaluate e
5
using a Taylor polynomial approximation:
e
5
= 1 +
(5)
1!
+
(5)
2
2!
+
(5)
3
3!
+
(5)
4
4!
+ +
(5)
5
5!
+
(5)
6
6!
. . .
With n = 25, the error is

(5)
26
26!
e
c

_ 10
8
Imagine calculating this polynomial using a computer with 4 digit
decimal arithmetic and rounding. To make the point about
cancellation more strongly, imagine that each of the terms in the
above polynomial is calculated exactly and then rounded to the
arithmetic of the computer. We add the terms exactly and then we
round to four digits.
55 / 83
Another example
Degree Term Sum Degree Term Sum
0 1.000 1.000 13 0.1960 0.04230
1 5.000 4.000 14 0.7001e 1 0.02771
2 12.50 8.500 15 0.2334e 1 0.004370
3 20.83 12.33 16 0.7293e 2 0.01166
4 26.04 13.71 17 0.2145e 2 0.009518
5 26.04 12.33 18 0.5958e 3 0.01011
6 21.70 9.370 19 0.1568e 3 0.009957
7 15.50 6.130 20 0.3920e 4 0.009996
8 9.688 3.558 21 0.9333e 5 0.009987
9 5.382 1.824 22 0.2121e 5 0.009989
10 2.691 0.8670 23 0.4611e 6 0.009989
11 1.223 0.3560 24 0.9670e 7 0.009989
12 0.5097 0.1537 25 0.1921e 7 0.009989
True answer is 0.006738
56 / 83
Another example
Degree Term Sum Degree Term Sum
0 1.000 1.000 13 0.1960 0.04230
1 5.000 4.000 14 0.7001e 1 0.02771
2 12.50 8.500 15 0.2334e 1 0.004370
3 20.83 12.33 16 0.7293e 2 0.01166
4 26.04 13.71 17 0.2145e 2 0.009518
5 26.04 12.33 18 0.5958e 3 0.01011
6 21.70 9.370 19 0.1568e 3 0.009957
7 15.50 6.130 20 0.3920e 4 0.009996
8 9.688 3.558 21 0.9333e 5 0.009987
9 5.382 1.824 22 0.2121e 5 0.009989
10 2.691 0.8670 23 0.4611e 6 0.009989
11 1.223 0.3560 24 0.9670e 7 0.009989
12 0.5097 0.1537 25 0.1921e 7 0.009989
True answer is 0.006738
57 / 83
Another example
To understand more fully the source of the error, look at the
numbers being added and their accuracy.
For example,
(5)
3
3!
=
125
6
= 20.83
in the 4 digit decimal calculation, with an error of magnitude
0.00333. Note that this error in an intermediate step is of same
magnitude as the true answer 0.006738 being sought. Other
similar errors are present in calculating other coecients, and thus
they cause a major error in the nal answer being calculated.
General principle
Whenever a sum is being formed in which the nal answer is much
smaller than some of the terms being combined, then a loss of
signicance error is occurring.
58 / 83
Another example
To understand more fully the source of the error, look at the
numbers being added and their accuracy. For example,
(5)
3
3!
=
125
6
= 20.83
in the 4 digit decimal calculation, with an error of magnitude
0.00333.
Note that this error in an intermediate step is of same
magnitude as the true answer 0.006738 being sought. Other
similar errors are present in calculating other coecients, and thus
they cause a major error in the nal answer being calculated.
General principle
Whenever a sum is being formed in which the nal answer is much
smaller than some of the terms being combined, then a loss of
signicance error is occurring.
59 / 83
Another example
To understand more fully the source of the error, look at the
numbers being added and their accuracy. For example,
(5)
3
3!
=
125
6
= 20.83
in the 4 digit decimal calculation, with an error of magnitude
0.00333. Note that this error in an intermediate step is of same
magnitude as the true answer 0.006738 being sought.
Other
similar errors are present in calculating other coecients, and thus
they cause a major error in the nal answer being calculated.
General principle
Whenever a sum is being formed in which the nal answer is much
smaller than some of the terms being combined, then a loss of
signicance error is occurring.
60 / 83
Another example
To understand more fully the source of the error, look at the
numbers being added and their accuracy. For example,
(5)
3
3!
=
125
6
= 20.83
in the 4 digit decimal calculation, with an error of magnitude
0.00333. Note that this error in an intermediate step is of same
magnitude as the true answer 0.006738 being sought. Other
similar errors are present in calculating other coecients, and thus
they cause a major error in the nal answer being calculated.
General principle
Whenever a sum is being formed in which the nal answer is much
smaller than some of the terms being combined, then a loss of
signicance error is occurring.
61 / 83
Another example
To understand more fully the source of the error, look at the
numbers being added and their accuracy. For example,
(5)
3
3!
=
125
6
= 20.83
in the 4 digit decimal calculation, with an error of magnitude
0.00333. Note that this error in an intermediate step is of same
magnitude as the true answer 0.006738 being sought. Other
similar errors are present in calculating other coecients, and thus
they cause a major error in the nal answer being calculated.
General principle
Whenever a sum is being formed in which the nal answer is much
smaller than some of the terms being combined, then a loss of
signicance error is occurring.
62 / 83
Noise in function evaluation
Consider plotting the function
f (x) = (x 1)
3
= x
3
3x
2
+ 3x 1 = 1 + x(3 + x(3 + x))
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
x
y
63 / 83
Noise in function evaluation
0.99998 1.00000 1.000002
-8
-4
0
4
8
x 10
-15
x
y
64 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmetic
operations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as an
answer contains noise.
This noise is generally \random" and small.
But it can aect the accuracy of other calculations which depend
on f (x).
65 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmetic
operations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as an
answer contains noise.
This noise is generally \random" and small.
But it can aect the accuracy of other calculations which depend
on f (x).
66 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmetic
operations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as an
answer contains noise.
This noise is generally \random" and small.
But it can aect the accuracy of other calculations which depend
on f (x).
67 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmetic
operations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as an
answer contains noise.
This noise is generally \random" and small.
But it can aect the accuracy of other calculations which depend
on f (x).
68 / 83
Underow errors
Consider evaluating
f (x) = x
10
for x near 0.
When using IEEE single precision arithmetic, the
smallest nonzero positive number expressible in normalized
oating-point format is
m = 2
126
= 1.18 10
38
Thus f (x) will be set to zero if
x
10
< m
[x[ < m
1
10
[x[ < 1.61 10
4
0.000161 < x < 0.000161
69 / 83
Underow errors
Consider evaluating
f (x) = x
10
for x near 0. When using IEEE single precision arithmetic, the
smallest nonzero positive number expressible in normalized
oating-point format is
m = 2
126
= 1.18 10
38
Thus f (x) will be set to zero if
x
10
< m
[x[ < m
1
10
[x[ < 1.61 10
4
0.000161 < x < 0.000161
70 / 83
Underow errors
Consider evaluating
f (x) = x
10
for x near 0. When using IEEE single precision arithmetic, the
smallest nonzero positive number expressible in normalized
oating-point format is
m = 2
126
= 1.18 10
38
Thus f (x) will be set to zero if
x
10
< m
[x[ < m
1
10
[x[ < 1.61 10
4
0.000161 < x < 0.000161
71 / 83
Underow errors
Consider evaluating
f (x) = x
10
for x near 0. When using IEEE single precision arithmetic, the
smallest nonzero positive number expressible in normalized
oating-point format is
m = 2
126
= 1.18 10
38
Thus f (x) will be set to zero if
x
10
< m
[x[ < m
1
10
[x[ < 1.61 10
4
0.000161 < x < 0.000161
72 / 83
Underow errors
Consider evaluating
f (x) = x
10
for x near 0. When using IEEE single precision arithmetic, the
smallest nonzero positive number expressible in normalized
oating-point format is
m = 2
126
= 1.18 10
38
Thus f (x) will be set to zero if
x
10
< m
[x[ < m
1
10
[x[ < 1.61 10
4
0.000161 < x < 0.000161
73 / 83
Underow errors
Consider evaluating
f (x) = x
10
for x near 0. When using IEEE single precision arithmetic, the
smallest nonzero positive number expressible in normalized
oating-point format is
m = 2
126
= 1.18 10
38
Thus f (x) will be set to zero if
x
10
< m
[x[ < m
1
10
[x[ < 1.61 10
4
0.000161 < x < 0.000161
74 / 83
Underow errors
Consider evaluating
f (x) = x
10
for x near 0. When using IEEE single precision arithmetic, the
smallest nonzero positive number expressible in normalized
oating-point format is
m = 2
126
= 1.18 10
38
Thus f (x) will be set to zero if
x
10
< m
[x[ < m
1
10
[x[ < 1.61 10
4
0.000161 < x < 0.000161
75 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors.
These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context. Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
76 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors. These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context.
Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
77 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors. These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context. Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
78 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors. These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context. Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
79 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors. These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context. Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
80 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors. These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context. Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
81 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors. These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context. Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
82 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors. These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context. Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
83 / 83
Numerical Analysis
conf.dr. Bostan Viorel
Fall 2010 Lecture 5
1 / 101
Loss of signicance errors
This can be considered a source of error or a consequence of the
niteness of calculator and computer arithmetic.
Example. Dene
f (x) = x

_
x + 1
_
x

and consider evaluating it on a 6-digit decimal calculator which


uses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e 006
10 1.54340 1.54347 7.0000e 005
100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 0.0074
10000 50.0000 49.9988 0.0012
100000 100.000 158.113 58.1130
2 / 101
Loss of signicance errors
This can be considered a source of error or a consequence of the
niteness of calculator and computer arithmetic.
Example. Dene
f (x) = x

_
x + 1
_
x

and consider evaluating it on a 6-digit decimal calculator which


uses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e 006
10 1.54340 1.54347 7.0000e 005
100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 0.0074
10000 50.0000 49.9988 0.0012
100000 100.000 158.113 58.1130
3 / 101
Loss of signicance errors
This can be considered a source of error or a consequence of the
niteness of calculator and computer arithmetic.
Example. Dene
f (x) = x

_
x + 1
_
x

and consider evaluating it on a 6-digit decimal calculator which


uses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e 006
10 1.54340 1.54347 7.0000e 005
100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 0.0074
10000 50.0000 49.9988 0.0012
100000 100.000 158.113 58.1130
4 / 101
Loss of signicance errors
In order to localise the error consider the case x = 100.
The calculator with 6 decimal digits will provide us with the
following values
_
100 = 10,
_
101 = 10.0499.
Then
_
x + 1
_
x =
_
101
_
100 = 0.0499000,
while the exact value is 0.0498756.
Three signicant digits in
_
x + 1 =
_
101 have been lost from
_
x =
_
100.
The loss of precision is due to the form of the function f (x) and
the niteness of the precision of the 6 digit calculator.
5 / 101
Loss of signicance errors
In order to localise the error consider the case x = 100.
The calculator with 6 decimal digits will provide us with the
following values
_
100 = 10,
_
101 = 10.0499.
Then
_
x + 1
_
x =
_
101
_
100 = 0.0499000,
while the exact value is 0.0498756.
Three signicant digits in
_
x + 1 =
_
101 have been lost from
_
x =
_
100.
The loss of precision is due to the form of the function f (x) and
the niteness of the precision of the 6 digit calculator.
6 / 101
Loss of signicance errors
In order to localise the error consider the case x = 100.
The calculator with 6 decimal digits will provide us with the
following values
_
100 = 10,
_
101 = 10.0499.
Then
_
x + 1
_
x =
_
101
_
100 = 0.0499000,
while the exact value is 0.0498756.
Three signicant digits in
_
x + 1 =
_
101 have been lost from
_
x =
_
100.
The loss of precision is due to the form of the function f (x) and
the niteness of the precision of the 6 digit calculator.
7 / 101
Loss of signicance errors
In order to localise the error consider the case x = 100.
The calculator with 6 decimal digits will provide us with the
following values
_
100 = 10,
_
101 = 10.0499.
Then
_
x + 1
_
x =
_
101
_
100 = 0.0499000,
while the exact value is 0.0498756.
Three signicant digits in
_
x + 1 =
_
101 have been lost from
_
x =
_
100.
The loss of precision is due to the form of the function f (x) and
the niteness of the precision of the 6 digit calculator.
8 / 101
Loss of signicance errors
In order to localise the error consider the case x = 100.
The calculator with 6 decimal digits will provide us with the
following values
_
100 = 10,
_
101 = 10.0499.
Then
_
x + 1
_
x =
_
101
_
100 = 0.0499000,
while the exact value is 0.0498756.
Three signicant digits in
_
x + 1 =
_
101 have been lost from
_
x =
_
100.
The loss of precision is due to the form of the function f (x) and
the niteness of the precision of the 6 digit calculator.
9 / 101
Loss of signicance errors
In this particular case, we can avoid the loss of precision by
rewritining the function as follows
f (x) = x
_
x + 1 +
_
x
_
x + 1 +
_
x

_
x + 1
_
x
1
=
x
_
x + 1 +
_
x
.
Thus we will avoid the subtraction on near quantities.
Doing so gives us
f (100) = 4.98756,
a value with 6 signicant digits.
10 / 101
Loss of signicance errors
In this particular case, we can avoid the loss of precision by
rewritining the function as follows
f (x) = x
_
x + 1 +
_
x
_
x + 1 +
_
x

_
x + 1
_
x
1
=
x
_
x + 1 +
_
x
.
Thus we will avoid the subtraction on near quantities.
Doing so gives us
f (100) = 4.98756,
a value with 6 signicant digits.
11 / 101
Loss of signicance errors
In this particular case, we can avoid the loss of precision by
rewritining the function as follows
f (x) = x
_
x + 1 +
_
x
_
x + 1 +
_
x

_
x + 1
_
x
1
=
x
_
x + 1 +
_
x
.
Thus we will avoid the subtraction on near quantities.
Doing so gives us
f (100) = 4.98756,
a value with 6 signicant digits.
12 / 101
Loss of signicance errors
In this particular case, we can avoid the loss of precision by
rewritining the function as follows
f (x) = x
_
x + 1 +
_
x
_
x + 1 +
_
x

_
x + 1
_
x
1
=
x
_
x + 1 +
_
x
.
Thus we will avoid the subtraction on near quantities.
Doing so gives us
f (100) = 4.98756,
a value with 6 signicant digits.
13 / 101
Loss of signicance errors
In this particular case, we can avoid the loss of precision by
rewritining the function as follows
f (x) = x
_
x + 1 +
_
x
_
x + 1 +
_
x

_
x + 1
_
x
1
=
x
_
x + 1 +
_
x
.
Thus we will avoid the subtraction on near quantities.
Doing so gives us
f (100) = 4.98756,
a value with 6 signicant digits.
14 / 101
Propagation of errors
Propagation in arithmetic operations
Let denote arithmetic operation such as +, , +,or /.
Let
+
denote the same arithmetic operation as it is actually
carried out in the computer, including rounding or chopping error.
Let x
A
- x
T
and y
A
- y
T
.
We want to obtain x
T
y
T
, but we actually obtain x
A

+
y
A
.
The error in x
A

+
y
A
is given by
(x
T
y
T
) (x
A

+
y
A
)
15 / 101
Propagation of errors
Propagation in arithmetic operations
Let denote arithmetic operation such as +, , +,or /.
Let
+
denote the same arithmetic operation as it is actually
carried out in the computer, including rounding or chopping error.
Let x
A
- x
T
and y
A
- y
T
.
We want to obtain x
T
y
T
, but we actually obtain x
A

+
y
A
.
The error in x
A

+
y
A
is given by
(x
T
y
T
) (x
A

+
y
A
)
16 / 101
Propagation of errors
Propagation in arithmetic operations
Let denote arithmetic operation such as +, , +,or /.
Let
+
denote the same arithmetic operation as it is actually
carried out in the computer, including rounding or chopping error.
Let x
A
- x
T
and y
A
- y
T
.
We want to obtain x
T
y
T
, but we actually obtain x
A

+
y
A
.
The error in x
A

+
y
A
is given by
(x
T
y
T
) (x
A

+
y
A
)
17 / 101
Propagation of errors
Propagation in arithmetic operations
Let denote arithmetic operation such as +, , +,or /.
Let
+
denote the same arithmetic operation as it is actually
carried out in the computer, including rounding or chopping error.
Let x
A
- x
T
and y
A
- y
T
.
We want to obtain x
T
y
T
, but we actually obtain x
A

+
y
A
.
The error in x
A

+
y
A
is given by
(x
T
y
T
) (x
A

+
y
A
)
18 / 101
Propagation of errors
Propagation in arithmetic operations
Let denote arithmetic operation such as +, , +,or /.
Let
+
denote the same arithmetic operation as it is actually
carried out in the computer, including rounding or chopping error.
Let x
A
- x
T
and y
A
- y
T
.
We want to obtain x
T
y
T
, but we actually obtain x
A

+
y
A
.
The error in x
A

+
y
A
is given by
(x
T
y
T
) (x
A

+
y
A
)
19 / 101
Propagation of errors
Propagation in arithmetic operations
The error in x
A

+
y
A
is rewritten as
(x
T
y
T
) (x
A

+
y
A
) = (x
T
y
T
x
A
y
A
) + (x
A
y
A
x
A

+
y
A
)
The nal term is the error introduced by the inexactness of the
machine arithmetic. For it, we usually assume
x
A

+
y
A
= (x
A
y
A
)
This means that the quantity x
A
y
A
is computed exactly and is
then rounded or chopped to t the answer into the oating point
representation of the machine.
20 / 101
Propagation of errors
Propagation in arithmetic operations
The error in x
A

+
y
A
is rewritten as
(x
T
y
T
) (x
A

+
y
A
) = (x
T
y
T
x
A
y
A
) + (x
A
y
A
x
A

+
y
A
)
The nal term is the error introduced by the inexactness of the
machine arithmetic. For it, we usually assume
x
A

+
y
A
= (x
A
y
A
)
This means that the quantity x
A
y
A
is computed exactly and is
then rounded or chopped to t the answer into the oating point
representation of the machine.
21 / 101
Propagation of errors
Propagation in arithmetic operations
The error in x
A

+
y
A
is rewritten as
(x
T
y
T
) (x
A

+
y
A
) = (x
T
y
T
x
A
y
A
) + (x
A
y
A
x
A

+
y
A
)
The nal term is the error introduced by the inexactness of the
machine arithmetic. For it, we usually assume
x
A

+
y
A
= (x
A
y
A
)
This means that the quantity x
A
y
A
is computed exactly and is
then rounded or chopped to t the answer into the oating point
representation of the machine.
22 / 101
Propagation of errors
Propagation in arithmetic operations
The formula
x
A

+
y
A
= (x
A
y
A
)
implies
x
A

+
y
A
= x
A
y
A
(1 + )
since
(x) = x(1 + )
where limits for were given earier.
Then,
Rel(x
A

+
y
A
) =
x
A
y
A
x
A

+
y
A
x
A
y
A
=
x
A
y
A
x
A
y
A
(1 + )
x
A
y
A
=
23 / 101
Propagation of errors
Propagation in arithmetic operations
The formula
x
A

+
y
A
= (x
A
y
A
)
implies
x
A

+
y
A
= x
A
y
A
(1 + )
since
(x) = x(1 + )
where limits for were given earier.
Then,
Rel(x
A

+
y
A
) =
x
A
y
A
x
A

+
y
A
x
A
y
A
=
x
A
y
A
x
A
y
A
(1 + )
x
A
y
A
=
24 / 101
Propagation of errors
Propagation in arithmetic operations
The formula
x
A

+
y
A
= (x
A
y
A
)
implies
x
A

+
y
A
= x
A
y
A
(1 + )
since
(x) = x(1 + )
where limits for were given earier.
Then,
Rel(x
A

+
y
A
) =
x
A
y
A
x
A

+
y
A
x
A
y
A
=
x
A
y
A
x
A
y
A
(1 + )
x
A
y
A
=
25 / 101
Propagation of errors
Propagation in arithmetic operations
The formula
x
A

+
y
A
= (x
A
y
A
)
implies
x
A

+
y
A
= x
A
y
A
(1 + )
since
(x) = x(1 + )
where limits for were given earier.
Then,
Rel(x
A

+
y
A
) =
x
A
y
A
x
A

+
y
A
x
A
y
A
=
x
A
y
A
x
A
y
A
(1 + )
x
A
y
A
=
26 / 101
Propagation of errors
Propagation in arithmetic operations
The formula
x
A

+
y
A
= (x
A
y
A
)
implies
x
A

+
y
A
= x
A
y
A
(1 + )
since
(x) = x(1 + )
where limits for were given earier.
Then,
Rel(x
A

+
y
A
) =
x
A
y
A
x
A

+
y
A
x
A
y
A
=
x
A
y
A
x
A
y
A
(1 + )
x
A
y
A
=
27 / 101
Propagation of errors
Propagation in arithmetic operations
The formula
x
A

+
y
A
= (x
A
y
A
)
implies
x
A

+
y
A
= x
A
y
A
(1 + )
since
(x) = x(1 + )
where limits for were given earier.
Then,
Rel(x
A

+
y
A
) =
x
A
y
A
x
A

+
y
A
x
A
y
A
=
x
A
y
A
x
A
y
A
(1 + )
x
A
y
A
=
28 / 101
Propagation of errors
Propagation in arithmetic operations
The formula
x
A

+
y
A
= (x
A
y
A
)
implies
x
A

+
y
A
= x
A
y
A
(1 + )
since
(x) = x(1 + )
where limits for were given earier.
Then,
Rel(x
A

+
y
A
) =
x
A
y
A
x
A

+
y
A
x
A
y
A
=
x
A
y
A
x
A
y
A
(1 + )
x
A
y
A
=
29 / 101
Propagation of errors
Propagation in arithmetic operations
With rounded binary arithmetic having n digits in the mantissa,
2
n
_ _ 2
n
Coming back to error formula
(x
T
y
T
) (x
A

+
y
A
) = (x
T
y
T
x
A
y
A
) + (x
A
y
A
x
A

+
y
A
)
| {z }
Relative error is
The second term
x
T
y
T
x
A
y
A
is the propagated error.
In what follows we examine it for particular cases.
30 / 101
Propagation of errors
Propagation in arithmetic operations
With rounded binary arithmetic having n digits in the mantissa,
2
n
_ _ 2
n
Coming back to error formula
(x
T
y
T
) (x
A

+
y
A
) = (x
T
y
T
x
A
y
A
) + (x
A
y
A
x
A

+
y
A
)
| {z }
Relative error is
The second term
x
T
y
T
x
A
y
A
is the propagated error.
In what follows we examine it for particular cases.
31 / 101
Propagation of errors
Propagation in arithmetic operations
With rounded binary arithmetic having n digits in the mantissa,
2
n
_ _ 2
n
Coming back to error formula
(x
T
y
T
) (x
A

+
y
A
) = (x
T
y
T
x
A
y
A
) + (x
A
y
A
x
A

+
y
A
)
| {z }
Relative error is
The second term
x
T
y
T
x
A
y
A
is the propagated error.
In what follows we examine it for particular cases.
32 / 101
Propagation of errors
Propagation in arithmetic operations
With rounded binary arithmetic having n digits in the mantissa,
2
n
_ _ 2
n
Coming back to error formula
(x
T
y
T
) (x
A

+
y
A
) = (x
T
y
T
x
A
y
A
) + (x
A
y
A
x
A

+
y
A
)
| {z }
Relative error is
The second term
x
T
y
T
x
A
y
A
is the propagated error.
In what follows we examine it for particular cases.
33 / 101
Propagation of errors
Propagation in multiplication
Let = +. Write
x
T
= x
A
+ , y
T
= y
A
+
Then for the relative error in x
A
y
A
Rel(x
A
y
A
) =
x
T
y
T
x
A
y
A
x
T
y
T
=
x
T
y
T
(x
T
)(y
t
)
x
T
y
T
=
x
T
y
T
x
T
y
T
+ x
T
+ y
T

x
T
y
T
=
x
T
+ y
T

x
T
y
T
=

x
T
+

y
T


x
T


y
T
= Rel(x
A
) + Rel( y
A
) Rel(x
A
) Rel( y
A
)
34 / 101
Propagation of errors
Propagation in multiplication
Let = +. Write
x
T
= x
A
+ , y
T
= y
A
+
Then for the relative error in x
A
y
A
Rel(x
A
y
A
) =
x
T
y
T
x
A
y
A
x
T
y
T
=
x
T
y
T
(x
T
)(y
t
)
x
T
y
T
=
x
T
y
T
x
T
y
T
+ x
T
+ y
T

x
T
y
T
=
x
T
+ y
T

x
T
y
T
=

x
T
+

y
T


x
T


y
T
= Rel(x
A
) + Rel( y
A
) Rel(x
A
) Rel( y
A
)
35 / 101
Propagation of errors
Propagation in multiplication
Let = +. Write
x
T
= x
A
+ , y
T
= y
A
+
Then for the relative error in x
A
y
A
Rel(x
A
y
A
) =
x
T
y
T
x
A
y
A
x
T
y
T
=
x
T
y
T
(x
T
)(y
t
)
x
T
y
T
=
x
T
y
T
x
T
y
T
+ x
T
+ y
T

x
T
y
T
=
x
T
+ y
T

x
T
y
T
=

x
T
+

y
T


x
T


y
T
= Rel(x
A
) + Rel( y
A
) Rel(x
A
) Rel( y
A
)
36 / 101
Propagation of errors
Propagation in multiplication
Let = +. Write
x
T
= x
A
+ , y
T
= y
A
+
Then for the relative error in x
A
y
A
Rel(x
A
y
A
) =
x
T
y
T
x
A
y
A
x
T
y
T
=
x
T
y
T
(x
T
)(y
t
)
x
T
y
T
=
x
T
y
T
x
T
y
T
+ x
T
+ y
T

x
T
y
T
=
x
T
+ y
T

x
T
y
T
=

x
T
+

y
T


x
T


y
T
= Rel(x
A
) + Rel( y
A
) Rel(x
A
) Rel( y
A
)
37 / 101
Propagation of errors
Propagation in multiplication
Let = +. Write
x
T
= x
A
+ , y
T
= y
A
+
Then for the relative error in x
A
y
A
Rel(x
A
y
A
) =
x
T
y
T
x
A
y
A
x
T
y
T
=
x
T
y
T
(x
T
)(y
t
)
x
T
y
T
=
x
T
y
T
x
T
y
T
+ x
T
+ y
T

x
T
y
T
=
x
T
+ y
T

x
T
y
T
=

x
T
+

y
T


x
T


y
T
= Rel(x
A
) + Rel( y
A
) Rel(x
A
) Rel( y
A
)
38 / 101
Propagation of errors
Propagation in multiplication
Let = +. Write
x
T
= x
A
+ , y
T
= y
A
+
Then for the relative error in x
A
y
A
Rel(x
A
y
A
) =
x
T
y
T
x
A
y
A
x
T
y
T
=
x
T
y
T
(x
T
)(y
t
)
x
T
y
T
=
x
T
y
T
x
T
y
T
+ x
T
+ y
T

x
T
y
T
=
x
T
+ y
T

x
T
y
T
=

x
T
+

y
T


x
T


y
T
= Rel(x
A
) + Rel( y
A
) Rel(x
A
) Rel( y
A
)
39 / 101
Propagation of errors
Propagation in multiplication
Let = +. Write
x
T
= x
A
+ , y
T
= y
A
+
Then for the relative error in x
A
y
A
Rel(x
A
y
A
) =
x
T
y
T
x
A
y
A
x
T
y
T
=
x
T
y
T
(x
T
)(y
t
)
x
T
y
T
=
x
T
y
T
x
T
y
T
+ x
T
+ y
T

x
T
y
T
=
x
T
+ y
T

x
T
y
T
=

x
T
+

y
T


x
T


y
T
= Rel(x
A
) + Rel( y
A
) Rel(x
A
) Rel( y
A
)
40 / 101
Propagation of errors
Propagation in multiplication
Let = +. Write
x
T
= x
A
+ , y
T
= y
A
+
Then for the relative error in x
A
y
A
Rel(x
A
y
A
) =
x
T
y
T
x
A
y
A
x
T
y
T
=
x
T
y
T
(x
T
)(y
t
)
x
T
y
T
=
x
T
y
T
x
T
y
T
+ x
T
+ y
T

x
T
y
T
=
x
T
+ y
T

x
T
y
T
=

x
T
+

y
T


x
T


y
T
= Rel(x
A
) + Rel( y
A
) Rel(x
A
) Rel( y
A
)
41 / 101
Propagation of errors
Propagation in multiplication
Usually we have
[Rel(x
A
)[ 1, [Rel(y
A
)[ 1
therefore, we can skip the last term Rel(x
A
)Rel( y
A
), since it is
much smaller compared with previous two
Rel(x
A
y
A
) = Rel (x
A
) + Rel ( y
A
) Rel(x
A
) Rel( y
A
)
- Rel(x
A
) + Rel( y
A
)
Thus small relative errors in the arguments x
A
and y
A
leads to a
small relative error in the product x
A
y
A
.
Also, note that there is some cancellation if these relative errors
are of opposite sign.
42 / 101
Propagation of errors
Propagation in multiplication
Usually we have
[Rel(x
A
)[ 1, [Rel(y
A
)[ 1
therefore, we can skip the last term Rel(x
A
)Rel( y
A
), since it is
much smaller compared with previous two
Rel(x
A
y
A
) = Rel (x
A
) + Rel ( y
A
) Rel(x
A
) Rel( y
A
)
- Rel(x
A
) + Rel( y
A
)
Thus small relative errors in the arguments x
A
and y
A
leads to a
small relative error in the product x
A
y
A
.
Also, note that there is some cancellation if these relative errors
are of opposite sign.
43 / 101
Propagation of errors
Propagation in multiplication
Usually we have
[Rel(x
A
)[ 1, [Rel(y
A
)[ 1
therefore, we can skip the last term Rel(x
A
)Rel( y
A
), since it is
much smaller compared with previous two
Rel(x
A
y
A
) = Rel (x
A
) + Rel ( y
A
) Rel(x
A
) Rel( y
A
)
- Rel(x
A
) + Rel( y
A
)
Thus small relative errors in the arguments x
A
and y
A
leads to a
small relative error in the product x
A
y
A
.
Also, note that there is some cancellation if these relative errors
are of opposite sign.
44 / 101
Propagation of errors
Propagation in multiplication
Usually we have
[Rel(x
A
)[ 1, [Rel(y
A
)[ 1
therefore, we can skip the last term Rel(x
A
)Rel( y
A
), since it is
much smaller compared with previous two
Rel(x
A
y
A
) = Rel (x
A
) + Rel ( y
A
) Rel(x
A
) Rel( y
A
)
- Rel(x
A
) + Rel( y
A
)
Thus small relative errors in the arguments x
A
and y
A
leads to a
small relative error in the product x
A
y
A
.
Also, note that there is some cancellation if these relative errors
are of opposite sign.
45 / 101
Propagation of errors
Propagation in multiplication
Usually we have
[Rel(x
A
)[ 1, [Rel(y
A
)[ 1
therefore, we can skip the last term Rel(x
A
)Rel( y
A
), since it is
much smaller compared with previous two
Rel(x
A
y
A
) = Rel (x
A
) + Rel ( y
A
) Rel(x
A
) Rel( y
A
)
- Rel(x
A
) + Rel( y
A
)
Thus small relative errors in the arguments x
A
and y
A
leads to a
small relative error in the product x
A
y
A
.
Also, note that there is some cancellation if these relative errors
are of opposite sign.
46 / 101
Propagation of errors
Propagation in multiplication
Usually we have
[Rel(x
A
)[ 1, [Rel(y
A
)[ 1
therefore, we can skip the last term Rel(x
A
)Rel( y
A
), since it is
much smaller compared with previous two
Rel(x
A
y
A
) = Rel (x
A
) + Rel ( y
A
) Rel(x
A
) Rel( y
A
)
- Rel(x
A
) + Rel( y
A
)
Thus small relative errors in the arguments x
A
and y
A
leads to a
small relative error in the product x
A
y
A
.
Also, note that there is some cancellation if these relative errors
are of opposite sign.
47 / 101
Propagation of errors
Propagation in division
There is a similar result for division:
Rel(x
A
y
A
) - Rel(x
A
) Rel( y
A
)
provided
[Rel(y
A
)[ 1
48 / 101
Propagation of errors
Propagation in addition and subtraction
For equal to or +, we have
[x
T
y
T
] [x
A
y
A
] = [x
T
x
A
] [y
T
y
A
]
Thus the error in a sum is the sum of the errors in the
original arguments, and similarly for subtraction.
However, there is a more subtle error occurring here.
49 / 101
Propagation of errors
Propagation in addition and subtraction
For equal to or +, we have
[x
T
y
T
] [x
A
y
A
] = [x
T
x
A
] [y
T
y
A
]
Thus the error in a sum is the sum of the errors in the
original arguments, and similarly for subtraction.
However, there is a more subtle error occurring here.
50 / 101
Propagation of errors
Propagation in addition and subtraction
For equal to or +, we have
[x
T
y
T
] [x
A
y
A
] = [x
T
x
A
] [y
T
y
A
]
Thus the error in a sum is the sum of the errors in the
original arguments, and similarly for subtraction.
However, there is a more subtle error occurring here.
51 / 101
Propagation of errors
Example
Suppose you are solving
x
2
26x + 1 = 0
Using the quadratic formula, we have the true answers
r
1,T
= 13 +
_
168, r
2,T
= 13
_
168
From a table of square roots, we take
_
168 - 12.961
Since this is correctly rounded to 5 digits, we have

_
168 12.961

_ 0.0005
Then dene
r
1,A
= 13 + 12.961 = 25.961,
r
2,A
= 13 12.961 = 0.039
52 / 101
Propagation of errors
Example
Suppose you are solving
x
2
26x + 1 = 0
Using the quadratic formula, we have the true answers
r
1,T
= 13 +
_
168, r
2,T
= 13
_
168
From a table of square roots, we take
_
168 - 12.961
Since this is correctly rounded to 5 digits, we have

_
168 12.961

_ 0.0005
Then dene
r
1,A
= 13 + 12.961 = 25.961,
r
2,A
= 13 12.961 = 0.039
53 / 101
Propagation of errors
Example
Suppose you are solving
x
2
26x + 1 = 0
Using the quadratic formula, we have the true answers
r
1,T
= 13 +
_
168, r
2,T
= 13
_
168
From a table of square roots, we take
_
168 - 12.961
Since this is correctly rounded to 5 digits, we have

_
168 12.961

_ 0.0005
Then dene
r
1,A
= 13 + 12.961 = 25.961,
r
2,A
= 13 12.961 = 0.039
54 / 101
Propagation of errors
Example
Suppose you are solving
x
2
26x + 1 = 0
Using the quadratic formula, we have the true answers
r
1,T
= 13 +
_
168, r
2,T
= 13
_
168
From a table of square roots, we take
_
168 - 12.961
Since this is correctly rounded to 5 digits, we have

_
168 12.961

_ 0.0005
Then dene
r
1,A
= 13 + 12.961 = 25.961,
r
2,A
= 13 12.961 = 0.039
55 / 101
Propagation of errors
Example
Suppose you are solving
x
2
26x + 1 = 0
Using the quadratic formula, we have the true answers
r
1,T
= 13 +
_
168, r
2,T
= 13
_
168
From a table of square roots, we take
_
168 - 12.961
Since this is correctly rounded to 5 digits, we have

_
168 12.961

_ 0.0005
Then dene
r
1,A
= 13 + 12.961 = 25.961,
r
2,A
= 13 12.961 = 0.039
56 / 101
Propagation of errors
Example
Then for both roots,
[r
T
r
A
[ _ 0.0005
For the relative errors, however,
Rel (r
1,A
) =
r
1,T
r
1,A
r
1,T
_
0.0005
25.9605
- 3.13 10
5
Rel (r
2,A
) =
r
2,T
r
2,A
r
2,T
_
0.0005
0.0385
- 0.0130
Why does r
2,A
have such poor accuracy in comparison to r
1,A
?
57 / 101
Propagation of errors
Example
Then for both roots,
[r
T
r
A
[ _ 0.0005
For the relative errors, however,
Rel (r
1,A
) =
r
1,T
r
1,A
r
1,T
_
0.0005
25.9605
- 3.13 10
5
Rel (r
2,A
) =
r
2,T
r
2,A
r
2,T
_
0.0005
0.0385
- 0.0130
Why does r
2,A
have such poor accuracy in comparison to r
1,A
?
58 / 101
Propagation of errors
Example
Then for both roots,
[r
T
r
A
[ _ 0.0005
For the relative errors, however,
Rel (r
1,A
) =
r
1,T
r
1,A
r
1,T
_
0.0005
25.9605
- 3.13 10
5
Rel (r
2,A
) =
r
2,T
r
2,A
r
2,T
_
0.0005
0.0385
- 0.0130
Why does r
2,A
have such poor accuracy in comparison to r
1,A
?
59 / 101
Propagation of errors
Example
Then for both roots,
[r
T
r
A
[ _ 0.0005
For the relative errors, however,
Rel (r
1,A
) =
r
1,T
r
1,A
r
1,T
_
0.0005
25.9605
- 3.13 10
5
Rel (r
2,A
) =
r
2,T
r
2,A
r
2,T
_
0.0005
0.0385
- 0.0130
Why does r
2,A
have such poor accuracy in comparison to r
1,A
?
60 / 101
Propagation of errors
Example
Then for both roots,
[r
T
r
A
[ _ 0.0005
For the relative errors, however,
Rel (r
1,A
) =
r
1,T
r
1,A
r
1,T
_
0.0005
25.9605
- 3.13 10
5
Rel (r
2,A
) =
r
2,T
r
2,A
r
2,T
_
0.0005
0.0385
- 0.0130
Why does r
2,A
have such poor accuracy in comparison to r
1,A
?
61 / 101
Propagation of errors
Example
Then for both roots,
[r
T
r
A
[ _ 0.0005
For the relative errors, however,
Rel (r
1,A
) =
r
1,T
r
1,A
r
1,T
_
0.0005
25.9605
- 3.13 10
5
Rel (r
2,A
) =
r
2,T
r
2,A
r
2,T
_
0.0005
0.0385
- 0.0130
Why does r
2,A
have such poor accuracy in comparison to r
1,A
?
62 / 101
Propagation of errors
Example
Then for both roots,
[r
T
r
A
[ _ 0.0005
For the relative errors, however,
Rel (r
1,A
) =
r
1,T
r
1,A
r
1,T
_
0.0005
25.9605
- 3.13 10
5
Rel (r
2,A
) =
r
2,T
r
2,A
r
2,T
_
0.0005
0.0385
- 0.0130
Why does r
2,A
have such poor accuracy in comparison to r
1,A
?
63 / 101
Propagation of errors
Example
The answer is due to the loss of signicance error involved in the
formula for calculating r
2,A
.
Instead, use the mathematically equivalent formula
r
2,A
=
1
13 +
_
168
-
1
25.961
This results in a much more accurate answer, at the expense of an
additional division.
64 / 101
Propagation of errors
Example
The answer is due to the loss of signicance error involved in the
formula for calculating r
2,A
.
Instead, use the mathematically equivalent formula
r
2,A
=
1
13 +
_
168
-
1
25.961
This results in a much more accurate answer, at the expense of an
additional division.
65 / 101
Propagation of errors
Example
The answer is due to the loss of signicance error involved in the
formula for calculating r
2,A
.
Instead, use the mathematically equivalent formula
r
2,A
=
1
13 +
_
168
-
1
25.961
This results in a much more accurate answer, at the expense of an
additional division.
66 / 101
Propagation of errors
Errors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate of
it which we denote by
e
f (x).
Now suppose that we have a number x
A
- x
T
.
We want to calculate f (x
T
), but instead we evaluate
e
f (x
A
).
What can we say about the error in this latter computed quantity?
f (x
T
)
e
f (x
A
)
67 / 101
Propagation of errors
Errors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate of
it which we denote by
e
f (x).
Now suppose that we have a number x
A
- x
T
.
We want to calculate f (x
T
), but instead we evaluate
e
f (x
A
).
What can we say about the error in this latter computed quantity?
f (x
T
)
e
f (x
A
)
68 / 101
Propagation of errors
Errors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate of
it which we denote by
e
f (x).
Now suppose that we have a number x
A
- x
T
.
We want to calculate f (x
T
), but instead we evaluate
e
f (x
A
).
What can we say about the error in this latter computed quantity?
f (x
T
)
e
f (x
A
)
69 / 101
Propagation of errors
Errors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate of
it which we denote by
e
f (x).
Now suppose that we have a number x
A
- x
T
.
We want to calculate f (x
T
), but instead we evaluate
e
f (x
A
).
What can we say about the error in this latter computed quantity?
f (x
T
)
e
f (x
A
)
70 / 101
Propagation of errors
Errors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate of
it which we denote by
e
f (x).
Now suppose that we have a number x
A
- x
T
.
We want to calculate f (x
T
), but instead we evaluate
e
f (x
A
).
What can we say about the error in this latter computed quantity?
f (x
T
)
e
f (x
A
)
71 / 101
Propagation of errors
Errors in function evaluation
f (x
T
)
e
f (x
A
) = [f (x
T
) f (x
A
)]
h
f (x
A
)
e
f (x
A
)
i
The quantity f (x
A
)
e
f (x
A
) is the \noise" in the evaluation of
f (x
A
) in the computer, and we will return later to some discussion
of it.
The quantity f (x
T
) f (x
A
) is called the propagated error. It is
the error that results from using perfect arithmetic in the
evaluation of the function.
If the function f (x) is dierentiable, then we can use the
\mean-value theorem" to write
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
for some between x
T
andx
A
.
72 / 101
Propagation of errors
Errors in function evaluation
f (x
T
)
e
f (x
A
) = [f (x
T
) f (x
A
)]
h
f (x
A
)
e
f (x
A
)
i
The quantity f (x
A
)
e
f (x
A
) is the \noise" in the evaluation of
f (x
A
) in the computer, and we will return later to some discussion
of it.
The quantity f (x
T
) f (x
A
) is called the propagated error. It is
the error that results from using perfect arithmetic in the
evaluation of the function.
If the function f (x) is dierentiable, then we can use the
\mean-value theorem" to write
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
for some between x
T
andx
A
.
73 / 101
Propagation of errors
Errors in function evaluation
f (x
T
)
e
f (x
A
) = [f (x
T
) f (x
A
)]
h
f (x
A
)
e
f (x
A
)
i
The quantity f (x
A
)
e
f (x
A
) is the \noise" in the evaluation of
f (x
A
) in the computer, and we will return later to some discussion
of it.
The quantity f (x
T
) f (x
A
) is called the propagated error. It is
the error that results from using perfect arithmetic in the
evaluation of the function.
If the function f (x) is dierentiable, then we can use the
\mean-value theorem" to write
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
for some between x
T
andx
A
.
74 / 101
Propagation of errors
Errors in function evaluation
f (x
T
)
e
f (x
A
) = [f (x
T
) f (x
A
)]
h
f (x
A
)
e
f (x
A
)
i
The quantity f (x
A
)
e
f (x
A
) is the \noise" in the evaluation of
f (x
A
) in the computer, and we will return later to some discussion
of it.
The quantity f (x
T
) f (x
A
) is called the propagated error. It is
the error that results from using perfect arithmetic in the
evaluation of the function.
If the function f (x) is dierentiable, then we can use the
\mean-value theorem" to write
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
for some between x
T
andx
A
.
75 / 101
Propagation of errors
Errors in function evaluation
Since usually x
T
and x
A
are close together, we can say is close to
either of them, and
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
- f
/
(x
T
)(x
T
x
A
)
- f
/
(x
A
)(x
T
x
A
)
76 / 101
Propagation of errors
Errors in function evaluation
Since usually x
T
and x
A
are close together, we can say is close to
either of them, and
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
- f
/
(x
T
)(x
T
x
A
)
- f
/
(x
A
)(x
T
x
A
)
77 / 101
Propagation of errors
Errors in function evaluation
Since usually x
T
and x
A
are close together, we can say is close to
either of them, and
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
- f
/
(x
T
)(x
T
x
A
)
- f
/
(x
A
)(x
T
x
A
)
78 / 101
Propagation of errors
Errors in function evaluation
Since usually x
T
and x
A
are close together, we can say is close to
either of them, and
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
- f
/
(x
T
)(x
T
x
A
)
- f
/
(x
A
)(x
T
x
A
)
79 / 101
Propagation of errors
Errors in function evaluation
Since usually x
T
and x
A
are close together, we can say is close to
either of them, and
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
- f
/
(x
T
)(x
T
x
A
)
- f
/
(x
A
)(x
T
x
A
)
80 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
81 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
82 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
83 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
84 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
85 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
86 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
87 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
.
The number K is called a condition
number for the computation.
88 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
89 / 101
Summation
Let S be a sum with a relatively large number of terms
S = a
1
+ a
2
+ . . . a
n
(1)
where a
j
, j = 1, . . . , n, are oating point numbers.
The
summation process consists of n 1 consecutive additions
S = (((. . . (a
1
+ a
2
) + a
3
) + . . . + a
n1
) + a
n
,
Dene
S
2
= (a
1
+ a
2
)
S
3
= (S
2
+ a
3
)
S
4
= (S
3
+ a
4
)
.
.
.
S
n
= (S
n1
+ a
n
)
Recall the formula
(x) = x(1 + )
90 / 101
Summation
Let S be a sum with a relatively large number of terms
S = a
1
+ a
2
+ . . . a
n
(1)
where a
j
, j = 1, . . . , n, are oating point numbers. The
summation process consists of n 1 consecutive additions
S = (((. . . (a
1
+ a
2
) + a
3
) + . . . + a
n1
) + a
n
,
Dene
S
2
= (a
1
+ a
2
)
S
3
= (S
2
+ a
3
)
S
4
= (S
3
+ a
4
)
.
.
.
S
n
= (S
n1
+ a
n
)
Recall the formula
(x) = x(1 + )
91 / 101
Summation
Let S be a sum with a relatively large number of terms
S = a
1
+ a
2
+ . . . a
n
(1)
where a
j
, j = 1, . . . , n, are oating point numbers. The
summation process consists of n 1 consecutive additions
S = (((. . . (a
1
+ a
2
) + a
3
) + . . . + a
n1
) + a
n
,
Dene
S
2
= (a
1
+ a
2
)
S
3
= (S
2
+ a
3
)
S
4
= (S
3
+ a
4
)
.
.
.
S
n
= (S
n1
+ a
n
)
Recall the formula
(x) = x(1 + )
92 / 101
Summation
S
2
= (a
1
+ a
2
)(1 +
2
)
S
3
= (S
2
+ a
3
)(1 +
3
)
S
4
= (S
3
+ a
4
)(1 +
4
)
.
.
.
S
n
= (S
n1
+ a
n
)(1 +
n
)
Then
S
3
= (S
2
+ a
3
)(1 +
3
)
= ((a
1
+ a
2
)(1 +
2
) + a
3
)(1 +
3
)
- (a
1
+ a
2
+ a
3
) + a
1
(
2
+
3
)
+a
2
(
2
+
3
) + a
3

3
,
93 / 101
Summation
S
2
= (a
1
+ a
2
)(1 +
2
)
S
3
= (S
2
+ a
3
)(1 +
3
)
S
4
= (S
3
+ a
4
)(1 +
4
)
.
.
.
S
n
= (S
n1
+ a
n
)(1 +
n
)
Then
S
3
= (S
2
+ a
3
)(1 +
3
)
= ((a
1
+ a
2
)(1 +
2
) + a
3
)(1 +
3
)
- (a
1
+ a
2
+ a
3
) + a
1
(
2
+
3
)
+a
2
(
2
+
3
) + a
3

3
,
94 / 101
Summation
Similarly,
S
4
- (a
1
+ a
2
+ a
3
+ a
4
) + a
1
(
2
+
3
+
4
)
+a
2
(
2
+
3
+
4
) + a
3
(
3
+
4
) + a
4

4
Finally,
S
n
- (a
1
+ a
2
+ . . . + a
n
) + a
1
(
2
+ . . . +
n
)
+a
2
(
2
+ . . . +
n
) + a
3
(
3
+ . . . +
n
)
+a
4
(
4
+ . . . +
n
) + . . . + a
n

n
95 / 101
Summation
Similarly,
S
4
- (a
1
+ a
2
+ a
3
+ a
4
) + a
1
(
2
+
3
+
4
)
+a
2
(
2
+
3
+
4
) + a
3
(
3
+
4
) + a
4

4
Finally,
S
n
- (a
1
+ a
2
+ . . . + a
n
) + a
1
(
2
+ . . . +
n
)
+a
2
(
2
+ . . . +
n
) + a
3
(
3
+ . . . +
n
)
+a
4
(
4
+ . . . +
n
) + . . . + a
n

n
96 / 101
Summation
We are interested in the error S S
n
:
S S
n
- a
1
(
2
+ . . . +
n
) a
2
(
2
+ . . . +
n
) a
3
(
3
+ . . . +
n
)
a
4
(
4
+ . . . +
n
) . . . a
n

n
From the last relation we can establish the strategy for sumation in
order to minimize the error S S
n
: initially rearrange the termsin
increasing order
[a
1
[ _ [a
2
[ _ [a
3
[ _ . . . _ [a
n
[
In this case smaller numbers a
1
and a
2
will be multiplied with
larger numbers
2
+ . . . +
n
, and larger number a
n
will be
multiplied with smaller number
n
.
97 / 101
Summation
We are interested in the error S S
n
:
S S
n
- a
1
(
2
+ . . . +
n
) a
2
(
2
+ . . . +
n
) a
3
(
3
+ . . . +
n
)
a
4
(
4
+ . . . +
n
) . . . a
n

n
From the last relation we can establish the strategy for sumation in
order to minimize the error S S
n
: initially rearrange the termsin
increasing order
[a
1
[ _ [a
2
[ _ [a
3
[ _ . . . _ [a
n
[
In this case smaller numbers a
1
and a
2
will be multiplied with
larger numbers
2
+ . . . +
n
, and larger number a
n
will be
multiplied with smaller number
n
.
98 / 101
Summation
We are interested in the error S S
n
:
S S
n
- a
1
(
2
+ . . . +
n
) a
2
(
2
+ . . . +
n
) a
3
(
3
+ . . . +
n
)
a
4
(
4
+ . . . +
n
) . . . a
n

n
From the last relation we can establish the strategy for sumation in
order to minimize the error S S
n
: initially rearrange the termsin
increasing order
[a
1
[ _ [a
2
[ _ [a
3
[ _ . . . _ [a
n
[
In this case smaller numbers a
1
and a
2
will be multiplied with
larger numbers
2
+ . . . +
n
, and larger number a
n
will be
multiplied with smaller number
n
.
99 / 101
Summation with chopping
Number
of terms, n
Exact
value
SL Error LS Error
10 2.929 2.928 0.001 2.927 0.002
25 3.816 3.813 0.003 3.806 0.010
50 4.499 4.491 0.008 4.470 0.020
100 5.187 5.170 0.017 5.142 0.045
200 5.878 5.841 0.037 5.786 0.092
500 6.793 6.692 0.101 6.569 0.224
1000 7.486 7.284 0.202 7.069 0.417
100 / 101
Summation with rounding
Number
of terms, n
Exact
value
SL Error LS Error
10 2.929 2.929 0 2.929 0
25 3.816 3.816 0 3.817 0.001
50 4.499 4.500 0.001 4.498 0.001
100 5.187 5.187 0 5.187 0
200 5.878 5.878 0 5.876 0.002
500 6.793 6.794 0.001 6.783 0.010
1000 7.486 7.486 0 7.449 0.037
101 / 101
Numerical Analysis
conf.dr. Bostan Viorel
Fall 2010 Lecture 6
1 / 94
Rootnding
We want to nd the numbers x for which
f (x) = 0
with f : [a, b] R a given real-valued function. Here, we denote
such roots or zeroes by the Greek letter . So
f () = 0
Rootnding problems occur in many contexts. Sometimes they are
a direct formulation of some physical situtation, but more often,
they are an intermediate step in solving a much larger problem.
2 / 94
Rootnding
We want to nd the numbers x for which
f (x) = 0
with f : [a, b] R a given real-valued function. Here, we denote
such roots or zeroes by the Greek letter . So
f () = 0
Rootnding problems occur in many contexts. Sometimes they are
a direct formulation of some physical situtation, but more often,
they are an intermediate step in solving a much larger problem.
3 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. This
means that such a method given an initail guess x
0
will provide us
with a sequence of consecutively computed solutions
x
1
, x
2
, x
3
, . . . , x
n
, . . . such that x
n
.
We begin with the simplest of such methods, one which most
people use at some time.
Suppose we are given a function f (x) and we assume we have an
interval [a, b] containing the root, on which the function is
continuous.
We also assume we are given an error tolerance > 0, and we
want an approximate root e [a, b] for which
[ e[ <
4 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. This
means that such a method given an initail guess x
0
will provide us
with a sequence of consecutively computed solutions
x
1
, x
2
, x
3
, . . . , x
n
, . . . such that x
n
.
We begin with the simplest of such methods, one which most
people use at some time.
Suppose we are given a function f (x) and we assume we have an
interval [a, b] containing the root, on which the function is
continuous.
We also assume we are given an error tolerance > 0, and we
want an approximate root e [a, b] for which
[ e[ <
5 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. This
means that such a method given an initail guess x
0
will provide us
with a sequence of consecutively computed solutions
x
1
, x
2
, x
3
, . . . , x
n
, . . . such that x
n
.
We begin with the simplest of such methods, one which most
people use at some time.
Suppose we are given a function f (x) and we assume we have an
interval [a, b] containing the root, on which the function is
continuous.
We also assume we are given an error tolerance > 0, and we
want an approximate root e [a, b] for which
[ e[ <
6 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. This
means that such a method given an initail guess x
0
will provide us
with a sequence of consecutively computed solutions
x
1
, x
2
, x
3
, . . . , x
n
, . . . such that x
n
.
We begin with the simplest of such methods, one which most
people use at some time.
Suppose we are given a function f (x) and we assume we have an
interval [a, b] containing the root, on which the function is
continuous.
We also assume we are given an error tolerance > 0, and we
want an approximate root e [a, b] for which
[ e[ <
7 / 94
Bisection method
Bisection method is based on the following theorem:
Theorem
If f : [a, b] R is a continuous function on closed and bounded
interval [a, b] and
f (a) f (b) < 0
then there exists [a, b] such that f () = 0.
Therefore, further assume that the function f (x) changes sign on
[a, b].
8 / 94
Bisection method
Bisection method is based on the following theorem:
Theorem
If f : [a, b] R is a continuous function on closed and bounded
interval [a, b] and
f (a) f (b) < 0
then there exists [a, b] such that f () = 0.
Therefore, further assume that the function f (x) changes sign on
[a, b].
9 / 94
Bisection method
Bisection method is based on the following theorem:
Theorem
If f : [a, b] R is a continuous function on closed and bounded
interval [a, b] and
f (a) f (b) < 0
then there exists [a, b] such that f () = 0.
Therefore, further assume that the function f (x) changes sign on
[a, b].
10 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, )
Step 1: Dene
c =
a + b
2
Step 2: If b c _ , accept c as our root, and then stop.
Step 3: If b c > , then compare the sign of f (c) to that of
f (a) and f (b). If
sign(f (a)) sign(f (b)) _ 0
then replace a with c; and otherwise, replace b with c.
Return to Step 1.
Note that we prefer checking the sign using condition
sign(f (a)) sign(f (b)) _ 0 instead of using sign(f (a) f (b)) _ 0.
11 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, )
Step 1: Dene
c =
a + b
2
Step 2: If b c _ , accept c as our root, and then stop.
Step 3: If b c > , then compare the sign of f (c) to that of
f (a) and f (b). If
sign(f (a)) sign(f (b)) _ 0
then replace a with c; and otherwise, replace b with c.
Return to Step 1.
Note that we prefer checking the sign using condition
sign(f (a)) sign(f (b)) _ 0 instead of using sign(f (a) f (b)) _ 0.
12 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, )
Step 1: Dene
c =
a + b
2
Step 2: If b c _ , accept c as our root, and then stop.
Step 3: If b c > , then compare the sign of f (c) to that of
f (a) and f (b). If
sign(f (a)) sign(f (b)) _ 0
then replace a with c; and otherwise, replace b with c.
Return to Step 1.
Note that we prefer checking the sign using condition
sign(f (a)) sign(f (b)) _ 0 instead of using sign(f (a) f (b)) _ 0.
13 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, )
Step 1: Dene
c =
a + b
2
Step 2: If b c _ , accept c as our root, and then stop.
Step 3: If b c > , then compare the sign of f (c) to that of
f (a) and f (b). If
sign(f (a)) sign(f (b)) _ 0
then replace a with c; and otherwise, replace b with c.
Return to Step 1.
Note that we prefer checking the sign using condition
sign(f (a)) sign(f (b)) _ 0 instead of using sign(f (a) f (b)) _ 0.
14 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, )
Step 1: Dene
c =
a + b
2
Step 2: If b c _ , accept c as our root, and then stop.
Step 3: If b c > , then compare the sign of f (c) to that of
f (a) and f (b). If
sign(f (a)) sign(f (b)) _ 0
then replace a with c; and otherwise, replace b with c.
Return to Step 1.
Note that we prefer checking the sign using condition
sign(f (a)) sign(f (b)) _ 0 instead of using sign(f (a) f (b)) _ 0.
15 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, )
Step 1: Dene
c =
a + b
2
Step 2: If b c _ , accept c as our root, and then stop.
Step 3: If b c > , then compare the sign of f (c) to that of
f (a) and f (b). If
sign(f (a)) sign(f (b)) _ 0
then replace a with c; and otherwise, replace b with c.
Return to Step 1.
Note that we prefer checking the sign using condition
sign(f (a)) sign(f (b)) _ 0 instead of using sign(f (a) f (b)) _ 0.
16 / 94
Bisection method
y
x

a
1
b
1
=b
2
c
1
=a
2
c
2
17 / 94
Bisection method
Example
Consider the function
f (x) = x
6
x 1
We want to nd the largest root with accuracy of = 0.001. It can
be seen form the graph of the function that the root is located in
[1, 2] . Also, note that the function is continuous. Let a = 1 and
b = 2, then f (a) = 1 and f (b) = 61, consequently the function
changes its sign and thus all conditions are being satised.
18 / 94
Bisection method
n a
n
b
n
c
n
f (c
n
) b
n
c
n
1 1.00000 2.00000 1.50000 8.891e + 00 5.000e 01
2 1.00000 1.50000 1.25000 1.565e + 00 2.500e 01
3 1.00000 1.25000 1.12500 9.771e 02 1.250e 01
4 1.12500 1.25000 1.18750 6.167e 01 6.250e 02
5 1.12500 1.18750 1.15625 2.333e 01 3.125e 02
6 1.12500 1.15625 1.14063 6.158e 02 1.563e 02
7 1.12500 1.14063 1.13281 1.958e 02 7.813e 03
8 1.13281 1.14063 1.13672 2.062e 02 3.906e 03
9 1.13281 1.13672 1.13477 4.268e 04 1.953e 03
10 1.13281 1.13477 1.13379 9.598e 03 9.766e 04
19 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
) =
1
2
n
(b a)
20 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
) =
1
2
n
(b a)
21 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
) =
1
2
n
(b a)
22 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
) =
1
2
n
(b a)
23 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
) =
1
2
n
(b a)
24 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
)
=
1
2
n
(b a)
25 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
) =
1
2
n
(b a)
26 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
) =
1
2
n
(b a)
27 / 94
Error analysis for bisection method
[ c
n
[ _
1
2
n
(b a)
This relation provides us with a stopping criterion for bisection
method. Moreover, it follows that c
n
as n .
Suppose we want to estimate the number of iterations in bisection
method necessary to nd the root with an error tolerance ,
[ c
n
[ _
1
2
n
(b a) _
n _
ln

ba

ln 2
For previuos example we get
n _
ln

1
0.001

ln 2
- 9.97
28 / 94
Error analysis for bisection method
[ c
n
[ _
1
2
n
(b a)
This relation provides us with a stopping criterion for bisection
method. Moreover, it follows that c
n
as n .
Suppose we want to estimate the number of iterations in bisection
method necessary to nd the root with an error tolerance ,
[ c
n
[ _
1
2
n
(b a) _
n _
ln

ba

ln 2
For previuos example we get
n _
ln

1
0.001

ln 2
- 9.97
29 / 94
Error analysis for bisection method
[ c
n
[ _
1
2
n
(b a)
This relation provides us with a stopping criterion for bisection
method. Moreover, it follows that c
n
as n .
Suppose we want to estimate the number of iterations in bisection
method necessary to nd the root with an error tolerance ,
[ c
n
[ _
1
2
n
(b a) _
n _
ln

ba

ln 2
For previuos example we get
n _
ln

1
0.001

ln 2
- 9.97
30 / 94
Error analysis for bisection method
[ c
n
[ _
1
2
n
(b a)
This relation provides us with a stopping criterion for bisection
method. Moreover, it follows that c
n
as n .
Suppose we want to estimate the number of iterations in bisection
method necessary to nd the root with an error tolerance ,
[ c
n
[ _
1
2
n
(b a) _
n _
ln

ba

ln 2
For previuos example we get
n _
ln

1
0.001

ln 2
- 9.97
31 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
32 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
33 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
34 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
35 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
36 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
37 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
38 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
39 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
40 / 94
Rootnding
We want to nd the root of a given function f (x).
Thus we want
to nd the point x at which the graph of y = f (x) intersects the
x-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearby
problem".
How do we obtain a nearby problem for f (x) = 0?
Begin rst by asking for types of problems which we can solve
easily. At the top of the list should be that of nding where a
straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 for
some linear polynomial p(x) that approximates f (x) in the vicinity
of the root .
41 / 94
Rootnding
We want to nd the root of a given function f (x). Thus we want
to nd the point x at which the graph of y = f (x) intersects the
x-axis.
One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearby
problem".
How do we obtain a nearby problem for f (x) = 0?
Begin rst by asking for types of problems which we can solve
easily. At the top of the list should be that of nding where a
straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 for
some linear polynomial p(x) that approximates f (x) in the vicinity
of the root .
42 / 94
Rootnding
We want to nd the root of a given function f (x). Thus we want
to nd the point x at which the graph of y = f (x) intersects the
x-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearby
problem".
How do we obtain a nearby problem for f (x) = 0?
Begin rst by asking for types of problems which we can solve
easily. At the top of the list should be that of nding where a
straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 for
some linear polynomial p(x) that approximates f (x) in the vicinity
of the root .
43 / 94
Rootnding
We want to nd the root of a given function f (x). Thus we want
to nd the point x at which the graph of y = f (x) intersects the
x-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearby
problem".
How do we obtain a nearby problem for f (x) = 0?
Begin rst by asking for types of problems which we can solve
easily. At the top of the list should be that of nding where a
straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 for
some linear polynomial p(x) that approximates f (x) in the vicinity
of the root .
44 / 94
Rootnding
We want to nd the root of a given function f (x). Thus we want
to nd the point x at which the graph of y = f (x) intersects the
x-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearby
problem".
How do we obtain a nearby problem for f (x) = 0?
Begin rst by asking for types of problems which we can solve
easily. At the top of the list should be that of nding where a
straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 for
some linear polynomial p(x) that approximates f (x) in the vicinity
of the root .
45 / 94
Rootnding
We want to nd the root of a given function f (x). Thus we want
to nd the point x at which the graph of y = f (x) intersects the
x-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearby
problem".
How do we obtain a nearby problem for f (x) = 0?
Begin rst by asking for types of problems which we can solve
easily. At the top of the list should be that of nding where a
straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 for
some linear polynomial p(x) that approximates f (x) in the vicinity
of the root .
46 / 94
Rootnding
We want to nd the root of a given function f (x). Thus we want
to nd the point x at which the graph of y = f (x) intersects the
x-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearby
problem".
How do we obtain a nearby problem for f (x) = 0?
Begin rst by asking for types of problems which we can solve
easily. At the top of the list should be that of nding where a
straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 for
some linear polynomial p(x) that approximates f (x) in the vicinity
of the root .
47 / 94
Rootnding
y
x

(x
0
,f (x
0
))
x
0
x
1
48 / 94
Newton's method
Let x
0
be an initial guess, suciently closed to the root .
Consider the tangent line to the graph of f (x) in (x
0
, f (x
0
)).
Tangent intersects x-axis at x
1
, a closer point to . Tangent has
equation
p
1
(x) = f (x
0
) + f
/
(x
0
)(x x
0
)
Since p
1
(x
1
) = 0 we get
f (x
0
) + f
/
(x
0
)(x
1
x
0
) = 0
x
1
= x
0

f (x
0
)
f
/
(x
0
)
Similarly, we get x
2
,
x
2
= x
1

f (x
1
)
f
/
(x
1
)
49 / 94
Newton's method
Let x
0
be an initial guess, suciently closed to the root .
Consider the tangent line to the graph of f (x) in (x
0
, f (x
0
)).
Tangent intersects x-axis at x
1
, a closer point to . Tangent has
equation
p
1
(x) = f (x
0
) + f
/
(x
0
)(x x
0
)
Since p
1
(x
1
) = 0 we get
f (x
0
) + f
/
(x
0
)(x
1
x
0
) = 0
x
1
= x
0

f (x
0
)
f
/
(x
0
)
Similarly, we get x
2
,
x
2
= x
1

f (x
1
)
f
/
(x
1
)
50 / 94
Newton's method
Let x
0
be an initial guess, suciently closed to the root .
Consider the tangent line to the graph of f (x) in (x
0
, f (x
0
)).
Tangent intersects x-axis at x
1
, a closer point to . Tangent has
equation
p
1
(x) = f (x
0
) + f
/
(x
0
)(x x
0
)
Since p
1
(x
1
) = 0 we get
f (x
0
) + f
/
(x
0
)(x
1
x
0
) = 0
x
1
= x
0

f (x
0
)
f
/
(x
0
)
Similarly, we get x
2
,
x
2
= x
1

f (x
1
)
f
/
(x
1
)
51 / 94
Newton's method
Let x
0
be an initial guess, suciently closed to the root .
Consider the tangent line to the graph of f (x) in (x
0
, f (x
0
)).
Tangent intersects x-axis at x
1
, a closer point to . Tangent has
equation
p
1
(x) = f (x
0
) + f
/
(x
0
)(x x
0
)
Since p
1
(x
1
) = 0 we get
f (x
0
) + f
/
(x
0
)(x
1
x
0
) = 0
x
1
= x
0

f (x
0
)
f
/
(x
0
)
Similarly, we get x
2
,
x
2
= x
1

f (x
1
)
f
/
(x
1
)
52 / 94
Newton's method
Let x
0
be an initial guess, suciently closed to the root .
Consider the tangent line to the graph of f (x) in (x
0
, f (x
0
)).
Tangent intersects x-axis at x
1
, a closer point to . Tangent has
equation
p
1
(x) = f (x
0
) + f
/
(x
0
)(x x
0
)
Since p
1
(x
1
) = 0 we get
f (x
0
) + f
/
(x
0
)(x
1
x
0
) = 0
x
1
= x
0

f (x
0
)
f
/
(x
0
)
Similarly, we get x
2
,
x
2
= x
1

f (x
1
)
f
/
(x
1
)
53 / 94
Newton's method
Let x
0
be an initial guess, suciently closed to the root .
Consider the tangent line to the graph of f (x) in (x
0
, f (x
0
)).
Tangent intersects x-axis at x
1
, a closer point to . Tangent has
equation
p
1
(x) = f (x
0
) + f
/
(x
0
)(x x
0
)
Since p
1
(x
1
) = 0 we get
f (x
0
) + f
/
(x
0
)(x
1
x
0
) = 0
x
1
= x
0

f (x
0
)
f
/
(x
0
)
Similarly, we get x
2
,
x
2
= x
1

f (x
1
)
f
/
(x
1
)
54 / 94
Newton's method
Repeat this process to obtaian the sequence x
1
, x
2
, x
3
, . . . that
hopefully will converge to .
General scheme for Newton's method consists in:
Starting with initial guess x
0
compute iteratively
x
n+1
= x
n

f (x
n
)
f
/
(x
n
)
, n = 0, 1, 2, . . .
55 / 94
Newton's method
Repeat this process to obtaian the sequence x
1
, x
2
, x
3
, . . . that
hopefully will converge to .
General scheme for Newton's method consists in:
Starting with initial guess x
0
compute iteratively
x
n+1
= x
n

f (x
n
)
f
/
(x
n
)
, n = 0, 1, 2, . . .
56 / 94
Newton's method
Repeat this process to obtaian the sequence x
1
, x
2
, x
3
, . . . that
hopefully will converge to .
General scheme for Newton's method consists in:
Starting with initial guess x
0
compute iteratively
x
n+1
= x
n

f (x
n
)
f
/
(x
n
)
, n = 0, 1, 2, . . .
57 / 94
Newton's method
Example
Apply Newton's method to
f (x) = x
6
x 1,
f
/
(x) = 6x
5
1
to get
x
n+1
= x
n

x
6
n
x
n
1
6x
5
n
1
, n _ 0
Use initial guess x
0
= 1.5.
58 / 94
Newton's method
Example
Apply Newton's method to
f (x) = x
6
x 1,
f
/
(x) = 6x
5
1
to get
x
n+1
= x
n

x
6
n
x
n
1
6x
5
n
1
, n _ 0
Use initial guess x
0
= 1.5.
59 / 94
Newton's method
Example
Apply Newton's method to
f (x) = x
6
x 1,
f
/
(x) = 6x
5
1
to get
x
n+1
= x
n

x
6
n
x
n
1
6x
5
n
1
, n _ 0
Use initial guess x
0
= 1.5.
60 / 94
Newton's method
n x
n
f (x
n
) x
n
x
n1
x
n
0 1.50000000 8.89e + 1
1 1.30049088 2.54e + 1 2.00e 1 3.65e 1
2 1.18148042 5.38e 1 1.19e 1 1.66e 1
3 1.13945559 4.92e 2 4.20e 2 4.68e 2
4 1.13477763 5.50e 4 4.68e 3 4.73e 3
5 1.13472415 7.11e 8 5.35e 5 5.35e 5
6 1.13472414 1.55e 15 6.91e 9 6.91e 9
True solution is = 1.134724138.
61 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)
implemented in some computers in the past.
Say, we are interested
in computing
a
b
= a
1
b
, where
1
b
is computed using Newton's
method.
f (x) = b
1
x
= 0,
with b positive. The root of this equation is: =
1
b
.
f
/
(x) =
1
x
2
and Newton's method for this problem becomes
x
n+1
= x
n

b
1
x
n
1
x
2
n
Simplifying
x
n+1
= x
n
(2 bx
n
), n _ 0
62 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)
implemented in some computers in the past. Say, we are interested
in computing
a
b
= a
1
b
, where
1
b
is computed using Newton's
method.
f (x) = b
1
x
= 0,
with b positive. The root of this equation is: =
1
b
.
f
/
(x) =
1
x
2
and Newton's method for this problem becomes
x
n+1
= x
n

b
1
x
n
1
x
2
n
Simplifying
x
n+1
= x
n
(2 bx
n
), n _ 0
63 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)
implemented in some computers in the past. Say, we are interested
in computing
a
b
= a
1
b
, where
1
b
is computed using Newton's
method.
f (x) = b
1
x
= 0,
with b positive. The root of this equation is: =
1
b
.
f
/
(x) =
1
x
2
and Newton's method for this problem becomes
x
n+1
= x
n

b
1
x
n
1
x
2
n
Simplifying
x
n+1
= x
n
(2 bx
n
), n _ 0
64 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)
implemented in some computers in the past. Say, we are interested
in computing
a
b
= a
1
b
, where
1
b
is computed using Newton's
method.
f (x) = b
1
x
= 0,
with b positive. The root of this equation is: =
1
b
.
f
/
(x) =
1
x
2
and Newton's method for this problem becomes
x
n+1
= x
n

b
1
x
n
1
x
2
n
Simplifying
x
n+1
= x
n
(2 bx
n
), n _ 0
65 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)
implemented in some computers in the past. Say, we are interested
in computing
a
b
= a
1
b
, where
1
b
is computed using Newton's
method.
f (x) = b
1
x
= 0,
with b positive. The root of this equation is: =
1
b
.
f
/
(x) =
1
x
2
and Newton's method for this problem becomes
x
n+1
= x
n

b
1
x
n
1
x
2
n
Simplifying
x
n+1
= x
n
(2 bx
n
), n _ 0
66 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1

= 1 bx
n+1
67 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1

= 1 bx
n+1
68 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1

= 1 bx
n+1
69 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1

= 1 bx
n+1
70 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1

= 1 bx
n+1
71 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1

= 1 bx
n+1
72 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1

= 1 bx
n+1
73 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1

= 1 bx
n+1
74 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(x
n+1
) = (Rel(x
n
))
2
In order to guarantee convergence x
n
,
[Rel(x
0
)[ < 1
or
0 < x
0
<
2
b
For example, suppose that [Rel(x
0
)[ = 0.1. Then
Rel(x
1
) = 10
2
, Rel(x
2
) = 10
4
Rel(x
3
) = 10
8
, Rel(x
4
) = 10
16
75 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(x
n+1
) = (Rel(x
n
))
2
In order to guarantee convergence x
n
,
[Rel(x
0
)[ < 1
or
0 < x
0
<
2
b
For example, suppose that [Rel(x
0
)[ = 0.1. Then
Rel(x
1
) = 10
2
, Rel(x
2
) = 10
4
Rel(x
3
) = 10
8
, Rel(x
4
) = 10
16
76 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(x
n+1
) = (Rel(x
n
))
2
In order to guarantee convergence x
n
,
[Rel(x
0
)[ < 1
or
0 < x
0
<
2
b
For example, suppose that [Rel(x
0
)[ = 0.1. Then
Rel(x
1
) = 10
2
, Rel(x
2
) = 10
4
Rel(x
3
) = 10
8
, Rel(x
4
) = 10
16
77 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(x
n+1
) = (Rel(x
n
))
2
In order to guarantee convergence x
n
,
[Rel(x
0
)[ < 1
or
0 < x
0
<
2
b
For example, suppose that [Rel(x
0
)[ = 0.1. Then
Rel(x
1
) = 10
2
, Rel(x
2
) = 10
4
Rel(x
3
) = 10
8
, Rel(x
4
) = 10
16
78 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(x
n+1
) = (Rel(x
n
))
2
In order to guarantee convergence x
n
,
[Rel(x
0
)[ < 1
or
0 < x
0
<
2
b
For example, suppose that [Rel(x
0
)[ = 0.1. Then
Rel(x
1
) = 10
2
, Rel(x
2
) = 10
4
Rel(x
3
) = 10
8
, Rel(x
4
) = 10
16
79 / 94
Newton's method. Division example
y
x
y=b-1/x
1/b
(x
0
,f(x
0
))
x
0
x
1
2/b
b
80 / 94
Error analysis for Newton's method
Let f (x) C
2
[a, b] and [a, b]. Also let f
/
() ,= 0.
Consider
Taylor formula for f (x) about x
n
f (x) = f (x
n
) + (x x
n
)f
/
(x
n
) +
(x x
n
)
2
2
f
//
(
n
),
where
n
is between x and x
n
. Take x = to get
f () = f (x
n
) + ( x
n
)f
/
(x
n
) +
( x
n
)
2
2
f
//
(
n
),
with
n
between and x
n
. Since f () = 0 we have
0 =
f (x
n
)
f
/
(x
n
)
+ ( x
n
) + ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
x
n+1
= ( x
n
)
2

f
//
(
n
)
2f
/
(x
n
)

81 / 94
Error analysis for Newton's method
Let f (x) C
2
[a, b] and [a, b]. Also let f
/
() ,= 0. Consider
Taylor formula for f (x) about x
n
f (x) = f (x
n
) + (x x
n
)f
/
(x
n
) +
(x x
n
)
2
2
f
//
(
n
),
where
n
is between x and x
n
.
Take x = to get
f () = f (x
n
) + ( x
n
)f
/
(x
n
) +
( x
n
)
2
2
f
//
(
n
),
with
n
between and x
n
. Since f () = 0 we have
0 =
f (x
n
)
f
/
(x
n
)
+ ( x
n
) + ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
x
n+1
= ( x
n
)
2

f
//
(
n
)
2f
/
(x
n
)

82 / 94
Error analysis for Newton's method
Let f (x) C
2
[a, b] and [a, b]. Also let f
/
() ,= 0. Consider
Taylor formula for f (x) about x
n
f (x) = f (x
n
) + (x x
n
)f
/
(x
n
) +
(x x
n
)
2
2
f
//
(
n
),
where
n
is between x and x
n
. Take x = to get
f () = f (x
n
) + ( x
n
)f
/
(x
n
) +
( x
n
)
2
2
f
//
(
n
),
with
n
between and x
n
.
Since f () = 0 we have
0 =
f (x
n
)
f
/
(x
n
)
+ ( x
n
) + ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
x
n+1
= ( x
n
)
2

f
//
(
n
)
2f
/
(x
n
)

83 / 94
Error analysis for Newton's method
Let f (x) C
2
[a, b] and [a, b]. Also let f
/
() ,= 0. Consider
Taylor formula for f (x) about x
n
f (x) = f (x
n
) + (x x
n
)f
/
(x
n
) +
(x x
n
)
2
2
f
//
(
n
),
where
n
is between x and x
n
. Take x = to get
f () = f (x
n
) + ( x
n
)f
/
(x
n
) +
( x
n
)
2
2
f
//
(
n
),
with
n
between and x
n
. Since f () = 0 we have
0 =
f (x
n
)
f
/
(x
n
)
+ ( x
n
) + ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
x
n+1
= ( x
n
)
2

f
//
(
n
)
2f
/
(x
n
)

84 / 94
Error analysis for Newton's method
Let f (x) C
2
[a, b] and [a, b]. Also let f
/
() ,= 0. Consider
Taylor formula for f (x) about x
n
f (x) = f (x
n
) + (x x
n
)f
/
(x
n
) +
(x x
n
)
2
2
f
//
(
n
),
where
n
is between x and x
n
. Take x = to get
f () = f (x
n
) + ( x
n
)f
/
(x
n
) +
( x
n
)
2
2
f
//
(
n
),
with
n
between and x
n
. Since f () = 0 we have
0 =
f (x
n
)
f
/
(x
n
)
+ ( x
n
) + ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
x
n+1
= ( x
n
)
2

f
//
(
n
)
2f
/
(x
n
)

85 / 94
Error analysis for Newton's method
For previous example, f
//
(x) = 30x
4
.We have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
=
30
4
2(6
5
1)
- 2.42
Therefore
x
n+1
- 2.42( x
n
)
2
For example if n = 3, we get x
3
- 4.73e 03 and
x
4
- 2.42( x
3
)
2
- 5.42e 05,
a result in accordance with the result presented in the table:
x
4
- 5.35e 05.
86 / 94
Error analysis for Newton's method
For previous example, f
//
(x) = 30x
4
.We have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
=
30
4
2(6
5
1)
- 2.42
Therefore
x
n+1
- 2.42( x
n
)
2
For example if n = 3, we get x
3
- 4.73e 03 and
x
4
- 2.42( x
3
)
2
- 5.42e 05,
a result in accordance with the result presented in the table:
x
4
- 5.35e 05.
87 / 94
Error analysis for Newton's method
For previous example, f
//
(x) = 30x
4
.We have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
=
30
4
2(6
5
1)
- 2.42
Therefore
x
n+1
- 2.42( x
n
)
2
For example if n = 3, we get x
3
- 4.73e 03 and
x
4
- 2.42( x
3
)
2
- 5.42e 05,
a result in accordance with the result presented in the table:
x
4
- 5.35e 05.
88 / 94
Error analysis for Newton's method
For previous example, f
//
(x) = 30x
4
.We have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
=
30
4
2(6
5
1)
- 2.42
Therefore
x
n+1
- 2.42( x
n
)
2
For example if n = 3, we get x
3
- 4.73e 03 and
x
4
- 2.42( x
3
)
2
- 5.42e 05,
a result in accordance with the result presented in the table:
x
4
- 5.35e 05.
89 / 94
Error analysis for Newton's method
If iteration x
n
is close to we have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
= M
x
n+1
- M( x
n
)
2
M( x
n+1
) - (M( x
n
))
2
Inductively
M( x
n+1
) - (M( x
0
))
2
n
, n _ 0
In other words, in order to guarantee the convergence of Newton's
method we should have
[M( x
0
)[ < 1
[ x
0
[ <
1
[M[
=

2f
/
()
f
//
()

90 / 94
Error analysis for Newton's method
If iteration x
n
is close to we have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
= M
x
n+1
- M( x
n
)
2
M( x
n+1
) - (M( x
n
))
2
Inductively
M( x
n+1
) - (M( x
0
))
2
n
, n _ 0
In other words, in order to guarantee the convergence of Newton's
method we should have
[M( x
0
)[ < 1
[ x
0
[ <
1
[M[
=

2f
/
()
f
//
()

91 / 94
Error analysis for Newton's method
If iteration x
n
is close to we have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
= M
x
n+1
- M( x
n
)
2
M( x
n+1
) - (M( x
n
))
2
Inductively
M( x
n+1
) - (M( x
0
))
2
n
, n _ 0
In other words, in order to guarantee the convergence of Newton's
method we should have
[M( x
0
)[ < 1
[ x
0
[ <
1
[M[
=

2f
/
()
f
//
()

92 / 94
Error analysis for Newton's method
If iteration x
n
is close to we have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
= M
x
n+1
- M( x
n
)
2
M( x
n+1
) - (M( x
n
))
2
Inductively
M( x
n+1
) - (M( x
0
))
2
n
, n _ 0
In other words, in order to guarantee the convergence of Newton's
method we should have
[M( x
0
)[ < 1
[ x
0
[ <
1
[M[
=

2f
/
()
f
//
()

93 / 94
Error analysis for Newton's method
If iteration x
n
is close to we have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
= M
x
n+1
- M( x
n
)
2
M( x
n+1
) - (M( x
n
))
2
Inductively
M( x
n+1
) - (M( x
0
))
2
n
, n _ 0
In other words, in order to guarantee the convergence of Newton's
method we should have
[M( x
0
)[ < 1
[ x
0
[ <
1
[M[
=

2f
/
()
f
//
()

94 / 94
For x
n
close to , and therefore c
n
also close to ,
we have
x
n+1

f
00
()
2f
0
()
( x
n
)
2
Thus Newtons method is quadratically convergent,
provided f
0
() 6= 0 and f(x) is twice dierentiable in
the vicinity of the root .
We can also use this to explore the interval of con-
vergence of Newtons method. Write the above as
x
n+1
M ( x
n
)
2
, M =
f
00
()
2f
0
()
Multiply both sides by M to get
M ( x
n+1
) [M ( x
n
)]
2
M ( x
n+1
) [M ( x
n
)]
2
Then we want these quantities to decrease; and this
suggests choosing x
0
so that
|M ( x
0
)| < 1
| x
0
| <
1
|M|
=

2f
0
()
f
00
()

If |M| is very large, then we may need to have a very


good initial guess in order to have the iterates x
n
converge to .
ADVANTAGES & DISADVANTAGES
Advantages: 1. It is rapidly convergent in most cases.
2. It is simple in its formulation, and therefore rela-
tively easy to apply and program.
3. It is intuitive in its construction. This means it is
easier to understand its behaviour, when it is likely to
behave well and when it may behave poorly.
Disadvantages: 1. It may not converge.
2. It is likely to have diculty if f
0
() = 0. This
condition means the x-axis is tangent to the graph of
y = f(x) at x = .
3. It needs to know both f(x) and f
0
(x). Contrast
this with the bisection method which requires only
f(x).
THE SECANT METHOD
Newtons method was based on using the line tangent
to the curve of y = f(x), with the point of tangency
(x
0
, f(x
0
)). When x
0
, the graph of the tangent
line is approximately the same as the graph of y =
f(x) around x = . We then used the root of the
tangent line to approximate .
Consider using an approximating line based on inter-
polation. We assume we have two estimates of the
root , say x
0
and x
1
. Then we produce a linear
function
q(x) = a
0
+ a
1
x
with
q(x
0
) = f(x
0
), q(x
1
) = f(x
1
) (*)
This line is sometimes called a secant line. Its equa-
tion is given by
q(x) =
(x
1
x) f(x
0
) + (x x
0
) f(x
1
)
x
1
x
0
(x
0
,f(x
0
))
(x
1
,f(x
1
))
x
2
x
0
x
1

x
y
y=f(x)
(x
0
,f(x
0
))
(x
1
,f(x
1
))
x
2
x
0
x
1

x
y
y=f(x)
q(x) =
(x
1
x) f(x
0
) + (x x
0
) f(x
1
)
x
1
x
0
This is linear in x; and by direction evaluation, it satis-
es the interpolation conditions of (*). We now solve
the equation q(x) = 0, denoting the root by x
2
. This
yields
x
2
= x
1
f(x
1
)
f(x
1
) f(x
0
)
x
1
x
0
We can now repeat the process. Use x
1
and x
2
to
produce another secant line, and then uses its root
to approximate . This yields the general iteration
formula
x
n+1
= x
n
f(x
n
)
f(x
n
) f(x
n1
)
x
n
x
n1
, n = 1, 2, 3...
This is called the secant method for solving f(x) = 0.
Example We solve the equation
f(x) x
6
x 1 = 0
which was used previously as an example for both the
bisection and Newton methods. The quantity x
n

x
n1
is used as an estimate of x
n1
. The iterate
x
8
equals rounded to nine signicant digits. As with
Newtons method for this equation, the initial iterates
do not converge rapidly. But as the iterates become
closer to , the speed of convergence increases.
n x
n
f(x
n
) x
n
x
n1
x
n1
0 2.0 61.0
1 1.0 1.0 1.0
2 1.01612903 9.15E 1 1.61E 2 1.35E 1
3 1.19057777 6.57E 1 1.74E 1 1.19E 1
4 1.11765583 1.68E 1 7.29E 2 5.59E 2
5 1.13253155 2.24E 2 1.49E 2 1.71E 2
6 1.13481681 9.54E 4 2.29E 3 2.19E 3
7 1.13472365 5.07E 6 9.32E 5 9.27E 5
8 1.13472414 1.13E 9 4.92E 7 4.92E 7
It is clear from the numerical results that the se-
cant method requires more iterates than the New-
ton method. But note that the secant method does
not require a knowledge of f
0
(x), whereas Newtons
method requires both f(x) and f
0
(x).
Note also that the secant method can be considered
an approximation of the Newton method
x
n+1
= x
n

f(x
n
)
f
0
(x
n
)
by using the approximation
f
0
(x
n
)
f(x
n
) f(x
n1
)
x
n
x
n1
CONVERGENCE ANALYSIS
With a combination of algebraic manipulation and the
mean-value theorem from calculus, we can show
x
n+1
= ( x
n
) ( x
n1
)
"
f
00
(
n
)
2f
0
(
n
)
#
, (**)
with
n
and
n
unknown points. The point
n
is lo-
cated between the minimum and maximum of x
n1
, x
n
,
and ; and
n
is located between the minimum and
maximum of x
n1
and x
n
. Recall for Newtons method
that the Newton iterates satised
x
n+1
= ( x
n
)
2
"
f
00
(
n
)
2f
0
(x
n
)
#
which closely resembles (**) above.
Using (**), it can be shown that x
n
converges to ,
and moreover,
lim
n
| x
n+1
|
| x
n
|
r
=

f
00
()
2f
0
()

r1
c
where
1
2
(1 + sqrt(5))
.
= 1.62. This assumes that x
0
and x
1
are chosen suciently close to ; and how
close this is will vary with the function f. In addition,
the above result assumes f(x) has two continuous
derivatives for all x in some interval about .
The above says that when we are close to , that
| x
n+1
| c | x
n
|
r
This looks very much like the Newton result
x
n+1
M ( x
n
)
2
, M =
f
00
()
2f
0
()
and c = |M|
r1
. Both the secant and Newton meth-
ods converge at faster than a linear rate, and they are
called superlinear methods.
The secant method converge slower than Newtons
method; but it is still quite rapid. It is rapid enough
that we can prove
lim
n
|x
n+1
x
n
|
| x
n
|
= 1
and therefore,
| x
n
| |x
n+1
x
n
|
is a good error estimator.
A note of warning: Do not combine the secant for-
mula and write it in the form
x
n+1
=
f(x
n
)x
n1
f(x
n1
)x
n
f(x
n
) f(x
n1
)
This has enormous loss of signicance errors as com-
pared with the earlier formulation.
COSTS OF SECANT & NEWTON METHODS
The Newton method
x
n+1
= x
n

f(x
n
)
f
0
(x
n
)
, n = 0, 1, 2, ...
requires two function evaluations per iteration, that
of f(x
n
) and f
0
(x
n
). The secant method
x
n+1
= x
n
f(x
n
)
f(x
n
) f(x
n1
)
x
n
x
n1
, n = 1, 2, 3...
requires 1 function evaluation per iteration, following
the initial step.
For this reason, the secant method is often faster in
time, even though more iterates are needed with it
than with Newtons method to attain a similar accu-
racy.
ADVANTAGES & DISADVANTAGES
Advantages of secant method: 1. It converges at
faster than a linear rate, so that it is more rapidly
convergent than the bisection method.
2. It does not require use of the derivative of the
function, something that is not available in a number
of applications.
3. It requires only one function evaluation per iter-
ation, as compared with Newtons method which re-
quires two.
Disadvantages of secant method:
1. It may not converge.
2. There is no guaranteed error bound for the com-
puted iterates.
3. It is likely to have diculty if f
0
() = 0. This
means the x-axis is tangent to the graph of y = f(x)
at x = .
4. Newtons method generalizes more easily to new
methods for solving simultaneous systems of nonlinear
equations.
BRENTS METHOD
Richard Brent devised a method combining the advan-
tages of the bisection method and the secant method.
1. It is guaranteed to converge.
2. It has an error bound which will converge to zero
in practice.
3. For most problems f(x) = 0, with f(x) dieren-
tiable about the root , the method behaves like the
secant method.
4. In the worst case, it is not too much worse in its
convergence than the bisection method.
In Matlab, it is implemented as fzero; and it is present
in most Fortran numerical analysis libraries.
FIXED POINT ITERATION
We begin with a computational example. Consider
solving the two equations
E1: x = 1 + .5 sin x
E2: x = 3 + 2 sin x
Graphs of these two equations are shown on accom-
panying graphs, with the solutions being
E1: = 1.49870113351785
E2: = 3.09438341304928
We are going to use a numerical scheme called xed
point iteration. It amounts to making an initial guess
of x
0
and substituting this into the right side of the
equation. The resulting value is denoted by x
1
; and
then the process is repeated, this time substituting x
1
into the right side. This is repeated until convergence
occurs or until the iteration is terminated.
In the above cases, we show the results of the rst 10
iterations in the accompanying table. Clearly conver-
gence is occurring with E1, but not with E2. Why?
x
y
y = x
y = 1 + .5sin x

x
y
y = x
y = 3 + 2sin x

E1: x = 1 + .5 sin x
E2: x = 3 + 2 sin x
E1 E2
n x
n
x
n
0 0.00000000000000 3.00000000000000
1 1.00000000000000 3.28224001611973
2 1.42073549240395 2.71963177181556
3 1.49438099256432 3.81910025488514
4 1.49854088439917 1.74629389651652
5 1.49869535552190 4.96927957214762
6 1.49870092540704 1.06563065299216
7 1.49870112602244 4.75018861639465
8 1.49870113324789 1.00142864236516
9 1.49870113350813 4.68448404916097
10 1.49870113351750 1.00077863465869
The above iterations can be written symbolically as
E1 : x
n+1
= 1 + 0:5 sin x
n
E2 : x
n+1
= 3 + 2 sin x
n
for n = 0; 1; 2; : : : Why does one of these iterations
converge, but not the other? The graphs show similar
behaviour, so why the dierence? Consider one more
example:
Suppose we are solving the equation
x
2
5 = 0
with exact root =
p
5 2:2361 using iterates of the
form
x
n+1
= g(x
n
):
Consider four dierent iterations
I
1
: x
n+1
= 5 + x
n
x
2
n
;
I
2
: x
n+1
=
5
x
n
;
I
3
: x
n+1
= 1 + x
n

1
5
x
2
n
;
I
4
: x
n+1
=
1
2

x
n
+
5
x
n

:
All of them, in case they are convergent will converge
to =
p
5 (just take the limit as n ! 1 of each
relation).
I
1
I
2
I
3
I
4
n x
n
x
n
x
n
x
n
0 1:0e + 00 1:0 1:0 1:0
1 5:0000e + 00 5:0 1:8000 3:0000
2 1:5000e + 01 1:0 2:1520 2:3333
3 2:3500e + 02 5:0 2:2258 2:2381
4 5:5455e + 04 1:0 2:2350 2:2361
5 3:0753e + 09 5:0 2:2360 2:2361
6 9:4575e + 18 1:0 2:2361 2:2361
7 8:9445e + 37 5:0 2:2361 2:2361
8 8:0004e + 75 1:0 2:2361 2:2361
As another example, note that the Newton method
x
n+1
= x
n

f(x
n
)
f
0
(x
n
)
is also a xed point iteration, for the equation
x = x
f(x)
f
0
(x)
In general, we are interested in solving equations
x = g(x)
by means of xed point iteration:
x
n+1
= g(x
n
); n = 0; 1; 2; : : :
It is called xed point iteration because the root is
a xed point of the function g(x), meaning that is a
number for which
g() =
EXISTENCE THEOREM
We begin by asking whether the equation
x = g(x)
has a solution. For this to occur, the graphs of y =
x and y = g(x) must intersect, as seen on the earlier
graphs. There are several lemmas and theorems that give
conditions under which we are guaranteed there is a xed
point .
Lemma 1 Let g(x) be a continuous function on the in-
terval [a; b], and suppose it satises the property
a x b ) a g(x) b
Then the equation x = g(x) has at least one solution
in the interval [a; b].
The proof of this is fairly intuitive. Look at the function
f(x) = x g(x), a x b. Evaluating at the end-
points, f(a) 0; f(b) 0. The function f(x) is
continuous on [a; b]; and therefore it contains a zero in
the interval.
Theorem: Assume g(x) and g
0
(x) exist and are con-
tinuous on the interval [a, b]; and further, assume
a x b a g(x) b
max
axb

g
0
(x)

< 1
Then:
S1. The equation x = g(x) has a unique solution
in [a, b].
S2. For any initial guess x
0
in [a, b], the iteration
x
n+1
= g(x
n
), n = 0, 1, 2, ...
will converge to .
S3.
| x
n
|

n
1
|x
1
x
0
| , n 0
S4.
lim
n
x
n+1
x
n
= g
0
()
Thus for x
n
close to ,
x
n+1
g
0
() ( x
n
)
The proof is given in the text, and I go over only a
portion of it here. For S2, note that from (#), if x
0
is in [a, b], then
x
1
= g(x
0
)
is also in [a, b]. Repeat the argument to show that
x
2
= g(x
1
)
belongs to [a, b]. This can be continued by induction
to show that every x
n
belongs to [a, b].
We need the following general result. For any two
points w and z in [a, b],
g(w) g(z) = g
0
(c) (w z)
for some unknown point c between w and z. There-
fore,
|g(w) g(z)| |w z|
for any a w, z b.
For S3, subtract x
n+1
= g(x
n
) from = g() to get
x
n+1
= g() g(x
n
)
= g
0
(c
n
) ( x
n
) ($)
| x
n+1
| | x
n
| (*)
with c
n
between and x
n
. From (*), we have that
the error is guaranteed to decrease by a factor of
with each iteration. This leads to
| x
n
|
n
| x
n
| , n 0
With some extra manipulation, we can obtain the error
bound in S3.
For S4, use ($) to write
x
n+1
x
n
= g
0
(c
n
)
Since x
n
and c
n
is between and x
n
, we have
g
0
(c
n
) g
0
().
The statement
x
n+1
g
0
() ( x
n
)
tells us that when near to the root , the errors will
decrease by a constant factor of g
0
(). If this is nega-
tive, then the errors will oscillate between positive and
negative, and the iterates will be approaching from
both sides. When g
0
() is positive, the iterates will
approach from only one side.
The statements
x
n+1
= g
0
(c
n
) ( x
n
)
x
n+1
g
0
() ( x
n
)
also tell us a bit more of what happens when

g
0
()

> 1
Then the errors will increase as we approach the root
rather than decrease in size.
Look at the earlier examples
E1: x = 1 + .5 sin x
E2: x = 3 + 2 sin x
In the rst case E1,
g(x) = 1 + .5 sin x
g
0
(x) = .5 cos x

g
0
(


1
2
Therefore the xed point iteration
x
n+1
= 1 + .5 sin x
n
will converge for E1.
For the second case E2,
g(x) = 3 + 2 sin x
g
0
(x) = 2 cos x
g
0
() = 2 cos (3.09438341304928)
.
= 1.998
Therefore the xed point iteration
x
n+1
= 3 + 2 sin x
n
will diverge for E2.
Consider example x
2
5 = 0
(I
1
) g(x) = 5 + x x
2
; g
0
(x) = 1 2x; g
0
() =
1 2
p
5 < 1: Thus, x
n
= g(x
n1
) do not con-
verge to
p
5:
(I
2
) g(x) =
5
x
; g
0
(x) =
5
x
2
; g
0
() = 1: There-
fore, x
n
= g(x
n1
) can be either convergent or
divergent, but numerical results show it divergent.
(I
3
) g(x) = 1 +x
1
5
x
2
; g
0
(x) = 1
2
5
x; g
0
() =
1
2
5
p
5 0:106: Thus, x
n
= g(x
n1
) converge
to
p
5: Moreover, we have
j x
n+1
j 0:106 j x
n
j ;
if x
n
is suciently close to : The errors are de-
creasing with a liniar rate of 0:106.
(I
4
) g(x) =
1
2

x +
5
x

; g
0
(x) =
1
2

1
5
x
2

; g
0
() =
0:Sequence x
n
= g(x
n1
) will converge to
p
5;with
an order of convergence bigger than 1:
Sometimes it is dicult to express equation f(x) = 0 in
the form x = g(x); such that the resulting iterates will
converge. Such a process is presented in the following
examples.
Example 1 Let x
4
x 1 = 0; rewritten as
x =
4
p
1 +x;
which will provde us with iterations
x
0
= 1; x
n+1
=
4
p
1 +x
n
; n 1
This sequence will converge to 1:2207:
Example 2 Let x
3
+x 1 = 0; rewritten as
x =
1
1 +x
2
and its xed point iterations
x
0
= 1; x
n+1
=
1
1 +x
2
n
; n 1
that will converge to 0:6823: Iterations are repre-
sented graphically in the following gure
0
x
y
y=g(x)
=0.6823 x
0
x
2
x
1
x
3
y=x
x
0
x
1
x
2

y
x
O
y =x
y =g(x)
0 < g
0
() < 1
x
y
O
x
0
x
1
x
2
x
3

y =x
y =g(x)
1 < g
0
() < 0
x
y
O
x
0
x
1
x
2
y =x
y =g(x)
g
0
() > 1
y
x
O
y =x
y =g(x)
x
0
x
1
x
2
g
0
() < 1
Besides the convergence we would like to know how fast is
the sequence x
n
= g(x
n1
) converging to the solution,
in other words how fast the error x
n
is decreasing.
We will say that sequence x
n

o
n=0
converges to with
order of convergence p 1; if
[ x
n+1
[ c [ x
n
[
p
; n 0;
where c 0 is a constant. Cases p = 1, p = 2 and p =
3 are called linear, quadratic and cubic convergencies. In
case of linear convergence, constant c is called the rate
of linear convergence liniare and we require additionally
that c < 1; otherwise sequence of errors x
n
can fail
to converge to zero. Also, for linear convergence wer can
use the relation,
[ x
n+1
[ c
n
[ x
0
[ ; n 0:
Thus bisection method is linearly convergent with rate
1
2
;
Newtons method is quadratically convergent, and secant
method has order of convergence p =
1+
_
5
2
:
If

g
t
()

< 1, from the last theorem we have that iter-


ations x
n
are at least linearly convergent. If in addition,
g
t
() ,= 0; then we have exactly linear convergence with
rate g
t
(): In practice, the last theorem is rarely used
since.it is quite dicult to nd an interval [a; b] such that
g ([a; b]) [a; b] : To simplify the usage of the Theorem
we consider the following corollary.
Corollary: Assume x = g(x) has a solution , and
further assume that both g(x) and g
0
(x) are contin-
uous for all x in some interval about . In addition,
assume

g
0
()

< 1 (**)
Then any suciently small number > 0, the interval
[a, b] = [ , + ] will satisfy the hypotheses of
the preceding theorem.
This means that if (**) is true, and if we choose x
0
suciently close to , then the xed point iteration
x
n+1
= g(x
n
) will converge and the earlier results
S1-S4 will all hold. The corollary does not tell us how
close we need to be to in order to have convergence.
NEWTONS METHOD
For Newtons method
x
n+1
= x
n

f(x
n
)
f
0
(x
n
)
we have it is a xed point iteration with
g(x) = x
f(x)
f
0
(x)
Check its convergence by checking the condition (**).
g
0
(x) = 1
f
0
(x)
f
0
(x)
+
f(x)f
00
(x)
[f
0
(x)]
2
=
f(x)f
00
(x)
[f
0
(x)]
2
g
0
() = 0
Therefore the Newton method will converge if x
0
is
chosen suciently close to .
HIGHER ORDER METHODS
What happens when g
0
() = 0? We use Taylors
theorem to answer this question.
Begin by writing
g(x) = g() + g
0
() (x ) +
1
2
g
00
(c) (x )
2
with c between x and . Substitute x = x
n
and
recall that g(x
n
) = x
n+1
and g() = . Also assume
g
0
() = 0.
Then
x
n+1
= +
1
2
g
00
(c
n
) (x
n
)
2
x
n+1
=
1
2
g
00
(c
n
) (x
n
)
2
with c
n
between and x
n
. Thus if g
0
() = 0, the
xed point iteration is quadratically convergent or bet-
ter. In fact, if g
00
() 6= 0, then the iteration is exactly
quadratically convergent.
ANOTHER RAPID ITERATION
Newtons method is rapid, but requires use of the
derivative f
0
(x). Can we get by without this. The
answer is yes! Consider the method
D
n
=
f(x
n
+ f(x
n
)) f(x
n
)
f(x
n
)
x
n+1
= x
n

f(x
n
)
D
n
This is an approximation to Newtons method, with
f
0
(x
n
) D
n
. To analyze its convergence, regard it
as a xed point iteration with
D(x) =
f(x + f(x)) f(x)
f(x)
g(x) = x
f(x)
D(x)
Then we can, with some diculty, show g
0
() = 0
and g
00
() 6= 0. This will prove this new iteration is
quadratically convergent.
FIXED POINT INTERATION: ERROR
Recall the result
lim
n
x
n
x
n1
= g
0
()
for the iteration
x
n
= g(x
n1
), n = 1, 2, ...
Thus
x
n
( x
n1
) (***)
with = g
0
() and || < 1.
If we were to know , then we could solve (***) for
:

x
n
x
n1
1
Usually, we write this as a modication of the cur-
rently computed iterate x
n
:

x
n
x
n1
1
=
x
n
x
n
1
+
x
n
x
n1
1
= x
n
+

1
[x
n
x
n1
]
The formula
x
n
+

1
[x
n
x
n1
]
is said to be an extrapolation of the numbers x
n1
and x
n
. But what is ?
From
lim
n
x
n
x
n1
= g
0
()
we have

x
n
x
n1
Unfortunately this also involves the unknown root
which we seek; and we must nd some other way of
estimating .
To calculate consider the ratio

n
=
x
n
x
n1
x
n1
x
n2
To see this is approximately as x
n
approaches ,
write
x
n
x
n1
x
n1
x
n2
=
g(x
n1
) g(x
n2
)
x
n1
x
n2
= g
0
(c
n
)
with c
n
between x
n1
and x
n2
. As the iterates ap-
proach , the number c
n
must also approach . Thus

n
approaches as x
n
.
We combine these results to obtain the estimation
b
x
n
= x
n
+

n
1
n
[x
n
x
n1
] ,
n
=
x
n
x
n1
x
n1
x
n2
We call
b
x
n
the Aitken extrapolate of {x
n2
, x
n1
, x
n
};
and
b
x
n
.
We can also rewrite this as
x
n

b
x
n
x
n
=

n
1
n
[x
n
x
n1
]
This is called Aitkens error estimation formula.
The accuracy of these procedures is tied directly to
the accuracy of the formulas
x
n
( x
n1
) , x
n1
( x
n2
)
If this is accurate, then so are the above extrapolation
and error estimation formulas.
EXAMPLE
Consider the iteration
x
n+1
= 6.28 + sin(x
n
), n = 0, 1, 2, ...
for solving
x = 6.28 + sin x
Iterates are shown on the accompanying sheet, includ-
ing calculations of
n
, the error estimate
x
n

b
x
n
x
n
=

n
1
n
[x
n
x
n1
] (Estimate)
The latter is called Estimate in the table. In this
instance,
g
0
()
.
= .9644
and therefore the convergence is very slow. This is
apparent in the table.
AITKENS ALGORITHM
Step 1: Select x
0
Step 2: Calculate
x
1
= g(x
0
), x
2
= g(x
1
)
Step3: Calculate
x
3
= x
2
+

2
1
2
[x
2
x
1
] ,
2
=
x
2
x
1
x
1
x
0
Step 4: Calculate
x
4
= g(x
3
), x
5
= g(x
4
)
and calculate x
6
as the extrapolate of {x
3
, x
4
, x
5
}.
Continue this procedure, ad innatum.
Of course in practice we will have some kind of er-
ror test to stop this procedure when believe we have
sucient accuracy.
EXAMPLE
Consider again the iteration
x
n+1
= 6.28 + sin(x
n
), n = 0, 1, 2, ...
for solving
x = 6.28 + sin x
Now we use the Aitken method, and the results are
shown in the accompanying table. With this we have
x
3
= 7.98 10
4
, x
6
= 2.27 10
6
In comparison, the original iteration had
x
6
= 1.23 10
2
GENERAL COMMENTS
Aitken extrapolation can greatly accelerate the con-
vergence of a linearly convergent iteration
x
n+1
= g(x
n
)
This shows the power of understanding the behaviour
of the error in a numerical process. From that un-
derstanding, we can often improve the accuracy, thru
extrapolation or some other procedure.
This is a justication for using mathematical analyses
to understand numerical methods. We will see this
repeated at later points in the course, and it holds
with many dierent types of problems and numerical
methods for their solution.
MULTIPLE ROOTS
We study two classes of functions for which there is
additional diculty in calculating their roots. The rst
of these are functions in which the desired root has a
multiplicity greater than 1. What does this mean?
Let be a root of the function f(x), and imagine
writing it in the factored form
f(x) = (x )
m
h(x)
with some integer m 1 and some continuous func-
tion h(x) for which h() 6= 0. Then we say that
is a root of f(x) of multiplicity m. For example, the
function
f(x) = e
x
2
1
has x = 0 as a root of multiplicity m = 2. In partic-
ular, dene
h(x) =
e
x
2
1
x
2
for x 6= 0.
Using Taylor polynomial approximations, we can show
for x 6= 0 that
h(x) 1 +
1
2
x
2
+
1
6
x
4
lim
x0
h(x) = 1
This leads us to extend the denition of h(x) to
h(x) =
e
x
2
1
x
2
, x 6= 0
h(0) = 1
Thus
f(x) = x
2
h(x)
as asserted and x = 0 is a root of f(x) of multiplicity
m = 2.
Roots for which m = 1 are called simple roots, and
the methods studied to this point were intended for
such roots. We now consider the case of m > 1.
If the function f(x) is m-times dierentiable around
, then we can dierentiate
f(x) = (x )
m
h(x)
m times to obtain an equivalent formulation of what
it means for the root to have multiplicity m.
For an example, consider the case
f(x) = (x )
3
h(x)
Then
f
0
(x) = 3 (x )
2
h(x) + (x )
3
h
0
(x)
(x )
2
h
2
(x)
h
2
(x) = 3h(x) + (x ) h
0
(x)
h
2
() = 3h() 6= 0
This shows is a root of f
0
(x) of multiplicity 2.
Dierentiating a second time, we can show
f
00
(x) = (x ) h
3
(x)
for a suitably dened h
3
(x) with h
3
() 6= 0, and is
a simple root of f
00
(x).
Dierentiating a third time, we have
f
000
() = h
3
() 6= 0
We can use this as part of a proof of the following:
is a root of f(x) of multiplicity m = 3 if and only if
f() = f
0
() = f
00
() = 0, f
000
() 6= 0
In general, is a root of f(x) of multiplicity m if and
only if
f() = = f
(m1)
() = 0, f
(m)
() 6= 0
DIFFICULTIES OF MULTIPLE ROOTS
There are two main diculties with the numerical cal-
culation of multiple roots (by which we mean m > 1
in the denition).
1. Methods such as Newtons method and the se-
cant method converge more slowly than for the
case of a simple root.
2. There is a large interval of uncertainty in the pre-
cise location of a multiple root on a computer or
calculator.
The second of these is the more dicult to deal with,
but we begin with the rst for the case of Newtons
method.
Recall that we can regard Newtons method as a xed
point method:
x
n+1
= g(x
n
), g(x) = x
f(x)
f
0
(x)
Then we substitute
f(x) = (x )
m
h(x)
to obtain
g(x) = x
(x )
m
h(x)
m(x )
m1
h(x) + (x )
m
h
0
(x)
= x
(x ) h(x)
mh(x) + (x ) h
0
(x)
Then we can use this to show
g
0
() = 1
1
m
=
m1
m
For m > 1, this is nonzero, and therefore Newtons
method is only linearly convergent:
x
n+1
( x
n
) , =
m1
m
Similar results hold for the secant method.
There are ways of improving the speed of convergence
of Newtons method, creating a modied method that
is again quadratically convergent. In particular, con-
sider the xed point iteration formula
x
n+1
= g(x
n
), g(x) = x m
f(x)
f
0
(x)
in which we assume to know the multiplicity m of
the root being sought. Then modifying the above
argument on the convergence of Newtons method,
we obtain
g
0
() = 1 m
1
m
= 0
and the iteration method will be quadratically conver-
gent.
But this is not the fundamental problem posed by
multiple roots.
NOISE IN FUNCTION EVALUATION
Recall the discussion of noise in evaluating a function
f(x), and in our case consider the evaluation for val-
ues of x near to . In the following gures, the noise
as measured by vertical distance is the same in both
graphs.
x
y
simple root
x
y
double root
Noise was discussed earlier and as example we used func-
tion
f(x) = x
3
3x
2
+ 3x 1 (x 1)
3
Because of the noise in evaluating f(x), it appears from
the graph that f(x) has many zeros around x = 1,
whereas the exact function outside of the computer has
only the root = 1; of multiplicity 3. Any rootnding
method to nd a multiple root that uses evaluation of
f(x) is doomed to having a large interval of uncertainty
as to the location of the root. If high accuracy is desired,
then the only satisfactory solution is to reformulate the
problem as a new problem F(x) = 0 in which is a sim-
ple root of F. Then use a standard rootnding method
to calculate . It is important that the evaluation of
F(x) not involve f(x) directly, as that is the source of
the noise and the uncertainly.
EXAMPLE
Consider nding the roots of
f(x) = (x 1:1)
3
(x 2:1)
= 2:7951 8:954x + 10:56x
2
5:4x
3
+x
4
This has a root at 1.1
n x
n
f(x
n
) x
n
Rate
0 0:800000 0:03510 0:300000
1 0:892857 0:01073 0:207143 0:690
2 0:958176 0:00325 0:141824 0:685
3 1:00344 0:00099 0:09656 0:681
4 1:03486 0:00029 0:06514 0:675
5 1:05581 0:00009 0:04419 0:678
6 1:07028 0:00003 0:02972 0:673
7 1:08092 0:0 0:01908 0:642
From an examination of the rate of linear convergence of
Newtons method applied to this function, one can guess
with high probability that the multiplicity is m = 3. Then
form exactly the second derivative
f
00
(x) = 21:12 32:4x + 12x
2
Applying Newtons method to this with a guess of x = 1
will lead to rapid convergence to = 1:1.
In general, if we know the root has multiplicity m > 1,
then replace the problem by that of solving
f
(m1)
(x) = 0
since is a simple root of this equation.
STABILITY
Generally we expect the world to be stable. By this,
we mean that if we make a small change in something,
then we expect to have this lead to other correspond-
ingly small changes. In fact, if we think about this
carefully, then we know this need not be true. We
now illustrate this for the case of rootnding.
Consider the polynomial
f(x) = x
7
28x
6
+ 322x
5
1960x
4
+6769x
3
13132x
2
+ 13068x 5040
This has the exact roots {1, 2, 3, 4, 5, 6, 7}. Now con-
sider the perturbed polynomial
F(x) = x
7
28.002x
6
+ 322x
5
1960x
4
+6769x
3
13132x
2
+ 13068x 5040
This is a relatively small change in one coecient, of
relative error
.002
28
= 7.14 10
5
What are the roots of F(x)?
Root of Root of Error
f(x) F(x)
1 1.0000028 2.8E 6
2 1.9989382 1.1E 3
3 3.0331253 0.033
4 3.8195692 0.180
5 5.4586758 +.54012578i .46 .54i
6 5.4586758 .54012578i .46 +.54i
7 7.2330128 0.233
Why have some of the roots departed so radically from
the original values? This phenomena goes under a
variety of names. We sometimes say this is an example
of an unstable or ill-conditioned rootnding problem.
These words are often used in a casual manner, but
they also have a very precise meaning in many areas
of numerical analysis (and more generally, in all of
mathematics).
A PERTURBATION ANALYSIS
We want to study what happens to the root of a func-
tion f(x) when it is perturbed by a small amount. For
some function g(x) and for all small , dene a per-
turbed function
F

(x) = f(x) +g(x)


The polynomial example would t this if we use
g(x) = x
6
, = .002
Let
0
be a simple root of f(x). It can be shown (us-
ing the implicit dierentiation theorem from calculus)
that if f(x) and g(x) are dierentiable for x
0
,
and if f
0
(
0
) 6= 0, then F

(x) has a unique simple


root () near to
0
= (0) for all small values of .
Moreover, () will be a dierentiable function of .
We use this to estimate ().
The linear Taylor polynomial approximation of () is
given by
() (0) +
0
(0)
We need to nd a formula for
0
(0). Recall that
F

(()) = 0
for all small values of . Dierentiate this as a function
of and using the chain rule. Then we obtain
F
0

(()) = f
0
(())
0
()
+g(()) + g
0
(())
0
() = 0
for all small . Substitute = 0, recall (0) =
0
,
and solve for
0
(0) to obtain
f
0
(
0
)
0
(0) +g(
0
) = 0

0
(0) =
g(
0
)
f
0
(
0
)
This then leads to
() (0) +
0
(0)
=
0

g(
0
)
f
0
(
0
)
(*)
Example: In our earlier polynomial example, consider
the simple root
0
= 3. Then
() 3
3
6
48
.
= 3 15.2
With = .002, we obtain
(.002) 3 15.2(.002)
.
= 3.0304
This is close to the actual root of 3.0331253.
However, the approximation (*) is not good at esti-
mating the change in the roots 5 and 6. By ob-
servation, the perturbation in the root is a complex
number, whereas the formula (*) predicts only a per-
turbation that is real. The value of is too large to
have (*) be accurate for the roots 5 and 6.
DISCUSSION
Looking again at the formula
()
0

g(
0
)
f
0
(
0
)
we have that the size of

g(
0
)
f
0
(
0
)
is an indication of the stability of the solution
0
.
If this quantity is large, then potentially we will have
diculty. Of course, not all functions g(x) are equally
possible, and we need to look only at functions g(x)
that will possibly occur in practice.
One quantity of interest is the size of f
0
(
0
). If it
is very small relative to g(
0
), then we are likely to
have diculty in nding
0
accurately.
INTERPOLATION
Interpolation is a process of nding a formula (often
a polynomial) whose graph will pass through a given
set of points (x, y).
As an example, consider dening
x
0
= 0, x
1
=

4
, x
2
=

2
and
y
i
= cos x
i
, i = 0, 1, 2
This gives us the three points
(0, 1) ,

4
,
1
sqrt(2)

2
, 0

Now nd a quadratic polynomial


p(x) = a
0
+ a
1
x + a
2
x
2
for which
p(x
i
) = y
i
, i = 0, 1, 2
The graph of this polynomial is shown on the accom-
panying graph. We later give an explicit formula.
Quadratic interpolation of cos(x)
x
y
/4 /2
y = cos(x)
y = p
2
(x)
PURPOSES OF INTERPOLATION
1. Replace a set of data points {(x
i
, y
i
)} with a func-
tion given analytically.
2. Approximate functions with simpler ones, usually
polynomials or piecewise polynomials.
Purpose #1 has several aspects.
The data may be from a known class of functions.
Interpolation is then used to nd the member of
this class of functions that agrees with the given
data. For example, data may be generated from
functions of the form
p(x) = a
0
+ a
1
e
x
+ a
2
e
2x
+ + a
n
e
nx
Then we need to nd the coecients
n
a
j
o
based
on the given data values.
We may want to take function values f(x) given
in a table for selected values of x, often equally
spaced, and extend the function to values of x
not in the table.
For example, given numbers from a table of loga-
rithms, estimate the logarithm of a number x not
in the table.
Given a set of data points {(x
i
, y
i
)}, nd a curve
passing thru these points that is pleasing to the
eye. In fact, this is what is done continually with
computer graphics. How do we connect a set of
points to make a smooth curve? Connecting them
with straight line segments will often give a curve
with many corners, whereas what was intended
was a smooth curve.
Purpose #2 for interpolation is to approximate func-
tions f(x) by simpler functions p(x), perhaps to make
it easier to integrate or dierentiate f(x). That will
be the primary reason for studying interpolation in this
course.
As as example of why this is important, consider the
problem of evaluating
I =
Z
1
0
dx
1 + x
10
This is very dicult to do analytically. But we will
look at producing polynomial interpolants of the inte-
grand; and polynomials are easily integrated exactly.
We begin by using polynomials as our means of doing
interpolation. Later in the chapter, we consider more
complex piecewise polynomial functions, often called
spline functions.
LINEAR INTERPOLATION
The simplest form of interpolation is probably the
straight line, connecting two points by a straight line.
Let two data points (x
0
, y
0
) and (x
1
, y
1
) be given.
There is a unique straight line passing through these
points. We can write the formula for a straight line
as
P
1
(x) = a
0
+ a
1
x
In fact, there are other more convenient ways to write
it, and we give several of them below.
P
1
(x) =
x x
1
x
0
x
1
y
0
+
x x
0
x
1
x
0
y
1
=
(x
1
x) y
0
+ (x x
0
) y
1
x
1
x
0
= y
0
+
x x
0
x
1
x
0
[y
1
y
0
]
= y
0
+

y
1
y
0
x
1
x
0
!
(x x
0
)
Check each of these by evaluating them at x = x
0
and x
1
to see if the respective values are y
0
and y
1
.
Example. Following is a table of values for f(x) =
tan x for a few values of x.
x 1 1.1 1.2 1.3
tan x 1.5574 1.9648 2.5722 3.6021
Use linear interpolation to estimate tan(1.15). Then
use
x
0
= 1.1, x
1
= 1.2
with corresponding values for y
0
and y
1
. Then
tan x y
0
+
x x
0
x
1
x
0
[y
1
y
0
]
tan x y
0
+
x x
0
x
1
x
0
[y
1
y
0
]
tan (1.15) 1.9648 +
1.15 1.1
1.2 1.1
[2.5722 1.9648]
= 2.2685
The true value is tan 1.15 = 2.2345. We will want
to examine formulas for the error in interpolation, to
know when we have sucient accuracy in our inter-
polant.
x
y
1 1.3
y=tan(x)
x
y
1.1 1.2
y = tan(x)
y = p
1
(x)
QUADRATIC INTERPOLATION
We want to nd a polynomial
P
2
(x) = a
0
+ a
1
x + a
2
x
2
which satises
P
2
(x
i
) = y
i
, i = 0, 1, 2
for given data points (x
0
, y
0
) , (x
1
, y
1
) , (x
2
, y
2
). One
formula for such a polynomial follows:
P
2
(x) = y
0
L
0
(x) + y
1
L
1
(x) + y
2
L
2
(x) ()
with
L
0
(x) =
(xx
1
)(xx
2
)
(x
0
x
1
)(x
0
x
2
)
, L
1
(x) =
(xx
0
)(xx
2
)
(x
1
x
0
)(x
1
x
2
)
L
2
(x) =
(xx
0
)(xx
1
)
(x
2
x
0
)(x
2
x
1
)
The formula () is called Lagranges form of the in-
terpolation polynomial.
LAGRANGE BASIS FUNCTIONS
The functions
L
0
(x) =
(xx
1
)(xx
2
)
(x
0
x
1
)(x
0
x
2
)
, L
1
(x) =
(xx
0
)(xx
2
)
(x
1
x
0
)(x
1
x
2
)
L
2
(x) =
(xx
0
)(xx
1
)
(x
2
x
0
)(x
2
x
1
)
are called Lagrange basis functions for quadratic in-
terpolation. They have the properties
L
i
(x
j
) =
(
1, i = j
0, i 6= j
for i, j = 0, 1, 2. Also, they all have degree 2. Their
graphs are on an accompanying page.
As a consequence of each L
i
(x) being of degree 2, we
have that the interpolant
P
2
(x) = y
0
L
0
(x) + y
1
L
1
(x) + y
2
L
2
(x)
must have degree 2.
UNIQUENESS
Can there be another polynomial, call it Q(x), for
which
deg(Q) 2
Q(x
i
) = y
i
, i = 0, 1, 2
Thus, is the Lagrange formula P
2
(x) unique?
Introduce
R(x) = P
2
(x) Q(x)
From the properties of P
2
and Q, we have deg(R)
2. Moreover,
R(x
i
) = P
2
(x
i
) Q(x
i
) = y
i
y
i
= 0
for all three node points x
0
, x
1
, and x
2
. How many
polynomials R(x) are there of degree at most 2 and
having three distinct zeros? The answer is that only
the zero polynomial satises these properties, and there-
fore
R(x) = 0 for all x
Q(x) = P
2
(x) for all x
SPECIAL CASES
Consider the data points
(x
0
, 1), (x
1
, 1), (x
2
, 1)
What is the polynomial P
2
(x) in this case?
Answer: We must have the polynomial interpolant is
P
2
(x) 1
meaning that P
2
(x) is the constant function. Why?
First, the constant function satises the property of
being of degree 2. Next, it clearly interpolates the
given data. Therefore by the uniqueness of quadratic
interpolation, P
2
(x) must be the constant function 1.
Consider now the data points
(x
0
, mx
0
), (x
1
, mx
1
), (x
2
, mx
2
)
for some constant m. What is P
2
(x) in this case? By
an argument similar to that above,
P
2
(x) = mx for all x
Thus the degree of P
2
(x) can be less than 2.
HIGHER DEGREE INTERPOLATION
We consider now the case of interpolation by poly-
nomials of a general degree n. We want to nd a
polynomial P
n
(x) for which
deg(P
n
) n
P
n
(x
i
) = y
i
, i = 0, 1, , n
()
with given data points
(x
0
, y
0
) , (x
1
, y
1
) , , (x
n
, y
n
)
The solution is given by Lagranges formula
P
n
(x) = y
0
L
0
(x) + y
1
L
1
(x) + + y
n
L
n
(x)
The Lagrange basis functions are given by
L
k
(x) =
(x x
0
) ..(x x
k1
)(x x
k+1
).. (x x
n
)
(x
k
x
0
) ..(x
k
x
k1
)(x
k
x
k+1
).. (x
k
x
n
)
for k = 0, 1, 2, ..., n. The quadratic case was covered
earlier.
In a manner analogous to the quadratic case, we can
show that the above P
n
(x) is the only solution to the
problem ().
In the formula
L
k
(x) =
(x x
0
) ..(x x
k1
)(x x
k+1
).. (x x
n
)
(x
k
x
0
) ..(x
k
x
k1
)(x
k
x
k+1
).. (x
k
x
n
)
we can see that each such function is a polynomial of
degree n. In addition,
L
k
(x
i
) =
(
1, k = i
0, k 6= i
Using these properties, it follows that the formula
P
n
(x) = y
0
L
0
(x) + y
1
L
1
(x) + + y
n
L
n
(x)
satises the interpolation problem of nding a solution
to
deg(P
n
) n
P
n
(x
i
) = y
i
, i = 0, 1, , n
EXAMPLE
Recall the table
x 1 1.1 1.2 1.3
tan x 1.5574 1.9648 2.5722 3.6021
We now interpolate this table with the nodes
x
0
= 1, x
1
= 1.1, x
2
= 1.2, x
3
= 1.3
Without giving the details of the evaluation process,
we have the following results for interpolation with
degrees n = 1, 2, 3.
n 1 2 3
P
n
(1.15) 2.2685 2.2435 2.2296
Error .0340 .0090 .0049
It improves with increasing degree n, but not at a very
rapid rate. In fact, the error becomes worse when n is
increased further. Later we will see that interpolation
of a much higher degree, say n 10, is often poorly
behaved when the node points {x
i
} are evenly spaced.
A FIRST ORDER DIVIDED DIFFERENCE
For a given function f(x) and two distinct points x
0
and
x
1
, dene
f[x
0
; x
1
] =
f(x
1
) f(x
0
)
x
1
x
0
This is called a rst order divided dierence of f(x). By
the Mean-value theorem,
f(x
1
) f(x
0
) = f0(c)(x
1
x
0
)
for some c between x
0
and x
1
. Thus
f[x
0
; x
1
] = f0(c)
and the divided dierence is very much like the derivative,
especially if x
0
and x
1
are quite close together. In fact,
f0(
x
1
+ x
0
2
) f[x
0
; x
1
]
is quite an accurate approximation of the derivative
SECOND ORDER DIVIDED DIFFERENCES
Given three distinct points x
0
, x
1
, and x
2
, dene
f[x
0
; x
1
; x
2
] =
f[x
1
; x
2
] f[x
0
; x
1
]
x
2
x
0
This is called the second order divided dierence of f(x).
By a fairly complicated argument, we can show
f[x
0
; x
1
; x
2
] =
1
2
f
00
(c)
for some c intermediate to x
0
, x
1
, and x
2
. In fact, as we
investigate,
f
00
(x
1
) 2f[x
0
; x
1
; x
2
]
in the case the nodes are evenly spaced,
x
1
x
0
= x
2
x
1
:
EXAMPLE
Consider the table
x 1 1.1 1.2 1.3 1.4
cos x .54030 .45360 .36236 .26750 .16997
Let x
0
= 1, x
1
= 1.1, and x
2
= 1.2. Then
f[x
0
, x
1
] =
.45360 .54030
1.1 1
= .86700
f[x
1
, x
2
] =
.36236 .45360
1.1 1
= .91240
f[x
0
, x
1
, x
2
] =
f[x
1
, x
2
] f[x
0
, x
1
]
x
2
x
0
=
.91240 (.86700)
1.2 1.0
= .22700
For comparison,
f
0

x
1
+ x
0
2

= sin (1.05) = .86742


1
2
f
00
(x
1
) =
1
2
cos (1.1) = .22680
GENERAL DIVIDED DIFFERENCES
Given n + 1 distinct points x
0
, ..., x
n
, with n 2,
dene
f[x
0
, ..., x
n
] =
f[x
1
, ..., x
n
] f[x
0
, ..., x
n1
]
x
n
x
0
This is a recursive denition of the n
th
-order divided
dierence of f(x), using divided dierences of order
n. Its relation to the derivative is as follows:
f[x
0
, ..., x
n
] =
1
n!
f
(n)
(c)
for some c intermediate to the points {x
0
, ..., x
n
}. Let
I denote the interval
I = [min {x
0
, ..., x
n
} , max {x
0
, ..., x
n
}]
Then c I, and the above result is based on the
assumption that f(x) is n-times continuously dier-
entiable on the interval I.
EXAMPLE
The following table gives divided dierences for the
data in
x 1 1.1 1.2 1.3 1.4
cos x .54030 .45360 .36236 .26750 .16997
For the column headings, we use
D
k
f(x
i
) = f[x
i
, ..., x
i+k
]
i x
i
f(x
i
) Df(x
i
) D
2
f(x
i
) D
3
f(x
i
) D
4
f(x
i
)
0 1.0 .54030 -.8670 -.2270 .1533 .0125
1 1.1 .45360 -.9124 -.1810 .1583
2 1.2 .36236 -.9486 -.1335
3 1.3 .26750 -.9753
4 1.4 .16997
These were computed using the recursive denition
f[x
0
, ..., x
n
] =
f[x
1
, ..., x
n
] f[x
0
, ..., x
n1
]
x
n
x
0
ORDER OF THE NODES
Looking at f[x
0
, x
1
], we have
f[x
0
, x
1
] =
f(x
1
) f(x
0
)
x
1
x
0
=
f(x
0
) f(x
1
)
x
0
x
1
= f[x
1
, x
0
]
The order of x
0
and x
1
does not matter. Looking at
f[x
0
, x
1
, x
2
] =
f[x
1
, x
2
] f[x
0
, x
1
]
x
2
x
0
we can expand it to get
f[x
0
, x
1
, x
2
] =
f(x
0
)
(x
0
x
1
) (x
0
x
2
)
+
f(x
1
)
(x
1
x
0
) (x
1
x
2
)
+
f(x
2
)
(x
2
x
0
) (x
2
x
1
)
With this formula, we can show that the order of the
arguments x
0
, x
1
, x
2
does not matter in the nal value
of f[x
0
, x
1
, x
2
] we obtain. Mathematically,
f[x
0
, x
1
, x
2
] = f[x
i
0
, x
i
1
, x
i
2
]
for any permutation (i
0
, i
1
, i
2
) of (0, 1, 2).
We can show in general that the value of f[x
0
, ..., x
n
]
is independent of the order of the arguments {x
0
, ..., x
n
},
even though the intermediate steps in its calculations
using
f[x
0
, ..., x
n
] =
f[x
1
, ..., x
n
] f[x
0
, ..., x
n1
]
x
n
x
0
are order dependent.
We can show
f[x
0
, ..., x
n
] = f[x
i
0
, ..., x
i
n
]
for any permutation (i
0
, i
1
, ..., i
n
) of (0, 1, ..., n).
COINCIDENT NODES
What happens when some of the nodes {x
0
, ..., x
n
}
are not distinct. Begin by investigating what happens
when they all come together as a single point x
0
.
For rst order divided dierences, we have
lim
x
1
x
0
f[x
0
, x
1
] = lim
x
1
x
0
f(x
1
) f(x
0
)
x
1
x
0
= f
0
(x
0
)
We extend the denition of f[x
0
, x
1
] to coincident
nodes using
f[x
0
, x
0
] = f
0
(x
0
)
For second order divided dierences, recall
f[x
0
, x
1
, x
2
] =
1
2
f
00
(c)
with c intermediate to x
0
, x
1
, and x
2
.
Then as x
1
x
0
and x
2
x
0
, we must also have
that c x
0
. Therefore,
lim
x
1
x
0
x
2
x
0
f[x
0
, x
1
, x
2
] =
1
2
f
00
(x
0
)
We therefore dene
f[x
0
, x
0
, x
0
] =
1
2
f
00
(x
0
)
For the case of general f[x
0
, ..., x
n
], recall that
f[x
0
, ..., x
n
] =
1
n!
f
(n)
(c)
for some c intermediate to {x
0
, ..., x
n
}. Then
lim
{x1,...,x
n
}x
0
f[x
0
, ..., x
n
] =
1
n!
f
(n)
(x
0
)
and we dene
f[x
0
, ..., x
0
| {z }
]
n+1 times
=
1
n!
f
(n)
(x
0
)
What do we do when only some of the nodes are
coincident. This too can be dealt with, although we
do so here only by examples.
f[x
0
, x
1
, x
1
] =
f[x
1
, x
1
] f[x
0
, x
1
]
x
1
x
0
=
f
0
(x
1
) f[x
0
, x
1
]
x
1
x
0
The recursion formula can be used in general in this
way to allow all possible combinations of possibly co-
incident nodes.
LAGRANGES FORMULA FOR THE
INTERPOLATION POLYNOMIAL
Recall the general interpolation problem: nd a polyno-
mial P
n
(x) for which
deg(P
n
) n
P
n
(x
i
) = y
i
; i = 0; 1; : : : ; n
with given data points
(x
0
; y
0
); (x
1
; y
1
); ; (x
n
; y
n
)
and with fx
0
; :::; x
n
g distinct points. The solution to
this problem is given as Lagranges formula
P
n
(x) = y
0
L
0
(x) + y
1
L
1
(x) + + y
n
L
n
(x)
with fL
0
(x); :::; L
n
(x)g the Lagrange basis polynomials.
Each L
j
is of degree n and it satises
L
j
(x
i
) =
(
1; if ; j = i
0; if ; j 6= i
for i = 0; 1; : : : ; n.
THE NEWTON DIVIDED DIFFERENCE FORM
OF THE INTERPOLATION POLYNOMIAL
Let the data values for the problem
deg(P
n
) n
P
n
(x
i
) = y
i
, i = 0, 1, , n
be generated from a function f(x):
y
i
= f(x
i
), i = 0, 1, ..., n
Using the divided dierences
f[x
0
, x
1
], f[x
0
, x
1
, x
2
], ..., f[x
0
, ..., x
n
]
we can write the interpolation polynomials
P
1
(x), P
2
(x), ..., P
n
(x)
in a way that is simple to compute.
P
1
(x) = f(x
0
) + f[x
0
, x
1
] (x x
0
)
P
2
(x) = f(x
0
) + f[x
0
, x
1
] (x x
0
)
+f[x
0
, x
1
, x
2
] (x x
0
) (x x
1
)
= P
1
(x) + f[x
0
, x
1
, x
2
] (x x
0
) (x x
1
)
For the case of the general problem
deg(P
n
) n
P
n
(x
i
) = y
i
, i = 0, 1, , n
we have
P
n
(x) = f(x
0
) + f[x
0
, x
1
] (x x
0
)
+f[x
0
, x
1
, x
2
] (x x
0
) (x x
1
)
+f[x
0
, x
1
, x
2
, x
3
] (x x
0
) (x x
1
) (x x
2
)
+
+f[x
0
, ..., x
n
] (x x
0
) (x x
n1
)
From this we have the recursion relation
P
n
(x) = P
n1
(x)+f[x
0
, ..., x
n
] (x x
0
) (x x
n1
)
in which P
n1
(x) interpolates f(x) at the points in
{x
0
, ..., x
n1
}.
Example: Recall the table
i x
i
f(x
i
) Df(x
i
) D
2
f(x
i
) D
3
f(x
i
) D
4
f(x
i
)
0 1.0 .54030 -.8670 -.2270 .1533 .0125
1 1.1 .45360 -.9124 -.1810 .1583
2 1.2 .36236 -.9486 -.1335
3 1.3 .26750 -.9753
4 1.4 .16997
with D
k
f(x
i
) = f[x
i
, ..., x
i+k
], k = 1, 2, 3, 4. Then
P
1
(x) = .5403 .8670 (x 1)
P
2
(x) = P
1
(x) .2270 (x 1) (x 1.1)
P
3
(x) = P
2
(x) + .1533 (x 1) (x 1.1) (x 1.2)
P
4
(x) = P
3
(x)
+.0125 (x 1) (x 1.1) (x 1.2) (x 1.3)
Using this table and these formulas, we have the fol-
lowing table of interpolants for the value x = 1.05.
The true value is cos(1.05) = .49757105.
n 1 2 3 4
P
n
(1.05) .49695 .49752 .49758 .49757
Error 6.20E4 5.00E5 1.00E5 0.0
EVALUATION OF THE DIVIDED DIFFERENCE
INTERPOLATION POLYNOMIAL
Let
d
1
= f[x
0
, x
1
]
d
2
= f[x
0
, x
1
, x
2
]
.
.
.
d
n
= f[x
0
, ..., x
n
]
Then the formula
P
n
(x) = f(x
0
) + f[x
0
, x
1
] (x x
0
)
+f[x
0
, x
1
, x
2
] (x x
0
) (x x
1
)
+f[x
0
, x
1
, x
2
, x
3
] (x x
0
) (x x
1
) (x x
2
)
+
+f[x
0
, ..., x
n
] (x x
0
) (x x
n1
)
can be written as
P
n
(x) = f(x
0
) + (x x
0
) (d
1
+ (x x
1
) (d
2
+
+(x x
n2
) (d
n1
+ (x x
n1
) d
n
) )
Thus we have a nested polynomial evaluation, and
this is quite ecient in computational cost.
ERROR IN LINEAR INTERPOLATION
Let P
1
(x) denote the linear polynomial interpolating
f(x) at x
0
and x
1
, with f(x) a given function (e.g.
f(x) = cos x). What is the error f(x) P
1
(x)?
Let f(x) be twice continuously dierentiable on an in-
terval [a, b] which contains the points {x
0
, x
1
}. Then
for a x b,
f(x) P
1
(x) =
(x x
0
) (x x
1
)
2
f
00
(c
x
)
for some c
x
between the minimum and maximum of
x
0
, x
1
, and x.
If x
1
and x are close to x
0
, then
f(x) P
1
(x)
(x x
0
) (x x
1
)
2
f
00
(x
0
)
Thus the error acts like a quadratic polynomial, with
zeros at x
0
and x
1
.
EXAMPLE
Let f(x) = log
10
x; and in line with typical tables of
log
10
x, we take 1 x, x
0
, x
1
10. For deniteness,
let x
0
< x
1
with h = x
1
x
0
. Then
f
00
(x) =
log
10
e
x
2
log
10
x P
1
(x) =
(x x
0
) (x x
1
)
2
"

log
10
e
c
2
x
#
= (x x
0
) (x
1
x)
"
log
10
e
2c
2
x
#
We usually are interpolating with x
0
x x
1
; and
in that case, we have
(x x
0
) (x
1
x) 0, x
0
c
x
x
1
(x x
0
) (x
1
x) 0, x
0
c
x
x
1
and therefore
(x x
0
) (x
1
x)
"
log
10
e
2x
2
1
#
log
10
x P
1
(x)
(x x
0
) (x
1
x)
"
log
10
e
2x
2
0
#
For h = x
1
x
0
small, we have for x
0
x x
1
log
10
x P
1
(x) (x x
0
) (x
1
x)
"
log
10
e
2x
2
0
#
Typical high school algebra textbooks contain tables
of log
10
x with a spacing of h = .01. What is the
error in this case? To look at this, we use
0 log
10
x P
1
(x) (x x
0
) (x
1
x)
"
log
10
e
2x
2
0
#
By simple geometry or calculus,
max
x
0
xx
1
(x x
0
) (x
1
x)
h
2
4
Therefore,
0 log
10
x P
1
(x)
h
2
4
"
log
10
e
2x
2
0
#
.
= .0543
h
2
x
2
0
If we want a uniform bound for all points 1 x
0
10,
we have
0 log
10
x P
1
(x)
h
2
log
10
e
8
.
= .0543h
2
0 log
10
x P
1
(x) .0543h
2
For h = .01, as is typical of the high school text book
tables of log
10
x,
0 log
10
x P
1
(x) 5.43 10
6
If you look at most tables, a typical entry is given to
only four decimal places to the right of the decimal
point, e.g.
log 5.41
.
= .7332
Therefore the entries are in error by as much as .00005.
Comparing this with the interpolation error, we see the
latter is less important than the rounding errors in the
table entries.
From the bound
0 log
10
x P
1
(x)
h
2
log
10
e
8x
2
0
.
= .0543
h
2
x
2
0
we see the error decreases as x
0
increases, and it is
about 100 times smaller for points near 10 than for
points near 1.
AN ERROR FORMULA:
THE GENERAL CASE
Recall the general interpolation problem: nd a poly-
nomial P
n
(x) for which deg(P
n
) n
P
n
(x
i
) = f(x
i
), i = 0, 1, , n
with distinct node points {x
0
, ..., x
n
} and a given
function f(x). Let [a, b] be a given interval on which
f(x) is (n + 1)-times continuously dierentiable; and
assume the points x
0
, ..., x
n
, and x are contained in
[a, b]. Then
f(x)P
n
(x) =
(x x
0
) (x x
1
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
with c
x
some point between the minimum and maxi-
mum of the points in {x, x
0
, ..., x
n
}.
f(x)P
n
(x) =
(x x
0
) (x x
1
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
As shorthand, introduce

n
(x) = (x x
0
) (x x
1
) (x x
n
)
a polynomial of degree n + 1 with roots {x
0
, ..., x
n
}.
Then
f(x) P
n
(x) =

n
(x)
(n + 1)!
f
(n+1)
(c
x
)
THE QUADRATIC CASE
For n = 2, we have
f(x) P
2
(x) =
(x x
0
) (x x
1
) (x x
2
)
3!
f
(3)
(c
x
)
(*)
with c
x
some point between the minimum and maxi-
mum of the points in {x, x
0
, x
1
, x
2
}.
To illustrate the use of this formula, consider the case
of evenly spaced nodes:
x
1
= x
0
+ h, x
2
= x
1
+ h
Further suppose we have x
0
x x
2
, as we would
usually have when interpolating in a table of given
function values (e.g. log
10
x). The quantity

2
(x) = (x x
0
) (x x
1
) (x x
2
)
can be evaluated directly for a particular x.
Graph of

2
(x) = (x + h) x(x h)
using (x
0
, x
1
, x
2
) = (h, 0, h):
x
y
h
-h
In the formula (), however, we do not know c
x
, and
therefore we replace

f
(3)
(c
x
)

with a maximum of

f
(3)
(x)

as x varies over x
0
x x
2
. This yields
|f(x) P
2
(x)|
|
2
(x)|
3!
max
x
0
xx
2

f
(3)
(x)

(**)
If we want a uniform bound for x
0
x x
2
, we must
compute
max
x
0
xx
2
|
2
(x)| = max
x
0
xx
2
|(x x
0
) (x x
1
) (x x
2
)|
Using calculus,
max
x
0
xx
2
|
2
(x)| =
2h
3
3 sqrt(3)
, at x = x
1

h
sqrt(3)
Combined with (), this yields
|f(x) P
2
(x)|
h
3
9 sqrt(3)
max
x
0
xx
2

f
(3)
(x)

for x
0
x x
2
.
For f(x) = log
10
x, with 1 x
0
x x
2
10, this
leads to
|log
10
x P
2
(x)|
h
3
9 sqrt(3)
max
x
0
xx
2
2 log
10
e
x
3
=
.05572 h
3
x
3
0
For the case of h = .01, we have
|log
10
x P
2
(x)|
5.57 10
8
x
3
0
5.57 10
8
Question: How much larger could we make h so that
quadratic interpolation would have an error compa-
rable to that of linear interpolation of log
10
x with
h = .01? The error bound for the linear interpolation
was 5.43 10
6
, and therefore we want the same to
be true of quadratic interpolation. Using a simpler
bound, we want to nd h so that
|log
10
x P
2
(x)| .05572 h
3
5 10
6
This is true if h = .04477. Therefore a spacing of
h = .04 would be sucient. A table with this spac-
ing and quadratic interpolation would have an error
comparable to a table with h = .01 and linear inter-
polation.
For the case of general n,
f(x) P
n
(x) =
(x x
0
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
=

n
(x)
(n + 1)!
f
(n+1)
(c
x
)

n
(x) = (x x
0
) (x x
1
) (x x
n
)
with c
x
some point between the minimum and max-
imum of the points in {x, x
0
, ..., x
n
}. When bound-
ing the error we replace f
(n+1)
(c
x
) with its maximum
over the interval containing {x, x
0
, ..., x
n
}, as we have
illustrated earlier in the linear and quadratic cases.
Consider now the function

n
(x)
(n + 1)!
over the interval determined by the minimum and
maximum of the points in {x, x
0
, ..., x
n
}. For evenly
spaced node points on [0, 1], with x
0
= 0 and x
n
= 1,
we give graphs for n = 2, 3, 4, 5 and for n = 6, 7, 8, 9
on accompanying pages.
DISCUSSION OF ERROR
Consider the error
f(x) P
n
(x) =
(x x
0
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
=

n
(x)
(n + 1)!
f
(n+1)
(c
x
)

n
(x) = (x x
0
) (x x
1
) (x x
n
)
as n increases and as x varies. As noted previously, we
cannot do much with f
(n+1)
(c
x
) except to replace it
with a maximum value of

f
(n+1)
(x)

over a suitable
interval. Thus we concentrate on understanding the
size of

n
(x)
(n + 1)!
ERROR FOR EVENLY SPACED NODES
We consider rst the case in which the node points
are evenly spaced, as this seems the natural way to
dene the points at which interpolation is carried out.
Moreover, using evenly spaced nodes is the case to
consider for table interpolation. What can we learn
from the given graphs?
The interpolation nodes are determined by using
h =
1
n
, x
0
= 0, x
1
= h, x
2
= 2h, ..., x
n
= nh = 1
For this case,

n
(x) = x(x h) (x 2h) (x 1)
Our graphs are the cases of n = 2, ..., 9.
x
y
n = 2
1
x
y
n = 3
1
x
y
n = 4
1
x
y
n = 5
1
Graphs of
n
(x) on [0, 1] for n = 2, 3, 4, 5
x
y
n = 6
1
x
y
n = 7
1
x
y
n = 8
1
x
y
n = 9
1
Graphs of
n
(x) on [0, 1] for n = 6, 7, 8, 9
Graph of

6
(x) = (x x
0
) (x x
1
) (x x
6
)
with evenly spaced nodes:
x
x
0
x
1
x
2
x
3
x
4
x
5
x
6
Using the following table
,
n M
n
n M
n
1 1.25E1 6 4.76E7
2 2.41E2 7 2.20E8
3 2.06E3 8 9.11E10
4 1.48E4 9 3.39E11
5 9.01E6 10 1.15E12
we can observe that the maximum
M
n
max
x
0
xx
n
|
n
(x)|
(n + 1)!
becomes smaller with increasing n.
From the graphs, there is enormous variation in the
size of
n
(x) as x varies over [0, 1]; and thus there
is also enormous variation in the error as x so varies.
For example, in the n = 9 case,
max
x
0
xx
1
|
n
(x)|
(n + 1)!
= 3.39 10
11
max
x
4
xx
5
|
n
(x)|
(n + 1)!
= 6.89 10
13
and the ratio of these two errors is approximately 49.
Thus the interpolation error is likely to be around 49
times larger when x
0
x x
1
as compared to the
case when x
4
x x
5
. When doing table inter-
polation, the point x at which you are interpolating
should be centrally located with respect to the inter-
polation nodes m{x
0
, ..., x
n
} being used to dene the
interpolation, if possible.
AN APPROXIMATION PROBLEM
Consider now the problem of using an interpolation
polynomial to approximate a given function f(x) on
a given interval [a, b]. In particular, take interpolation
nodes
a x
0
< x
1
< < x
n1
< x
n
b
and produce the interpolation polynomial P
n
(x) that
interpolates f(x) at the given node points. We would
like to have
max
axb
|f(x) P
n
(x)| 0 as n
Does it happen?
Recall the error bound
max
axb
|f(x) P
n
(x)|
max
axb
|
n
(x)|
(n + 1)!
max
axb

f
(n+1)
(x)

We begin with an example using evenly spaced node


points.
RUNGES EXAMPLE
Use evenly spaced node points:
h =
b a
n
x
i
= a + ih for i = 0; : : : ; n
For some functions, such as f(x) = e
x
, the maximum
error goes to zero quite rapidly. But the size of the deriv-
ative term f
(n+1)
(x) in
max
axb
jf(x) Pn(x)j

1
(n + 1)!
max
axb
j
n
(x)j max
axb

f
(n+1)
(x)

can badly hurt or destroy the convergence of other cases.


In particular, we show the graph of
f(x) =
1
1 + x
2
and P
n
(x) on [5; 5] for the case n = 10. It can
be proven that for this function, the maximum error on
[5; 5] does not converge to zero. Thus the use of evenly
spaced nodes is not necessarily a good approach to ap-
proximating a function f(x) by interpolation.
Runges example with n = 10:
x
y
y=P
10
(x)
y=1/(1+x
2
)
OTHER CHOICES OF NODES
Recall the general error bound
max
axb
|f(x) P
n
(x)| max
axb
|
n
(x)|
(n + 1)!
max
axb

f
(n+1)
(x)

There is nothing we really do with the derivative term


for f; but we can examine the way of dening the
nodes {x
0
, ..., x
n
} within the interval [a, b]. We ask
how these nodes can be chosen so that the maximum
of |
n
(x)| over [a, b] is made as small as possible.
This problem has quite an elegant solution, and it will be
considered in next lecture. The node points fx
0
; :::; x
n
g
turn out to be the zeros of a particular polynomial T
n+1
(x)
of degree n + 1, called a Chebyshev polynomial. These
zeros are known explicitly, and with them
max
axb
j
n
(x)j =

b a
2

n+1
2
n
This turns out to be smaller than for evenly spaced cases;
and although this polynomial interpolation does not work
for all functions f(x), it works for all dierentiable func-
tions and more.
ANOTHER ERROR FORMULA
Recall the error formula
f(x) P
n
(x) =

n
(x)
(n + 1)!
f
(n+1)
(c)

n
(x) = (x x
0
) (x x
1
) (x x
n
)
with c between the minimum and maximum of {x
0
, ..., x
n
, x}.
A second formula is given by
f(x) P
n
(x) =
n
(x) f[x
0
, ..., x
n
, x]
To show this is a simple, but somewhat subtle argu-
ment.
Let P
n+1
(x) denote the polynomial of degree n+1
which interpolates f(x) at the points {x
0
, ..., x
n
, x
n+1
}.
Then
P
n+1
(x) = P
n
(x)
+f[x
0
, ..., x
n
, x
n+1
] (x x
0
) (x x
n
)
Substituting x = x
n+1
, and using the fact that P
n+1
(x)
interpolates f(x) at x
n+1
, we have
f(x
n+1
) = P
n
(x
n+1
)
+f[x
0
, ..., x
n
, x
n+1
] (x
n+1
x
0
) (x
n+1
x
n
)
f(x
n+1
) = P
n
(x
n+1
)
+f[x
0
, ..., x
n
, x
n+1
] (x
n+1
x
0
) (x
n+1
x
n
)
In this formula, the number x
n+1
is completely ar-
bitrary, other than being distinct from the points in
{x
0
, ..., x
n
}. To emphasize this fact, replace x
n+1
by
x throughout the formula, obtaining
f(x) = P
n
(x) + f[x
0
, ..., x
n
, x] (x x
0
) (x x
n
)
= P
n
(x) +
n
(x) f[x
0
, ..., x
n
, x]
provided x 6= x
0
, ..., x
n
.
The formula
f(x) = P
n
(x) + f[x
0
, ..., x
n
, x] (x x
0
) (x x
n
)
= P
n
(x) +
n
(x) f[x
0
, ..., x
n
, x]
is easily true for x a node point. Provided f(x) is
dierentiable, the formula is also true for x a node
point.
This shows
f(x) P
n
(x) =
n
(x) f[x
0
, ..., x
n
, x]
Compare the two error formulas
f(x) P
n
(x) =
n
(x) f[x
0
, ..., x
n
, x]
f(x) P
n
(x) =

n
(x)
(n + 1)!
f
(n+1)
(c)
Then

n
(x) f[x
0
, ..., x
n
, x] =

n
(x)
(n + 1)!
f
(n+1)
(c)
f[x
0
, ..., x
n
, x] =
f
(n+1)
(c)
(n + 1)!
for some c between the smallest and largest of the
numbers in {x
0
, ..., x
n
, x}.
To make this somewhat symmetric in its arguments,
let m = n + 1, x = x
n+1
. Then
f[x
0
, ..., x
m1
, x
m
] =
f
(m)
(c)
m!
with c an unknown number between the smallest and
largest of the numbers in {x
0
, ..., x
m
}. This was given
in an earlier lecture where divided dierences were in-
troduced.
PIECEWISE POLYNOMIAL INTERPOLATION
Recall the examples of higher degree polynomial in-
terpolation of the function f(x) =

1 + x
2

1
on
[5, 5]. The interpolants P
n
(x) oscillated a great
deal, whereas the function f(x) was nonoscillatory.
To obtain interpolants that are better behaved, we
look at other forms of interpolating functions.
Consider the data
x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0
What are methods of interpolating this data, other
than using a degree 6 polynomial. Shown in the text
are the graphs of the degree 6 polynomial interpolant,
along with those of piecewise linear and a piecewise
quadratic interpolating functions.
Since we only have the data to consider, we would gen-
erally want to use an interpolant that had somewhat
the shape of that of the piecewise linear interpolant.
x
y
1 2 3 4
1
2
The data points
x
y
1 2 3 4
1
2
Piecewise linear interpolation
x
y
1 2 3 4
1
2
3
4
Polynomial Interpolation
x
y
1 2 3 4
1
2
Piecewise quadratic interpolation
PIECEWISE POLYNOMIAL FUNCTIONS
Consider being given a set of data points (x
1
, y
1
), ...,
(x
n
, y
n
), with
x
1
< x
2
< < x
n
Then the simplest way to connect the points (x
j
, y
j
)
is by straight line segments. This is called a piecewise
linear interpolant of the data
n
(x
j
, y
j
)
o
. This graph
has corners, and often we expect the interpolant to
have a smooth graph.
To obtain a somewhat smoother graph, consider using
piecewise quadratic interpolation. Begin by construct-
ing the quadratic polynomial that interpolates
{(x
1
, y
1
), (x
2
, y
2
), (x
3
, y
3
)}
Then construct the quadratic polynomial that inter-
polates
{(x
3
, y
3
), (x
4
, y
4
), (x
5
, y
5
)}
Continue this process of constructing quadratic inter-
polants on the subintervals
[x
1
, x
3
], [x
3
, x
5
], [x
5
, x
7
], ...
If the number of subintervals is even (and therefore
n is odd), then this process comes out ne, with the
last interval being [x
n2
, x
n
]. This was illustrated
on the graph for the preceding data. If, however, n is
even, then the approximation on the last interval must
be handled by some modication of this procedure.
Suggest such!
With piecewise quadratic interpolants, however, there
are corners on the graph of the interpolating func-
tion. With our preceding example, they are at x
3
and
x
5
. How do we avoid this?
Piecewise polynomial interpolants are used in many
applications. We will consider them later, to obtain
numerical integration formulas.
SMOOTH NON-OSCILLATORY
INTERPOLATION
Let data points (x
1
, y
1
), ..., (x
n
, y
n
) be given, as let
x
1
< x
2
< < x
n
Consider nding functions s(x) for which the follow-
ing properties hold:
(1) s(x
i
) = y
i
, i = 1, ..., n
(2) s(x), s
0
(x), s
00
(x) are continuous on [x
1
, x
n
].
Then among such functions s(x) satisfying these prop-
erties, nd the one which minimizes the integral
Z
x
n
x
1

s
00
(x)

2
dx
The idea of minimizing the integral is to obtain an in-
terpolating function for which the rst derivative does
not change rapidly. It turns out there is a unique so-
lution to this problem, and it is called a natural cubic
spline function.
SPLINE FUNCTIONS
Let a set of node points {x
i
} be given, satisfying
a x
1
< x
2
< < x
n
b
for some numbers a and b. Often we use [a, b] =
[x
1
, x
n
]. A cubic spline function s(x) on [a, b] with
breakpoints or knots {x
i
} has the following prop-
erties:
1. On each of the intervals
[a, x
1
], [x
1
, x
2
], ..., [x
n1
, x
n
], [x
n
, b]
s(x) is a polynomial of degree 3.
2. s(x), s
0
(x), s
00
(x) are continuous on [a, b].
In the case that we have given data points (x
1
, y
1
),...,
(x
n
, y
n
), we say s(x) is a cubic interpolating spline
function for this data if
3. s(x
i
) = y
i
, i = 1, ..., n.
EXAMPLE
Dene
(x )
3
+
=
(
(x )
3
, x
0, x
This is a cubic spline function on (, ) with the
single breakpoint x
1
= .
Combinations of these form more complicated cubic
spline functions. For example,
s(x) = 3 (x 1)
3
+
2 (x 3)
3
+
is a cubic spline function on (, ) with the break-
points x
1
= 1, x
2
= 3.
Dene
s(x) = p
3
(x) +
n
X
j=1
a
j

x x
j

3
+
with p
3
(x) some cubic polynomial. Then s(x) is a
cubic spline function on (, ) with breakpoints
{x
1
, ..., x
n
}.
Return to the earlier problem of choosing an interpo-
lating function s(x) to minimize the integral
Z
x
n
x
1

s
00
(x)

2
dx
There is a unique solution to problem. The solution
s(x) is a cubic interpolating spline function, and more-
over, it satises
s
00
(x
1
) = s
00
(x
n
) = 0
Spline functions satisfying these boundary conditions
are called natural cubic spline functions, and the so-
lution to our minimization problem is a natural cubic
interpolatory spline function. We will show a method
to construct this function from the interpolation data.
Motivation for these boundary conditions can be given
by looking at the physics of bending thin beams of
exible materials to pass thru the given data. To the
left of x
1
and to the right of x
n
, the beam is straight
and therefore the second derivatives are zero at the
transition points x
1
and x
n
.
CONSTRUCTION OF THE
INTERPOLATING SPLINE FUNCTION
To make the presentation more specic, suppose we
have data
(x
1
, y
1
) , (x
2
, y
2
) , (x
3
, y
3
) , (x
4
, y
4
)
with x
1
< x
2
< x
3
< x
4
. Then on each of the
intervals
[x
1
, x
2
] , [x
2
, x
3
] , [x
3
, x
4
]
s(x) is a cubic polynomial. Taking the rst interval,
s(x) is a cubic polynomial and s
00
(x) is a linear poly-
nomial. Let
M
i
= s
00
(x
i
), i = 1, 2, 3, 4
Then on [x
1
, x
2
],
s
00
(x) =
(x
2
x) M
1
+ (x x
1
) M
2
x
2
x
1
, x
1
x x
2
We can nd s(x) by integrating twice:
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6 (x
2
x
1
)
+ c
1
x + c
2
We determine the constants of integration by using
s(x
1
) = y
1
, s(x
2
) = y
2
(*)
Then
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6 (x
2
x
1
)
+
(x
2
x) y
1
+ (x x
1
) y
2
x
2
x
1

x
2
x
1
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
for x
1
x x
2
.
Check that this formula satises the given interpola-
tion condition (*)!
We can repeat this on the intervals [x
2
, x
3
] and [x
3
, x
4
],
obtaining similar formulas.
For x
2
x x
3
,
s(x) =
(x
3
x)
3
M
2
+ (x x
2
)
3
M
3
6 (x
3
x
2
)
+
(x
3
x) y
2
+ (x x
2
) y
3
x
3
x
2

x
3
x
2
6
[(x
3
x) M
2
+ (x x
2
) M
3
]
For x
3
x x
4
,
s(x) =
(x
4
x)
3
M
3
+ (x x
3
)
3
M
4
6 (x
4
x
3
)
+
(x
4
x) y
3
+ (x x
3
) y
4
x
4
x
3

x
4
x
3
6
[(x
4
x) M
3
+ (x x
3
) M
4
]
We still do not know the values of the second deriv-
atives {M
1
, M
2
, M
3
, M
4
}. The above formulas guar-
antee that s(x) and s
00
(x) are continuous for
x
1
x x
4
. For example, the formula on [x
1
, x
2
]
yields
s(x
2
) = y
2
, s
00
(x
2
) = M
2
The formula on [x
2
, x
3
] also yields
s(x
2
) = y
2
, s
00
(x
2
) = M
2
All that is lacking is to make s
0
(x) continuous at x
2
and x
3
. Thus we require
s
0
(x
2
+ 0) = s
0
(x
2
0)
s
0
(x
3
+ 0) = s
0
(x
3
0)
(**)
This means
lim
x&x
2
s
0
(x) = lim
x%x
2
s
0
(x)
and similarly for x
3
.
To simplify the presentation somewhat, I assume in
the following that our node points are evenly spaced:
x
2
= x
1
+ h, x
3
= x
1
+ 2h, x
4
= x
1
+ 3h
Then our earlier formulas simplify to
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6h
+
(x
2
x) y
1
+ (x x
1
) y
2
h

h
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
for x
1
x x
2
, with similar formulas on [x
2
, x
3
] and
[x
3
, x
4
].
Without going thru all of the algebra, the conditions
(**) leads to the following pair of equations.
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
3
y
2
h

y
2
y
1
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
4
y
3
h

y
3
y
2
h
This gives us two equations in four unknowns. The
earlier boundary conditions on s
00
(x) gives us immedi-
ately
M
1
= M
4
= 0
Then we can solve the linear system for M
2
and M
3
.
EXAMPLE
Consider the interpolation data points
x 1 2 3 4
y 1
1
2
1
3
1
4
In this case, h = 1, and linear system becomes
2
3
M
2
+
1
6
M
3
= y
3
2y
2
+ y
1
=
1
3
1
6
M
2
+
2
3
M
3
= y
4
2y
3
+ y
2
=
1
12
This has the solution
M
2
=
1
2
, M
3
= 0
This leads to the spline function formula on each
subinterval.
On [1, 2],
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6h
+
(x
2
x) y
1
+ (x x
1
) y
2
h

h
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
=
(2 x)
3
0 + (x 1)
3

1
2

6
+
(2 x) 1 + (x 1)

1
2

1
6
h
(2 x) 0 + (x 1)

1
2
i
=
1
12
(x 1)
3

7
12
(x 1) + 1
Similarly, for 2 x 3,
s(x) =
1
12
(x 2)
3
+
1
4
(x 2)
2

1
3
(x 1) +
1
2
and for 3 x 4,
s(x) =
1
12
(x 4) +
1
4
x 1 2 3 4
y 1
1
2
1
3
1
4
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0.2
0.4
0.6
0.8
1
x
y
y = 1/x
y = s(x)
Graph of example of natural cubic spline
interpolation
x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating natural cubic spline function
ALTERNATIVE BOUNDARY CONDITIONS
Return to the equations
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
3
y
2
h

y
2
y
1
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
4
y
3
h

y
3
y
2
h
Sometimes other boundary conditions are imposed on
s(x) to help in determining the values of M
1
and
M
4
. For example, the data in our numerical exam-
ple were generated from the function f(x) =
1
x
. With
it, f
00
(x) =
2
x
3
, and thus we could use
M
1
= 2, M
4
=
1
32
With this we are led to a new formula for s(x), one
that approximates f(x) =
1
x
more closely.
THE CLAMPED SPLINE
In this case, we augment the interpolation conditions
s(x
i
) = y
i
, i = 1, 2, 3, 4
with the boundary conditions
s
0
(x
1
) = y
0
1
, s
0
(x
4
) = y
0
4
(#)
The conditions (#) lead to another pair of equations,
augmenting the earlier ones. Combined these equa-
tions are
h
3
M
1
+
h
6
M
2
=
y
2
y
1
h
y
0
1
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
3
y
2
h

y
2
y
1
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
4
y
3
h

y
3
y
2
h
h
6
M
3
+
h
3
M
4
= y
0
4

y
4
y
3
h
For our numerical example, it is natural to obtain
these derivative values from f
0
(x) =
1
x
2
:
y
0
1
= 1, y
0
4
=
1
16
When combined with your earlier equations, we have
the system
1
3
M
1
+
1
6
M
2
=
1
2
1
6
M
1
+
2
3
M
2
+
1
6
M
3
=
1
3
1
6
M
2
+
2
3
M
3
+
1
6
M
4
=
1
12
1
6
M
3
+
1
3
M
4
=
1
48
This has the solution
[M
1
, M
2
, M
3
, M
4
] =

173
120
,
7
60
,
11
120
,
1
60

We can now write the functions s(x) for each of the


subintervals [x
1
, x
2
], [x
2
, x
3
], and [x
3
, x
4
]. Recall for
x
1
x x
2
,
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6h
+
(x
2
x) y
1
+ (x x
1
) y
2
h

h
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
We can substitute in from the data
x 1 2 3 4
y 1
1
2
1
3
1
4
and the solutions {M
i
}. Doing so, consider the error
f(x) s(x). As an example,
f(x) =
1
x
, f

3
2

=
2
3
, s

3
2

= .65260
This is quite a decent approximation.
THE GENERAL PROBLEM
Consider the spline interpolation problem with n nodes
(x
1
, y
1
) , (x
2
, y
2
) , ..., (x
n
, y
n
)
and assume the node points {x
i
} are evenly spaced,
x
j
= x
1
+ (j 1) h, j = 1, ..., n
We have that the interpolating spline s(x) on
x
j
x x
j+1
is given by
s(x) =

x
j+1
x

3
M
j
+

x x
j

3
M
j+1
6h
+

x
j+1
x

y
j
+

x x
j

y
j+1
h

h
6
h
x
j+1
x

M
j
+

x x
j

M
j+1
i
for j = 1, ..., n 1.
To enforce continuity of s
0
(x) at the interior node
points x
2
, ..., x
n1
, the second derivatives
n
M
j
o
must
satisfy the linear equations
h
6
M
j1
+
2h
3
M
j
+
h
6
M
j+1
=
y
j1
2y
j
+ y
j+1
h
for j = 2, ..., n 1. Writing them out,
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
1
2y
2
+ y
3
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
2
2y
3
+ y
4
h
.
.
.
h
6
M
n2
+
2h
3
M
n1
+
h
6
M
n
=
y
n2
2y
n1
+ y
n
h
This is a system of n2 equations in the n unknowns
{M
1
, ..., M
n
}. Two more conditions must be imposed
on s(x) in order to have the number of equations equal
the number of unknowns, namely n. With the added
boundary conditions, this form of linear system can be
solved very eciently.
BOUNDARY CONDITIONS
Natural boundary conditions
s
00
(x
1
) = s
00
(x
n
) = 0
Spline functions satisfying these conditions are called
natural cubic splines. They arise out the minimiza-
tion problem stated earlier. But generally they are not
considered as good as some other cubic interpolating
splines.
Clamped boundary conditions We add the condi-
tions
s
0
(x
1
) = y
0
1
, s
0
(x
n
) = y
0
n
with y
0
1
, y
0
n
given slopes for the endpoints of s(x) on
[x
1
, x
n
]. This has many quite good properties when
compared with the natural cubic interpolating spline;
but it does require knowing the derivatives at the end-
points.
Not a knot boundary conditions This is more com-
plicated to explain, but it is the version of cubic spline
interpolation that is implemented in Matlab.
THE NOT A KNOT CONDITIONS
As before, let the interpolation nodes be
(x
1
, y
1
) , (x
2
, y
2
) , ..., (x
n
, y
n
)
We separate these points into two categories. For
constructing the interpolating cubic spline function,
we use the points
(x
1
, y
1
) , (x
3
, y
3
) , ..., (x
n2
, y
n2
) , (x
n
, y
n
)
Thus deleting two of the points. We now have n 2
points, and the interpolating spline s(x) can be deter-
mined on the intervals
[x
1
, x
3
] , [x
3
, x
4
] , ..., [x
n3
, x
n2
] , [x
n2
, x
n
]
This leads to n 4 equations in the n 2 unknowns
M
1
, M
3
, ..., M
n2
, M
n
. The two additional boundary
conditions are
s(x
2
) = y
2
, s(x
n1
) = y
n1
These translate into two additional equations, and we
obtain a system of n2 linear simultaneous equations
in the n 2 unknowns M
1
, M
3
, ..., M
n2
, M
n
.
x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating cubic spline function with not-a knot
boundary conditions
MATLAB SPLINE FUNCTION LIBRARY
Given data points
(x
1
, y
1
) , (x
2
, y
2
) , ..., (x
n
, y
n
)
type arrays containing the x and y coordinates:
x = [x
1
x
2
...x
n
]
y = [y
1
y
2
...y
n
]
plot (x, y, o)
The last statement will draw a plot of the data points,
marking them with the letter oh. To nd the inter-
polating cubic spline function and evaluate it at the
points of another array xx, say
h = (x
n
x
1
) / (10 n) ; xx = x
1
: h : x
n
;
use
yy = spline (x, y, xx)
plot (x, y, o, xx, yy)
The last statement will plot the data points, as be-
fore, and it will plot the interpolating spline s(x) as a
continuous curve.
ERROR IN CUBIC SPLINE INTERPOLATION
Let an interval [a, b] be given, and then dene
h =
b a
n 1
, x
j
= a + (j 1)h, j = 1, ..., n
Suppose we want to approximate a given function
f(x) on the interval [a, b] using cubic spline inter-
polation. Dene
y
i
= f(x
i
), j = 1, ..., n
Let s
n
(x) denote the cubic spline interpolating this
data and satisfying the not a knot boundary con-
ditions. Then it can be shown that for a suitable
constant c,
E
n
max
axb
|f(x) s
n
(x)| ch
4
The corresponding bound for natural cubic spline in-
terpolation contains only a term of h
2
rather than h
4
;
it does not converge to zero as rapidly.
EXAMPLE
Take f(x) = arctan x on [0, 5]. The following ta-
ble gives values of the maximum error E
n
for various
values of n. The values of h are being successively
halved.
n E
n
E
1
2
n
/E
n
7 7.09E3
13 3.24E4 21.9
25 3.06E5 10.6
49 1.48E6 20.7
97 9.04E8 16.4
BEST APPROXIMATION
Given a function f(x) that is continuous on a given
interval [a, b], consider approximating it by some poly-
nomial p(x). To measure the error in p(x) as an ap-
proximation, introduce
E(p) = max
axb
|f(x) p(x)|
This is called the maximum error or uniform error of
approximation of f(x) by p(x) on [a, b].
With an eye towards eciency, we want to nd the
best possible approximation of a given degree n.
With this in mind, introduce the following:

n
(f) = min
deg(p)n
E(p)
= min
deg(p)n
"
max
axb
|f(x) p(x)|
#
The number
n
(f) will be the smallest possible uni-
form error, or minimax error, when approximating f(x)
by polynomials of degree at most n. If there is a
polynomial giving this smallest error, we denote it by
m
n
(x); thus E(m
n
) =
n
(f).
Example. Let f(x) = e
x
on [1, 1]. In the following
table, we give the values of E(t
n
), t
n
(x) the Tay-
lor polynomial of degree n for e
x
about x = 0, and
E(m
n
).
Maximum Error in:
n t
n
(x) m
n
(x)
1 7.18E 1 2.79E 1
2 2.18E 1 4.50E 2
3 5.16E 2 5.53E 3
4 9.95E 3 5.47E 4
5 1.62E 3 4.52E 5
6 2.26E 4 3.21E 6
7 2.79E 5 2.00E 7
8 3.06E 6 1.11E 8
9 3.01E 7 5.52E 10
Consider graphically how we can improve on the Tay-
lor polynomial
t
1
(x) = 1 + x
as a uniform approximation to e
x
on the interval [1, 1].
The linear minimax approximation is
m
1
(x) = 1.2643 + 1.1752x
x
y
-1 1
1
2
y=t
1
(x)
y=m
1
(x)
y=e
x
Linear Taylor and minimax approximations to e
x
x
y
-1 1
0.0516
Error in cubic Taylor approximation to e
x
x
y
-1 1
0.00553
-0.00553
Error in cubic minimax approximation to e
x
Accuracy of the minimax approximation.

n
(f)
[(b a)/2]
n+1
(n + 1)!2
n
max
axb

f
(n+1)
(x)

This error bound does not always become smaller with


increasing n, but it will give a fairly accurate bound
for many common functions f(x).
Example. Let f(x) = e
x
for 1 x 1. Then

n
(e
x
)
e
(n + 1)!2
n
(*)
n Bound (*)
n
(f)
1 6.80E 1 2.79E 1
2 1.13E 1 4.50E 2
3 1.42E 2 5.53E 3
4 1.42E 3 5.47E 4
5 1.18E 4 4.52E 5
6 8.43E 6 3.21E 6
7 5.27E 7 2.00E 7
CHEBYSHEV POLYNOMIALS
Chebyshev polynomials are used in many parts of nu-
merical analysis, and more generally, in applications
of mathematics. For an integer n 0, dene the
function
T
n
(x) = cos

ncos
1
x

, 1 x 1 (1)
This may not appear to be a polynomial, but we will
show it is a polynomial of degree n. To simplify the
manipulation of (1), we introduce
= cos
1
(x) or x = cos(), 0 (2)
Then
T
n
(x) = cos(n) (3)
Example. n = 0
T
0
(x) = cos(0 ) = 1
n = 1
T
1
(x) = cos() = x
n = 2
T
2
(x) = cos(2) = 2 cos
2
() 1 = 2x
2
1
x
y
-1 1
1
-1
T
0
(x)
T
1
(x)
T
2
(x)
x
y
-1 1
1
-1
T
3
(x)
T
4
(x)
The triple recursion relation. Recall the trigonomet-
ric addition formulas,
cos( ) = cos() cos() sin() sin()
Let n 1, and apply these identities to get
T
n+1
(x) = cos[(n + 1)] = cos(n +)
= cos(n) cos() sin(n) sin()
T
n1
(x) = cos[(n 1)] = cos(n )
= cos(n) cos() + sin(n) sin()
Add these two equations, and then use (1) and (3) to
obtain
T
n+1
(x) +T
n1
= 2 cos(n) cos() = 2xT
n
(x)
T
n+1
(x) = 2xT
n
(x) T
n1
(x), n 1
(4)
This is called the triple recursion relation for the Cheby-
shev polynomials. It is often used in evaluating them,
rather than using the explicit formula (1).
Example. Recall
T
0
(x) = 1, T
1
(x) = x
T
n+1
(x) = 2xT
n
(x) T
n1
(x), n 1
Let n = 2. Then
T
3
(x) = 2xT
2
(x) T
1
(x)
= 2x(2x
2
1) x
= 4x
3
3x
Let n = 3. Then
T
4
(x) = 2xT
3
(x) T
2
(x)
= 2x(4x
3
3x) (2x
2
1)
= 8x
4
8x
2
+ 1
The minimum size property. Note that
|T
n
(x)| 1, 1 x 1 (5)
for all n 0. Also, note that
T
n
(x) = 2
n1
x
n
+ lower degree terms, n 1
(6)
This can be proven using the triple recursion relation
and mathematical induction.
Introduce a modied version of T
n
(x),
e
T
n
(x) =
1
2
n1
T
n
(x) = x
n
+lower degree terms (7)
From (5) and (6),

e
T
n
(x)


1
2
n1
, 1 x 1, n 1 (8)
Example.
e
T
4
(x) =
1
8

8x
4
8x
2
+ 1

= x
4
x
2
+
1
8
A polynomial whose highest degree term has a coe-
cient of 1 is called a monic polynomial. Formula (8)
says the monic polynomial
e
T
n
(x) has size 1/2
n1
on
1 x 1, and this becomes smaller as the degree
n increases. In comparison,
max
1x1
|x
n
| = 1
Thus x
n
is a monic polynomial whose size does not
change with increasing n.
Theorem. Let n 1 be an integer, and consider all
possible monic polynomials of degree n. Then the
degree n monic polynomial with the smallest maxi-
mum on [1, 1] is the modied Chebyshev polynomial
e
T
n
(x), and its maximum value on [1, 1] is 1/2
n1
.
This result is used in devising applications of Cheby-
shev polynomials. We apply it to obtain an improved
interpolation scheme.
A NEAR-MINIMAX APPROXIMATION METHOD
Let f(x) be continuous on [a, b] = [1, 1]. Consider
approximating f by an interpolatory polynomial of de-
gree at most n = 3. Let x
0
, x
1
, x
2
, x
3
be interpo-
lation node points in [1, 1]; let c
3
(x) be of degree
3 and interpolate f(x) at {x
0
, x
1
, x
2
, x
3
}. The in-
terpolation error is
f(x) c
3
(x) =
(x)
4!
f
(4)
(
x
), 1 x 1 (1)
(x) = (x x
0
)(x x
1
)(x x
2
)(x x
3
) (2)
with
x
in [1, 1]. We want to choose the nodes
{x
0
, x
1
, x
2
, x
3
} so as to minimize the maximum value
of |f(x) c
3
(x)| on [1, 1].
From (1), the only general quantity, independent of f,
is (x). Thus we choose {x
0
, x
1
, x
2
, x
3
} to minimize
max
1x1
|(x)| (3)
Expand to get
(x) = x
4
+ lower degree terms
This is a monic polynomial of degree 4. From the
theorem in the preceding section, the smallest possible
value for (3) is obtained with
(x) =
e
T
4
(x) =
T
4
(x)
2
3
=
1
8
(8x
4
8x
2
+ 1) (4)
and the smallest value of (3) is 1/2
3
in this case. The
equation (4) denes implicitly the nodes {x
0
, x
1
, x
2
, x
3
}:
they are the roots of T
4
(x).
In our case this means solving
T
4
(x) = cos(4) = 0, x = cos()
4 =

2
,
3
2
,
5
2
,
7
2
, . . .
=

8
,
3
8
,
5
8
,
7
8
, . . .
x = cos

, cos

3
8

, cos

5
8

, . . . (5)
using cos() = cos().
x = cos

, cos

3
8

, cos

5
8

, cos

7
8

, . . .
The rst four values are distinct; the following ones
are repetitive. For example,
cos

9
8

= cos

7
8

The rst four values are


{x
0
, x
1
, x
2
, x
3
} = {0.382683, 0.923880} (6)
Example. Let f(x) = e
x
on [1, 1]. Use these nodes
to produce the interpolating polynomial c
3
(x) of de-
gree 3. From the interpolation error formula and the
bound of 1/2
3
for |(x)| on [1, 1] , we have
max
1x1
|f(x) c
3
(x)|
1/2
3
4!
max
1x1
e

e
192
.
= 0.014158
By direct calculation,
max
1x1
|e
x
c
3
(x)|
.
= 0.00666
Interpolation Data: f(x) = e
x
i x
i
f(x
i
) f[x
0
, . . . , x
i
]
0 0.923880 2.5190442 2.5190442
1 0.382683 1.4662138 1.9453769
2 0.382683 0.6820288 0.7047420
3 0.923880 0.3969760 0.1751757
x
y
-1 1
0.00666
-0.00624
The error e
x
c
3
(x)
For comparison, E(t
3
)
.
= 0.0142 and
3
(e
x
)
.
= 0.00553.
THE GENERAL CASE
Consider interpolating f(x) on [1, 1] by a polyno-
mial of degree n, with the interpolation nodes
{x
0
, . . . , x
n
} in [1, 1]. Denote the interpolation poly-
nomial by c
n
(x). The interpolation error on [1, 1] is
given by
f(x) c
n
(x) =
(x)
(n + 1)!
f
(n+1)
(
x
) (7)
(x) = (x x
0
) (x x
n
)
with
x
and unknown point in [1, 1]. In order to
minimize the interpolation error, we seek to minimize
max
1x1
|(x)| (8)
The polynomial being minimized is monic of degree
n + 1,
(x) = x
n+1
+ lower degree terms
From the theorem of the preceding section, this min-
imum is attained by the monic polynomial
e
T
n+1
(x) =
1
2
n
T
n+1
(x)
Thus the interpolation nodes are the zeros of T
n+1
(x);
and by the procedure that led to (5), they are given
by
x
j
= cos

2j + 1
2n + 2

, j = 0, 1, . . . , n (9)
The near-minimax approximation c
n
(x) of degree n is
obtained by interpolating to f(x) at these n+1 nodes
on [1, 1].
The polynomial c
n
(x) is sometimes called a Cheby-
shev approximation.
Example. Let f(x) = e
x
. the following table contains
the maximum errors in c
n
(x) on [1, 1] for varying
n. For comparison, we also include the corresponding
minimax errors. These gures illustrate that for prac-
tical purposes, c
n
(x) is a satisfactory replacement for
the minimax approximation m
n
(x).
n max |e
x
c
n
(x)|
n
(e
x
)
1 3.72E 1 2.79E 1
2 5.65E 2 4.50E 2
3 6.66E 3 5.53E 3
4 6.40E 4 5.47E 4
5 5.18E 5 4.52E 5
6 3.80E 6 3.21E 6
THEORETICAL INTERPOLATION ERROR
For the error
f(x) c
n
(x) =
(x)
(n + 1)!
f
(n+1)
(
x
)
we have
max
1x1
|f(x) c
n
(x)|
max
1x1
|(x)|
(n + 1)!
max
11
|f()|
From the theorem of the preceding section,
max
1x1

e
T
n+1
(x)

= max
1x1
|(x)| =
1
2
n
in this case. Thus
max
1x1
|f(x) c
n
(x)|
1
(n + 1)!2
n
max
11
|f()|
OTHER INTERVALS
Consider approximating f(x) on the nite interval
[a, b]. Introduce the linear change of variables
x =
1
2
[(1 t) a + (1 + t) b] (10)
t =
2
b a

x
b + a
2

(11)
Introduce
F(t) = f

1
2
[(1 t) a + (1 + t) b]

, 1 t 1
The function F(t) on [1, 1] is equivalent to f(x) on
[a, b], and we can move between them via (10)-(11).
We can now proceed to approximate f(x) on [a, b] by
instead approximating F(t) on [1, 1].
Example. Approximating f(x) = cos x on [0, /2] is
equivalent to approximating
F(t) = cos

1 + t
4

, 1 t 1
NUMERICAL DIFFERENTIATION
There are two major reasons for considering numeri-
cally approximations of the dierentiation process.
1. Approximation of derivatives in ordinary dieren-
tial equations and partial dierential equations.
This is done in order to reduce the dierential
equation to a form that can be solved more easily
than the original dierential equation.
2. Forming the derivative of a function f(x) which is
known only as empirical data {(x
i
, y
i
) | i = 1, . . . , m}.
The data generally is known only approximately,
so that y
i
f(x
i
), i = 1, . . . , m.
Recall the denition
f
0
(x) = lim
h0
f(x +h) f(x)
h
This justies using
f
0
(x)
f(x +h) f(x)
h
D
h
f(x) (1)
for small values of h. The approximation D
h
f(x) is
called a numerical derivative of f(x) with stepsize h.
Example. Use D
h
f(x) to approximate the derivative
of f(x) = cos(x) at x = /6. In the table, the error
is almost halved when h is halved.
h D
h
f Error Ratio
0.1 0.54243 0.04243
0.05 0.52144 0.02144 1.98
0.025 0.51077 0.01077 1.99
0.0125 0.50540 0.00540 1.99
0.00625 0.50270 0.00270 2.00
0.003125 0.50135 0.00135 2.00
Error behaviour. Using Taylors theorem,
f(x +h) = f(x) +hf
0
(x) +
1
2
h
2
f
00
(c)
with c between x and x +h. Evaluating (1),
D
h
f(x) =
1
h
nh
f(x) +hf
0
(x) +
1
2
h
2
f
00
(c)
i
f(x)
o
= f
0
(x) +
1
2
hf
00
(c)
f
0
(x) D
h
f(x) =
1
2
hf
00
(c) (2)
Using a higher order Taylor expansion,
f
0
(x) D
h
f(x) =
1
2
hf
00
(x)
1
6
h
2
f
00
(c),
f
0
(x) D
h
f(x)
1
2
hf
00
(x) (3)
for small values of h.
For f(x) = cos x,
f
0
(x) D
h
f(x) =
1
2
hcos c, c
h

6
,

6
+h
i
In the preceding table, check the accuracy of the ap-
proximation (3) with x =

6
.
The formula (1),
f
0
(x)
f(x +h) f(x)
h
D
h
f(x)
is called a forward dierence formula for approximat-
ing f
0
(x). In contrast, the approximation
f
0
(x)
f(x) f(x h)
h
, h > 0 (4)
is called a backward dierence formula for approxi-
mating f
0
(x). A similar derivation leads to
f
0
(x)
f(x) f(x h)
h
=
h
2
f
00
(c) (5)
for some c between x and x h. The accuracy of
the backward dierence formula (4) is essentially the
same as that of the forward dierence formula (1).
The motivation for this formula is in applications to
solving dierential equations.
DIFFERENTIATION USING INTERPOLATION
Let P
n
(x) be the degree n polynomial that interpo-
lates f(x) at n + 1 node points x
0
, x
1
, . . . , x
n
. To
calculate f
0
(x) at some point x = t, use
f
0
(t) P
0
n
(t) (6)
Many dierent formulas can be obtained by varying n
and by varying the placement of the nodes x
0
, . . . , x
n
relative to the point t of interest.
Example. Take n = 2, and use evenly spaced nodes
x
0
, x
1
= x
0
+h, x
2
= x
1
+h. Then
P
2
(x) = f(x
0
)L
0
(x) +f(x
1
)L
1
(x) +f(x
2
)L
2
(x)
P
0
2
(x) = f(x
0
)L
0
0
(x) +f(x
1
)L
0
1
(x) +f(x
2
)L
0
2
(x)
with
L
0
(x) =
(x x
1
)(x x
2
)
(x
0
x
1
)(x
0
x
2
)
L
1
(x) =
(x x
0
)(x x
2
)
(x
1
x
0
)(x
1
x
2
)
L
2
(x) =
(x x
0
)(x x
1
)
(x
2
x
0
)(x
2
x
1
)
Forming the derivatives of these Lagrange basis func-
tions and evaluating them at x = x
1
f
0
(x
1
) P
0
2
(x
1
) =
f(x
1
+h) f(x
1
h)
2h
D
h
f(x
1
)
(7)
For the error,
f
0
(x
1
)
f(x
1
+h) f(x
1
h)
2h
=
h
2
6
f
000
(c
2
) (8)
with x
1
h c
2
x
1
+h.
A proof of this begins with the interpolation error for-
mula
f(x) P
2
(x) =
2
(x)f [x
0
, x
1
, x
2
, x]

2
(x) = (x x
0
) (x x
1
) (x x
2
)
Dierentiate to get
f
0
(x) P
0
2
(x) =
2
(x)
d
dx
f [x
0
, x
1
, x
2
, x]
+
0
2
(x)f [x
0
, x
1
, x
2
, x]
f
0
(x) P
0
2
(x) =
2
(x)
d
dx
f [x
0
, x
1
, x
2
, x]
+
0
2
(x)f [x
0
, x
1
, x
2
, x]
With properties of the divided dierence, we can show
f
0
(x)P
0
2
(x) =
1
24

2
(x)f
(4)

c
1,x

+
1
6

0
2
(x)f
(3)

c
2,x

with c
1,x
and c
2,x
between the smallest and largest of
the values {x
0
, x
1
, x
2
, x}. Letting x = x
1
and noting
that
2
(x
1
) = 0, we obtain (8).
Example. Take f(x) = cos(x) and x
1
=
1
6
. Then
(7) is illustrated as follows.
h D
h
f Error Ratio
0.1 0.49916708 0.0008329
0.05 0.49979169 0.0002083 4.00
0.025 0.49994792 0.00005208 4.00
0.0125 0.49998698 0.00001302 4.00
0.00625 0.49999674 0.000003255 4.00
Note the smaller errors and faster convergence as com-
pared to the forward dierence formula (1).
UNDETERMINED COEFFICIENTS
Derive an approximation for f
00
(x) at x = t. Write
f
00
(t) D
(2)
h
f(t) Af(t +h)
+Bf(t) +Cf(t h)
(9)
with A, B, and C unspecied constants. Use Taylor
polynomial approximations
f(t h) f(t) hf
0
(t) +
h
2
2
f
00
(t)

h
3
6
f
000
(t) +
h
4
24
f
(4)
(t)
f(t +h) f(t) +hf
0
(t) +
h
2
2
f
00
(t)
+
h
3
6
f
000
(t) +
h
4
24
f
(4)
(t)
(10)
Substitute into (9) and rearrange:
D
(2)
h
f(t) (A+B +C)f(t)
+h(AC)f
0
(t) +
h
2
2
(A+C)f
00
(t)
+
h
3
6
(AC)f
000
(t) +
h
4
24
(A+C)f
(4)
(t)
(11)
To have
D
(2)
h
f(t) f
00
(t) (12)
for arbitrary functions f(x), require
A+B +C = 0: coecient of f(t)
h(AC) = 0: coecient of f
0
(t)
h
2
2
(A+C) = 1: coecient of f
00
(t)
Solution:
A = C =
1
h
2
, B =
2
h
2
(13)
This determines
D
(2)
h
f(t) =
f(t +h) 2f(t) +f(t h)
h
2
(14)
For the error, substitute (13) into (11):
D
(2)
h
f(t) f
00
(t) +
h
2
12
f
(4)
(t)
Thus
f
00
(t)
f(t +h) 2f(t) +f(t h)
h
2

h
2
12
f
(4)
(t)
(15)
Example. Let f(x) = cos(x), t =
1
6
; use (14) to
calculate f
00
(t) = cos

1
6

.
h D
(2)
h
f Error Ratio
0.5 0.84813289 1.789E 2
0.25 0.86152424 4.501E 3 3.97
0.125 0.86489835 1.127E 3 3.99
0.0625 0.86574353 2.819E 4 4.00
0.03125 0.86595493 7.048E 5 4.00
EFFECTS OF ERROR IN FUNCTION VALUES
Recall
D
(2)
h
f(x
1
) =
f(x
2
) 2f(x
1
) +f(x
0
)
h
2
f
00
(x
1
)
with x
2
= x
1
+ h, x
0
= x
1
h. Assume the ac-
tual function values used in the computation contain
data error, and denote these values by
b
f
0
,
b
f
1
, and
b
f
2
.
Introduce the data errors:

i
= f(x
i
)
b
f
i
, i = 0, 1, 2 (16)
The actual quantity calculated is
c
D
(2)
h
f(x
1
) =
b
f
2
2
b
f
1
+
b
f
2
h
2
(17)
For the error in this quantity, replace
b
f
j
by f(x
j
)
j
,
j = 0, 1, 2, to obtain the following:
f
00
(x
1
)
c
D
(2)
h
f(x
1
) = f
00
(x
1
)

[f(x
2
)
2
] 2[f(x
1
)
1
] + [f(x
0
)
0
]
h
2
=
"
f
00
(x
1
)
f(x
2
) 2f(x
1
) +f(x
0
)
h
2
#
+

2
2
1
+
0
h
2

1
12
h
2
f
(4)
(x
1
) +

2
2
1
+
0
h
2
(18)
The last line uses (15).
The errors {
0
,
1
,
2
} are generally random in some
interval [, ]. If
n
b
f
0
,
b
f
1
,
b
f
2
o
are experimental data,
then is a bound on the experimental error. If
n
b
f
j
o
are obtained from computing f(x) in a computer, then
the errors
j
are the combination of rounding or chop-
ping errors and is a bound on these errors.
In either case, (18) yields the approximate inequality

f
00
(x
1
)
c
D
(2)
h
f(x
1
)

h
2
12

f
(4)
(x
1
)

+
4
h
2
(19)
This suggests that as h 0, the error will eventually
increase, because of the nal term
4
h
2
.
Example. Calculate
c
D
(2)
h
(x
1
) for f(x) = cos(x) at
x
1
=
1
6
. To show the eect of rounding errors, the
values
b
f
i
are obtained by rounding f(x
i
) to six signif-
icant digits; and the errors satisfy
|
i
| 5.0 10
7
= , i = 0, 1, 2
Other than these rounding errors, the formula
c
D
(2)
h
f(x
1
)
is calculated exactly. In this example, the bound (19)
becomes

f
00
(x
1
)
c
D
(2)
h
f(x
1
)

1
12
h
2
cos

1
6

4
h
2

(5 10
7
)
.
= 0.0722h
2
+
210
6
h
2
E(h)
For h = 0.125, the bound E(h)
.
= 0.00126, which is
not too far o from the actual error given in the table.
h
c
D
(2)
h
f(x
1
) Error
0.5 0.848128 0.017897
0.25 0.861504 0.004521
0.125 0.864832 0.001193
0.0625 0.865536 0.000489
0.03125 0.865280 0.000745
0.015625 0.860160 0.005865
0.0078125 0.851968 0.014057
0.00390625 0.786432 0.079593
The bound E(h) indicates that there is a smallest
value of h, call it h

, below which the error bound


will begin to increase. To nd it, let E
0
(h) = 0, with
its root being h

. This leads to h

.
= 0.0726, which is
consistent with the behavior of the errors in the table.
LINEAR SYSTEMS
Consider the following example of a linear system:
x
1
+ 2x
2
+ 3x
3
= 5
x
1
+ x
3
= 3
3x
1
+ x
2
+ 3x
3
= 3
Its unique solution is
x
1
= 1, x
2
= 0, x
3
= 2
In general we want to solve n equations in n un-
knowns. For this, we need some simplifying nota-
tion. In particular we introduce arrays. We can think
of these as means for storing information about the
linear system in a computer. In the above case, we
introduce
A =

1 2 3
1 0 1
3 1 3

, b =

5
3
3

, x =

1
0
2

These arrays completely specify the linear system and


its solution. We also know that we can give mean-
ing to multiplication and addition of these quantities,
calling them matrices and vectors. The linear system
is then written as
Ax = b
with Ax denoting a matrix-vector multiplication.
The general system is written as
a
1,1
x
1
+ + a
1,n
x
n
= b
1
.
.
.
a
n,1
x
1
+ + a
n,n
x
n
= b
n
This is a system of n linear equations in the n un-
knowns x
1
, ..., x
n
. This can be written in matrix-
vector notation as
Ax = b
A =

a
1,1
a
1,n
.
.
.
.
.
.
.
.
.
a
n,1
a
n,n

, b =

b
1
.
.
.
b
n

x =

x
1
.
.
.
x
n

A TRIDIAGONAL SYSTEM
Consider the tridiagonal linear system
3x
1
x
2
= 2
x
1
+ 3x
2
x
3
= 1
.
.
.
x
n2
+ 3x
n1
x
n
= 1
x
n1
+ 3x
n
= 2
The solution is
x
1
= = x
n
= 1
This has the associated arrays
A =

3 1 0 0
1 3 1 0
.
.
.
.
.
. 1 3 1
0 1 3

, b =

2
1
.
.
.
1
2

, x =

1
1
.
.
.
1
1

SOLVING LINEAR SYSTEMS


Linear systems Ax = b occur widely in applied mathe-
matics. They occur as direct formulations of real world
problems; but more often, they occur as a part of the nu-
merical analysis of some other problem. As examples of
the latter, we have the construction of spline functions,
the numerical solution of systems of nonlinear equations,
ordinary and partial dierential equations, integral equa-
tions, and the solution of optimization problems.
There are many ways of classifying linear systems.
Size: Small, moderate, and large. This of course varies
with the machine you are using.
For a matrix A of order n n, it will take 8n
2
bytes
to store it in double precision. Thus a matrix of order
8000 will need around 512 MB of storage. The latter
would be too large for most present day PCs, if the
matrix was to be stored in the computers memory,
although one can easily expand a PC to contain much
more memory than this.
Sparse vs. Dense. Many linear systems have a matrix
A in which almost all the elements are zero. These
matrices are said to be sparse. For example, it is quite
common to work with tridiagonal matrices
A =

a
1
c
1
0 0
b
2
a
2
c
2
0
.
.
.
0 b
3
a
3
c
3
.
.
.
.
.
.
0 b
n
a
n

in which the order is 10


4
or much more. For such
matrices, it does not make sense to store the zero ele-
ments; and the sparsity should be taken into account
when solving the linear system Ax = b. Also, the
sparsity need not be as regular as in this example.
BASIC DEFINITIONS AND THEORY
A homogeneous linear systemAx = b is one for which the
right hand constants are all zero. Using vector notation,
we say b is the zero vector for a homogeneous system.
Otherwise the linear system is call non-homogeneous.
Theorem. The following are equivalent statements.
(1) For each b, there is exactly one solution x.
(2) For each b, there is a solution x.
(3) The homogeneous system Ax = 0 has only the solu-
tion x = 0.
(4) det(A) 6= 0.
(5) Inverse matrix A
1
exists.
EXAMPLE. Consider again the tridiagonal system
3x
1
x
2
= 2
x
1
+ 3x
2
x
3
= 1
.
.
.
x
n2
+ 3x
n1
x
n
= 1
x
n1
+ 3x
n
= 2
The homogeneous version is simply
3x
1
x
2
= 0
x
1
+ 3x
2
x
3
= 0
.
.
.
x
n2
+ 3x
n1
x
n
= 0
x
n1
+ 3x
n
= 0
Assume x 6= 0, and therefore that x has nonzero com-
ponents. Let x
k
denote a component of maximum
size:
|x
k
| = max
1jn

x
j

Consider now equation k, and assume 1 < k < n.


Then
x
k1
+ 3x
k
x
k+1
= 0
x
k
=
1
3

x
k1
+ x
k+1

|x
k
|
1
3

x
k1

x
k+1

1
3
(|x
k
| + |x
k
|)
=
2
3
|x
k
|
This implies x
k
= 0, and therefore x = 0. A similar
proof is valid if k = 1 or k = n, using the rst or the
last equation, respectively.
Thus the original tridiagonal linear system Ax = b has
a unique solution x for each right side b.
METHODS OF SOLUTION
There are two general categories of numerical methods
for solving Ax = b.
Direct Methods: These are methods with a nite
number of steps; and they end with the exact solution
x, provided that all arithmetic operations are exact.
The most used of these methods is Gaussian elimi-
nation, which we begin with. There are other direct
methods, but we do not study them here.
Iteration Methods: These are used in solving all types
of linear systems, but they are most commonly used
with large sparse systems, especially those produced
by discretizing partial dierential equations. This is
an extremely active area of research.
MATRICES in MATLAB
Consider the matrices
A =

1 2 3
2 2 3
3 3 3

, b =

1
1
1

In MATLAB, A can be created as follows.


A = [1 2 3; 2 2 3; 3 3 3];
A = [1, 2, 3; 2, 2, 3; 3, 3, 3];
A = [1 2 3
2 2 3
3 3 3] ;
Commas can be used to replace the spaces. The vec-
tor b can be created by
b = ones(3, 1);
Consider setting up the matrices for the system
Ax = b with
A
i,j
= max {i, j} , b
i
= 1, 1 i, j n
One way to set up the matrix A is as follows:
A = zeros(n, n);
for i = 1 : n
A(i, 1 : i) = i;
A(i, i + 1 : n) = i + 1 : n;
end
and set up the vector b by
b = ones(n, 1);
MATRIX ADDITION
Let A =
h
a
i,j
i
and B =
h
b
i,j
i
be matrices of order
m n. Then
C = A + B
is another matrix of order m n, with
c
i.j
= a
i,j
+ b
i,j
EXAMPLE.

1 2
3 4
5 6

1 1
1 1
1 1

2 1
2 5
6 5

MULTIPLICATION BY A CONSTANT
c

a
1,1
a
1,n
.
.
.
.
.
.
.
.
.
a
m,1
a
m,n

ca
1,1
ca
1,n
.
.
.
.
.
.
.
.
.
ca
m,1
ca
m,n

EXAMPLE.
5

1 2
3 4
5 6

5 10
15 20
25 30

(1)
"
a b
c d
#
=
"
a b
c d
#
THE ZERO MATRIX 0
Dene the zero matrix of order m n as the matrix
of that order having all zero entries. It is sometimes
written as 0
mn
, but more commonly as simply 0.
Then for any matrix A of order m n,
A + 0 = 0 + A = A
The zero matrix 0
mn
acts in the same role as does
the number zero when doing arithmetic with real and
complex numbers.
EXAMPLE.
"
1 2
3 4
#
+
"
0 0
0 0
#
=
"
1 2
3 4
#
We denote by A the solution of the equation
A + B = 0
It is the matrix obtained by taking the negative of all
of the entries in A. For example,
"
a b
c d
#
+
"
a b
c d
#
=
"
0 0
0 0
#

"
a b
c d
#
=
"
a b
c d
#
= (1)
"
a b
c d
#

"
a
1,1
a
1,2
a
2,1
a
2,2
#
=
"
a
1,1
a
1,2
a
2,1
a
2,2
#
MATRIX MULTIPLICATION
Let A =
h
a
i,j
i
have order mn and B =
h
b
i,j
i
have
order n p. Then
C = AB
is a matrix of order m p and
c
i,j
= A
i,
B
,j
= a
i,1
b
1,j
+ a
i,2
b
2,j
+ + a
i,n
b
n,j
or equivalently
c
i,j
=
h
a
i,1
a
i,2
a
i,n
i

b
1,j
b
2,j
.
.
.
b
n,j

= a
i,1
b
1,j
+ a
i,2
b
2,j
+ + a
i,n
b
n,j
EXAMPLES
"
1 2 3
4 5 6
#

1 2
3 4
5 6

=
"
22 28
49 64
#

1 2
3 4
5 6

"
1 2 3
4 5 6
#
=

9 12 15
19 26 33
29 40 51

a
1,1
a
1,n
.
.
.
.
.
.
.
.
.
a
n,1
a
n,n

x
1
.
.
.
x
n

a
1,1
x
1
+ + a
1,n
x
n
.
.
.
a
n,1
x
1
+ + a
n,n
x
n

Thus we write the linear system


a
1,1
x
1
+ + a
1,n
x
n
= b
1
.
.
.
a
n,1
x
1
+ + a
n,n
x
n
= b
n
as
Ax = b
THE IDENTITY MATRIX I
For a given integer n 1, Dene I
n
to be the matrix
of order n n with 1s in all diagonal positions and
zeros elsewhere:
I
n
=

1 0 . . . 0
0 1 0
.
.
.
.
.
.
.
.
.
0 . . . 1

More commonly it is denoted by simply I.


Let A be a matrix of order m n. Then
AI
n
= A, I
m
A = A
The identity matrix I acts in the same role as does
the number 1 when doing arithmetic with real and
complex numbers.
THE MATRIX INVERSE
Let A be a matrix of order nn for some n 1. We
say a matrix B is an inverse for A if
AB = BA = I
It can be shown that if an inverse exists for A, then
it is unique.
EXAMPLES. If ad bc 6= 0, then
"
a b
c d
#
1
=
1
ad bc
"
d b
c a
#
"
1 2
2 2
#
1
=
"
1 1
1
1
2
#

1
1
2
1
3
1
2
1
3
1
4
1
3
1
4
1
5

1
=

9 36 30
36 192 180
30 180 180

Recall the earlier theorem on the solution of linear


systems Ax = b with A a square matrix.
Theorem. The following are equivalent statements.
1. For each b, there is exactly one solution x.
2. For each b, there is a solution x.
3. The homogeneous system Ax = 0 has only the
solution x = 0.
4. det (A) 6= 0.
5. A
1
exists.
EXAMPLE
det

1 2 3
4 5 6
7 8 9

= 0
Therefore, the linear system

1 2 3
4 5 6
7 8 9

x
1
x
2
x
3

b
1
b
2
b
3

is not always solvable, the coecient matrix does not


have an inverse, and the homogeneous system Ax = 0
has a solution other than the zero vector, namely

1 2 3
4 5 6
7 8 9

1
2
1

0
0
0

PARTITIONED MATRICES
Matrices can be built up from smaller matrices; or
conversely, we can decompose a large matrix into a
matrix of smaller matrices. For example, consider
A =

1 2 0
2 1 1
0 1 5

=
"
B c
d e
#
B =
"
1 2
2 1
#
c =
"
0
1
#
d =
h
0 1
i
e = 5
Matlab allows you to build up larger matrices out of
smaller matrices in exactly this manner; and smaller
matrices can be dened as portions of larger matrices.
We will often write an n n square matrix in terms
of its columns:
A =
h
A
,1
, ..., A
,n
i
For the n n identity matrix I, we write
I = [e
1
, ..., e
n
]
with e
j
denoting a column vector with a 1 in position
j and zeros elsewhere.
ARITHMETIC OF PARTITIONED MATRICES
As with matrices, we can do addition and multiplica-
tion with partitioned matrices provided the individual
constituent parts have the proper orders.
For example, let A, B, C, D be n n matrices. Then
"
I A
B I
# "
I C
D I
#
=
"
I + AD C + A
B + D I + BC
#
Let A be n n and x be a column vector of length
n. Then
Ax =
h
A
,1
, ..., A
,n
i

x
1
.
.
.
x
n

= x
1
A
,1
+ +x
n
A
,n
Compare this to

a
1,1
a
1,n
.
.
.
.
.
.
.
.
.
a
n,1
a
n,n

x
1
.
.
.
x
n

a
1,1
x
1
+ + a
1,n
x
n
.
.
.
a
n,1
x
1
+ + a
n,n
x
n

PARTITIONED MATRICES IN MATLAB


In MATLAB, matrices can be constructed using smaller
matrices. For example, let
A = [1, 2; 3, 4]; x = [5, 6]; y = [7, 8]
0
;
Then
B = [A, y; x, 9];
forms the matrix
B =

1 2 7
3 4 8
5 6 9

SOLVING LINEAR SYSTEMS


We want to solve the linear system
a
1,1
x
1
+ + a
1,n
x
n
= b
1
.
.
.
a
n,1
x
1
+ + a
n,n
x
n
= b
n
This will be done by the method used in beginning
algebra, by successively eliminating unknowns from
equations, until eventually we have only one equation
in one unknown. This process is known as Gaussian
elimination. To put it onto a computer, however, we
must be more precise than is generally the case in high
school algebra.
We begin with the linear system
3x
1
2x
2
x
3
= 0 (E1)
6x
1
2x
2
+ 2x
3
= 6 (E2)
9x
1
+ 7x
2
+ x
3
= 1 (E3)
3x
1
2x
2
x
3
= 0 (E1)
6x
1
2x
2
+ 2x
3
= 6 (E2)
9x
1
+ 7x
2
+ x
3
= 1 (E3)
[1] Eliminate x
1
from equations (E2) and (E3). Sub-
tract 2 times (E1) from (E2); and subtract 3 times
(E1) from (E3). This yields
3x
1
2x
2
x
3
= 0 (E1)
2x
2
+ 4x
3
= 6 (E2)
x
2
2x
3
= 1 (E3)
[2] Eliminate x
2
from equation (E3). Subtract
1
2
times
(E2) from (E3). This yields
3x
1
2x
2
x
3
= 0 (E1)
2x
2
+ 4x
3
= 6 (E2)
4x
3
= 4 (E3)
Using back substitution, solve for x
3
, x
2
, and x
1
, ob-
taining
x
3
= x
2
= x
1
= 1
In the computer, we work on the arrays rather than
on the equations. To illustrate this, we repeat the
preceding example using array notation.
The original system is Ax = b, with
A =

3 2 1
6 2 2
9 7 1

, b =

0
6
1

We often write these in combined form as an aug-


mented matrix:
[A | b] =

3 2 1
6 2 2
9 7 1

0
6
1

In step 1, we eliminate x
1
from equations 2 and 3.
We multiply row 1 by 2 and subtract it from row 2;
and we multiply row 1 by -3 and subtract it from row
3. This yields

3 2 1
0 2 4
0 1 2

0
6
1

3 2 1
0 2 4
0 1 2

0
6
1

In step 2, we eliminate x
2
from equation 3. We mul-
tiply row 2 by
1
2
and subtract from row 3. This yields

3 2 1
0 2 4
0 0 4

0
6
4

Then we proceed with back substitution as previously.


For the general case, we reduce
[A | b] =

a
(1)
1,1
a
(1)
1,n
.
.
.
.
.
.
.
.
.
a
(1)
n,1
a
(1)
n,n

b
(1)
1
.
.
.
b
(1)
n

in n 1 steps to the form

a
(1)
1,1
a
(1)
1,n
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 a
(n)
n,n

b
(1)
1
.
.
.
.
.
.
b
(n)
n

More simply, and introducing new notation, this is


equivalent to the matrix-vector equation Ux = g:

u
1,1
u
1,n
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 u
n,n

x
1
.
.
.
.
.
.
x
n

g
1
.
.
.
.
.
.
g
n

This is the linear system


u
1,1
x
1
+ u
1,2
x
2
+ + u
1,n1
x
n1
+ u
1,n
x
n
= g
1
.
.
.
u
n1,n1
x
n1
+ u
n1,n
x
n
= g
n1
u
n,n
x
n
= g
n
We solve for x
n
, then x
n1
, and backwards to x
1
.
This process is called back substitution.
x
n
=
g
n
u
n,n
u
k
=
g
k

n
u
k,k+1
x
k+1
+ + u
k,n
x
n
o
u
k,k
for k = n1, ..., 1. What we have done here is simply
a more carefully dened and methodical version of
what you have done in high school algebra.
How do we carry out the conversion of

a
(1)
1,1
a
(1)
1,n
.
.
.
.
.
.
.
.
.
a
(1)
n,1
a
(1)
n,n

b
(1)
1
.
.
.
b
(1)
n

to

a
(1)
1,1
a
(1)
1,n
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 a
(n)
n,n

b
(1)
1
.
.
.
.
.
.
b
(n)
n

To help us keep track of the steps of this process, we


will denote the initial system by
[A
(1)
| b
(1)
] =

a
(1)
1,1
a
(1)
1,n
.
.
.
.
.
.
.
.
.
a
(1)
n,1
a
(1)
n,n

b
(1)
1
.
.
.
b
(1)
n

Initially we will make the assumption that every pivot


element will be nonzero; and later we remove this
assumption.
Step 1. We will eliminate x
1
from equations 2 thru
n. Begin by dening the multipliers
m
i,1
=
a
(1)
i,1
a
(1)
1,1
, i = 2, ..., n
Here we are assuming the pivot element a
(1)
1,1
6= 0.
Then in succession, multiply m
i,1
times row 1 (called
the pivot row) and subtract the result from row i.
This yields new matrix elements
a
(2)
i,j
= a
(1)
i,j
m
i,1
a
(1)
1,j
, j = 2, ..., n
b
(2)
i
= b
(1)
i
m
i,1
b
(1)
1
for i = 2, ..., n.
Note that the index j does not include j = 1. The
reason is that with the denition of the multiplier m
i,1
,
it is automatic that
a
(2)
i,1
= a
(1)
i,1
m
i,1
a
(1)
1,1
= 0, i = 2, ..., n
The augmented matrix now is
[A
(2)
| b
(2)
] =

a
(1)
1,1
a
(1)
1,2
a
(1)
1,n
0 a
(2)
2,2
a
(2)
2,n
.
.
.
.
.
.
.
.
.
.
.
.
0 a
(2)
n,2
a
(2)
n,n

b
(1)
1
b
(2)
2
.
.
.
b
(2)
n

Step k: Assume that for i = 1, ..., k 1 the unknown


x
i
has been eliminated from equations i + 1 thru n.
We have the augmented matrix
[A
(k)
| b
(k)
] =

a
(1)
1,1
a
(1)
1,2
a
(1)
1,n
0 a
(2)
2,2
a
(2)
2,n
.
.
.
.
.
.
.
.
.
.
.
. 0 a
(k)
k,k
a
(k)
k,n
.
.
.
.
.
.
.
.
.
.
.
.
0 0 a
(k)
n,k
a
(k)
n,n

b
(1)
1
b
(2)
2
.
.
.
b
(k)
k
.
.
.
b
(k)
n

We want to eliminate unknown x


k
from equations k+
1 thru n. Begin by dening the multipliers
m
i,k
=
a
(k)
i,k
a
(k)
k,k
, i = k + 1, ..., n
The pivot element is a
(k)
k,k
, and we assume it is nonzero.
Using these multipliers, we eliminate x
k
from equa-
tions k + 1 thru n. Multiply m
i,k
times row k (the
pivot row) and subtract from row i, for i = k+1 thru
n.
a
(k+1)
i,j
= a
(k)
i,j
m
i,k
a
(k)
k,j
, j = k + 1, ..., n
b
(k+1)
i
= b
(k)
i
m
i,k
b
(k)
k
for i = k +1, ..., n. This yields the augmented matrix
[A
(k+1)
| b
(k+1)
]:

a
(1)
1,1
a
(1)
1,n
0
.
.
.
.
.
.
a
(k)
k,k
a
(k)
k,k+1
a
(k)
k,n
.
.
. 0 a
(k+1)
k+1,k+1
a
(k+1)
k+1,n
.
.
.
.
.
.
.
.
.
.
.
.
0 0 a
(k+1)
n,k+1
a
(k+1)
n,n

b
(1)
1
.
.
.
b
(k)
k
b
(k+1)
k+1
.
.
.
b
(k+1)
n

Doing this for k = 1, 2, ..., n 1 leads to the upper


triangular system with the augmented matrix

a
(1)
1,1
a
(1)
1,n
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 a
(n)
n,n

b
(1)
1
.
.
.
.
.
.
b
(n)
n

We later remove the assumption


a
(k)
k,k
6= 0, k = 1, 2, ..., n
QUESTIONS
How do we remove the assumption on the pivot
elements?
How many operations are involved in this proce-
dure?
How much error is there in the computed solution
due to rounding errors in the calculations?
How does the machine architecture aect the im-
plementation of this algorithm.
PARTIAL PIVOTING
Recall the reduction of
[A
(1)
| b
(1)
] =

a
(1)
1,1
a
(1)
1,n
.
.
.
.
.
.
.
.
.
a
(1)
n,1
a
(1)
n,n

b
(1)
1
.
.
.
b
(1)
n

to
[A
(2)
| b
(2)
] =

a
(1)
1,1
a
(1)
1,2
a
(1)
1,n
0 a
(2)
2,2
a
(2)
2,n
.
.
.
.
.
.
.
.
.
.
.
.
0 a
(2)
n,2
a
(2)
n,n

b
(1)
1
b
(2)
2
.
.
.
b
(2)
n

What if a
(1)
1,1
= 0? In that case we look for an equation
in which the x
1
is present. To do this in such a way
as to avoid zero the maximum extant possible, we do
the following.
Look at all the elements in the rst column,
a
(1)
1,1
, a
(1)
2,1
, ..., a
(1)
n,1
and pick the largest in size. Say it is

a
(1)
k,1

= max
j=1,...,n

a
(1)
j,1

Then interchange equations 1 and k, which means


interchanging rows 1 and k in the augmented matrix
[A
(1)
| b
(1)
]. Then proceed with the elimination of x
1
from equations 2 thru n as before.
Having obtained
[A
(2)
| b
(2)
] =

a
(1)
1,1
a
(1)
1,2
a
(1)
1,n
0 a
(2)
2,2
a
(2)
2,n
.
.
.
.
.
.
.
.
.
.
.
.
0 a
(2)
n,2
a
(2)
n,n

b
(1)
1
b
(2)
2
.
.
.
b
(2)
n

what if a
(2)
2,2
= 0? Then we proceed as before.
Among the elements
a
(2)
2,2
, a
(2)
3,2
, ..., a
(2)
n,2
pick the one of largest size:

a
(2)
k,2

= max
j=2,...,n

a
(2)
j,2

Interchange rows 2 and k. Then proceed as before to


eliminate x
2
from equations 3 thru n, thus obtaining
[A
(3)
| b
(3)
] =

a
(1)
1,1
a
(1)
1,2
a
(1)
1,3
a
(1)
1,n
0 a
(2)
2,2
a
(2)
2,3
a
(2)
2,n
0 0 a
(3)
3,3
a
(3)
3,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 a
(3)
n,3
a
(3)
n,n

b
(1)
1
b
(2)
2
b
(3)
3
.
.
.
b
(3)
n

This is done at every stage of the elimination process.


This technique is called partial pivoting, and it is a
part of most Gaussian elimination programs (including
the one in the text).
Consequences of partial pivoting. Recall the deni-
tion of the elements obtained in the process of elimi-
nating x
1
from equations 2 thru n.
m
i,1
=
a
(1)
i,1
a
(1)
1,1
, i = 2, ..., n
a
(2)
i,j
= a
(1)
i,j
m
i,1
a
(1)
1,j
, j = 2, ..., n
b
(2)
i
= b
(1)
i
m
i,1
b
(1)
1
for i = 2, ..., n. By our denition of the pivot element
a
(1)
1,1
, we have

m
i,1

1, i = 2, ..., n
Thus in the calculation of a
(2)
i,j
and b
(2)
i
, we have that
the elements do not grow rapidly in size. This is in
comparison to what might happen otherwise, in which
the multipliers m
i,1
might have been very large. This
property is true of the multipliers at very step of the
elimination process:

m
i,k

1, i = k + 1, ..., n, k = 1, ..., n 1
The property

m
i,k

1, i = k + 1, ..., n
leads to good error propagation properties in Gaussian
elimination with partial pivoting. The only error in
Gaussian elimination is that derived from the round-
ing errors in the arithmetic operations. For example,
at the rst elimination step (eliminating x
1
from equa-
tions 2 thru n),
a
(2)
i,j
= a
(1)
i,j
m
i,1
a
(1)
1,j
, j = 2, ..., n
b
(2)
i
= b
(1)
i
m
i,1
b
(1)
1
The above property on the size of the multipliers pre-
vents these numbers and the errors in their calculation
from growing as rapidly as they might if no partial piv-
oting was used.
As an example of the improvement in accuracy ob-
tained with partial pivoting, see the example on pages
262-263.
OPERATION COUNTS
One of the major ways in which we compare the e-
ciency of dierent numerical methods is to count the
number of needed arithmetic operations. For solving
the linear system
a
1,1
x
1
+ + a
1,n
x
n
= b
1
.
.
.
a
n,1
x
1
+ + a
n,n
x
n
= b
n
using Gaussian elimination, we have the following op-
eration counts.
1. A U, where we are converting Ax = b to
Ux = g:
Divisions
n(n 1)
2
Additions
n(n 1)(2n 1)
6
Multiplications
n(n 1)(2n 1)
6
2. b g:
Additions
n(n 1)
2
Multiplications
n(n 1)
2
3. Solving Ux = g:
Divisions n
Additions
n(n 1)
2
Multiplications
n(n 1)
2
On some machines, the cost of a division is much
more than that of a multiplication; whereas on others
there is not any important dierence. We assume the
latter; and then the operation costs are as follows.
MD(A U) =
n

n
2
1

3
MD(b g) =
n(n 1)
2
MD(Find x) =
n(n + 1)
2
AS(A U) =
n(n 1)(2n 1)
6
AS(b g) =
n(n 1)
2
AS(Find x) =
n(n 1)
2
Thus the total number of operations is
Additions
2n
3
+ 3n
2
5n
6

Multiplications
and Divisions
!
n
3
+ 3n
2
n
3
Both are around
1
3
n
3
, and thus the total operations
account is approximately
2
3
n
3
What happens to the cost when n is doubled?
Solving Ax = b and Ax = c. What is the cost? Only
the modication of the right side is dierent in these
two cases. Thus the additional cost is

MD(b g)
MD(Find x)
!
= n
2

AS(b g)
AS(Find x)
!
= n(n 1)
The total is around 2n
2
operations, which is quite a
bit smaller than
2
3
n
3
when n is even moderately large,
say n = 100.
Thus one can solve the linear system Ax = c at little
additional cost to that for solving Ax = b. This has
important consequences when it comes to estimation
of the error in computed solutions.
CALCULATING THE MATRIX INVERSE
Consider nding the inverse of a 3 3 matrix
A =

a
1,1
a
1,2
a
1,3
a
2,1
a
2,2
a
2,3
a
3,1
a
3,2
a
3,3

=
h
A
,1
, A
,2
, A
,3
i
We want to nd a matrix
X =
h
X
,1
, X
,2
, X
,3
i
for which
AX = I
A
h
X
,1
, X
,2
, X
,3
i
= [e
1
, e
2
, e
3
]
h
AX
,1
, AX
,2
, AX
,3
i
= [e
1
, e
2
, e
3
]
This means we want to solve
AX
,1
= e
1
, AX
,2
= e
2
, AX
,3
= e
3
We want to solve three linear systems, all with the
same matrix of coecients A.
MATRIX INVERSE EXAMPLE
A =

1 1 2
1 1 1
1 1 0

1 1 2
1 1 1
1 1 0

1 0 0
0 1 0
0 0 1

m
2,1
= 1 m
3,1
= 1

1 1 2
0 0 3
0 2 2

1 0 0
1 1 0
1 0 1

1 1 2
0 2 2
0 0 3

1 0 0
1 0 1
1 1 0

1 1 2
0 2 2
0 0 3

1 0 0
1 0 1
1 1 0

Then by using back substitution to solve for each col-


umn of the inverse, we obtain
A
1
=

1
6
1
3
1
2
1
6
1
3

1
2

1
3
1
3
0

COST OF MATRIX INVERSION


In calculating A
1
, we are solving for the matrix X =
h
X
,1
, X
,2
, . . . , X
,n
i
where
A
h
X
,1
, X
,2
, . . . , X
,n
i
= [e
1
, e
2
, . . . , e
n
]
and e
j
is column j of the identity matrix. Thus we
are solving n linear systems
AX
,1
= e
1
, AX
,2
= e
2
, . . . , AX
,n
= e
n
(1)
all with the same coecient matrix. Returning to
the earlier operation counts for solving a single linear
system, we have the following.
Cost of triangulating A: approx.
2
3
n
3
operations
Cost of solving Ax = b: 2n
2
operations
Thus solving the n linear systems in (1) costs approx-
imately
2
3
n
3
+ n

2n
2

=
8
3
n
3
operations, approximately
It costs approximately four times as many operations
to invert A as to solve a single system. With attention
to the form of the right-hand sides in (1) this can be
reduced to 2n
3
operations.
MATLAB MATRIX OPERATIONS
To solve the linear system Ax = b in Matlab, use
x = A\ b
In Matlab, the command
inv (A)
will calculate the inverse of A.
There are many matrix operations built into Matlab,
both for general matrices and for special classes of
matrices. We do not discuss those here, but recom-
mend the student to investigate these thru the Matlab
help options.
GAUSSIAN ELIMINATION - REVISITED
Consider solving the linear system
2x
1
+ x
2
x
3
+ 2x
4
= 5
4x
1
+ 5x
2
3x
3
+ 6x
4
= 9
2x
1
+ 5x
2
2x
3
+ 6x
4
= 4
4x
1
+ 11x
2
4x
3
+ 8x
4
= 2
by Gaussian elimination without pivoting. We denote
this linear system by Ax = b. The augmented matrix
for this system is
[A | b] =

2 1 1 2
4 5 3 6
2 5 2 6
4 11 4 8

5
9
4
2

To eliminate x
1
from equations 2, 3, and 4, use mul-
tipliers
m
2,1
= 2, m
3,1
= 1, m
4,1
= 2
To eliminate x
1
from equations 2, 3, and 4, use mul-
tipliers
m
2,1
= 2, m
3,1
= 1, m
4,1
= 2
This will introduce zeros into the positions below the
diagonal in column 1, yielding

2 1 1 2
0 3 1 2
0 6 3 8
0 9 2 4

5
1
9
8

To eliminate x
2
from equations 3 and 4, use multipli-
ers
m
3,2
= 2, m
4,2
= 3
This reduces the augmented matrix to

2 1 1 2
0 3 1 2
0 0 1 4
0 0 1 2

5
1
11
5

To eliminate x
3
from equation 4, use the multiplier
m
4,3
= 1
This reduces the augmented matrix to

2 1 1 2
0 3 1 2
0 0 1 4
0 0 0 2

5
1
11
6

Return this to the familiar linear system


2x
1
+ x
2
x
3
+ 2x
4
= 5
3x
2
x
3
+ 2x
4
= 1
x
3
+ 4x
4
= 11
2x
4
= 6
Solving by back substitution, we obtain
x
4
= 3, x
3
= 1, x
2
= 2, x
1
= 1
There is a surprising result involving matrices asso-
ciated with this elimination process. Introduce the
upper triangular matrix
U =

2 1 1 2
0 3 1 2
0 0 1 4
0 0 0 2

which resulted from the elimination process. Then


introduce the lower triangular matrix
L =

1 0 0 0
m
2,1
1 0 0
m
3,1
m
3,2
1 0
m
4,1
m
4,2
m
4,3
1

1 0 0 0
2 1 0 0
1 2 1 0
2 3 1 1

This uses the multipliers introduced in the elimination


process. Then
A = LU

2 1 1 2
4 5 3 6
2 5 2 6
4 11 4 8

1 0 0 0
2 1 0 0
1 2 1 0
2 3 1 1

2 1 1 2
0 3 1 2
0 0 1 4
0 0 0 2

In general, when the process of Gaussian elimination


without pivoting is applied to solving a linear system
Ax = b, we obtain A = LU with L and U constructed
as above.
For the case in which partial pivoting is used, we ob-
tain the slightly modied result
LU = PA
where L and U are constructed as before and P is a
permutation matrix. For example, consider
P =

0 0 1 0
1 0 0 0
0 0 0 1
0 1 0 0

Then
PA =

0 0 1 0
1 0 0 0
0 0 0 1
0 1 0 0

a
1,1
a
1,2
a
1,3
a
1,4
a
2,1
a
2,2
a
2,3
a
2,4
a
3,1
a
3,2
a
3,3
a
3,4
a
4,1
a
4,2
a
4,3
a
4,4

A
3,
A
1,
A
4,
A
2,

PA =

0 0 1 0
1 0 0 0
0 0 0 1
0 1 0 0

a
1,1
a
1,2
a
1,3
a
1,4
a
2,1
a
2,2
a
2,3
a
2,4
a
3,1
a
3,2
a
3,3
a
3,4
a
4,1
a
4,2
a
4,3
a
4,4

A
3,
A
1,
A
4,
A
2,

The matrix PA is obtained from A by switching around


rows of A. The result LU = PA means that the LU-
factorization is valid for the matrix A with its rows
suitably permuted.
Consequences: If we have a factorization
A = LU
with L lower triangular and U upper triangular, then
we can solve the linear system Ax = b in a relatively
straightforward way.
The linear system can be written as
LUx = b
Write this as a two stage process:
Lg = b, Ux = g
The system Lg = b is a lower triangular system
g
1
= b
1

2,1
g
1
+ g
2
= b
2

3,1
g
1
+
3,2
g
2
+ g
3
= b
3
.
.
.

n,1
g
1
+
n,n1
g
n1
+ g
n
= b
n
We solve it by forward substitution. Then we solve
the upper triangular system Ux = g by back substi-
tution.
VARIANTS OF GAUSSIAN ELIMINATION
If no partial pivoting is needed, then we can look for
a factorization
A = LU
without going thru the Gaussian elimination process.
For example, suppose A is 4 4. We write

a
1,1
a
1,2
a
1,3
a
1,4
a
2,1
a
2,2
a
2,3
a
2,4
a
3,1
a
3,2
a
3,3
a
3,4
a
4,1
a
4,2
a
4,3
a
4,4

1 0 0 0

2,1
1 0 0

3,1

3,2
1 0

4,1

4,2

4,3
1

u
1,1
u
1,2
u
1,3
u
1,4
0 u
2,2
u
2,3
u
2,4
0 0 u
3,3
u
3,4
0 0 0 u
4,4

To nd the elements
n

i,j
o
and
n
u
i,j
o
, we multiply
the right side matrices L and U and match the results
with the corresponding elements in A.
Multiplying the rst row of L times all of the columns
of U leads to
u
1,j
= a
1,j
, j = 1, 2, 3, 4
Then multiplying rows 2, 3, 4 times the rst column
of U yields

i,1
u
1,1
= a
i,1
, i = 2, 3, 4
and we can solve for
n

2,1
,
3,1
,
4,1
o
. We can con-
tinue this process, nding the second row of U and
then the second column of L, and so on. For example,
to solve for
4,3
, we need to solve for it in

4,1
u
1,3
+
4,2
u
2,3
+
4,3
u
3,3
= a
4,3
Why do this? A hint of an answer is given by this
last equation. If we had an n n matrix A, then we
would nd
n,n1
by solving for it in the equation

n,1
u
1,n1
+
n,2
u
2,n1
+ +
n,n1
u
n1,n1
= a
n,n1

n,n1
=
a
n,n1

h

n,1
u
1,n1
+ +
n,n2
u
n2,n1
i
u
n1,n1
Embedded in this formula we have a dot product. This
is in fact typical of this process, with the length of the
inner products varying from one position to another.
Recalling the discussion of dot products, we can evaluate
this last formula by using a higher precision arithmetic
and thus avoid many rounding errors.
This leads to a variant of Gaussian elimination in which
there are far fewer rounding errors.
With ordinary Gaussian elimination, the number of round-
ing errors is proportional to n
3
. This reduces the number
of rounding errors, with the number now being propor-
tional to only n
2
. This can lead to major increases in
accuracy, especially for matrices which are very sensitive
to small changes.
TRIDIAGONAL MATRICES
A =

b
1
c
1
0 0 0
a
2
b
2
c
2
0
0 a
3
b
3
c
3
.
.
.
.
.
.
.
.
. a
n1
b
n1
c
n1
0 a
n
b
n

These occur very commonly in the numerical solution


of partial dierential equations, as well as in other ap-
plications (e.g. computing interpolating cubic spline
functions).
We factor A = LU, as before. But now L and U
take very simple forms. Before proceeding, we note
with an example that the same may not be true of the
matrix inverse.
EXAMPLE
Dene an n n tridiagonal matrix
A =

1 1 0 0 0
1 2 1 0
0 1 2 1
.
.
.
.
.
.
.
.
. 1 2 1
0 1
n1
n

Then A
1
is given by

A
1

i,j
= max {i, j}
Thus the sparse matrix A can (and usually does) have
a dense inverse.
We factor A = LU, with
L =

1 0 0 0 0

2
1 0 0
0
3
1 0
.
.
.
.
.
.
.
.
.
n1
1 0
0
n
1

U =

1
c
1
0 0 0
0
2
c
2
0
0 0
3
c
3
.
.
.
.
.
.
.
.
. 0
n1
c
n1
0 0
n

Multiply these and match coecients with A to nd


{
i
,
i
}.
To solve the linear system
Ax = f
or
LUx = f
instead solve the two triangular systems
Lg = f; Ux = g
Solving Lg = f :
g
1
= f
1
g
j
= f
j

j
g
j1
; j = 2; : : : ; n
Solving Ux = g:
x
n
=
g
n

n
x
j
=
g
j
c
j
x
j+1

j
; j = n 1; : : : ; 1
By doing a few multiplications of rows of L times
columns of U, we obtain the general pattern as fol-
lows.

1
= b
1
: row 1 of LU

1
= a
2
,
2
c
1
+
2
= b
2
: row 2 of LU
.
.
.

n1
= a
n
,
n
c
n1
+
n
= b
n
: row n of LU
These are straightforward to solve.

1
= b
1

j
=
a
j

j1
,
j
= b
j

j
c
j1
, j = 2, ..., n
OPERATIONS COUNT
Factoring A = LU.
Additions: n 1
Multiplications: n 1
Divisions: n 1
Solving Lz = f and Ux = z:
Additions: 2n 2
Multiplications: 2n 2
Divisions: n
Thus the total number of arithmetic operations is ap-
proximately 3n to factor A; and it takes about 5n to
solve the linear system using the factorization of A.
If we had A
1
at no cost, what would it cost to com-
pute x = A
1
f?
x
i
=
n
X
j=1

A
1

i,j
f
j
, i = 1, ..., n
MATLAB MATRIX OPERATIONS
To obtain the LU-factorization of a matrix, including
the use of partial pivoting, use the Matlab command
lu. In particular,
[L, U, P] = lu(X)
returns the lower triangular matrix L, upper triangular
matrix U, and permutation matrix P so that
PX = LU
NUMERICAL INTEGRATION
How do you evaluate
I =
Z
b
a
f(x) dx
From calculus, if F(x) is an antiderivative of f(x),
then
I =
Z
b
a
f(x) dx = F(x)|
b
a
= F(b) F(a)
However, in practice most integrals cannot be evalu-
ated by this means. And even when this can work, an
approximate numerical method may be much simpler
and easier to use. For example, the integrand in
Z
1
0
dx
1 + x
5
has an extremely complicated antiderivative; and it is
easier to evaluate the integral by approximate means.
Try evaluating this integral with Maple or Mathemat-
ica.
NUMERICAL INTEGRATION
A GENERAL FRAMEWORK
Returning to a lesson used earlier with rootnding:
If you cannot solve a problem, then replace it with a
near-by problem that you can solve.
In our case, we want to evaluate
I =
Z
b
a
f(x) dx
To do so, many of the numerical schemes are based
on choosing approximates of f(x). Calling one such
e
f(x), use
I
Z
b
a
e
f(x) dx
e
I
What is the error?
E = I
e
I =
Z
b
a
h
f(x)
e
f(x)
i
dx
|E|
Z
b
a

f(x)
e
f(x)

dx
(b a)

f
e
f

f
e
f

max
axb

f(x)
e
f(x)

We also want to choose the approximates


e
f(x) of a
form we can integrate directly and easily. Examples
are polynomials, trig functions, piecewise polynomials,
and others.
If we use polynomial approximations, then how do we
choose them. At this point, we have two choices:
1. Taylor polynomials approximating f(x)
2. Interpolatory polynomials approximating f(x)
EXAMPLE
Consider evaluating
I =
Z
1
0
e
x
2
dx
Use
e
t
= 1 + t +
1
2!
t
2
+ +
1
n!
t
n
+
1
(n+1)!
t
n+1
e
c
t
e
x
2
= 1 + x
2
+
1
2!
x
4
+ +
1
n!
x
2n
+
1
(n+1)!
x
2n+2
e
d
x
with 0 d
x
x
2
. Then
I =
Z
1
0
h
1 + x
2
+
1
2!
x
4
+ +
1
n!
x
2n
i
dx
+
1
(n+1)!
Z
1
0
h
x
2n+2
e
d
x
i
dx
Taking n = 3, we have
I = 1 +
1
3
+
1
10
+
1
42
+ E = 1.4571 + E
0 < E
e
24
Z
1
0
h
x
8
i
dx =
e
216
= .0126
USING INTERPOLATORY POLYNOMIALS
In spite of the simplicity of the above example, it is
generally more dicult to do numerical integration by
constructing Taylor polynomial approximations than
by constructing polynomial interpolates. We therefore
construct the function
e
f in
Z
b
a
f(x) dx
Z
b
a
e
f(x) dx
by means of interpolation.
Initially, we consider only the case in which the in-
terpolation is based on interpolation at evenly spaced
node points.
LINEAR INTERPOLATION
The linear interpolant to f(x), interpolating at a and
b, is given by
P
1
(x) =
(b x) f(a) + (x a) f(b)
b a
Using the linear interpolant
P
1
(x) =
(b x) f(a) + (x a) f(b)
b a
we obtain the approximation
Z
b
a
f(x) dx
Z
b
a
P
1
(x) dx
=
1
2
(b a) [f(a) + f(b)] T
1
(f)
The rule
b
Z
a
f(x) dx T
1
(f)
is called the trapezoidal rule.
x
y
a b
y=f(x)
y=p
1
(x)
Illustrating I T
1
(f)
Example.
Z
/2
0
sin xdx

4
h
sin 0 + sin

2
i
=

4
.
= .785398
Error = .215
HOW TO OBTAIN GREATER ACCURACY?
How do we improve our estimate of the integral
I =
Z
b
a
f(x) dx
One direction is to increase the degree of the approxi-
mation, moving next to a quadratic interpolating poly-
nomial for f(x). We rst look at an alternative.
Instead of using the trapezoidal rule on the original
interval [a, b], apply it to integrals of f(x) over smaller
subintervals. For example:
I =
Z
c
a
f(x) dx +
Z
b
c
f(x) dx, c =
b+a
2

ca
2
[f(a) + f(c)] +
bc
2
[f(c) + f(b)]
=
h
2
[f(a) + 2f(c) + f(b)] T
2
(f), h =
ba
2
Example.
Z
/2
0
sin xdx

8
h
sin 0 + 2 sin

+ sin

2
i
.
= .948059
Error = .0519
x
y
a=x
0
b=x
3
x
1
x
2
y=f(x)
Illustrating I T
3
(f)
THE TRAPEZOIDAL RULE
We can continue as above by dividing [a, b] into even
smaller subintervals and applying

f(x) dx

2
[f() + f()] , ()
on each of the smaller subintervals. Begin by intro-
ducing a positive integer n 1,
h =
b a
n
, x
j
= a + j h, j = 0, 1, ..., n
Then
I =
Z
x
n
x
0
f(x) dx
=
Z
x
1
x
0
f(x) dx +
Z
x
2
x
1
f(x) dx + +
Z
x
n
x
n1
f(x) dx
Use [, ] = [x
0
, x
1
], [x
1
, x
2
], ..., [x
n1
, x
n
], for each
of which the subinterval has length h.
Then applying

f(x) dx

2
[f() + f()]
we have
I
h
2
[f(x
0
) + f(x
1
)] +
h
2
[f(x
1
) + f(x
2
)]
+
+
h
2
[f(x
n2
) + f(x
n1
)] +
h
2
[f(x
n1
) + f(x
n
)]
Simplifying,
I h

1
2
f(a) + f(x
1
) + + f(x
n1
) +
1
2
f(b)

T
n
(f)
This is called the composite trapezoidal rule, or
more simply, the trapezoidal rule.
Example. Again integrate sin x over
h
0,

2
i
. Then we
have
n T
n
(f) Error Ratio
1 .785398163 2.15E1
2 .948059449 5.19E2 4.13
4 .987115801 1.29E2 4.03
8 .996785172 3.21E3 4.01
16 .999196680 8.03E4 4.00
32 .999799194 2.01E4 4.00
64 .999949800 5.02E5 4.00
128 .999987450 1.26E5 4.00
256 .999996863 3.14E6 4.00
Note that the errors are decreasing by a constant fac-
tor of 4. Why do we always double n?
USING QUADRATIC INTERPOLATION
We want to approximate I =
R
b
a
f(x) dx using quadratic
interpolation of f(x). Interpolate f(x) at the points
{a, c, b}, with c =
1
2
(a + b). Also let h =
1
2
(b a).
The quadratic interpolating polynomial is given by
P
2
(x) =
(x c) (x b)
2h
2
f(a) +
(x a) (x b)
h
2
f(c)
+
(x a) (x c)
2h
2
f(b)
Replacing f(x) by P
2
(x), we obtain the approximation
Z
b
a
f(x) dx
Z
b
a
P
2
(x) dx
=
h
3
[f(a) + 4f(c) + f(b)] S
2
(f)
This is called Simpsons rule.
x
y
a b (a+b)/2
y=f(x)
Illustration of I S
2
(f)
Example.
Z
/2
0
sin xdx
/2
3
h
sin 0 + 4 sin

+ sin

2
i
.
= 1.00227987749221
Error = 0.00228
SIMPSONS RULE
As with the trapezoidal rule, we can apply Simpsons
rule on smaller subdivisions in order to obtain better
accuracy in approximating
I =
Z
b
a
f(x) dx
Again, Simpsons rule is given by
Z

f(x) dx

3
[f() + 4f() + f()] , =
+
2
and =
1
2
( ).
Let n be a positive even integer, and
h =
b a
n
, x
j
= a + j h, j = 0, 1, ..., n
Then write
I =
Z
x
n
x
0
f(x) dx
=
Z
x
2
x
0
f(x) dx +
Z
x
4
x
2
f(x) dx + +
Z
x
n
x
n2
f(x) dx
Apply
Z

f(x) dx

3
[f() + 4f() + f()] , =
+
2
to each of these subintegrals, with
[, ] = [x
0
, x
2
] , [x
2
, x
4
] , ..., [x
n2
, x
n
]
In all cases,
1
2
( ) = h. Then
I
h
3
[f(x
0
) + 4f(x
1
) + f(x
2
)]
+
h
3
[f(x
2
) + 4f(x
3
) + f(x
4
)]
+
+
h
3
[f(x
n4
) + 4f(x
n3
) + f(x
n2
)]
+
h
3
[f(x
n2
) + 4f(x
n1
) + f(x
n
)]
This can be simplied to
Z
b
a
f(x) dx S
n
(f)
h
3
[f(x
0
) + 4f(x
1
)
+2f(x
2
) + 4f(x
3
) + 2f(x
4
)
+ + 2f(x
n2
) + 4f(x
n1
) + f(x
n
)]
This is called the composite Simpsons rule or more
simply, .Simpsons rule
EXAMPLE
Approximate
Z
/2
0
sin xdx. The Simpson rule results
are as follows.
n S
n
(f) Error Ratio
2 1.00227987749221 2.28E3
4 1.00013458497419 1.35E4 16.94
8 1.00000829552397 8.30E6 16.22
16 1.00000051668471 5.17E7 16.06
32 1.00000003226500 3.23E8 16.01
64 1.00000000201613 2.02E9 16.00
128 1.00000000012600 1.26E10 16.00
256 1.00000000000788 7.88E12 16.00
512 1.00000000000049 4.92E13 15.99
Note that the ratios of successive errors have con-
verged to 16. Why? Also compare this table with
that for the trapezoidal rule. For example,
I T
4
= 1.29E 2
I S
4
= 1.35E 4
Example 1
I
(1)
=
Z
1
0
e
x
2
dx 0:746824132812427
I
(2)
=
Z
4
0
dx
1 + x
2
= arctan(4) 1:32581766366803
I
(3)
=
Z
2
0
dx
2 + cos x
=
2
p
3
3:62759872846844
Table 1. Trapezoidal rule applied to Example 1.
n I
(1)
I
(2)
I
(3)
Error R Error R Error R
2 1:6E 2 1:3E 1 5:6E 1
4 3:8E 3 4:02 3:6E 3 37:0 3:8E 2 14:9
8 9:6E 4 4:01 5:6E 4 6:4 1:9E 4 195:0
16 2:4E 4 4:00 1:4E 4 3:9 5:2E 9 37600
32 6:0E 5 4:00 3:6E 5 4:00
64 1:5E 5 4:00 9:0E 6 4:00
128 3:7E 6 4:00 2:3E 6 4:00
Table 2. Simpson rule applied to Example 1.
n I
(1)
I
(2)
I
(3)
Error R Error R Error R
2 3:6E 4 8:7E 2 1:26
4 3:1E 5 11:4 3:9E 2 2:2 1:4E 1 9:2
8 2:0E 6 15:7 2:0E 3 20 1:2E 2 11:2
16 1:3E 7 15:9 4:0E 6 485 6:4E 5 191
32 7:8E 9 16:0 2:3E 8 172 1:7E 9 37600
64 4:9E 10 16:0 1:5E 9 16
128 3:0E 11 16:0 9:2E 11 16
TRAPEZOIDAL METHOD
ERROR FORMULA
Theorem Let f(x) have two continuous derivatives on
the interval a x b. Then
E
T
n
(f)
Z
b
a
f(x) dx T
n
(f) =
h
2
(b a)
12
f
00
(c
n
)
for some c
n
in the interval [a, b].
Later I will say something about the proof of this re-
sult, as it leads to some other useful formulas for the
error.
The above formula says that the error decreases in
a manner that is roughly proportional to h
2
. Thus
doubling n (and halving h) should cause the error to
decrease by a factor of approximately 4. This is what
we observed with a past example from the preceding
section.
Example. Consider evaluating
I =
Z
2
0
dx
1 + x
2
using the trapezoidal method T
n
(f). How large should
n be chosen in order to ensure that

E
T
n
(f)

5 10
6
We begin by calculating the derivatives:
f
0
(x) =
2x

1 + x
2

2
, f
00
(x) =
2 + 6x
2

1 + x
2

3
From a graph of f
00
(x),
max
0x2

f
00
(x)

= 2
Recall that b a = 2. Therefore,
E
T
n
(f) =
h
2
(b a)
12
f
00
(c
n
)

E
T
n
(f)


h
2
(2)
12
2 =
h
2
3
E
T
n
(f) =
h
2
(b a)
12
f
00
(c
n
)

E
T
n
(f)


h
2
2
12
2 =
h
2
3
We bound

f
00
(c
n
)

since we do not know c


n
, and
therefore we must assume the worst possible case, that
which makes the error formula largest. That is what
has been done above.
When do we have

E
T
n
(f)

5 10
6
(1)
To ensure this, we choose h so small that
h
2
3
5 10
6
This is equivalent to choosing h and n to satisfy
h .003873
n =
2
h
516.4
Thus n 517 will imply (1).
DERIVING THE ERROR FORMULA
There are two stages in deriving the error:
(1) Obtain the error formula for the case of a single
subinterval (n = 1);
(2) Use this to obtain the general error formula given
earlier.
For the trapezoidal method with only a single subin-
terval, we have
Z
+h

f(x) dx
h
2
[f() + f( + h)] =
h
3
12
f
00
(c)
for some c in the interval [, + h].
A sketch of the derivation of this error formula is given
in the problems.
Recall that the general trapezoidal rule T
n
(f) was ob-
tained by applying the simple trapezoidal rule to a sub-
division of the original interval of integration. Recall
dening and writing
h =
b a
n
, x
j
= a + j h, j = 0, 1, ..., n
I =
x
n
Z
x
0
f(x) dx
=
x
1
Z
x
0
f(x) dx +
x
2
Z
x
1
f(x) dx +
+
x
n
Z
x
n1
f(x) dx
I
h
2
[f(x
0
) + f(x
1
)] +
h
2
[f(x
1
) + f(x
2
)]
+
+
h
2
[f(x
n2
) + f(x
n1
)] +
h
2
[f(x
n1
) + f(x
n
)]
Then the error
E
T
n
(f)
Z
b
a
f(x) dx T
n
(f)
can be analyzed by adding together the errors over the
subintervals [x
0
, x
1
], [x
1
, x
2
], ..., [x
n1
, x
n
]. Recall
Z
+h

f(x) dx
h
2
[f() + f( + h)] =
h
3
12
f
00
(c)
Then on [x
j1
, x
j
],
x
j
Z
x
j1
f(x) dx
h
2
h
f(x
j1
) + f(x
j
)
i
=
h
3
12
f
00
(
j
)
with x
j1

j
x
j
, but otherwise
j
unknown.
Then combining these errors, we obtain
E
T
n
(f) =
h
3
12
f
00
(
1
)
h
3
12
f
00
(
n
)
This formula can be further simplied, and we will do
so in two ways.
Rewrite this error as
E
T
n
(f) =
h
3
n
12
"
f
00
(
1
) + + f
00
(
n
)
n
#
Denote the quantity inside the brackets by
n
. This
number satises
min
axb
f
00
(x)
n
max
axb
f
00
(x)
Since f
00
(x) is a continuous function (by original as-
sumption), we have that there must be some number
c
n
in [a, b] for which
f
00
(c
n
) =
n
Recall also that hn = b a. Then
E
T
n
(f) =
h
3
n
12
"
f
00
(
1
) + + f
00
(
n
)
n
#
=
h
2
(b a)
12
f
00
(c
n
)
This is the error formula given on the rst slide.
AN ERROR ESTIMATE
We now obtain a way to estimate the error E
T
n
(f).
Return to the formula
E
T
n
(f) =
h
3
12
f
00
(
1
)
h
3
12
f
00
(
n
)
and rewrite it as
E
T
n
(f) =
h
2
12
h
f
00
(
1
)h + + f
00
(
n
)h
i
The quantity
f
00
(
1
)h + + f
00
(
n
)h
is a Riemann sum for the integral
Z
b
a
f
00
(x) dx = f
0
(b) f
0
(a)
By this we mean
lim
n
h
f
00
(
1
)h + + f
00
(
n
)h
i
=
Z
b
a
f
00
(x) dx
Thus
f
00
(
1
)h + + f
00
(
n
)h f
0
(b) f
0
(a)
for larger values of n. Combining this with the earlier
error formula
E
T
n
(f) =
h
2
12
h
f
00
(
1
)h + + f
00
(
n
)h
i
we have
E
T
n
(f)
h
2
12
h
f
0
(b) f
0
(a)
i

e
E
T
n
(f)
This is a computable estimate of the error in the nu-
merical integration. It is called an asymptotic error
estimate.
Example. Consider evaluating
I(f) =
Z

0
e
x
cos xdx =
e

+ 1
2
.
= 12.070346
In this case,
f
0
(x) = e
x
[cos x sin x]
f
00
(x) = 2e
x
sin x
max
0x

f
00
(x)

f
00
(.75)

= 14. 921
Then
E
T
n
(f) =
h
2
(b a)
12
f
00
(c
n
)

E
T
n
(f)


h
2

12
14.921 = 3.906h
2
Also
e
E
T
n
(f) =
h
2
12

f
0
() f
0
(0)

=
h
2
12
[e

+ 1]
.
= 2.012h
2
I(f) T
n
(f)
h
2
12

f
0
(b) f
0
(a)

I(f) T
n
(f)
h
2
12

f
0
(b) f
0
(a)

CT
n
(f) T
n
(f)
h
2
12

f
0
(b) f
0
(a)

This is the corrected trapezoidal rule. It is easy to obtain


from the trapezoidal rule, and in most cases, it converges
more rapidly than the trapezoidal rule.
Table 3. Asymptotic and corrected trapesoidal rule ap-
plied to integral I
(1)
from Example 1.
n I T
n
(f) R
e
E
n
(f) I CT
n
(f) R
2 1:6E 2 1:5E 2 1:3E 4
4 3:8E 3 4 3:8E 3 7:9E 6 15:8
8 9:6E 4 4 9:6E 4 4:9E 7 16
16 2:4E 4 4 2:4E 4 3:1E 8 16
32 5:9E 5 4 5:9E 5 2:0E 9 16
64 1:5E 5 4 1:5E 5 2:2E 10 16
SIMPSONS RULE ERROR FORMULA
Recall the general Simpsons rule
Z
b
a
f(x) dx S
n
(f)
h
3
[f(x
0
) + 4f(x
1
) + 2f(x
2
)
+4f(x
3
) + 2f(x
4
) +
+2f(x
n2
) + 4f(x
n1
) + f(x
n
)]
For its error, we have
E
S
n
(f)
b
Z
a
f(x) dx S
n
(f) =
h
4
(b a)
180
f
(4)
(c
n
)
for some a c
n
b, with c
n
otherwise unknown. For
an asymptotic error estimate,
b
Z
a
f(x) dxS
n
(f)
e
E
S
n
(f)
h
4
180
h
f
000
(b) f
000
(a)
i
DISCUSSION
For Simpsons error formula, both formulas assume
that the integrand f(x) has four continuous deriva-
tives on the interval [a, b]. What happens when this
is not valid? We return later to this question.
Both formulas also say the error should decrease by a
factor of around 16 when n is doubled.
Compare these results with those for the trapezoidal
rule error formulas:.
E
T
n
(f)
Z
b
a
f(x) dx T
n
(f) =
h
2
(b a)
12
f
00
(c
n
)
E
T
n
(f)
h
2
12
h
f
0
(b) f
0
(a)
i

e
E
T
n
(f)
EXAMPLE
Consider evaluating
I =
Z
2
0
dx
1 + x
2
using Simpsons rule S
n
(f). How large should n be
chosen in order to ensure that

E
S
n
(f)

5 10
6
Begin by noting that
f
(4)
(x) = 24
5x
4
10x
2
+ 1

1 + x
2

5
max
0x1

f
(4)
(x)

= f
(4)
(0) = 24
Then
E
S
n
(f) =
h
4
(b a)
180
f
(4)
(c
n
)

E
S
n
(f)


h
4
2
180
24 =
4h
4
15
Then

E
S
n
(f)

5 10
6
is true if
4h
4
15
5 10
6
h .0658
n 30.39
Therefore, choosing n 32 will give the desired er-
ror bound. Compare this with the earlier trapezoidal
example in which n 517 was needed.
For the asymptotic error estimate, we have
f
000
(x) = 24x
x
2
1

1 + x
2

4
e
E
S
n
(f)
h
4
180

f
000
(2) f
000
(0)

=
h
4
180

144
625
=
4
3125
h
4
INTEGRATING sqrt(x)
Consider the numerical approximation of
Z
1
0
sqrt(x) dx =
2
3
In the following table, we give the errors when using
both the trapezoidal and Simpson rules.
n E
T
n
Ratio E
S
n
Ratio
2 6.311E 2 2.860E 2
4 2.338E 2 2.70 1.012E 2 2.82
8 8.536E 3 2.74 3.587E 3 2.83
16 3.085E 3 2.77 1.268E 3 2.83
32 1.108E 3 2.78 4.485E 4 2.83
64 3.959E 4 2.80 1.586E 4 2.83
128 1.410E 4 2.81 5.606E 5 2.83
The rate of convergence is slower because the func-
tion f(x) =sqrt(x) is not suciently dierentiable on
[0, 1]. Both methods converge with a rate propor-
tional to h
1.5
.
ASYMPTOTIC ERROR FORMULAS
If we have a numerical integration formula,
Z
b
a
f(x) dx
n
X
j=0
w
j
f(x
j
)
let E
n
(f) denote its error,
E
n
(f) =
Z
b
a
f(x) dx
n
X
j=0
w
j
f(x
j
)
We say another formula
e
E
n
(f) is an asymptotic error
formula this numerical integration if it satises
lim
n
e
E
n
(f)
E
n
(f)
= 1
Equivalently,
lim
n
E
n
(f)
e
E
n
(f)
E
n
(f)
= 0
These conditions say that
e
E
n
(f) looks increasingly
like E
n
(f) as n increases, and thus
E
n
(f)
e
E
n
(f)
Example. For the trapezoidal rule,
E
T
n
(f)
e
E
T
n
(f)
h
2
12
h
f
0
(b) f
0
(a)
i
This assumes f(x) has two continuous derivatives on
the interval [a, b].
Example. For Simpsons rule,
E
S
n
(f)
e
E
S
n
(f)
h
4
180
h
f
000
(b) f
000
(a)
i
This assumes f(x) has four continuous derivatives on
the interval [a, b].
Note that both of these formulas can be written in an
equivalent form as
e
E
n
(f) =
c
n
p
for appropriate constant c and exponent p. With the
trapezoidal rule, p = 2 and
c =
(b a)
2
12
h
f
0
(b) f
0
(a)
i
and for Simpsons rule, p = 4 with a suitable c.
The formula
e
E
n
(f) =
c
n
p
(2)
occurs for many other numerical integration formulas
that we have not yet dened or studied. In addition,
if we use the trapezoidal or Simpson rules with an
integrand f(x) which is not suciently dierentiable,
then (2) may hold with an exponent p that is less than
the ideal.
Example. Consider
I =
Z
1
0
x

dx
in which 1 < < 1, 6= 0. Then the conver-
gence of the trapezoidal rule can be shown to have an
asymptotic error formula
E
n

e
E
n
=
c
n
+1
(3)
for some constant c dependent on . A similar result
holds for Simpsons rule, with 1 < < 3, not an
integer. We can actually specify a formula for c; but
the formula is often less important than knowing that
(2) is valid for some c.
APPLICATION OF ASYMPTOTIC
ERROR FORMULAS
Assume we know that an asymptotic error formula
I I
n

c
n
p
is valid for some numerical integration rule denoted by
I
n
. Initially, assume we know the exponent p. Then
imagine calculating both I
n
and I
2n
. With I
2n
, we
have
I I
2n

c
2
p
n
p
This leads to
I I
n
2
p
[I I
2n
]
I
2
p
I
2n
I
n
2
p
1
= I
2n
+
I
2n
I
n
2
p
1
The formula
I I
2n
+
I
2n
I
n
2
p
1
(4)
is called Richardsons extrapolation formula.
Example. With the trapezoidal rule and with the in-
tegrand f(x) having two continuous derivatives,
I T
2n
+
1
3
[T
2n
T
n
]
Example. With Simpsons rule and with the integrand
f(x) having four continuous derivatives,
I S
2n
+
1
15
[S
2n
S
n
]
We can also use the formula (2) to obtain error esti-
mation formulas:
I I
2n

I
2n
I
n
2
p
1
(5)
This is called Richardsons error estimate. For exam-
ple, with the trapezoidal rule,
I T
2n

1
3
[T
2n
T
n
]
These formulas are illustrated for the trapezoidal rule
in an accompanying table, for
Z

0
e
x
cos xdx =
e

+ 1
2
.
= 12.07034632
AITKEN EXTRAPOLATION
In this case, we again assume
I I
n

c
n
p
But in contrast to previously, we do not know either
c or p. Imagine computing I
n
, I
2n
, and I
4n
. Then
I I
n

c
n
p
I I
2n

c
2
p
n
p
I I
4n

c
4
p
n
p
We can directly try to estimate I. Dividing
I I
n
I I
2n
2
p

I I
2n
I I
4n
Solving for I, we obtain
(I I
2n
)
2
(I I
n
) (I I
4n
)
I (I
n
+ I
4n
2I
2n
) I
n
I
4n
I
2
2n
I
I
n
I
4n
I
2
2n
I
n
+ I
4n
2I
2n
This can be improved computationally, to avoid loss
of signicance errors.
I I
4n
+
"
I
n
I
4n
I
2
2n
I
n
+ I
4n
2I
2n
I
4n
#
= I
4n

(I
4n
I
2n
)
2
(I
4n
I
2n
) (I
2n
I
n
)
This is called Aitkens extrapolation formula.
To estimate p, we use
I
2n
I
n
I
4n
I
2n
2
p
To see this, write
I
2n
I
n
I
4n
I
2n
=
(I I
n
) (I I
2n
)
(I I
2n
) (I I
4n
)
Then substitute from the following and simplify:
I I
n

c
n
p
I I
2n

c
2
p
n
p
I I
4n

c
4
p
n
p
Example. Consider the following table of numerical
integrals. What is its order of convergence?
n I
n
I
n
I
1
2
n
Ratio
2 .28451779686
4 .28559254576 1.075E 3
8 .28570248748 1.099E 4 9.78
16 .28571317731 1.069E 5 10.28
32 .28571418363 1.006E 6 10.62
64 .28571427643 9.280E 8 10.84
It appears
2
p
.
= 10.84, p
.
= log
2
10.84 = 3.44
We could now combine this with Richardsons error
formula to estimate the error:
I I
n

1
2
p
1

I
n
I
1
2
n

For example,
I I
64

1
9.84
[9.280E 8] = 9.43E 9
PERIODIC FUNCTIONS
A function f(x) is periodic if the following condition
is satised. There is a smallest real number > 0 for
which
f(x + ) = f(x), < x < (6)
The number is called the period of the function
f(x). The constant function f(x) 1 is also consid-
ered periodic, but it satises this condition with any
> 0. Basically, a periodic function is one which
repeats itself over intervals of length .
The condition (6) implies
f
(m)
(x + ) = f
(m)
(x), < x < (7)
for the m
th
-derivative of f(x), provided there is such
a derivative. Thus the derivatives are also periodic.
Periodic functions occur very frequently in applica-
tions of mathematics, reecting the periodicity of many
phenomena in the physical world.
PERIODIC INTEGRANDS
Consider the special class of integrals
I(f) =
Z
b
a
f(x) dx
in which f(x) is periodic, with ba an integer multiple
of the period for f(x). In this case, the performance
of the trapezoidal rule and other numerical integration
rules is much better than that predicted by earlier error
formulas.
To hint at this improved performance, recall
Z
b
a
f(x) dx T
n
(f)
e
E
n
(f)
h
2
12
h
f
0
(b) f
0
(a)
i
With our assumption on the periodicity of f(x), we
have
f(a) = f(b), f
0
(a) = f
0
(b)
Therefore,
e
E
n
(f) = 0
and we should expect improved performance in the
convergence behaviour of the trapezoidal sums T
n
(f).
If in addition to being periodic on [a, b], the integrand
f(x) also has m continous derivatives, then it can be
shown that
I(f) T
n
(f) =
c
n
m
+ smaller terms
By smaller terms, we mean terms which decrease
to zero more rapidly than n
m
.
Thus if f(x) is periodic with b a an integer multiple
of the period for f(x), and if f(x) is innitely dier-
entiable, then the error I T
n
decreases to zero more
rapidly than n
m
for any m > 0. For periodic inte-
grands, the trapezoidal rule is an optimal numerical
integration method.
Example. Consider evaluating
I =
Z
2
0
sin xdx
1 + e
sin x
Using the trapezoidal rule, we have the results in the
following table. In this case, the formulas based on
Richardson extrapolation are no longer valid.
n T
n
T
n
T
1
2
n
2 0.0
4 0.72589193317292 7.259E 1
8 0.74006131211583 1.417E 2
16 0.74006942337672 8.111E 6
32 0.74006942337946 2.746E 12
64 0.74006942337946 0.0
NUMERICAL INTEGRATION:
ANOTHER APPROACH
We look for numerical integration formulas
Z
1
1
f(x) dx
n
X
j=1
w
j
f(x
j
)
which are to be exact for polynomials of as large a
degree as possible. There are no restrictions placed
on the nodes
n
x
j
o
nor the weights
n
w
j
o
in working
towards that goal. The motivation is that if it is exact
for high degree polynomials, then perhaps it will be
very accurate when integrating functions that are well
approximated by polynomials.
There is no guarantee that such an approach will work.
In fact, it turns out to be a bad idea when the node
points
n
x
j
o
are required to be evenly spaced over the
interval of integration. But without this restriction on
n
x
j
o
we are able to develop a very accurate set of
quadrature formulas.
The case n = 1. We want a formula
w
1
f(x
1
)
1
R
1
f(x)dx
The weight w
1
and the nodex
1
are to be so chosen that
the formula is exact for polynomials of as large degree
as possible. To do this we substitute f(x) = 1 and
f(x) = x. The rst choice leads to
w
1
1
1
R
1
1dx
w
1
= 2
The choice f(x) = x leads to
w
1
x
1

1
R
1
xdx
x
1
= 0
The desired formula is
1
R
1
f(x)dx 2f(0)
It is called the midpoint rule.
The case n = 2. We want a formula
w
1
f(x
1
) + w
2
f(x
2
)
Z
1
1
f(x) dx
The weights w
1
, w
2
and the nodes x
1
, x
2
are to be so
chosen that the formula is exact for polynomials of as
large a degree as possible. We substitute and force
equality for
f(x) = 1, x, x
2
, x
3
This leads to the system
w
1
+ w
2
=
Z
1
1
1 dx = 2
w
1
x
1
+ w
2
x
2
=
Z
1
1
xdx = 0
w
1
x
2
1
+ w
2
x
2
2
=
Z
1
1
x
2
dx =
2
3
w
1
x
3
1
+ w
2
x
3
2
=
Z
1
1
x
3
dx = 0
The solution is given by
w
1
= w
2
= 1, x
1
=
1
sqrt(3)
, x
2
=
1
sqrt(3)
This yields the formula
Z
1
1
f(x) dx f

1
sqrt(3)

+ f

1
sqrt(3)

(1)
We say it has degree of precision equal to 3 since it
integrates exactly all polynomials of degree 3. We
can verify directly that it does not integrate exactly
f(x) = x
4
.
Z
1
1
x
4
dx =
2
5
f

1
sqrt(3)

+ f

1
sqrt(3)

=
2
9
Thus (1) has degree of precision exactly 3.
EXAMPLE Integrate
Z
1
1
dx
3 + x
= log 2
.
= 0.69314718
The formula (1) yields
1
3 + x
1
+
1
3 + x
2
= 0.69230769
Error = .000839
THE GENERAL CASE
We want to nd the weights {w
i
} and nodes {x
i
} so
as to have
Z
1
1
f(x) dx
n
X
j=1
w
j
f(x
j
)
be exact for a polynomials f(x) of as large a degree
as possible. As unknowns, there are n weights w
i
and
n nodes x
i
. Thus it makes sense to initially impose
2n conditions so as to obtain 2n equations for the 2n
unknowns. We require the quadrature formula to be
exact for the cases
f(x) = x
i
, i = 0, 1, 2, ..., 2n 1
Then we obtain the system of equations
w
1
x
i
1
+ w
2
x
i
2
+ + w
n
x
i
n
=
Z
1
1
x
i
dx
for i = 0, 1, 2, ..., 2n 1. For the right sides,
Z
1
1
x
i
dx =

2
i + 1
, i = 0, 2, ..., 2n 2
0, i = 1, 3, ..., 2n 1
The system of equations
w
1
x
i
1
+ + w
n
x
i
n
=
Z
1
1
x
i
dx, i = 0, ..., 2n 1
has a solution, and the solution is unique except for
re-ordering the unknowns. The resulting numerical
integration rule is called Gaussian quadrature.
In fact, the nodes and weights are not found by solv-
ing this system. Rather, the nodes and weights have
other properties which enable them to be found more
easily by other methods. There are programs to pro-
duce them; and most subroutine libraries have either
a program to produce them or tables of them for com-
monly used cases.
CHANGE OF INTERVAL
OF INTEGRATION
Integrals on other nite intervals [a, b] can be con-
verted to integrals over [1, 1], as follows:
Z
b
a
F(x) dx =
b a
2
Z
1
1
F

b + a + t(b a)
2
!
dt
based on the change of integration variables
x =
b + a + t(b a)
2
, 1 t 1
EXAMPLE Over the interval [0, ], use
x = (1 + t)

2
Then
Z

0
F(x) dx =

2
Z
1
1
F

(1 + t)

2

dt
AN ERROR FORMULA
The usual error formula for Gaussian quadrature for-
mula,
E
n
(f) =
Z
1
1
f(x) dx
n
X
j=1
w
j
f(x
j
)
is not particularly intuitive. It is given by
E
n
(f) = e
n
f
(2n)
(c
n
)
(2n)!
e
n
=
2
2n+1
(n!)
4
(2n + 1) [(2n)!]
2


4
n
for some a c
n
b.
To help in understanding the implications of this error
formula, introduce
M
k
= max
1x1

f
(k)
(x)

k!
With many integrands f(x), this sequence {M
k
} is
bounded or even decreases to zero. For example,
f(x) =

cos x
1
2 + x
M
k

1
k!
1
Then for our error formula,
E
n
(f) = e
n
f
(2n)
(c
n
)
(2n)!
|E
n
(f)| e
n
M
2n
(2)
By other methods, we can show
e
n


4
n
When combined with (2) and an assumption of uni-
form boundedness for {M
k
}, we have the error de-
creases by a factor of at least 4 with each increase of
n to n + 1. Compare this to the convergence of the
trapezoidal and Simpson rules for such functions, to
help explain the very rapid convergence of Gaussian
quadrature.
A SECOND ERROR FORMULA
Let f(x) be continuous for a x b; let n 1.
Then, for the Gaussian numerical integration formula
I
Z
b
a
f(x) dx
n
X
j=1
w
j
f(x
j
) I
n
on [a, b], the error in I
n
satises
|I(f) I
n
(f)| 2 (b a)
2n1
(f) (3)
Here
2n1
(f) is the minimax error of degree 2n 1
for f(x) on [a, b]:

m
(f) = min
deg(p)m
"
max
axb
|f(x) p(x)|
#
, m 0
EXAMPLE Let f(x) = e
x
2
. Then the minimax er-
rors
m
(f) are given in the following table.
m
m
(f) m
m
(f)
1 5.30E 2 6 7.82E 6
2 1.79E 2 7 4.62E 7
3 6.63E 4 8 9.64E 8
4 4.63E 4 9 8.05E 9
5 1.62E 5 10 9.16E 10
Using this table, apply (3) to
I =
Z
1
0
e
x
2
dx
For n = 3, (3) implies
|I I
3
| 2
5

e
x
2

.
= 3.24 10
5
The actual error is 9.55E 6.
INTEGRATING
A NON-SMOOTH INTEGRAND
Consider using Gaussian quadrature to evaluate
I =
Z
1
0
sqrt(x) dx =
2
3
n I I
n
Ratio
2 7.22E 3
4 1.16E 3 6.2
8 1.69E 4 6.9
16 2.30E 5 7.4
32 3.00E 6 7.6
64 3.84E 7 7.8
The column labeled Ratio is dened by
I I
1
2
n
I I
n
It is consistent with I I
n

c
n
3
, which can be proven
theoretically. In comparison for the trapezoidal and
Simpson rules, I I
n

c
n
1.5
WEIGHTED GAUSSIAN QUADRATURE
Consider needing to evaluate integrals such as
Z
1
0
f(x) log xdx,
Z
1
0
x
1
3
f(x) dx
How do we proceed? Consider numerical integration
formulas
Z
b
a
w(x)f(x) dx
n
X
j=1
w
j
f(x
j
)
in which f(x) is considered a nice function (one
with several continuous derivatives). The function
w(x) is allowed to be singular, but must be integrable.
We assume here that [a, b] is a nite interval. The
function w(x) is called a weight function, and it is
implicitly absorbed into the denition of the quadra-
ture weights {w
i
}. We again determine the nodes
{x
i
} and weights {w
i
} so as to make the integration
formula exact for f(x) a polynomial of as large a de-
gree as possible.
The resulting numerical integration formula
Z
b
a
w(x)f(x) dx
n
X
j=1
w
j
f(x
j
)
is called a Gaussian quadrature formula with weight
function w(x). We determine the nodes {x
i
} and
weights {w
i
} by requiring exactness in the above for-
mula for
f(x) = x
i
, i = 0, 1, 2, ..., 2n 1
To make the derivation more understandable, we con-
sider the particular case
Z
1
0
x
1
3
f(x) dx
n
X
j=1
w
j
f(x
j
)
We follow the same pattern as used earlier.
The case n = 1. We want a formula
w
1
f(x
1
)
Z
1
0
x
1
3
f(x) dx
The weight w
1
and the node x
1
are to be so chosen
that the formula is exact for polynomials of as large a
degree as possible. Choosing f(x) = 1, we have
w
1
=
Z
1
0
x
1
3
dx =
3
4
Choosing f(x) = x, we have
w
1
x
1
=
1
Z
0
x
1
3
xdx =
3
7
x
1
=
4
7
Thus
Z
1
0
x
1
3
f(x) dx
3
4
f

4
7

has degree of precision 1.


The case n = 2. We want a formula
w
1
f(x
1
) + w
2
f(x
2
)
Z
1
0
x
1
3
f(x) dx
The weights w
1
, w
2
and the nodes x
1
, x
2
are to be
so chosen that the formula is exact for polynomials of
as large a degree as possible. We determine them by
requiring equality for
f(x) = 1, x, x
2
, x
3
This leads to the system
w
1
+ w
2
=
1
Z
0
x
1
3
dx =
3
4
w
1
x
1
+ w
2
x
2
=
1
Z
0
xx
1
3
dx =
3
7
w
1
x
2
1
+ w
2
x
2
2
=
1
Z
0
x
2
x
1
3
dx =
3
10
w
1
x
3
1
+ w
2
x
3
2
=
1
Z
0
x
3
x
1
3
dx =
3
13
The solution is
x
1
=
7
13

3
65
sqrt(35), x
2
=
7
13
+
3
65
sqrt(35)
w
1
=
3
8

3
392
sqrt(35), w
2
=
3
8
+
3
392
sqrt(35)
Numerically,
x
1
= .2654117024, x
2
= .8115113746
w
1
= .3297238792, w
2
= .4202761208
The formula
Z
1
0
x
1
3
f(x) dx w
1
f(x
1
) + w
2
f(x
2
) (4)
has degree of precision 3.
EXAMPLE Consider evaluating the integral
Z
1
0
x
1
3
cos xdx (5)
In applying (4), we take f(x) = cos x. Then
w
1
f(x
1
) + w
2
f(x
2
) = 0.6074977951
The true answer is
Z
1
0
x
1
3
cos xdx
.
= 0.6076257393
and our numerical answer is in error by E
2
.
= .000128.
This is quite a good answer involving very little com-
putational eort (once the formula has been deter-
mined). In contrast, the trapezoidal and Simpson
rules applied to (5) would converge very slowly be-
cause the rst derivative of the integrand is singular
at the origin.
CHANGE OF VARIABLES
As a side note to the preceding example, we observe
that the change of variables x = t
3
transforms the
integral (5) to
3
Z
1
0
t
3
cos

t
3

dt
and both the trapezoidal and Simpson rules will per-
form better with this formula, although still not as
good as our weighted Gaussian quadrature.
A change of the integration variable can often im-
prove the performance of a standard method, usually
by increasing the dierentiability of the integrand.
EXAMPLE Using x = t
r
for some r > 1, we have
Z
1
0
g(x) log x dx = r
Z
1
0
t
r1
g (t
r
) log t dt
The new integrand is generally smoother than the
original one.
INTERPOLATION
Interpolation is a process of nding a formula (often
a polynomial) whose graph will pass through a given
set of points (x, y).
As an example, consider dening
x
0
= 0, x
1
=

4
, x
2
=

2
and
y
i
= cos x
i
, i = 0, 1, 2
This gives us the three points
(0, 1) ,

4
,
1
sqrt(2)

2
, 0

Now nd a quadratic polynomial


p(x) = a
0
+ a
1
x + a
2
x
2
for which
p(x
i
) = y
i
, i = 0, 1, 2
The graph of this polynomial is shown on the accom-
panying graph. We later give an explicit formula.
Quadratic interpolation of cos(x)
x
y
/4 /2
y = cos(x)
y = p
2
(x)
PURPOSES OF INTERPOLATION
1. Replace a set of data points {(x
i
, y
i
)} with a func-
tion given analytically.
2. Approximate functions with simpler ones, usually
polynomials or piecewise polynomials.
Purpose #1 has several aspects.
The data may be from a known class of functions.
Interpolation is then used to nd the member of
this class of functions that agrees with the given
data. For example, data may be generated from
functions of the form
p(x) = a
0
+ a
1
e
x
+ a
2
e
2x
+ + a
n
e
nx
Then we need to nd the coecients
n
a
j
o
based
on the given data values.
We may want to take function values f(x) given
in a table for selected values of x, often equally
spaced, and extend the function to values of x
not in the table.
For example, given numbers from a table of loga-
rithms, estimate the logarithm of a number x not
in the table.
Given a set of data points {(x
i
, y
i
)}, nd a curve
passing thru these points that is pleasing to the
eye. In fact, this is what is done continually with
computer graphics. How do we connect a set of
points to make a smooth curve? Connecting them
with straight line segments will often give a curve
with many corners, whereas what was intended
was a smooth curve.
Purpose #2 for interpolation is to approximate func-
tions f(x) by simpler functions p(x), perhaps to make
it easier to integrate or dierentiate f(x). That will
be the primary reason for studying interpolation in this
course.
As as example of why this is important, consider the
problem of evaluating
I =
Z
1
0
dx
1 + x
10
This is very dicult to do analytically. But we will
look at producing polynomial interpolants of the inte-
grand; and polynomials are easily integrated exactly.
We begin by using polynomials as our means of doing
interpolation. Later in the chapter, we consider more
complex piecewise polynomial functions, often called
spline functions.
LINEAR INTERPOLATION
The simplest form of interpolation is probably the
straight line, connecting two points by a straight line.
Let two data points (x
0
, y
0
) and (x
1
, y
1
) be given.
There is a unique straight line passing through these
points. We can write the formula for a straight line
as
P
1
(x) = a
0
+ a
1
x
In fact, there are other more convenient ways to write
it, and we give several of them below.
P
1
(x) =
x x
1
x
0
x
1
y
0
+
x x
0
x
1
x
0
y
1
=
(x
1
x) y
0
+ (x x
0
) y
1
x
1
x
0
= y
0
+
x x
0
x
1
x
0
[y
1
y
0
]
= y
0
+

y
1
y
0
x
1
x
0
!
(x x
0
)
Check each of these by evaluating them at x = x
0
and x
1
to see if the respective values are y
0
and y
1
.
Example. Following is a table of values for f(x) =
tan x for a few values of x.
x 1 1.1 1.2 1.3
tan x 1.5574 1.9648 2.5722 3.6021
Use linear interpolation to estimate tan(1.15). Then
use
x
0
= 1.1, x
1
= 1.2
with corresponding values for y
0
and y
1
. Then
tan x y
0
+
x x
0
x
1
x
0
[y
1
y
0
]
tan x y
0
+
x x
0
x
1
x
0
[y
1
y
0
]
tan (1.15) 1.9648 +
1.15 1.1
1.2 1.1
[2.5722 1.9648]
= 2.2685
The true value is tan 1.15 = 2.2345. We will want
to examine formulas for the error in interpolation, to
know when we have sucient accuracy in our inter-
polant.
x
y
1 1.3
y=tan(x)
x
y
1.1 1.2
y = tan(x)
y = p
1
(x)
QUADRATIC INTERPOLATION
We want to nd a polynomial
P
2
(x) = a
0
+ a
1
x + a
2
x
2
which satises
P
2
(x
i
) = y
i
, i = 0, 1, 2
for given data points (x
0
, y
0
) , (x
1
, y
1
) , (x
2
, y
2
). One
formula for such a polynomial follows:
P
2
(x) = y
0
L
0
(x) + y
1
L
1
(x) + y
2
L
2
(x) ()
with
L
0
(x) =
(xx
1
)(xx
2
)
(x
0
x
1
)(x
0
x
2
)
, L
1
(x) =
(xx
0
)(xx
2
)
(x
1
x
0
)(x
1
x
2
)
L
2
(x) =
(xx
0
)(xx
1
)
(x
2
x
0
)(x
2
x
1
)
The formula () is called Lagranges form of the in-
terpolation polynomial.
LAGRANGE BASIS FUNCTIONS
The functions
L
0
(x) =
(xx
1
)(xx
2
)
(x
0
x
1
)(x
0
x
2
)
, L
1
(x) =
(xx
0
)(xx
2
)
(x
1
x
0
)(x
1
x
2
)
L
2
(x) =
(xx
0
)(xx
1
)
(x
2
x
0
)(x
2
x
1
)
are called Lagrange basis functions for quadratic in-
terpolation. They have the properties
L
i
(x
j
) =
(
1, i = j
0, i 6= j
for i, j = 0, 1, 2. Also, they all have degree 2. Their
graphs are on an accompanying page.
As a consequence of each L
i
(x) being of degree 2, we
have that the interpolant
P
2
(x) = y
0
L
0
(x) + y
1
L
1
(x) + y
2
L
2
(x)
must have degree 2.
UNIQUENESS
Can there be another polynomial, call it Q(x), for
which
deg(Q) 2
Q(x
i
) = y
i
, i = 0, 1, 2
Thus, is the Lagrange formula P
2
(x) unique?
Introduce
R(x) = P
2
(x) Q(x)
From the properties of P
2
and Q, we have deg(R)
2. Moreover,
R(x
i
) = P
2
(x
i
) Q(x
i
) = y
i
y
i
= 0
for all three node points x
0
, x
1
, and x
2
. How many
polynomials R(x) are there of degree at most 2 and
having three distinct zeros? The answer is that only
the zero polynomial satises these properties, and there-
fore
R(x) = 0 for all x
Q(x) = P
2
(x) for all x
SPECIAL CASES
Consider the data points
(x
0
, 1), (x
1
, 1), (x
2
, 1)
What is the polynomial P
2
(x) in this case?
Answer: We must have the polynomial interpolant is
P
2
(x) 1
meaning that P
2
(x) is the constant function. Why?
First, the constant function satises the property of
being of degree 2. Next, it clearly interpolates the
given data. Therefore by the uniqueness of quadratic
interpolation, P
2
(x) must be the constant function 1.
Consider now the data points
(x
0
, mx
0
), (x
1
, mx
1
), (x
2
, mx
2
)
for some constant m. What is P
2
(x) in this case? By
an argument similar to that above,
P
2
(x) = mx for all x
Thus the degree of P
2
(x) can be less than 2.
HIGHER DEGREE INTERPOLATION
We consider now the case of interpolation by poly-
nomials of a general degree n. We want to nd a
polynomial P
n
(x) for which
deg(P
n
) n
P
n
(x
i
) = y
i
, i = 0, 1, , n
()
with given data points
(x
0
, y
0
) , (x
1
, y
1
) , , (x
n
, y
n
)
The solution is given by Lagranges formula
P
n
(x) = y
0
L
0
(x) + y
1
L
1
(x) + + y
n
L
n
(x)
The Lagrange basis functions are given by
L
k
(x) =
(x x
0
) ..(x x
k1
)(x x
k+1
).. (x x
n
)
(x
k
x
0
) ..(x
k
x
k1
)(x
k
x
k+1
).. (x
k
x
n
)
for k = 0, 1, 2, ..., n. The quadratic case was covered
earlier.
In a manner analogous to the quadratic case, we can
show that the above P
n
(x) is the only solution to the
problem ().
In the formula
L
k
(x) =
(x x
0
) ..(x x
k1
)(x x
k+1
).. (x x
n
)
(x
k
x
0
) ..(x
k
x
k1
)(x
k
x
k+1
).. (x
k
x
n
)
we can see that each such function is a polynomial of
degree n. In addition,
L
k
(x
i
) =
(
1, k = i
0, k 6= i
Using these properties, it follows that the formula
P
n
(x) = y
0
L
0
(x) + y
1
L
1
(x) + + y
n
L
n
(x)
satises the interpolation problem of nding a solution
to
deg(P
n
) n
P
n
(x
i
) = y
i
, i = 0, 1, , n
EXAMPLE
Recall the table
x 1 1.1 1.2 1.3
tan x 1.5574 1.9648 2.5722 3.6021
We now interpolate this table with the nodes
x
0
= 1, x
1
= 1.1, x
2
= 1.2, x
3
= 1.3
Without giving the details of the evaluation process,
we have the following results for interpolation with
degrees n = 1, 2, 3.
n 1 2 3
P
n
(1.15) 2.2685 2.2435 2.2296
Error .0340 .0090 .0049
It improves with increasing degree n, but not at a very
rapid rate. In fact, the error becomes worse when n is
increased further. Later we will see that interpolation
of a much higher degree, say n 10, is often poorly
behaved when the node points {x
i
} are evenly spaced.
A FIRST ORDER DIVIDED DIFFERENCE
For a given function f(x) and two distinct points x
0
and x
1
, dene
f[x
0
, x
1
] =
f(x
1
) f(x
0
)
x
1
x
0
This is called a rst order divided dierence of f(x).
By the Mean-value theorem,
f(x
1
) f(x
0
) = f
0
(c) (x
1
x
0
)
for some c between x
0
and x
1
. Thus
f[x
0
, x
1
] = f
0
(c)
and the divided dierence in very much like the deriv-
ative, especially if x
0
and x
1
are quite close together.
In fact,
f
0

x
1
+ x
0
2

f[x
0
, x
1
]
is quite an accurate approximation of the derivative
(see 5.4).
SECOND ORDER DIVIDED DIFFERENCES
Given three distinct points x
0
, x
1
, and x
2
, dene
f[x
0
, x
1
, x
2
] =
f[x
1
, x
2
] f[x
0
, x
1
]
x
2
x
0
This is called the second order divided dierence of
f(x).
By a fairly complicated argument, we can show
f[x
0
, x
1
, x
2
] =
1
2
f
00
(c)
for some c intermediate to x
0
, x
1
, and x
2
. In fact, as
we investigate in 5.4,
f
00
(x
1
) 2f[x
0
, x
1
, x
2
]
in the case the nodes are evenly spaced,
x
1
x
0
= x
2
x
1
EXAMPLE
Consider the table
x 1 1.1 1.2 1.3 1.4
cos x .54030 .45360 .36236 .26750 .16997
Let x
0
= 1, x
1
= 1.1, and x
2
= 1.2. Then
f[x
0
, x
1
] =
.45360 .54030
1.1 1
= .86700
f[x
1
, x
2
] =
.36236 .45360
1.1 1
= .91240
f[x
0
, x
1
, x
2
] =
f[x
1
, x
2
] f[x
0
, x
1
]
x
2
x
0
=
.91240 (.86700)
1.2 1.0
= .22700
For comparison,
f
0

x
1
+ x
0
2

= sin (1.05) = .86742


1
2
f
00
(x
1
) =
1
2
cos (1.1) = .22680
GENERAL DIVIDED DIFFERENCES
Given n + 1 distinct points x
0
, ..., x
n
, with n 2,
dene
f[x
0
, ..., x
n
] =
f[x
1
, ..., x
n
] f[x
0
, ..., x
n1
]
x
n
x
0
This is a recursive denition of the n
th
-order divided
dierence of f(x), using divided dierences of order
n. Its relation to the derivative is as follows:
f[x
0
, ..., x
n
] =
1
n!
f
(n)
(c)
for some c intermediate to the points {x
0
, ..., x
n
}. Let
I denote the interval
I = [min {x
0
, ..., x
n
} , max {x
0
, ..., x
n
}]
Then c I, and the above result is based on the
assumption that f(x) is n-times continuously dier-
entiable on the interval I.
EXAMPLE
The following table gives divided dierences for the
data in
x 1 1.1 1.2 1.3 1.4
cos x .54030 .45360 .36236 .26750 .16997
For the column headings, we use
D
k
f(x
i
) = f[x
i
, ..., x
i+k
]
i x
i
f(x
i
) Df(x
i
) D
2
f(x
i
) D
3
f(x
i
) D
4
f(x
i
)
0 1.0 .54030 -.8670 -.2270 .1533 .0125
1 1.1 .45360 -.9124 -.1810 .1583
2 1.2 .36236 -.9486 -.1335
3 1.3 .26750 -.9753
4 1.4 .16997
These were computed using the recursive denition
f[x
0
, ..., x
n
] =
f[x
1
, ..., x
n
] f[x
0
, ..., x
n1
]
x
n
x
0
ORDER OF THE NODES
Looking at f[x
0
, x
1
], we have
f[x
0
, x
1
] =
f(x
1
) f(x
0
)
x
1
x
0
=
f(x
0
) f(x
1
)
x
0
x
1
= f[x
1
, x
0
]
The order of x
0
and x
1
does not matter. Looking at
f[x
0
, x
1
, x
2
] =
f[x
1
, x
2
] f[x
0
, x
1
]
x
2
x
0
we can expand it to get
f[x
0
, x
1
, x
2
] =
f(x
0
)
(x
0
x
1
) (x
0
x
2
)
+
f(x
1
)
(x
1
x
0
) (x
1
x
2
)
+
f(x
2
)
(x
2
x
0
) (x
2
x
1
)
With this formula, we can show that the order of the
arguments x
0
, x
1
, x
2
does not matter in the nal value
of f[x
0
, x
1
, x
2
] we obtain. Mathematically,
f[x
0
, x
1
, x
2
] = f[x
i
0
, x
i
1
, x
i
2
]
for any permutation (i
0
, i
1
, i
2
) of (0, 1, 2).
We can show in general that the value of f[x
0
, ..., x
n
]
is independent of the order of the arguments {x
0
, ..., x
n
},
even though the intermediate steps in its calculations
using
f[x
0
, ..., x
n
] =
f[x
1
, ..., x
n
] f[x
0
, ..., x
n1
]
x
n
x
0
are order dependent.
We can show
f[x
0
, ..., x
n
] = f[x
i
0
, ..., x
i
n
]
for any permutation (i
0
, i
1
, ..., i
n
) of (0, 1, ..., n).
COINCIDENT NODES
What happens when some of the nodes {x
0
, ..., x
n
}
are not distinct. Begin by investigating what happens
when they all come together as a single point x
0
.
For rst order divided dierences, we have
lim
x
1
x
0
f[x
0
, x
1
] = lim
x
1
x
0
f(x
1
) f(x
0
)
x
1
x
0
= f
0
(x
0
)
We extend the denition of f[x
0
, x
1
] to coincident
nodes using
f[x
0
, x
0
] = f
0
(x
0
)
For second order divided dierences, recall
f[x
0
, x
1
, x
2
] =
1
2
f
00
(c)
with c intermediate to x
0
, x
1
, and x
2
.
Then as x
1
x
0
and x
2
x
0
, we must also have
that c x
0
. Therefore,
lim
x
1
x
0
x
2
x
0
f[x
0
, x
1
, x
2
] =
1
2
f
00
(x
0
)
We therefore dene
f[x
0
, x
0
, x
0
] =
1
2
f
00
(x
0
)
For the case of general f[x
0
, ..., x
n
], recall that
f[x
0
, ..., x
n
] =
1
n!
f
(n)
(c)
for some c intermediate to {x
0
, ..., x
n
}. Then
lim
{x1,...,x
n
}x
0
f[x
0
, ..., x
n
] =
1
n!
f
(n)
(x
0
)
and we dene
f[x
0
, ..., x
0
| {z }
]
n+1 times
=
1
n!
f
(n)
(x
0
)
What do we do when only some of the nodes are
coincident. This too can be dealt with, although we
do so here only by examples.
f[x
0
, x
1
, x
1
] =
f[x
1
, x
1
] f[x
0
, x
1
]
x
1
x
0
=
f
0
(x
1
) f[x
0
, x
1
]
x
1
x
0
The recursion formula can be used in general in this
way to allow all possible combinations of possibly co-
incident nodes.
LAGRANGES FORMULA FOR
THE INTERPOLATION POLYNOMIAL
Recall the general interpolation problem: nd a poly-
nomial P
n
(x) for which
deg(P
n
) n
P
n
(x
i
) = y
i
, i = 0, 1, , n
with given data points
(x
0
, y
0
) , (x
1
, y
1
) , , (x
n
, y
n
)
and with {x
0
, ..., x
n
} distinct points.
In 5.1, we gave the solution as Lagranges formula
P
n
(x) = y
0
L
0
(x) + y
1
L
1
(x) + + y
n
L
n
(x)
with {L
0
(x), ..., L
n
(x)} the Lagrange basis polynomi-
als. Each L
j
is of degree n and it satises
L
j
(x
i
) =
(
1, j = i
0, j 6= i
for i = 0, 1, ..., n.
THE NEWTON DIVIDED DIFFERENCE FORM
OF THE INTERPOLATION POLYNOMIAL
Let the data values for the problem
deg(P
n
) n
P
n
(x
i
) = y
i
, i = 0, 1, , n
be generated from a function f(x):
y
i
= f(x
i
), i = 0, 1, ..., n
Using the divided dierences
f[x
0
, x
1
], f[x
0
, x
1
, x
2
], ..., f[x
0
, ..., x
n
]
we can write the interpolation polynomials
P
1
(x), P
2
(x), ..., P
n
(x)
in a way that is simple to compute.
P
1
(x) = f(x
0
) + f[x
0
, x
1
] (x x
0
)
P
2
(x) = f(x
0
) + f[x
0
, x
1
] (x x
0
)
+f[x
0
, x
1
, x
2
] (x x
0
) (x x
1
)
= P
1
(x) + f[x
0
, x
1
, x
2
] (x x
0
) (x x
1
)
For the case of the general problem
deg(P
n
) n
P
n
(x
i
) = y
i
, i = 0, 1, , n
we have
P
n
(x) = f(x
0
) + f[x
0
, x
1
] (x x
0
)
+f[x
0
, x
1
, x
2
] (x x
0
) (x x
1
)
+f[x
0
, x
1
, x
2
, x
3
] (x x
0
) (x x
1
) (x x
2
)
+
+f[x
0
, ..., x
n
] (x x
0
) (x x
n1
)
From this we have the recursion relation
P
n
(x) = P
n1
(x)+f[x
0
, ..., x
n
] (x x
0
) (x x
n1
)
in which P
n1
(x) interpolates f(x) at the points in
{x
0
, ..., x
n1
}.
Example: Recall the table
i x
i
f(x
i
) Df(x
i
) D
2
f(x
i
) D
3
f(x
i
) D
4
f(x
i
)
0 1.0 .54030 -.8670 -.2270 .1533 .0125
1 1.1 .45360 -.9124 -.1810 .1583
2 1.2 .36236 -.9486 -.1335
3 1.3 .26750 -.9753
4 1.4 .16997
with D
k
f(x
i
) = f[x
i
, ..., x
i+k
], k = 1, 2, 3, 4. Then
P
1
(x) = .5403 .8670 (x 1)
P
2
(x) = P
1
(x) .2270 (x 1) (x 1.1)
P
3
(x) = P
2
(x) + .1533 (x 1) (x 1.1) (x 1.2)
P
4
(x) = P
3
(x)
+.0125 (x 1) (x 1.1) (x 1.2) (x 1.3)
Using this table and these formulas, we have the fol-
lowing table of interpolants for the value x = 1.05.
The true value is cos(1.05) = .49757105.
n 1 2 3 4
P
n
(1.05) .49695 .49752 .49758 .49757
Error 6.20E4 5.00E5 1.00E5 0.0
EVALUATION OF THE DIVIDED DIFFERENCE
INTERPOLATION POLYNOMIAL
Let
d
1
= f[x
0
, x
1
]
d
2
= f[x
0
, x
1
, x
2
]
.
.
.
d
n
= f[x
0
, ..., x
n
]
Then the formula
P
n
(x) = f(x
0
) + f[x
0
, x
1
] (x x
0
)
+f[x
0
, x
1
, x
2
] (x x
0
) (x x
1
)
+f[x
0
, x
1
, x
2
, x
3
] (x x
0
) (x x
1
) (x x
2
)
+
+f[x
0
, ..., x
n
] (x x
0
) (x x
n1
)
can be written as
P
n
(x) = f(x
0
) + (x x
0
) (d
1
+ (x x
1
) (d
2
+
+(x x
n2
) (d
n1
+ (x x
n1
) d
n
) )
Thus we have a nested polynomial evaluation, and
this is quite ecient in computational cost.
ERROR IN LINEAR INTERPOLATION
Let P
1
(x) denote the linear polynomial interpolating
f(x) at x
0
and x
1
, with f(x) a given function (e.g.
f(x) = cos x). What is the error f(x) P
1
(x)?
Let f(x) be twice continuously dierentiable on an in-
terval [a, b] which contains the points {x
0
, x
1
}. Then
for a x b,
f(x) P
1
(x) =
(x x
0
) (x x
1
)
2
f
00
(c
x
)
for some c
x
between the minimum and maximum of
x
0
, x
1
, and x.
If x
1
and x are close to x
0
, then
f(x) P
1
(x)
(x x
0
) (x x
1
)
2
f
00
(x
0
)
Thus the error acts like a quadratic polynomial, with
zeros at x
0
and x
1
.
EXAMPLE
Let f(x) = log
10
x; and in line with typical tables of
log
10
x, we take 1 x, x
0
, x
1
10. For deniteness,
let x
0
< x
1
with h = x
1
x
0
. Then
f
00
(x) =
log
10
e
x
2
log
10
x P
1
(x) =
(x x
0
) (x x
1
)
2
"

log
10
e
c
2
x
#
= (x x
0
) (x
1
x)
"
log
10
e
2c
2
x
#
We usually are interpolating with x
0
x x
1
; and
in that case, we have
(x x
0
) (x
1
x) 0, x
0
c
x
x
1
(x x
0
) (x
1
x) 0, x
0
c
x
x
1
and therefore
(x x
0
) (x
1
x)
"
log
10
e
2x
2
1
#
log
10
x P
1
(x)
(x x
0
) (x
1
x)
"
log
10
e
2x
2
0
#
For h = x
1
x
0
small, we have for x
0
x x
1
log
10
x P
1
(x) (x x
0
) (x
1
x)
"
log
10
e
2x
2
0
#
Typical high school algebra textbooks contain tables
of log
10
x with a spacing of h = .01. What is the
error in this case? To look at this, we use
0 log
10
x P
1
(x) (x x
0
) (x
1
x)
"
log
10
e
2x
2
0
#
By simple geometry or calculus,
max
x
0
xx
1
(x x
0
) (x
1
x)
h
2
4
Therefore,
0 log
10
x P
1
(x)
h
2
4
"
log
10
e
2x
2
0
#
.
= .0543
h
2
x
2
0
If we want a uniform bound for all points 1 x
0
10,
we have
0 log
10
x P
1
(x)
h
2
log
10
e
8
.
= .0543h
2
0 log
10
x P
1
(x) .0543h
2
For h = .01, as is typical of the high school text book
tables of log
10
x,
0 log
10
x P
1
(x) 5.43 10
6
If you look at most tables, a typical entry is given to
only four decimal places to the right of the decimal
point, e.g.
log 5.41
.
= .7332
Therefore the entries are in error by as much as .00005.
Comparing this with the interpolation error, we see the
latter is less important than the rounding errors in the
table entries.
From the bound
0 log
10
x P
1
(x)
h
2
log
10
e
8x
2
0
.
= .0543
h
2
x
2
0
we see the error decreases as x
0
increases, and it is
about 100 times smaller for points near 10 than for
points near 1.
AN ERROR FORMULA:
THE GENERAL CASE
Recall the general interpolation problem: nd a poly-
nomial P
n
(x) for which deg(P
n
) n
P
n
(x
i
) = f(x
i
), i = 0, 1, , n
with distinct node points {x
0
, ..., x
n
} and a given
function f(x). Let [a, b] be a given interval on which
f(x) is (n + 1)-times continuously dierentiable; and
assume the points x
0
, ..., x
n
, and x are contained in
[a, b]. Then
f(x)P
n
(x) =
(x x
0
) (x x
1
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
with c
x
some point between the minimum and maxi-
mum of the points in {x, x
0
, ..., x
n
}.
f(x)P
n
(x) =
(x x
0
) (x x
1
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
As shorthand, introduce

n
(x) = (x x
0
) (x x
1
) (x x
n
)
a polynomial of degree n + 1 with roots {x
0
, ..., x
n
}.
Then
f(x) P
n
(x) =

n
(x)
(n + 1)!
f
(n+1)
(c
x
)
THE QUADRATIC CASE
For n = 2, we have
f(x) P
2
(x) =
(x x
0
) (x x
1
) (x x
2
)
3!
f
(3)
(c
x
)
(*)
with c
x
some point between the minimum and maxi-
mum of the points in {x, x
0
, x
1
, x
2
}.
To illustrate the use of this formula, consider the case
of evenly spaced nodes:
x
1
= x
0
+ h, x
2
= x
1
+ h
Further suppose we have x
0
x x
2
, as we would
usually have when interpolating in a table of given
function values (e.g. log
10
x). The quantity

2
(x) = (x x
0
) (x x
1
) (x x
2
)
can be evaluated directly for a particular x.
Graph of

2
(x) = (x + h) x(x h)
using (x
0
, x
1
, x
2
) = (h, 0, h):
x
y
h
-h
In the formula (), however, we do not know c
x
, and
therefore we replace

f
(3)
(c
x
)

with a maximum of

f
(3)
(x)

as x varies over x
0
x x
2
. This yields
|f(x) P
2
(x)|
|
2
(x)|
3!
max
x
0
xx
2

f
(3)
(x)

(**)
If we want a uniform bound for x
0
x x
2
, we must
compute
max
x
0
xx
2
|
2
(x)| = max
x
0
xx
2
|(x x
0
) (x x
1
) (x x
2
)|
Using calculus,
max
x
0
xx
2
|
2
(x)| =
2h
3
3 sqrt(3)
, at x = x
1

h
sqrt(3)
Combined with (), this yields
|f(x) P
2
(x)|
h
3
9 sqrt(3)
max
x
0
xx
2

f
(3)
(x)

for x
0
x x
2
.
For f(x) = log
10
x, with 1 x
0
x x
2
10, this
leads to
|log
10
x P
2
(x)|
h
3
9 sqrt(3)
max
x
0
xx
2
2 log
10
e
x
3
=
.05572 h
3
x
3
0
For the case of h = .01, we have
|log
10
x P
2
(x)|
5.57 10
8
x
3
0
5.57 10
8
Question: How much larger could we make h so that
quadratic interpolation would have an error compa-
rable to that of linear interpolation of log
10
x with
h = .01? The error bound for the linear interpolation
was 5.43 10
6
, and therefore we want the same to
be true of quadratic interpolation. Using a simpler
bound, we want to nd h so that
|log
10
x P
2
(x)| .05572 h
3
5 10
6
This is true if h = .04477. Therefore a spacing of
h = .04 would be sucient. A table with this spac-
ing and quadratic interpolation would have an error
comparable to a table with h = .01 and linear inter-
polation.
For the case of general n,
f(x) P
n
(x) =
(x x
0
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
=

n
(x)
(n + 1)!
f
(n+1)
(c
x
)

n
(x) = (x x
0
) (x x
1
) (x x
n
)
with c
x
some point between the minimum and max-
imum of the points in {x, x
0
, ..., x
n
}. When bound-
ing the error we replace f
(n+1)
(c
x
) with its maximum
over the interval containing {x, x
0
, ..., x
n
}, as we have
illustrated earlier in the linear and quadratic cases.
Consider now the function

n
(x)
(n + 1)!
over the interval determined by the minimum and
maximum of the points in {x, x
0
, ..., x
n
}. For evenly
spaced node points on [0, 1], with x
0
= 0 and x
n
= 1,
we give graphs for n = 2, 3, 4, 5 and for n = 6, 7, 8, 9
on accompanying pages.
DISCUSSION OF ERROR
Consider the error
f(x) P
n
(x) =
(x x
0
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
=

n
(x)
(n + 1)!
f
(n+1)
(c
x
)

n
(x) = (x x
0
) (x x
1
) (x x
n
)
as n increases and as x varies. As noted previously, we
cannot do much with f
(n+1)
(c
x
) except to replace it
with a maximum value of

f
(n+1)
(x)

over a suitable
interval. Thus we concentrate on understanding the
size of

n
(x)
(n + 1)!
ERROR FOR EVENLY SPACED NODES
We consider rst the case in which the node points
are evenly spaced, as this seems the natural way to
dene the points at which interpolation is carried out.
Moreover, using evenly spaced nodes is the case to
consider for table interpolation. What can we learn
from the given graphs?
The interpolation nodes are determined by using
h =
1
n
, x
0
= 0, x
1
= h, x
2
= 2h, ..., x
n
= nh = 1
For this case,

n
(x) = x(x h) (x 2h) (x 1)
Our graphs are the cases of n = 2, ..., 9.
x
y
n = 2
1
x
y
n = 3
1
x
y
n = 4
1
x
y
n = 5
1
Graphs of
n
(x) on [0, 1] for n = 2, 3, 4, 5
x
y
n = 6
1
x
y
n = 7
1
x
y
n = 8
1
x
y
n = 9
1
Graphs of
n
(x) on [0, 1] for n = 6, 7, 8, 9
Graph of

6
(x) = (x x
0
) (x x
1
) (x x
6
)
with evenly spaced nodes:
x
x
0
x
1
x
2
x
3
x
4
x
5
x
6
Using the following table
,
n M
n
n M
n
1 1.25E1 6 4.76E7
2 2.41E2 7 2.20E8
3 2.06E3 8 9.11E10
4 1.48E4 9 3.39E11
5 9.01E6 10 1.15E12
we can observe that the maximum
M
n
max
x
0
xx
n
|
n
(x)|
(n + 1)!
becomes smaller with increasing n.
From the graphs, there is enormous variation in the
size of
n
(x) as x varies over [0, 1]; and thus there
is also enormous variation in the error as x so varies.
For example, in the n = 9 case,
max
x
0
xx
1
|
n
(x)|
(n + 1)!
= 3.39 10
11
max
x
4
xx
5
|
n
(x)|
(n + 1)!
= 6.89 10
13
and the ratio of these two errors is approximately 49.
Thus the interpolation error is likely to be around 49
times larger when x
0
x x
1
as compared to the
case when x
4
x x
5
. When doing table inter-
polation, the point x at which you are interpolating
should be centrally located with respect to the inter-
polation nodes m{x
0
, ..., x
n
} being used to dene the
interpolation, if possible.
AN APPROXIMATION PROBLEM
Consider now the problem of using an interpolation
polynomial to approximate a given function f(x) on
a given interval [a, b]. In particular, take interpolation
nodes
a x
0
< x
1
< < x
n1
< x
n
b
and produce the interpolation polynomial P
n
(x) that
interpolates f(x) at the given node points. We would
like to have
max
axb
|f(x) P
n
(x)| 0 as n
Does it happen?
Recall the error bound
max
axb
|f(x) P
n
(x)|
max
axb
|
n
(x)|
(n + 1)!
max
axb

f
(n+1)
(x)

We begin with an example using evenly spaced node


points.
RUNGES EXAMPLE
Use evenly spaced node points:
h =
b a
n
, x
i
= a + ih for i = 0, ..., n
For some functions, such as f(x) = e
x
, the maximum
error goes to zero quite rapidly. But the size of the
derivative term f
(n+1)
(x) in
max
axb
|f(x) P
n
(x)|
max
axb
|
n
(x)|
(n + 1)!
max
axb

f
(n+1)
(x)

can badly hurt or destroy the convergence of other


cases.
In particular, we show the graph of f(x) = 1/

1 + x
2

and P
n
(x) on [5, 5] for the cases n = 8 and n = 12.
The case n = 10 is in the text on page 127. It can
be proven that for this function, the maximum er-
ror on [5, 5] does not converge to zero. Thus the
use of evenly spaced nodes is not necessarily a good
approach to approximating a function f(x) by inter-
polation.
Runges example with n = 10:
x
y
y=P
10
(x)
y=1/(1+x
2
)
OTHER CHOICES OF NODES
Recall the general error bound
max
axb
|f(x) P
n
(x)| max
axb
|
n
(x)|
(n + 1)!
max
axb

f
(n+1)
(x)

There is nothing we really do with the derivative term


for f; but we can examine the way of dening the
nodes {x
0
, ..., x
n
} within the interval [a, b]. We ask
how these nodes can be chosen so that the maximum
of |
n
(x)| over [a, b] is made as small as possible.
This problem has quite an elegant solution, and it is
taken up in 4.6. The node points {x
0
, ..., x
n
} turn
out to be the zeros of a particular polynomial T
n+1
(x)
of degree n+1, called a Chebyshev polynomial. These
zeros are known explicitly, and with them
max
axb
|
n
(x)| =

b a
2

n+1
2
n
This turns out to be smaller than for evenly spaced
cases; and although this polynomial interpolation does
not work for all functions f(x), it works for all dier-
entiable functions and more.
ANOTHER ERROR FORMULA
Recall the error formula
f(x) P
n
(x) =

n
(x)
(n + 1)!
f
(n+1)
(c)

n
(x) = (x x
0
) (x x
1
) (x x
n
)
with c between the minimum and maximum of {x
0
, ..., x
n
, x}.
A second formula is given by
f(x) P
n
(x) =
n
(x) f[x
0
, ..., x
n
, x]
To show this is a simple, but somewhat subtle argu-
ment.
Let P
n+1
(x) denote the polynomial of degree n+1
which interpolates f(x) at the points {x
0
, ..., x
n
, x
n+1
}.
Then
P
n+1
(x) = P
n
(x)
+f[x
0
, ..., x
n
, x
n+1
] (x x
0
) (x x
n
)
Substituting x = x
n+1
, and using the fact that P
n+1
(x)
interpolates f(x) at x
n+1
, we have
f(x
n+1
) = P
n
(x
n+1
)
+f[x
0
, ..., x
n
, x
n+1
] (x
n+1
x
0
) (x
n+1
x
n
)
f(x
n+1
) = P
n
(x
n+1
)
+f[x
0
, ..., x
n
, x
n+1
] (x
n+1
x
0
) (x
n+1
x
n
)
In this formula, the number x
n+1
is completely ar-
bitrary, other than being distinct from the points in
{x
0
, ..., x
n
}. To emphasize this fact, replace x
n+1
by
x throughout the formula, obtaining
f(x) = P
n
(x) + f[x
0
, ..., x
n
, x] (x x
0
) (x x
n
)
= P
n
(x) +
n
(x) f[x
0
, ..., x
n
, x]
provided x 6= x
0
, ..., x
n
.
The formula
f(x) = P
n
(x) + f[x
0
, ..., x
n
, x] (x x
0
) (x x
n
)
= P
n
(x) +
n
(x) f[x
0
, ..., x
n
, x]
is easily true for x a node point. Provided f(x) is
dierentiable, the formula is also true for x a node
point.
This shows
f(x) P
n
(x) =
n
(x) f[x
0
, ..., x
n
, x]
Compare the two error formulas
f(x) P
n
(x) =
n
(x) f[x
0
, ..., x
n
, x]
f(x) P
n
(x) =

n
(x)
(n + 1)!
f
(n+1)
(c)
Then

n
(x) f[x
0
, ..., x
n
, x] =

n
(x)
(n + 1)!
f
(n+1)
(c)
f[x
0
, ..., x
n
, x] =
f
(n+1)
(c)
(n + 1)!
for some c between the smallest and largest of the
numbers in {x
0
, ..., x
n
, x}.
To make this somewhat symmetric in its arguments,
let m = n + 1, x = x
n+1
. Then
f[x
0
, ..., x
m1
, x
m
] =
f
(m)
(c)
m!
with c an unknown number between the smallest and
largest of the numbers in {x
0
, ..., x
m
}. This was given
in an earlier lecture where divided dierences were in-
troduced.
PIECEWISE POLYNOMIAL INTERPOLATION
Recall the examples of higher degree polynomial in-
terpolation of the function f(x) =

1 + x
2

1
on
[5, 5]. The interpolants P
n
(x) oscillated a great
deal, whereas the function f(x) was nonoscillatory.
To obtain interpolants that are better behaved, we
look at other forms of interpolating functions.
Consider the data
x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0
What are methods of interpolating this data, other
than using a degree 6 polynomial. Shown in the text
are the graphs of the degree 6 polynomial interpolant,
along with those of piecewise linear and a piecewise
quadratic interpolating functions.
Since we only have the data to consider, we would gen-
erally want to use an interpolant that had somewhat
the shape of that of the piecewise linear interpolant.
x
y
1 2 3 4
1
2
The data points
x
y
1 2 3 4
1
2
Piecewise linear interpolation
x
y
1 2 3 4
1
2
3
4
Polynomial Interpolation
x
y
1 2 3 4
1
2
Piecewise quadratic interpolation
PIECEWISE POLYNOMIAL FUNCTIONS
Consider being given a set of data points (x
1
, y
1
), ...,
(x
n
, y
n
), with
x
1
< x
2
< < x
n
Then the simplest way to connect the points (x
j
, y
j
)
is by straight line segments. This is called a piecewise
linear interpolant of the data
n
(x
j
, y
j
)
o
. This graph
has corners, and often we expect the interpolant to
have a smooth graph.
To obtain a somewhat smoother graph, consider using
piecewise quadratic interpolation. Begin by construct-
ing the quadratic polynomial that interpolates
{(x
1
, y
1
), (x
2
, y
2
), (x
3
, y
3
)}
Then construct the quadratic polynomial that inter-
polates
{(x
3
, y
3
), (x
4
, y
4
), (x
5
, y
5
)}
Continue this process of constructing quadratic inter-
polants on the subintervals
[x
1
, x
3
], [x
3
, x
5
], [x
5
, x
7
], ...
If the number of subintervals is even (and therefore
n is odd), then this process comes out ne, with the
last interval being [x
n2
, x
n
]. This was illustrated
on the graph for the preceding data. If, however, n is
even, then the approximation on the last interval must
be handled by some modication of this procedure.
Suggest such!
With piecewise quadratic interpolants, however, there
are corners on the graph of the interpolating func-
tion. With our preceding example, they are at x
3
and
x
5
. How do we avoid this?
Piecewise polynomial interpolants are used in many
applications. We will consider them later, to obtain
numerical integration formulas.
SMOOTH NON-OSCILLATORY
INTERPOLATION
Let data points (x
1
, y
1
), ..., (x
n
, y
n
) be given, as let
x
1
< x
2
< < x
n
Consider nding functions s(x) for which the follow-
ing properties hold:
(1) s(x
i
) = y
i
, i = 1, ..., n
(2) s(x), s
0
(x), s
00
(x) are continuous on [x
1
, x
n
].
Then among such functions s(x) satisfying these prop-
erties, nd the one which minimizes the integral
Z
x
n
x
1

s
00
(x)

2
dx
The idea of minimizing the integral is to obtain an in-
terpolating function for which the rst derivative does
not change rapidly. It turns out there is a unique so-
lution to this problem, and it is called a natural cubic
spline function.
SPLINE FUNCTIONS
Let a set of node points {x
i
} be given, satisfying
a x
1
< x
2
< < x
n
b
for some numbers a and b. Often we use [a, b] =
[x
1
, x
n
]. A cubic spline function s(x) on [a, b] with
breakpoints or knots {x
i
} has the following prop-
erties:
1. On each of the intervals
[a, x
1
], [x
1
, x
2
], ..., [x
n1
, x
n
], [x
n
, b]
s(x) is a polynomial of degree 3.
2. s(x), s
0
(x), s
00
(x) are continuous on [a, b].
In the case that we have given data points (x
1
, y
1
),...,
(x
n
, y
n
), we say s(x) is a cubic interpolating spline
function for this data if
3. s(x
i
) = y
i
, i = 1, ..., n.
EXAMPLE
Dene
(x )
3
+
=
(
(x )
3
, x
0, x
This is a cubic spline function on (, ) with the
single breakpoint x
1
= .
Combinations of these form more complicated cubic
spline functions. For example,
s(x) = 3 (x 1)
3
+
2 (x 3)
3
+
is a cubic spline function on (, ) with the break-
points x
1
= 1, x
2
= 3.
Dene
s(x) = p
3
(x) +
n
X
j=1
a
j

x x
j

3
+
with p
3
(x) some cubic polynomial. Then s(x) is a
cubic spline function on (, ) with breakpoints
{x
1
, ..., x
n
}.
Return to the earlier problem of choosing an interpo-
lating function s(x) to minimize the integral
Z
x
n
x
1

s
00
(x)

2
dx
There is a unique solution to problem. The solution
s(x) is a cubic interpolating spline function, and more-
over, it satises
s
00
(x
1
) = s
00
(x
n
) = 0
Spline functions satisfying these boundary conditions
are called natural cubic spline functions, and the so-
lution to our minimization problem is a natural cubic
interpolatory spline function. We will show a method
to construct this function from the interpolation data.
Motivation for these boundary conditions can be given
by looking at the physics of bending thin beams of
exible materials to pass thru the given data. To the
left of x
1
and to the right of x
n
, the beam is straight
and therefore the second derivatives are zero at the
transition points x
1
and x
n
.
CONSTRUCTION OF THE
INTERPOLATING SPLINE FUNCTION
To make the presentation more specic, suppose we
have data
(x
1
, y
1
) , (x
2
, y
2
) , (x
3
, y
3
) , (x
4
, y
4
)
with x
1
< x
2
< x
3
< x
4
. Then on each of the
intervals
[x
1
, x
2
] , [x
2
, x
3
] , [x
3
, x
4
]
s(x) is a cubic polynomial. Taking the rst interval,
s(x) is a cubic polynomial and s
00
(x) is a linear poly-
nomial. Let
M
i
= s
00
(x
i
), i = 1, 2, 3, 4
Then on [x
1
, x
2
],
s
00
(x) =
(x
2
x) M
1
+ (x x
1
) M
2
x
2
x
1
, x
1
x x
2
We can nd s(x) by integrating twice:
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6 (x
2
x
1
)
+ c
1
x + c
2
We determine the constants of integration by using
s(x
1
) = y
1
, s(x
2
) = y
2
(*)
Then
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6 (x
2
x
1
)
+
(x
2
x) y
1
+ (x x
1
) y
2
x
2
x
1

x
2
x
1
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
for x
1
x x
2
.
Check that this formula satises the given interpola-
tion condition (*)!
We can repeat this on the intervals [x
2
, x
3
] and [x
3
, x
4
],
obtaining similar formulas.
For x
2
x x
3
,
s(x) =
(x
3
x)
3
M
2
+ (x x
2
)
3
M
3
6 (x
3
x
2
)
+
(x
3
x) y
2
+ (x x
2
) y
3
x
3
x
2

x
3
x
2
6
[(x
3
x) M
2
+ (x x
2
) M
3
]
For x
3
x x
4
,
s(x) =
(x
4
x)
3
M
3
+ (x x
3
)
3
M
4
6 (x
4
x
3
)
+
(x
4
x) y
3
+ (x x
3
) y
4
x
4
x
3

x
4
x
3
6
[(x
4
x) M
3
+ (x x
3
) M
4
]
We still do not know the values of the second deriv-
atives {M
1
, M
2
, M
3
, M
4
}. The above formulas guar-
antee that s(x) and s
00
(x) are continuous for
x
1
x x
4
. For example, the formula on [x
1
, x
2
]
yields
s(x
2
) = y
2
, s
00
(x
2
) = M
2
The formula on [x
2
, x
3
] also yields
s(x
2
) = y
2
, s
00
(x
2
) = M
2
All that is lacking is to make s
0
(x) continuous at x
2
and x
3
. Thus we require
s
0
(x
2
+ 0) = s
0
(x
2
0)
s
0
(x
3
+ 0) = s
0
(x
3
0)
(**)
This means
lim
x&x
2
s
0
(x) = lim
x%x
2
s
0
(x)
and similarly for x
3
.
To simplify the presentation somewhat, I assume in
the following that our node points are evenly spaced:
x
2
= x
1
+ h, x
3
= x
1
+ 2h, x
4
= x
1
+ 3h
Then our earlier formulas simplify to
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6h
+
(x
2
x) y
1
+ (x x
1
) y
2
h

h
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
for x
1
x x
2
, with similar formulas on [x
2
, x
3
] and
[x
3
, x
4
].
Without going thru all of the algebra, the conditions
(**) leads to the following pair of equations.
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
3
y
2
h

y
2
y
1
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
4
y
3
h

y
3
y
2
h
This gives us two equations in four unknowns. The
earlier boundary conditions on s
00
(x) gives us immedi-
ately
M
1
= M
4
= 0
Then we can solve the linear system for M
2
and M
3
.
EXAMPLE
Consider the interpolation data points
x 1 2 3 4
y 1
1
2
1
3
1
4
In this case, h = 1, and linear system becomes
2
3
M
2
+
1
6
M
3
= y
3
2y
2
+ y
1
=
1
3
1
6
M
2
+
2
3
M
3
= y
4
2y
3
+ y
2
=
1
12
This has the solution
M
2
=
1
2
, M
3
= 0
This leads to the spline function formula on each
subinterval.
On [1, 2],
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6h
+
(x
2
x) y
1
+ (x x
1
) y
2
h

h
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
=
(2 x)
3
0 + (x 1)
3

1
2

6
+
(2 x) 1 + (x 1)

1
2

1
6
h
(2 x) 0 + (x 1)

1
2
i
=
1
12
(x 1)
3

7
12
(x 1) + 1
Similarly, for 2 x 3,
s(x) =
1
12
(x 2)
3
+
1
4
(x 2)
2

1
3
(x 1) +
1
2
and for 3 x 4,
s(x) =
1
12
(x 4) +
1
4
x 1 2 3 4
y 1
1
2
1
3
1
4
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0.2
0.4
0.6
0.8
1
x
y
y = 1/x
y = s(x)
Graph of example of natural cubic spline
interpolation
x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating natural cubic spline function
ALTERNATIVE BOUNDARY CONDITIONS
Return to the equations
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
3
y
2
h

y
2
y
1
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
4
y
3
h

y
3
y
2
h
Sometimes other boundary conditions are imposed on
s(x) to help in determining the values of M
1
and
M
4
. For example, the data in our numerical exam-
ple were generated from the function f(x) =
1
x
. With
it, f
00
(x) =
2
x
3
, and thus we could use
M
1
= 2, M
4
=
1
32
With this we are led to a new formula for s(x), one
that approximates f(x) =
1
x
more closely.
THE CLAMPED SPLINE
In this case, we augment the interpolation conditions
s(x
i
) = y
i
, i = 1, 2, 3, 4
with the boundary conditions
s
0
(x
1
) = y
0
1
, s
0
(x
4
) = y
0
4
(#)
The conditions (#) lead to another pair of equations,
augmenting the earlier ones. Combined these equa-
tions are
h
3
M
1
+
h
6
M
2
=
y
2
y
1
h
y
0
1
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
3
y
2
h

y
2
y
1
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
4
y
3
h

y
3
y
2
h
h
6
M
3
+
h
3
M
4
= y
0
4

y
4
y
3
h
For our numerical example, it is natural to obtain
these derivative values from f
0
(x) =
1
x
2
:
y
0
1
= 1, y
0
4
=
1
16
When combined with your earlier equations, we have
the system
1
3
M
1
+
1
6
M
2
=
1
2
1
6
M
1
+
2
3
M
2
+
1
6
M
3
=
1
3
1
6
M
2
+
2
3
M
3
+
1
6
M
4
=
1
12
1
6
M
3
+
1
3
M
4
=
1
48
This has the solution
[M
1
, M
2
, M
3
, M
4
] =

173
120
,
7
60
,
11
120
,
1
60

We can now write the functions s(x) for each of the


subintervals [x
1
, x
2
], [x
2
, x
3
], and [x
3
, x
4
]. Recall for
x
1
x x
2
,
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6h
+
(x
2
x) y
1
+ (x x
1
) y
2
h

h
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
We can substitute in from the data
x 1 2 3 4
y 1
1
2
1
3
1
4
and the solutions {M
i
}. Doing so, consider the error
f(x) s(x). As an example,
f(x) =
1
x
, f

3
2

=
2
3
, s

3
2

= .65260
This is quite a decent approximation.
THE GENERAL PROBLEM
Consider the spline interpolation problem with n nodes
(x
1
, y
1
) , (x
2
, y
2
) , ..., (x
n
, y
n
)
and assume the node points {x
i
} are evenly spaced,
x
j
= x
1
+ (j 1) h, j = 1, ..., n
We have that the interpolating spline s(x) on
x
j
x x
j+1
is given by
s(x) =

x
j+1
x

3
M
j
+

x x
j

3
M
j+1
6h
+

x
j+1
x

y
j
+

x x
j

y
j+1
h

h
6
h
x
j+1
x

M
j
+

x x
j

M
j+1
i
for j = 1, ..., n 1.
To enforce continuity of s
0
(x) at the interior node
points x
2
, ..., x
n1
, the second derivatives
n
M
j
o
must
satisfy the linear equations
h
6
M
j1
+
2h
3
M
j
+
h
6
M
j+1
=
y
j1
2y
j
+ y
j+1
h
for j = 2, ..., n 1. Writing them out,
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
1
2y
2
+ y
3
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
2
2y
3
+ y
4
h
.
.
.
h
6
M
n2
+
2h
3
M
n1
+
h
6
M
n
=
y
n2
2y
n1
+ y
n
h
This is a system of n2 equations in the n unknowns
{M
1
, ..., M
n
}. Two more conditions must be imposed
on s(x) in order to have the number of equations equal
the number of unknowns, namely n. With the added
boundary conditions, this form of linear system can be
solved very eciently.
BOUNDARY CONDITIONS
Natural boundary conditions
s
00
(x
1
) = s
00
(x
n
) = 0
Spline functions satisfying these conditions are called
natural cubic splines. They arise out the minimiza-
tion problem stated earlier. But generally they are not
considered as good as some other cubic interpolating
splines.
Clamped boundary conditions We add the condi-
tions
s
0
(x
1
) = y
0
1
, s
0
(x
n
) = y
0
n
with y
0
1
, y
0
n
given slopes for the endpoints of s(x) on
[x
1
, x
n
]. This has many quite good properties when
compared with the natural cubic interpolating spline;
but it does require knowing the derivatives at the end-
points.
Not a knot boundary conditions This is more com-
plicated to explain, but it is the version of cubic spline
interpolation that is implemented in Matlab.
THE NOT A KNOT CONDITIONS
As before, let the interpolation nodes be
(x
1
, y
1
) , (x
2
, y
2
) , ..., (x
n
, y
n
)
We separate these points into two categories. For
constructing the interpolating cubic spline function,
we use the points
(x
1
, y
1
) , (x
3
, y
3
) , ..., (x
n2
, y
n2
) , (x
n
, y
n
)
Thus deleting two of the points. We now have n 2
points, and the interpolating spline s(x) can be deter-
mined on the intervals
[x
1
, x
3
] , [x
3
, x
4
] , ..., [x
n3
, x
n2
] , [x
n2
, x
n
]
This leads to n 4 equations in the n 2 unknowns
M
1
, M
3
, ..., M
n2
, M
n
. The two additional boundary
conditions are
s(x
2
) = y
2
, s(x
n1
) = y
n1
These translate into two additional equations, and we
obtain a system of n2 linear simultaneous equations
in the n 2 unknowns M
1
, M
3
, ..., M
n2
, M
n
.
x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating cubic spline function with not-a knot
boundary conditions
MATLAB SPLINE FUNCTION LIBRARY
Given data points
(x
1
, y
1
) , (x
2
, y
2
) , ..., (x
n
, y
n
)
type arrays containing the x and y coordinates:
x = [x
1
x
2
...x
n
]
y = [y
1
y
2
...y
n
]
plot (x, y, o)
The last statement will draw a plot of the data points,
marking them with the letter oh. To nd the inter-
polating cubic spline function and evaluate it at the
points of another array xx, say
h = (x
n
x
1
) / (10 n) ; xx = x
1
: h : x
n
;
use
yy = spline (x, y, xx)
plot (x, y, o, xx, yy)
The last statement will plot the data points, as be-
fore, and it will plot the interpolating spline s(x) as a
continuous curve.
ERROR IN CUBIC SPLINE INTERPOLATION
Let an interval [a, b] be given, and then dene
h =
b a
n 1
, x
j
= a + (j 1)h, j = 1, ..., n
Suppose we want to approximate a given function
f(x) on the interval [a, b] using cubic spline inter-
polation. Dene
y
i
= f(x
i
), j = 1, ..., n
Let s
n
(x) denote the cubic spline interpolating this
data and satisfying the not a knot boundary con-
ditions. Then it can be shown that for a suitable
constant c,
E
n
max
axb
|f(x) s
n
(x)| ch
4
The corresponding bound for natural cubic spline in-
terpolation contains only a term of h
2
rather than h
4
;
it does not converge to zero as rapidly.
EXAMPLE
Take f(x) = arctan x on [0, 5]. The following ta-
ble gives values of the maximum error E
n
for various
values of n. The values of h are being successively
halved.
n E
n
E
1
2
n
/E
n
7 7.09E3
13 3.24E4 21.9
25 3.06E5 10.6
49 1.48E6 20.7
97 9.04E8 16.4
BEST APPROXIMATION
Given a function f(x) that is continuous on a given
interval [a, b], consider approximating it by some poly-
nomial p(x). To measure the error in p(x) as an ap-
proximation, introduce
E(p) = max
axb
|f(x) p(x)|
This is called the maximum error or uniform error of
approximation of f(x) by p(x) on [a, b].
With an eye towards eciency, we want to nd the
best possible approximation of a given degree n.
With this in mind, introduce the following:

n
(f) = min
deg(p)n
E(p)
= min
deg(p)n
"
max
axb
|f(x) p(x)|
#
The number
n
(f) will be the smallest possible uni-
form error, or minimax error, when approximating f(x)
by polynomials of degree at most n. If there is a
polynomial giving this smallest error, we denote it by
m
n
(x); thus E(m
n
) =
n
(f).
Example. Let f(x) = e
x
on [1, 1]. In the following
table, we give the values of E(t
n
), t
n
(x) the Tay-
lor polynomial of degree n for e
x
about x = 0, and
E(m
n
).
Maximum Error in:
n t
n
(x) m
n
(x)
1 7.18E 1 2.79E 1
2 2.18E 1 4.50E 2
3 5.16E 2 5.53E 3
4 9.95E 3 5.47E 4
5 1.62E 3 4.52E 5
6 2.26E 4 3.21E 6
7 2.79E 5 2.00E 7
8 3.06E 6 1.11E 8
9 3.01E 7 5.52E 10
Consider graphically how we can improve on the Tay-
lor polynomial
t
1
(x) = 1 + x
as a uniform approximation to e
x
on the interval [1, 1].
The linear minimax approximation is
m
1
(x) = 1.2643 + 1.1752x
x
y
-1 1
1
2
y=t
1
(x)
y=m
1
(x)
y=e
x
Linear Taylor and minimax approximations to e
x
x
y
-1 1
0.0516
Error in cubic Taylor approximation to e
x
x
y
-1 1
0.00553
-0.00553
Error in cubic minimax approximation to e
x
Accuracy of the minimax approximation.

n
(f)
[(b a)/2]
n+1
(n + 1)!2
n
max
axb

f
(n+1)
(x)

This error bound does not always become smaller with


increasing n, but it will give a fairly accurate bound
for many common functions f(x).
Example. Let f(x) = e
x
for 1 x 1. Then

n
(e
x
)
e
(n + 1)!2
n
(*)
n Bound (*)
n
(f)
1 6.80E 1 2.79E 1
2 1.13E 1 4.50E 2
3 1.42E 2 5.53E 3
4 1.42E 3 5.47E 4
5 1.18E 4 4.52E 5
6 8.43E 6 3.21E 6
7 5.27E 7 2.00E 7
CHEBYSHEV POLYNOMIALS
Chebyshev polynomials are used in many parts of nu-
merical analysis, and more generally, in applications
of mathematics. For an integer n 0, dene the
function
T
n
(x) = cos

ncos
1
x

, 1 x 1 (1)
This may not appear to be a polynomial, but we will
show it is a polynomial of degree n. To simplify the
manipulation of (1), we introduce
= cos
1
(x) or x = cos(), 0 (2)
Then
T
n
(x) = cos(n) (3)
Example. n = 0
T
0
(x) = cos(0 ) = 1
n = 1
T
1
(x) = cos() = x
n = 2
T
2
(x) = cos(2) = 2 cos
2
() 1 = 2x
2
1
x
y
-1 1
1
-1
T
0
(x)
T
1
(x)
T
2
(x)
x
y
-1 1
1
-1
T
3
(x)
T
4
(x)
The triple recursion relation. Recall the trigonomet-
ric addition formulas,
cos( ) = cos() cos() sin() sin()
Let n 1, and apply these identities to get
T
n+1
(x) = cos[(n + 1)] = cos(n +)
= cos(n) cos() sin(n) sin()
T
n1
(x) = cos[(n 1)] = cos(n )
= cos(n) cos() + sin(n) sin()
Add these two equations, and then use (1) and (3) to
obtain
T
n+1
(x) +T
n1
= 2 cos(n) cos() = 2xT
n
(x)
T
n+1
(x) = 2xT
n
(x) T
n1
(x), n 1
(4)
This is called the triple recursion relation for the Cheby-
shev polynomials. It is often used in evaluating them,
rather than using the explicit formula (1).
Example. Recall
T
0
(x) = 1, T
1
(x) = x
T
n+1
(x) = 2xT
n
(x) T
n1
(x), n 1
Let n = 2. Then
T
3
(x) = 2xT
2
(x) T
1
(x)
= 2x(2x
2
1) x
= 4x
3
3x
Let n = 3. Then
T
4
(x) = 2xT
3
(x) T
2
(x)
= 2x(4x
3
3x) (2x
2
1)
= 8x
4
8x
2
+ 1
The minimum size property. Note that
|T
n
(x)| 1, 1 x 1 (5)
for all n 0. Also, note that
T
n
(x) = 2
n1
x
n
+ lower degree terms, n 1
(6)
This can be proven using the triple recursion relation
and mathematical induction.
Introduce a modied version of T
n
(x),
e
T
n
(x) =
1
2
n1
T
n
(x) = x
n
+lower degree terms (7)
From (5) and (6),

e
T
n
(x)


1
2
n1
, 1 x 1, n 1 (8)
Example.
e
T
4
(x) =
1
8

8x
4
8x
2
+ 1

= x
4
x
2
+
1
8
A polynomial whose highest degree term has a coe-
cient of 1 is called a monic polynomial. Formula (8)
says the monic polynomial
e
T
n
(x) has size 1/2
n1
on
1 x 1, and this becomes smaller as the degree
n increases. In comparison,
max
1x1
|x
n
| = 1
Thus x
n
is a monic polynomial whose size does not
change with increasing n.
Theorem. Let n 1 be an integer, and consider all
possible monic polynomials of degree n. Then the
degree n monic polynomial with the smallest maxi-
mum on [1, 1] is the modied Chebyshev polynomial
e
T
n
(x), and its maximum value on [1, 1] is 1/2
n1
.
This result is used in devising applications of Cheby-
shev polynomials. We apply it to obtain an improved
interpolation scheme.

Das könnte Ihnen auch gefallen