Beruflich Dokumente
Kultur Dokumente
1
3!
x
6
+ +
(1)
n
n!
x
2n
+
(1)
n+1
(n + 1)!
x
2n+2
e
x
Because c
t
must be between 0 and x
2
, we have it
must be negative. Thus we let c
t
=
x
in the error
term, with 0
x
x
2
.
EVALUATING A POLYNOMIAL
Consider having a polynomial
p(x) = a
0
+ a
1
x + a
2
x
2
+ + a
n
x
n
which you need to evaluate for many values of x. How
do you evaluate it? This may seem a strange question,
but the answer is not as obvious as you might think.
The standard way, written in a loose algorithmic for-
mat:
poly = a
0
for j = 1 : n
poly = poly + a
j
x
j
end
To compare the costs of dierent numerical meth-
ods, we do an operations count, and then we compare
these for the competing methods. Above, the counts
are as follows:
additions : n
multiplications : 1 + 2 + 3 + + n =
n(n + 1)
2
This assumes each term a
j
x
j
is computed indepen-
dently of the remaining terms in the polynomial.
Next, do the terms x
j
recursively:
x
j
= x x
j1
Then to compute
n
x
2
, x
3
, ..., x
n
o
will cost n1 mul-
tiplications. Our algorithm becomes
poly = a
0
+ a
1
x
power = x
for j = 2 : n
power = x power
poly = poly + a
j
power
end
The total operations cost is
additions : n
multiplications : n + n 1 = 2n 1
When n is evenly moderately large, this is much less
than for the rst method of evaluating p(x). For ex-
ample, with n = 20, the rst method has 210 multi-
plications, whereas the second has 39 multiplications.
We now considered nested multiplication. As exam-
ples of particular degrees, write
n = 2 : p(x) = a
0
+ x(a
1
+ a
2
x)
n = 3 : p(x) = a
0
+ x(a
1
+ x(a
2
+ a
3
x))
n = 4 : p(x) = a
0
+ x(a
1
+ x(a
2
+ x(a
3
+ a
4
x)))
These contain, respectively, 2, 3, and 4 multiplica-
tions. This is less than the preceding method, which
would have need 3, 5, and 7 multiplications, respec-
tively.
For the general case, write
p(x) = a
0
+x(a
1
+ x(a
2
+ + x(a
n1
+ a
n
x) ))
This requires n multiplications, which is only about
half that for the preceding method. For an algorithm,
write
poly = a
n
for j = n 1 : 1 : 0
poly = a
j
+ x poly
end
With all three methods, the number of additions is n;
but the number of multiplications can be dramatically
dierent for large values of n.
NESTED MULTIPLICATION
Imagine we are evaluating the polynomial
p(x) = a
0
+ a
1
x + a
2
x
2
+ + a
n
x
n
at a point x = z. Thus with nested multiplication
p(z) = a
0
+z (a
1
+ z (a
2
+ + z (a
n1
+ a
n
z) ))
We can write this as the following sequence of oper-
ations:
b
n
= a
n
b
n1
= a
n1
+ zb
n
b
n2
= a
n2
+ zb
n1
.
.
.
b
0
= a
0
+ zb
1
The quantities b
n1
, ..., b
0
are simply the quantities in
parentheses, starting from the inner most and working
outward.
Introduce
q(x) = b
1
+ b
2
x + b
3
x
2
+ + b
n
x
n1
Claim:
p(x) = b
0
+ (x z)q(x) ()
Proof: Simply expand
b
0
+ (x z)
b
1
+ b
2
x + b
3
x
2
+ + b
n
x
n1
1
1
6
t
2
+
1
120
t
4
cos c
t
dt
= x
1
18
x
3
+
1
120
Z
x
0
t
4
cos c
t
dt
1
x
Z
x
0
sin t
t
dt = 1
1
18
x
2
+ R
2
(x)
R
2
(x) =
1
120
1
x
x
Z
0
t
4
cos c
t
dt
How large is the error in the approximation
SF(x) 1
1
18
x
2
on the interval [1, 1]? Since |cos c
t
| 1, we have
for x > 0 that
0 R
2
(x)
1
120
1
x
Z
x
0
t
4
dt
=
1
600
x
4
and the same result can be shown for x < 0. Then
for |x| 1, we have
0 R
2
(x)
1
600
To obtain a more accurate approximation, we can pro-
ceed exactly as above, but simply use a higher degree
approximation to sin t.
BINARY INTEGERS
A binary integer x is a nite sequence of the digits 0
and 1, which we write symbolically as
x = (a
m
a
m1
a
2
a
1
a
0
)
2
where I insert the parentheses with subscript ()
2
in
order to make clear that the number is binary. The
above has the decimal equivalent
x = a
m
2
m
+ a
m1
2
m1
+ + a
1
2
1
+ a
0
For example, the binary integer x = (110101)
2
has
the decimal value
x = 2
5
+ 2
4
+ 2
2
+ 2
0
= 53
The binary integer x = (111 1)
2
with m ones has
the decimal value
x = 2
m1
+ + 2
1
+ 1 = 2
m
1
DECIMAL TO BINARY INTEGER CONVERSION
Given a decimal integer x we write
x = (a
m
a
m1
a
2
a
1
a
0
)
2
= a
m
2
m
+ a
m1
2
m1
+ + a
1
2
1
+ a
0
Divide x by 2, calling the quotient x
1
. The remainder
is a
0
, and
x
1
= a
m
2
m1
+ a
m1
2
m2
+ + a
1
2
0
Continue the process. Divide x
1
by 2, calling the quo-
tient x
2
. The remainder is a
1
, and
x
2
= a
m
2
m2
+ a
m1
2
m3
+ + a
2
2
0
After a nite number of such steps, we will obtain all
of the coecients a
i
, and the nal quotient will be
zero.
Try this with a few decimal integers.
EXAMPLE
The following shortened form of the above method is
convenient for hand computation. Convert (11)
10
to
binary.
b2
11c = 5 = x
1
a
0
= 1
b2
5c = 2 = x
2
a
1
= 1
b2
2c = 1 = x
3
a
2
= 0
b2
1c = 0 = x
4
a
3
= 1
In this, the notation bbc denotes the largest integer
b, and the notation 2
X
i=0
r
i
=
1
1 r
, |r| < 1
Using this,
(.0101010101010 )
2
= 2
2
+ 2
4
+ 2
6
+
= 2
2
1 + 2
2
+ 2
4
+
10000000001
| {z }
E
1100100000000000 : : : 0000
| {z }
1:b
13
b
14
:::b
64
=x
which provides us with the IEEE double precision rep-
resentation of 7:125.
SOME DEFINITIONS
Let x
T
denote the true value of some number, usually
unknown in practice; and let x
A
denote an approxi-
mation of x
T
.
The error in x
A
is
error(x
A
) = x
T
x
A
The relative error in x
A
is
rel(x
A
) =
error(x
A
)
x
T
=
x
T
x
A
x
T
Example:
x
T
= e; x
A
=
19
7
: Then,
error(x
A
) = e
19
7
= 0:003996
rel(xA) =
0:003996
e
= 0:00147
Relative error is more exact in representing the der-
ence between the true value and approximated one.
Example: Suppose the distance between two cities is
D
T
= 100 km and let this distance be approximated
with D
A
= 99 km. In this case,
Err (D
A
) = D
T
D
A
= 1 km,
Rel (D
A
) =
Err (D
A
)
D
T
= 0:01 1%:
Now, suppose that distance is d
T
= 2 km and esti-
mate it with d
A
= 1 km. Then
Err (d
A
) = d
T
d
A
= 1 km,
Rel (d
A
) =
Err (d
A
)
d
T
= 0:5 50%:
In both cases the error is the same. But, obviously
D
A
is a better approximation of D
T
, then d
A
of d
T
:
Numerical Analysis
conf.dr. Bostan Viorel
Fall 2010 Lecture 3
1 / 83
Sources of Error
The sources of error in the computation of the solution of a
mathematical model for some physical situation can be roughly
characterised as follows:
1. Modelling Error.
Consider the example of a projectile of mass m that is travelling
thrugh the earth's atmosphere. A simple and oftenly used
description of projectile motion is given by
m
d
2
r
dt
2
(t) = mg
k b
d
r
dt
with b _ 0. In this,
r (t) is the vector position of the projectile;
and the nal term in the equation represents friction force in air. If
there is an error in this a model of a physical situation, then the
numerical solution of this equation is not going to improve the
results.
2 / 83
Sources of Error
The sources of error in the computation of the solution of a
mathematical model for some physical situation can be roughly
characterised as follows:
1. Modelling Error.
Consider the example of a projectile of mass m that is travelling
thrugh the earth's atmosphere. A simple and oftenly used
description of projectile motion is given by
m
d
2
r
dt
2
(t) = mg
k b
d
r
dt
with b _ 0. In this,
r (t) is the vector position of the projectile;
and the nal term in the equation represents friction force in air. If
there is an error in this a model of a physical situation, then the
numerical solution of this equation is not going to improve the
results.
3 / 83
Sources of Error
The sources of error in the computation of the solution of a
mathematical model for some physical situation can be roughly
characterised as follows:
1. Modelling Error.
Consider the example of a projectile of mass m that is travelling
thrugh the earth's atmosphere. A simple and oftenly used
description of projectile motion is given by
m
d
2
r
dt
2
(t) = mg
k b
d
r
dt
with b _ 0. In this,
r (t) is the vector position of the projectile;
and the nal term in the equation represents friction force in air. If
there is an error in this a model of a physical situation, then the
numerical solution of this equation is not going to improve the
results.
4 / 83
Sources of Error
The sources of error in the computation of the solution of a
mathematical model for some physical situation can be roughly
characterised as follows:
1. Modelling Error.
Consider the example of a projectile of mass m that is travelling
thrugh the earth's atmosphere. A simple and oftenly used
description of projectile motion is given by
m
d
2
r
dt
2
(t) = mg
k b
d
r
dt
with b _ 0. In this,
r (t) is the vector position of the projectile;
and the nal term in the equation represents friction force in air. If
there is an error in this a model of a physical situation, then the
numerical solution of this equation is not going to improve the
results.
5 / 83
Sources of Error
2. Physical / Observational / Measurement Error.
The radius of an electron is given by
(2.81777 + ) 10
13
cm, [[ _ 0.00011
This error cannot be removed, and it must aect the accuracy of
any computation in which it is used.
We need to be aware of these eects and to so arrange the
computation as to minimize the eects.
6 / 83
Sources of Error
2. Physical / Observational / Measurement Error.
The radius of an electron is given by
(2.81777 + ) 10
13
cm, [[ _ 0.00011
This error cannot be removed, and it must aect the accuracy of
any computation in which it is used.
We need to be aware of these eects and to so arrange the
computation as to minimize the eects.
7 / 83
Sources of Error
3. Approximation Error.
This is also called \discretization error" and \truncation error";
and it is the main source of error with which we deal in this course.
Such errors generally occur when we replace a computationally
unsolvable problem with a nearby problem that is more tractable
computationally.
For example, the Taylor polynomial approximation
e
x
- 1 + x +
1
2
x
2
contains an \approximation error".
The numerical integration
1
_
0
f (x)dx -
1
N
N
j =1
f
_
j
N
_
contains an approximation error.
8 / 83
Sources of Error
3. Approximation Error.
This is also called \discretization error" and \truncation error";
and it is the main source of error with which we deal in this course.
Such errors generally occur when we replace a computationally
unsolvable problem with a nearby problem that is more tractable
computationally.
For example, the Taylor polynomial approximation
e
x
- 1 + x +
1
2
x
2
contains an \approximation error".
The numerical integration
1
_
0
f (x)dx -
1
N
N
j =1
f
_
j
N
_
contains an approximation error.
9 / 83
Sources of Error
3. Approximation Error.
This is also called \discretization error" and \truncation error";
and it is the main source of error with which we deal in this course.
Such errors generally occur when we replace a computationally
unsolvable problem with a nearby problem that is more tractable
computationally.
For example, the Taylor polynomial approximation
e
x
- 1 + x +
1
2
x
2
contains an \approximation error".
The numerical integration
1
_
0
f (x)dx -
1
N
N
j =1
f
_
j
N
_
contains an approximation error.
10 / 83
Sources of Error
3. Approximation Error.
This is also called \discretization error" and \truncation error";
and it is the main source of error with which we deal in this course.
Such errors generally occur when we replace a computationally
unsolvable problem with a nearby problem that is more tractable
computationally.
For example, the Taylor polynomial approximation
e
x
- 1 + x +
1
2
x
2
contains an \approximation error".
The numerical integration
1
_
0
f (x)dx -
1
N
N
j =1
f
_
j
N
_
contains an approximation error.
11 / 83
Sources of Error
4. Finiteness of Algorithm Error
This is an error due to stopping an algorithm after a nite number
of iterations.
Even if theoretically an algorithm can run for indenite time, after
a nite (usually specied) number of iterations the algorithm will
be stopped.
12 / 83
Sources of Error
4. Finiteness of Algorithm Error
This is an error due to stopping an algorithm after a nite number
of iterations.
Even if theoretically an algorithm can run for indenite time, after
a nite (usually specied) number of iterations the algorithm will
be stopped.
13 / 83
Sources of Error
5. Blunders.
In the pre-computer era, blunders were mostly arithmetic errors.
In
the earlier years of the computer era, the typical blunder was a
programming bugs. Present day \blunders" are still often
programming errors. But now they are often much more dicult to
nd, as they are often embedded in very large codes which may
mask their eect.
Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on the
output, checking whether the output is reasonable or not.
14 / 83
Sources of Error
5. Blunders.
In the pre-computer era, blunders were mostly arithmetic errors. In
the earlier years of the computer era, the typical blunder was a
programming bugs.
Present day \blunders" are still often
programming errors. But now they are often much more dicult to
nd, as they are often embedded in very large codes which may
mask their eect.
Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on the
output, checking whether the output is reasonable or not.
15 / 83
Sources of Error
5. Blunders.
In the pre-computer era, blunders were mostly arithmetic errors. In
the earlier years of the computer era, the typical blunder was a
programming bugs. Present day \blunders" are still often
programming errors. But now they are often much more dicult to
nd, as they are often embedded in very large codes which may
mask their eect.
Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on the
output, checking whether the output is reasonable or not.
16 / 83
Sources of Error
5. Blunders.
In the pre-computer era, blunders were mostly arithmetic errors. In
the earlier years of the computer era, the typical blunder was a
programming bugs. Present day \blunders" are still often
programming errors. But now they are often much more dicult to
nd, as they are often embedded in very large codes which may
mask their eect.
Some simple rules to decrease the risk of having a bug in the code:
Break programs into small testable subprograms;
Run test cases for which you know the outcome;
When running the full code, maintain a skeptical eye on the
output, checking whether the output is reasonable or not.
17 / 83
Sources of Error
6. Rounding/chopping Error.
This is the main source of many problems, especially problems in
solving systems of linear equations. We later look at the eects of
such errors.
18 / 83
Sources of Error
7. Finitness of precision errors
All the numbers stored in computer memory are subject to the
niteness of allocated space for storage.
19 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
T
mg
Model this physical problem mathematically.
Second Newton law provides us with:
..
=
g
l
sin
_
.
=
.
=
g
l
sin
20 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
T
mg
Model this physical problem mathematically.
Second Newton law provides us with:
..
=
g
l
sin
_
.
=
.
=
g
l
sin
21 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
T
mg
Model this physical problem mathematically.
Second Newton law provides us with:
..
=
g
l
sin
_
.
=
.
=
g
l
sin
22 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
T
mg
Model this physical problem mathematically.
Second Newton law provides us with:
..
=
g
l
sin
_
.
=
.
=
g
l
sin
23 / 83
Pendulum Example
Original problem in engineering or in science to be solved:
T
mg
Model this physical problem mathematically.
Second Newton law provides us with:
..
=
g
l
sin
_
.
=
.
=
g
l
sin
24 / 83
Pendulum Example
Problem of continuous mathematics:
T
mg
_
.
=
.
=
g
l
sin
Modeling Errors
Physical Errors
25 / 83
Pendulum Example
Problem of continuous mathematics:
T
mg
_
.
=
.
=
g
l
sin
Modeling Errors
Physical Errors
26 / 83
Pendulum Example
Problem of continuous mathematics:
T
mg
_
.
=
.
=
g
l
sin
Modeling Errors
Physical Errors
27 / 83
Pendulum Example
Mathematical Algorithms:
T
mg
_
n+1
=
n
+ h
n+1
n+1
=
n
h
g
l
sin (
n
)
Discretisation Errors
Finiteness of Algorithm Errors
28 / 83
Pendulum Example
Mathematical Algorithms:
T
mg
_
n+1
=
n
+ h
n+1
n+1
=
n
h
g
l
sin (
n
)
Discretisation Errors
Finiteness of Algorithm Errors
29 / 83
Pendulum Example
Mathematical Algorithms:
T
mg
_
n+1
=
n
+ h
n+1
n+1
=
n
h
g
l
sin (
n
)
Discretisation Errors
Finiteness of Algorithm Errors
30 / 83
Pendulum Example
Computer Implementation:
T
mg
for i=1:Nmax
Omega = Omega - H*g/L*sin(Theta);
Theta = Theta + H*Omega
end
Rounding / Chopping Errors
Bugs in the Code
Finite Precision Errors
31 / 83
Pendulum Example
Computer Implementation:
T
mg
for i=1:Nmax
Omega = Omega - H*g/L*sin(Theta);
Theta = Theta + H*Omega
end
Rounding / Chopping Errors
Bugs in the Code
Finite Precision Errors
32 / 83
Pendulum Example
Computer Implementation:
T
mg
for i=1:Nmax
Omega = Omega - H*g/L*sin(Theta);
Theta = Theta + H*Omega
end
Rounding / Chopping Errors
Bugs in the Code
Finite Precision Errors
33 / 83
Loss of signicance errors
This can be considered a source of error or a consequence of the
niteness of calculator and computer arithmetic.
Example. Dene
f (x) = x
_
_
x + 1
_
x
_
and consider evaluating it on a 6-digit decimal calculator which
uses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e 006
10 1.54340 1.54347 7.0000e 005
100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 0.0074
10000 50.0000 49.9988 0.0012
100000 100.000 158.113 58.1130
34 / 83
Loss of signicance errors
This can be considered a source of error or a consequence of the
niteness of calculator and computer arithmetic.
Example. Dene
f (x) = x
_
_
x + 1
_
x
_
and consider evaluating it on a 6-digit decimal calculator which
uses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e 006
10 1.54340 1.54347 7.0000e 005
100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 0.0074
10000 50.0000 49.9988 0.0012
100000 100.000 158.113 58.1130
35 / 83
Loss of signicance errors
This can be considered a source of error or a consequence of the
niteness of calculator and computer arithmetic.
Example. Dene
f (x) = x
_
_
x + 1
_
x
_
and consider evaluating it on a 6-digit decimal calculator which
uses rounded arithmetic.
x Computed f (x) True f (x) Error
1 0.4142210 0.414214 7.0000e 006
10 1.54340 1.54347 7.0000e 005
100 4.99000 4.98756 0.0024
1000 15.8000 15.8074 0.0074
10000 50.0000 49.9988 0.0012
100000 100.000 158.113 58.1130
36 / 83
Loss of signicance errors
Example. Dene
g(x) =
1 cos x
x
2
and consider evaluating it on a 10-digit decimal calculator which
uses rounded arithmetic.
x Computed f (x) True f (x) Error
0.1 0.4995834700 0.4995834722 2.2000e 009
0.01 0.4999960000 0.4999958333 1.6670e 007
0.001 0.5000000000 0.4999999583 4.1700e 008
0.0001 0.5000000000 0.4999999996 4.0000e 010
0.00001 0.0 0.5000000000 0.5
37 / 83
Loss of signicance errors
Example. Dene
g(x) =
1 cos x
x
2
and consider evaluating it on a 10-digit decimal calculator which
uses rounded arithmetic.
x Computed f (x) True f (x) Error
0.1 0.4995834700 0.4995834722 2.2000e 009
0.01 0.4999960000 0.4999958333 1.6670e 007
0.001 0.5000000000 0.4999999583 4.1700e 008
0.0001 0.5000000000 0.4999999996 4.0000e 010
0.00001 0.0 0.5000000000 0.5
38 / 83
Loss of signicance errors
Consider one case, that of x = 0.001.
Then on the calculator:
cos(0.001) = 0.9999994999
1 cos(0.001) = 5.001 10
7
1 cos(0.001)
(0.001)
2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583 0.5001 =
0.0001000417
0.4999999583
= 0.0002
There are 3 signicant digits in the answer. How can such a
straightforward and short calculation lead to such a large error
(relative to the accuracy of the calculator)?
39 / 83
Loss of signicance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1 cos(0.001) = 5.001 10
7
1 cos(0.001)
(0.001)
2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583 0.5001 =
0.0001000417
0.4999999583
= 0.0002
There are 3 signicant digits in the answer. How can such a
straightforward and short calculation lead to such a large error
(relative to the accuracy of the calculator)?
40 / 83
Loss of signicance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1 cos(0.001) = 5.001 10
7
1 cos(0.001)
(0.001)
2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583 0.5001 =
0.0001000417
0.4999999583
= 0.0002
There are 3 signicant digits in the answer. How can such a
straightforward and short calculation lead to such a large error
(relative to the accuracy of the calculator)?
41 / 83
Loss of signicance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1 cos(0.001) = 5.001 10
7
1 cos(0.001)
(0.001)
2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583 0.5001 =
0.0001000417
0.4999999583
= 0.0002
There are 3 signicant digits in the answer. How can such a
straightforward and short calculation lead to such a large error
(relative to the accuracy of the calculator)?
42 / 83
Loss of signicance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1 cos(0.001) = 5.001 10
7
1 cos(0.001)
(0.001)
2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583 0.5001 =
0.0001000417
0.4999999583
= 0.0002
There are 3 signicant digits in the answer. How can such a
straightforward and short calculation lead to such a large error
(relative to the accuracy of the calculator)?
43 / 83
Loss of signicance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1 cos(0.001) = 5.001 10
7
1 cos(0.001)
(0.001)
2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583 0.5001 =
0.0001000417
0.4999999583
= 0.0002
There are 3 signicant digits in the answer. How can such a
straightforward and short calculation lead to such a large error
(relative to the accuracy of the calculator)?
44 / 83
Loss of signicance errors
Consider one case, that of x = 0.001. Then on the calculator:
cos(0.001) = 0.9999994999
1 cos(0.001) = 5.001 10
7
1 cos(0.001)
(0.001)
2
= 0.5001000000
The true answer is
f (0.001) = .4999999583
The relative error in our answer is
0.4999999583 0.5001 =
0.0001000417
0.4999999583
= 0.0002
There are 3 signicant digits in the answer. How can such a
straightforward and short calculation lead to such a large error
(relative to the accuracy of the calculator)?
45 / 83
Loss of signicance errors
When two numbers are nearly equal and we subtract them, then
we suer a \loss of signicance error" in the calculation.
In some
cases, these can be quite subtle and dicult to detect. And even
after they are detected, they may be dicult to x.
The last example, fortunately, can be xed in a number of ways.
Easiest is to use a trigonometric identity:
cos(2) = 2 cos
2
() 1 = 1 2 sin
2
()
Let x = 2. Then
f (x) =
1 cos x
x
2
=
2 sin
2
(x/2)
x
2
=
1
2
_
sin(x/2)
x/2
_
2
This latter formula, with x = 0.001, yields a computed value of
0.4999999584, nearly the true answer. We could also have used a
Taylor polynomial for cos(x) around x = 0 to obtain a better
approximation to f (x) for small values of x.
46 / 83
Loss of signicance errors
When two numbers are nearly equal and we subtract them, then
we suer a \loss of signicance error" in the calculation. In some
cases, these can be quite subtle and dicult to detect.
And even
after they are detected, they may be dicult to x.
The last example, fortunately, can be xed in a number of ways.
Easiest is to use a trigonometric identity:
cos(2) = 2 cos
2
() 1 = 1 2 sin
2
()
Let x = 2. Then
f (x) =
1 cos x
x
2
=
2 sin
2
(x/2)
x
2
=
1
2
_
sin(x/2)
x/2
_
2
This latter formula, with x = 0.001, yields a computed value of
0.4999999584, nearly the true answer. We could also have used a
Taylor polynomial for cos(x) around x = 0 to obtain a better
approximation to f (x) for small values of x.
47 / 83
Loss of signicance errors
When two numbers are nearly equal and we subtract them, then
we suer a \loss of signicance error" in the calculation. In some
cases, these can be quite subtle and dicult to detect. And even
after they are detected, they may be dicult to x.
The last example, fortunately, can be xed in a number of ways.
Easiest is to use a trigonometric identity:
cos(2) = 2 cos
2
() 1 = 1 2 sin
2
()
Let x = 2. Then
f (x) =
1 cos x
x
2
=
2 sin
2
(x/2)
x
2
=
1
2
_
sin(x/2)
x/2
_
2
This latter formula, with x = 0.001, yields a computed value of
0.4999999584, nearly the true answer. We could also have used a
Taylor polynomial for cos(x) around x = 0 to obtain a better
approximation to f (x) for small values of x.
48 / 83
Loss of signicance errors
When two numbers are nearly equal and we subtract them, then
we suer a \loss of signicance error" in the calculation. In some
cases, these can be quite subtle and dicult to detect. And even
after they are detected, they may be dicult to x.
The last example, fortunately, can be xed in a number of ways.
Easiest is to use a trigonometric identity:
cos(2) = 2 cos
2
() 1 = 1 2 sin
2
()
Let x = 2. Then
f (x) =
1 cos x
x
2
=
2 sin
2
(x/2)
x
2
=
1
2
_
sin(x/2)
x/2
_
2
This latter formula, with x = 0.001, yields a computed value of
0.4999999584, nearly the true answer. We could also have used a
Taylor polynomial for cos(x) around x = 0 to obtain a better
approximation to f (x) for small values of x.
49 / 83
Loss of signicance errors
When two numbers are nearly equal and we subtract them, then
we suer a \loss of signicance error" in the calculation. In some
cases, these can be quite subtle and dicult to detect. And even
after they are detected, they may be dicult to x.
The last example, fortunately, can be xed in a number of ways.
Easiest is to use a trigonometric identity:
cos(2) = 2 cos
2
() 1 = 1 2 sin
2
()
Let x = 2. Then
f (x) =
1 cos x
x
2
=
2 sin
2
(x/2)
x
2
=
1
2
_
sin(x/2)
x/2
_
2
This latter formula, with x = 0.001, yields a computed value of
0.4999999584, nearly the true answer. We could also have used a
Taylor polynomial for cos(x) around x = 0 to obtain a better
approximation to f (x) for small values of x.
50 / 83
Loss of signicance errors
When two numbers are nearly equal and we subtract them, then
we suer a \loss of signicance error" in the calculation. In some
cases, these can be quite subtle and dicult to detect. And even
after they are detected, they may be dicult to x.
The last example, fortunately, can be xed in a number of ways.
Easiest is to use a trigonometric identity:
cos(2) = 2 cos
2
() 1 = 1 2 sin
2
()
Let x = 2. Then
f (x) =
1 cos x
x
2
=
2 sin
2
(x/2)
x
2
=
1
2
_
sin(x/2)
x/2
_
2
This latter formula, with x = 0.001, yields a computed value of
0.4999999584, nearly the true answer. We could also have used a
Taylor polynomial for cos(x) around x = 0 to obtain a better
approximation to f (x) for small values of x.
51 / 83
Another example
Evaluate e
5
using a Taylor polynomial approximation:
e
5
= 1 +
(5)
1!
+
(5)
2
2!
+
(5)
3
3!
+
(5)
4
4!
+ +
(5)
5
5!
+
(5)
6
6!
. . .
With n = 25, the error is
(5)
26
26!
e
c
_ 10
8
Imagine calculating this polynomial using a computer with 4 digit
decimal arithmetic and rounding. To make the point about
cancellation more strongly, imagine that each of the terms in the
above polynomial is calculated exactly and then rounded to the
arithmetic of the computer. We add the terms exactly and then we
round to four digits.
52 / 83
Another example
Evaluate e
5
using a Taylor polynomial approximation:
e
5
= 1 +
(5)
1!
+
(5)
2
2!
+
(5)
3
3!
+
(5)
4
4!
+ +
(5)
5
5!
+
(5)
6
6!
. . .
With n = 25, the error is
(5)
26
26!
e
c
_ 10
8
Imagine calculating this polynomial using a computer with 4 digit
decimal arithmetic and rounding. To make the point about
cancellation more strongly, imagine that each of the terms in the
above polynomial is calculated exactly and then rounded to the
arithmetic of the computer. We add the terms exactly and then we
round to four digits.
53 / 83
Another example
Evaluate e
5
using a Taylor polynomial approximation:
e
5
= 1 +
(5)
1!
+
(5)
2
2!
+
(5)
3
3!
+
(5)
4
4!
+ +
(5)
5
5!
+
(5)
6
6!
. . .
With n = 25, the error is
(5)
26
26!
e
c
_ 10
8
Imagine calculating this polynomial using a computer with 4 digit
decimal arithmetic and rounding.
To make the point about
cancellation more strongly, imagine that each of the terms in the
above polynomial is calculated exactly and then rounded to the
arithmetic of the computer. We add the terms exactly and then we
round to four digits.
54 / 83
Another example
Evaluate e
5
using a Taylor polynomial approximation:
e
5
= 1 +
(5)
1!
+
(5)
2
2!
+
(5)
3
3!
+
(5)
4
4!
+ +
(5)
5
5!
+
(5)
6
6!
. . .
With n = 25, the error is
(5)
26
26!
e
c
_ 10
8
Imagine calculating this polynomial using a computer with 4 digit
decimal arithmetic and rounding. To make the point about
cancellation more strongly, imagine that each of the terms in the
above polynomial is calculated exactly and then rounded to the
arithmetic of the computer. We add the terms exactly and then we
round to four digits.
55 / 83
Another example
Degree Term Sum Degree Term Sum
0 1.000 1.000 13 0.1960 0.04230
1 5.000 4.000 14 0.7001e 1 0.02771
2 12.50 8.500 15 0.2334e 1 0.004370
3 20.83 12.33 16 0.7293e 2 0.01166
4 26.04 13.71 17 0.2145e 2 0.009518
5 26.04 12.33 18 0.5958e 3 0.01011
6 21.70 9.370 19 0.1568e 3 0.009957
7 15.50 6.130 20 0.3920e 4 0.009996
8 9.688 3.558 21 0.9333e 5 0.009987
9 5.382 1.824 22 0.2121e 5 0.009989
10 2.691 0.8670 23 0.4611e 6 0.009989
11 1.223 0.3560 24 0.9670e 7 0.009989
12 0.5097 0.1537 25 0.1921e 7 0.009989
True answer is 0.006738
56 / 83
Another example
Degree Term Sum Degree Term Sum
0 1.000 1.000 13 0.1960 0.04230
1 5.000 4.000 14 0.7001e 1 0.02771
2 12.50 8.500 15 0.2334e 1 0.004370
3 20.83 12.33 16 0.7293e 2 0.01166
4 26.04 13.71 17 0.2145e 2 0.009518
5 26.04 12.33 18 0.5958e 3 0.01011
6 21.70 9.370 19 0.1568e 3 0.009957
7 15.50 6.130 20 0.3920e 4 0.009996
8 9.688 3.558 21 0.9333e 5 0.009987
9 5.382 1.824 22 0.2121e 5 0.009989
10 2.691 0.8670 23 0.4611e 6 0.009989
11 1.223 0.3560 24 0.9670e 7 0.009989
12 0.5097 0.1537 25 0.1921e 7 0.009989
True answer is 0.006738
57 / 83
Another example
To understand more fully the source of the error, look at the
numbers being added and their accuracy.
For example,
(5)
3
3!
=
125
6
= 20.83
in the 4 digit decimal calculation, with an error of magnitude
0.00333. Note that this error in an intermediate step is of same
magnitude as the true answer 0.006738 being sought. Other
similar errors are present in calculating other coecients, and thus
they cause a major error in the nal answer being calculated.
General principle
Whenever a sum is being formed in which the nal answer is much
smaller than some of the terms being combined, then a loss of
signicance error is occurring.
58 / 83
Another example
To understand more fully the source of the error, look at the
numbers being added and their accuracy. For example,
(5)
3
3!
=
125
6
= 20.83
in the 4 digit decimal calculation, with an error of magnitude
0.00333.
Note that this error in an intermediate step is of same
magnitude as the true answer 0.006738 being sought. Other
similar errors are present in calculating other coecients, and thus
they cause a major error in the nal answer being calculated.
General principle
Whenever a sum is being formed in which the nal answer is much
smaller than some of the terms being combined, then a loss of
signicance error is occurring.
59 / 83
Another example
To understand more fully the source of the error, look at the
numbers being added and their accuracy. For example,
(5)
3
3!
=
125
6
= 20.83
in the 4 digit decimal calculation, with an error of magnitude
0.00333. Note that this error in an intermediate step is of same
magnitude as the true answer 0.006738 being sought.
Other
similar errors are present in calculating other coecients, and thus
they cause a major error in the nal answer being calculated.
General principle
Whenever a sum is being formed in which the nal answer is much
smaller than some of the terms being combined, then a loss of
signicance error is occurring.
60 / 83
Another example
To understand more fully the source of the error, look at the
numbers being added and their accuracy. For example,
(5)
3
3!
=
125
6
= 20.83
in the 4 digit decimal calculation, with an error of magnitude
0.00333. Note that this error in an intermediate step is of same
magnitude as the true answer 0.006738 being sought. Other
similar errors are present in calculating other coecients, and thus
they cause a major error in the nal answer being calculated.
General principle
Whenever a sum is being formed in which the nal answer is much
smaller than some of the terms being combined, then a loss of
signicance error is occurring.
61 / 83
Another example
To understand more fully the source of the error, look at the
numbers being added and their accuracy. For example,
(5)
3
3!
=
125
6
= 20.83
in the 4 digit decimal calculation, with an error of magnitude
0.00333. Note that this error in an intermediate step is of same
magnitude as the true answer 0.006738 being sought. Other
similar errors are present in calculating other coecients, and thus
they cause a major error in the nal answer being calculated.
General principle
Whenever a sum is being formed in which the nal answer is much
smaller than some of the terms being combined, then a loss of
signicance error is occurring.
62 / 83
Noise in function evaluation
Consider plotting the function
f (x) = (x 1)
3
= x
3
3x
2
+ 3x 1 = 1 + x(3 + x(3 + x))
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
x
y
63 / 83
Noise in function evaluation
0.99998 1.00000 1.000002
-8
-4
0
4
8
x 10
-15
x
y
64 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmetic
operations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as an
answer contains noise.
This noise is generally \random" and small.
But it can aect the accuracy of other calculations which depend
on f (x).
65 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmetic
operations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as an
answer contains noise.
This noise is generally \random" and small.
But it can aect the accuracy of other calculations which depend
on f (x).
66 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmetic
operations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as an
answer contains noise.
This noise is generally \random" and small.
But it can aect the accuracy of other calculations which depend
on f (x).
67 / 83
Noise in function evaluation
Whenever a function f (x) is evaluated, there are arithmetic
operations carried out which involve rounding or chopping errors.
This means that what the computer eventually returns as an
answer contains noise.
This noise is generally \random" and small.
But it can aect the accuracy of other calculations which depend
on f (x).
68 / 83
Underow errors
Consider evaluating
f (x) = x
10
for x near 0.
When using IEEE single precision arithmetic, the
smallest nonzero positive number expressible in normalized
oating-point format is
m = 2
126
= 1.18 10
38
Thus f (x) will be set to zero if
x
10
< m
[x[ < m
1
10
[x[ < 1.61 10
4
0.000161 < x < 0.000161
69 / 83
Underow errors
Consider evaluating
f (x) = x
10
for x near 0. When using IEEE single precision arithmetic, the
smallest nonzero positive number expressible in normalized
oating-point format is
m = 2
126
= 1.18 10
38
Thus f (x) will be set to zero if
x
10
< m
[x[ < m
1
10
[x[ < 1.61 10
4
0.000161 < x < 0.000161
70 / 83
Underow errors
Consider evaluating
f (x) = x
10
for x near 0. When using IEEE single precision arithmetic, the
smallest nonzero positive number expressible in normalized
oating-point format is
m = 2
126
= 1.18 10
38
Thus f (x) will be set to zero if
x
10
< m
[x[ < m
1
10
[x[ < 1.61 10
4
0.000161 < x < 0.000161
71 / 83
Underow errors
Consider evaluating
f (x) = x
10
for x near 0. When using IEEE single precision arithmetic, the
smallest nonzero positive number expressible in normalized
oating-point format is
m = 2
126
= 1.18 10
38
Thus f (x) will be set to zero if
x
10
< m
[x[ < m
1
10
[x[ < 1.61 10
4
0.000161 < x < 0.000161
72 / 83
Underow errors
Consider evaluating
f (x) = x
10
for x near 0. When using IEEE single precision arithmetic, the
smallest nonzero positive number expressible in normalized
oating-point format is
m = 2
126
= 1.18 10
38
Thus f (x) will be set to zero if
x
10
< m
[x[ < m
1
10
[x[ < 1.61 10
4
0.000161 < x < 0.000161
73 / 83
Underow errors
Consider evaluating
f (x) = x
10
for x near 0. When using IEEE single precision arithmetic, the
smallest nonzero positive number expressible in normalized
oating-point format is
m = 2
126
= 1.18 10
38
Thus f (x) will be set to zero if
x
10
< m
[x[ < m
1
10
[x[ < 1.61 10
4
0.000161 < x < 0.000161
74 / 83
Underow errors
Consider evaluating
f (x) = x
10
for x near 0. When using IEEE single precision arithmetic, the
smallest nonzero positive number expressible in normalized
oating-point format is
m = 2
126
= 1.18 10
38
Thus f (x) will be set to zero if
x
10
< m
[x[ < m
1
10
[x[ < 1.61 10
4
0.000161 < x < 0.000161
75 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors.
These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context. Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
76 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors. These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context.
Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
77 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors. These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context. Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
78 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors. These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context. Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
79 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors. These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context. Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
80 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors. These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context. Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
81 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors. These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context. Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
82 / 83
Overow errors
Attempts to use numbers that are too large for the oating-point
format will lead to overow errors. These are generally fatal
errors on most computers. With the IEEE oating-point format,
overow errors can be carried along as having a value of or
NaN, depending on the context. Usually an overow error is an
indication of a more signicant problem or error in the program
and the user needs to be aware of such errors.
When using IEEE single precision arithmetic, the largest nonzero
positive number expressible in normalized oating point format is
m = 2
128
_
1 2
24
_
= 3.40 10
38
Thus, f (x) will overow if
x
10
> m
[x[ > m
1
10
[x[ > 7131.6
83 / 83
Numerical Analysis
conf.dr. Bostan Viorel
Fall 2010 Lecture 5
1 / 101
Loss of signicance errors
This can be considered a source of error or a consequence of the
niteness of calculator and computer arithmetic.
Example. Dene
f (x) = x
_
x + 1
_
x
_
168 12.961
_ 0.0005
Then dene
r
1,A
= 13 + 12.961 = 25.961,
r
2,A
= 13 12.961 = 0.039
52 / 101
Propagation of errors
Example
Suppose you are solving
x
2
26x + 1 = 0
Using the quadratic formula, we have the true answers
r
1,T
= 13 +
_
168, r
2,T
= 13
_
168
From a table of square roots, we take
_
168 - 12.961
Since this is correctly rounded to 5 digits, we have
_
168 12.961
_ 0.0005
Then dene
r
1,A
= 13 + 12.961 = 25.961,
r
2,A
= 13 12.961 = 0.039
53 / 101
Propagation of errors
Example
Suppose you are solving
x
2
26x + 1 = 0
Using the quadratic formula, we have the true answers
r
1,T
= 13 +
_
168, r
2,T
= 13
_
168
From a table of square roots, we take
_
168 - 12.961
Since this is correctly rounded to 5 digits, we have
_
168 12.961
_ 0.0005
Then dene
r
1,A
= 13 + 12.961 = 25.961,
r
2,A
= 13 12.961 = 0.039
54 / 101
Propagation of errors
Example
Suppose you are solving
x
2
26x + 1 = 0
Using the quadratic formula, we have the true answers
r
1,T
= 13 +
_
168, r
2,T
= 13
_
168
From a table of square roots, we take
_
168 - 12.961
Since this is correctly rounded to 5 digits, we have
_
168 12.961
_ 0.0005
Then dene
r
1,A
= 13 + 12.961 = 25.961,
r
2,A
= 13 12.961 = 0.039
55 / 101
Propagation of errors
Example
Suppose you are solving
x
2
26x + 1 = 0
Using the quadratic formula, we have the true answers
r
1,T
= 13 +
_
168, r
2,T
= 13
_
168
From a table of square roots, we take
_
168 - 12.961
Since this is correctly rounded to 5 digits, we have
_
168 12.961
_ 0.0005
Then dene
r
1,A
= 13 + 12.961 = 25.961,
r
2,A
= 13 12.961 = 0.039
56 / 101
Propagation of errors
Example
Then for both roots,
[r
T
r
A
[ _ 0.0005
For the relative errors, however,
Rel (r
1,A
) =
r
1,T
r
1,A
r
1,T
_
0.0005
25.9605
- 3.13 10
5
Rel (r
2,A
) =
r
2,T
r
2,A
r
2,T
_
0.0005
0.0385
- 0.0130
Why does r
2,A
have such poor accuracy in comparison to r
1,A
?
57 / 101
Propagation of errors
Example
Then for both roots,
[r
T
r
A
[ _ 0.0005
For the relative errors, however,
Rel (r
1,A
) =
r
1,T
r
1,A
r
1,T
_
0.0005
25.9605
- 3.13 10
5
Rel (r
2,A
) =
r
2,T
r
2,A
r
2,T
_
0.0005
0.0385
- 0.0130
Why does r
2,A
have such poor accuracy in comparison to r
1,A
?
58 / 101
Propagation of errors
Example
Then for both roots,
[r
T
r
A
[ _ 0.0005
For the relative errors, however,
Rel (r
1,A
) =
r
1,T
r
1,A
r
1,T
_
0.0005
25.9605
- 3.13 10
5
Rel (r
2,A
) =
r
2,T
r
2,A
r
2,T
_
0.0005
0.0385
- 0.0130
Why does r
2,A
have such poor accuracy in comparison to r
1,A
?
59 / 101
Propagation of errors
Example
Then for both roots,
[r
T
r
A
[ _ 0.0005
For the relative errors, however,
Rel (r
1,A
) =
r
1,T
r
1,A
r
1,T
_
0.0005
25.9605
- 3.13 10
5
Rel (r
2,A
) =
r
2,T
r
2,A
r
2,T
_
0.0005
0.0385
- 0.0130
Why does r
2,A
have such poor accuracy in comparison to r
1,A
?
60 / 101
Propagation of errors
Example
Then for both roots,
[r
T
r
A
[ _ 0.0005
For the relative errors, however,
Rel (r
1,A
) =
r
1,T
r
1,A
r
1,T
_
0.0005
25.9605
- 3.13 10
5
Rel (r
2,A
) =
r
2,T
r
2,A
r
2,T
_
0.0005
0.0385
- 0.0130
Why does r
2,A
have such poor accuracy in comparison to r
1,A
?
61 / 101
Propagation of errors
Example
Then for both roots,
[r
T
r
A
[ _ 0.0005
For the relative errors, however,
Rel (r
1,A
) =
r
1,T
r
1,A
r
1,T
_
0.0005
25.9605
- 3.13 10
5
Rel (r
2,A
) =
r
2,T
r
2,A
r
2,T
_
0.0005
0.0385
- 0.0130
Why does r
2,A
have such poor accuracy in comparison to r
1,A
?
62 / 101
Propagation of errors
Example
Then for both roots,
[r
T
r
A
[ _ 0.0005
For the relative errors, however,
Rel (r
1,A
) =
r
1,T
r
1,A
r
1,T
_
0.0005
25.9605
- 3.13 10
5
Rel (r
2,A
) =
r
2,T
r
2,A
r
2,T
_
0.0005
0.0385
- 0.0130
Why does r
2,A
have such poor accuracy in comparison to r
1,A
?
63 / 101
Propagation of errors
Example
The answer is due to the loss of signicance error involved in the
formula for calculating r
2,A
.
Instead, use the mathematically equivalent formula
r
2,A
=
1
13 +
_
168
-
1
25.961
This results in a much more accurate answer, at the expense of an
additional division.
64 / 101
Propagation of errors
Example
The answer is due to the loss of signicance error involved in the
formula for calculating r
2,A
.
Instead, use the mathematically equivalent formula
r
2,A
=
1
13 +
_
168
-
1
25.961
This results in a much more accurate answer, at the expense of an
additional division.
65 / 101
Propagation of errors
Example
The answer is due to the loss of signicance error involved in the
formula for calculating r
2,A
.
Instead, use the mathematically equivalent formula
r
2,A
=
1
13 +
_
168
-
1
25.961
This results in a much more accurate answer, at the expense of an
additional division.
66 / 101
Propagation of errors
Errors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate of
it which we denote by
e
f (x).
Now suppose that we have a number x
A
- x
T
.
We want to calculate f (x
T
), but instead we evaluate
e
f (x
A
).
What can we say about the error in this latter computed quantity?
f (x
T
)
e
f (x
A
)
67 / 101
Propagation of errors
Errors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate of
it which we denote by
e
f (x).
Now suppose that we have a number x
A
- x
T
.
We want to calculate f (x
T
), but instead we evaluate
e
f (x
A
).
What can we say about the error in this latter computed quantity?
f (x
T
)
e
f (x
A
)
68 / 101
Propagation of errors
Errors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate of
it which we denote by
e
f (x).
Now suppose that we have a number x
A
- x
T
.
We want to calculate f (x
T
), but instead we evaluate
e
f (x
A
).
What can we say about the error in this latter computed quantity?
f (x
T
)
e
f (x
A
)
69 / 101
Propagation of errors
Errors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate of
it which we denote by
e
f (x).
Now suppose that we have a number x
A
- x
T
.
We want to calculate f (x
T
), but instead we evaluate
e
f (x
A
).
What can we say about the error in this latter computed quantity?
f (x
T
)
e
f (x
A
)
70 / 101
Propagation of errors
Errors in function evaluation
Suppose we are evaluating a function f (x) in the machine.
Then the result is generally not f (x), but rather an approximate of
it which we denote by
e
f (x).
Now suppose that we have a number x
A
- x
T
.
We want to calculate f (x
T
), but instead we evaluate
e
f (x
A
).
What can we say about the error in this latter computed quantity?
f (x
T
)
e
f (x
A
)
71 / 101
Propagation of errors
Errors in function evaluation
f (x
T
)
e
f (x
A
) = [f (x
T
) f (x
A
)]
h
f (x
A
)
e
f (x
A
)
i
The quantity f (x
A
)
e
f (x
A
) is the \noise" in the evaluation of
f (x
A
) in the computer, and we will return later to some discussion
of it.
The quantity f (x
T
) f (x
A
) is called the propagated error. It is
the error that results from using perfect arithmetic in the
evaluation of the function.
If the function f (x) is dierentiable, then we can use the
\mean-value theorem" to write
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
for some between x
T
andx
A
.
72 / 101
Propagation of errors
Errors in function evaluation
f (x
T
)
e
f (x
A
) = [f (x
T
) f (x
A
)]
h
f (x
A
)
e
f (x
A
)
i
The quantity f (x
A
)
e
f (x
A
) is the \noise" in the evaluation of
f (x
A
) in the computer, and we will return later to some discussion
of it.
The quantity f (x
T
) f (x
A
) is called the propagated error. It is
the error that results from using perfect arithmetic in the
evaluation of the function.
If the function f (x) is dierentiable, then we can use the
\mean-value theorem" to write
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
for some between x
T
andx
A
.
73 / 101
Propagation of errors
Errors in function evaluation
f (x
T
)
e
f (x
A
) = [f (x
T
) f (x
A
)]
h
f (x
A
)
e
f (x
A
)
i
The quantity f (x
A
)
e
f (x
A
) is the \noise" in the evaluation of
f (x
A
) in the computer, and we will return later to some discussion
of it.
The quantity f (x
T
) f (x
A
) is called the propagated error. It is
the error that results from using perfect arithmetic in the
evaluation of the function.
If the function f (x) is dierentiable, then we can use the
\mean-value theorem" to write
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
for some between x
T
andx
A
.
74 / 101
Propagation of errors
Errors in function evaluation
f (x
T
)
e
f (x
A
) = [f (x
T
) f (x
A
)]
h
f (x
A
)
e
f (x
A
)
i
The quantity f (x
A
)
e
f (x
A
) is the \noise" in the evaluation of
f (x
A
) in the computer, and we will return later to some discussion
of it.
The quantity f (x
T
) f (x
A
) is called the propagated error. It is
the error that results from using perfect arithmetic in the
evaluation of the function.
If the function f (x) is dierentiable, then we can use the
\mean-value theorem" to write
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
for some between x
T
andx
A
.
75 / 101
Propagation of errors
Errors in function evaluation
Since usually x
T
and x
A
are close together, we can say is close to
either of them, and
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
- f
/
(x
T
)(x
T
x
A
)
- f
/
(x
A
)(x
T
x
A
)
76 / 101
Propagation of errors
Errors in function evaluation
Since usually x
T
and x
A
are close together, we can say is close to
either of them, and
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
- f
/
(x
T
)(x
T
x
A
)
- f
/
(x
A
)(x
T
x
A
)
77 / 101
Propagation of errors
Errors in function evaluation
Since usually x
T
and x
A
are close together, we can say is close to
either of them, and
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
- f
/
(x
T
)(x
T
x
A
)
- f
/
(x
A
)(x
T
x
A
)
78 / 101
Propagation of errors
Errors in function evaluation
Since usually x
T
and x
A
are close together, we can say is close to
either of them, and
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
- f
/
(x
T
)(x
T
x
A
)
- f
/
(x
A
)(x
T
x
A
)
79 / 101
Propagation of errors
Errors in function evaluation
Since usually x
T
and x
A
are close together, we can say is close to
either of them, and
f (x
T
) f (x
A
) = f
/
()(x
T
x
A
)
- f
/
(x
T
)(x
T
x
A
)
- f
/
(x
A
)(x
T
x
A
)
80 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
81 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
82 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
83 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
84 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
85 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
86 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
87 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
.
The number K is called a condition
number for the computation.
88 / 101
Propagation of errors
Example
Dene f (x) = b
x
, where b is a positive real number. Then last
formula yields
b
x
T
b
x
A
- (ln b)b
x
T
(x
T
x
A
)
Rel (b
x
A
) -
(ln b)b
x
T
(x
T
x
A
)
b
x
T
=
(ln b)(x
T
x
A
)x
T
x
T
= x
T
ln b Rel(x
A
)
= K Rel(x
A
)
Note that if K = 10
4
and Rel(x
A
) = 10
7
, then Rel(b
x
A
) - 10
3
.
This is a large decrease in accuracy; and it is independent of how
we actually calculate b
x
. The number K is called a condition
number for the computation.
89 / 101
Summation
Let S be a sum with a relatively large number of terms
S = a
1
+ a
2
+ . . . a
n
(1)
where a
j
, j = 1, . . . , n, are oating point numbers.
The
summation process consists of n 1 consecutive additions
S = (((. . . (a
1
+ a
2
) + a
3
) + . . . + a
n1
) + a
n
,
Dene
S
2
= (a
1
+ a
2
)
S
3
= (S
2
+ a
3
)
S
4
= (S
3
+ a
4
)
.
.
.
S
n
= (S
n1
+ a
n
)
Recall the formula
(x) = x(1 + )
90 / 101
Summation
Let S be a sum with a relatively large number of terms
S = a
1
+ a
2
+ . . . a
n
(1)
where a
j
, j = 1, . . . , n, are oating point numbers. The
summation process consists of n 1 consecutive additions
S = (((. . . (a
1
+ a
2
) + a
3
) + . . . + a
n1
) + a
n
,
Dene
S
2
= (a
1
+ a
2
)
S
3
= (S
2
+ a
3
)
S
4
= (S
3
+ a
4
)
.
.
.
S
n
= (S
n1
+ a
n
)
Recall the formula
(x) = x(1 + )
91 / 101
Summation
Let S be a sum with a relatively large number of terms
S = a
1
+ a
2
+ . . . a
n
(1)
where a
j
, j = 1, . . . , n, are oating point numbers. The
summation process consists of n 1 consecutive additions
S = (((. . . (a
1
+ a
2
) + a
3
) + . . . + a
n1
) + a
n
,
Dene
S
2
= (a
1
+ a
2
)
S
3
= (S
2
+ a
3
)
S
4
= (S
3
+ a
4
)
.
.
.
S
n
= (S
n1
+ a
n
)
Recall the formula
(x) = x(1 + )
92 / 101
Summation
S
2
= (a
1
+ a
2
)(1 +
2
)
S
3
= (S
2
+ a
3
)(1 +
3
)
S
4
= (S
3
+ a
4
)(1 +
4
)
.
.
.
S
n
= (S
n1
+ a
n
)(1 +
n
)
Then
S
3
= (S
2
+ a
3
)(1 +
3
)
= ((a
1
+ a
2
)(1 +
2
) + a
3
)(1 +
3
)
- (a
1
+ a
2
+ a
3
) + a
1
(
2
+
3
)
+a
2
(
2
+
3
) + a
3
3
,
93 / 101
Summation
S
2
= (a
1
+ a
2
)(1 +
2
)
S
3
= (S
2
+ a
3
)(1 +
3
)
S
4
= (S
3
+ a
4
)(1 +
4
)
.
.
.
S
n
= (S
n1
+ a
n
)(1 +
n
)
Then
S
3
= (S
2
+ a
3
)(1 +
3
)
= ((a
1
+ a
2
)(1 +
2
) + a
3
)(1 +
3
)
- (a
1
+ a
2
+ a
3
) + a
1
(
2
+
3
)
+a
2
(
2
+
3
) + a
3
3
,
94 / 101
Summation
Similarly,
S
4
- (a
1
+ a
2
+ a
3
+ a
4
) + a
1
(
2
+
3
+
4
)
+a
2
(
2
+
3
+
4
) + a
3
(
3
+
4
) + a
4
4
Finally,
S
n
- (a
1
+ a
2
+ . . . + a
n
) + a
1
(
2
+ . . . +
n
)
+a
2
(
2
+ . . . +
n
) + a
3
(
3
+ . . . +
n
)
+a
4
(
4
+ . . . +
n
) + . . . + a
n
n
95 / 101
Summation
Similarly,
S
4
- (a
1
+ a
2
+ a
3
+ a
4
) + a
1
(
2
+
3
+
4
)
+a
2
(
2
+
3
+
4
) + a
3
(
3
+
4
) + a
4
4
Finally,
S
n
- (a
1
+ a
2
+ . . . + a
n
) + a
1
(
2
+ . . . +
n
)
+a
2
(
2
+ . . . +
n
) + a
3
(
3
+ . . . +
n
)
+a
4
(
4
+ . . . +
n
) + . . . + a
n
n
96 / 101
Summation
We are interested in the error S S
n
:
S S
n
- a
1
(
2
+ . . . +
n
) a
2
(
2
+ . . . +
n
) a
3
(
3
+ . . . +
n
)
a
4
(
4
+ . . . +
n
) . . . a
n
n
From the last relation we can establish the strategy for sumation in
order to minimize the error S S
n
: initially rearrange the termsin
increasing order
[a
1
[ _ [a
2
[ _ [a
3
[ _ . . . _ [a
n
[
In this case smaller numbers a
1
and a
2
will be multiplied with
larger numbers
2
+ . . . +
n
, and larger number a
n
will be
multiplied with smaller number
n
.
97 / 101
Summation
We are interested in the error S S
n
:
S S
n
- a
1
(
2
+ . . . +
n
) a
2
(
2
+ . . . +
n
) a
3
(
3
+ . . . +
n
)
a
4
(
4
+ . . . +
n
) . . . a
n
n
From the last relation we can establish the strategy for sumation in
order to minimize the error S S
n
: initially rearrange the termsin
increasing order
[a
1
[ _ [a
2
[ _ [a
3
[ _ . . . _ [a
n
[
In this case smaller numbers a
1
and a
2
will be multiplied with
larger numbers
2
+ . . . +
n
, and larger number a
n
will be
multiplied with smaller number
n
.
98 / 101
Summation
We are interested in the error S S
n
:
S S
n
- a
1
(
2
+ . . . +
n
) a
2
(
2
+ . . . +
n
) a
3
(
3
+ . . . +
n
)
a
4
(
4
+ . . . +
n
) . . . a
n
n
From the last relation we can establish the strategy for sumation in
order to minimize the error S S
n
: initially rearrange the termsin
increasing order
[a
1
[ _ [a
2
[ _ [a
3
[ _ . . . _ [a
n
[
In this case smaller numbers a
1
and a
2
will be multiplied with
larger numbers
2
+ . . . +
n
, and larger number a
n
will be
multiplied with smaller number
n
.
99 / 101
Summation with chopping
Number
of terms, n
Exact
value
SL Error LS Error
10 2.929 2.928 0.001 2.927 0.002
25 3.816 3.813 0.003 3.806 0.010
50 4.499 4.491 0.008 4.470 0.020
100 5.187 5.170 0.017 5.142 0.045
200 5.878 5.841 0.037 5.786 0.092
500 6.793 6.692 0.101 6.569 0.224
1000 7.486 7.284 0.202 7.069 0.417
100 / 101
Summation with rounding
Number
of terms, n
Exact
value
SL Error LS Error
10 2.929 2.929 0 2.929 0
25 3.816 3.816 0 3.817 0.001
50 4.499 4.500 0.001 4.498 0.001
100 5.187 5.187 0 5.187 0
200 5.878 5.878 0 5.876 0.002
500 6.793 6.794 0.001 6.783 0.010
1000 7.486 7.486 0 7.449 0.037
101 / 101
Numerical Analysis
conf.dr. Bostan Viorel
Fall 2010 Lecture 6
1 / 94
Rootnding
We want to nd the numbers x for which
f (x) = 0
with f : [a, b] R a given real-valued function. Here, we denote
such roots or zeroes by the Greek letter . So
f () = 0
Rootnding problems occur in many contexts. Sometimes they are
a direct formulation of some physical situtation, but more often,
they are an intermediate step in solving a much larger problem.
2 / 94
Rootnding
We want to nd the numbers x for which
f (x) = 0
with f : [a, b] R a given real-valued function. Here, we denote
such roots or zeroes by the Greek letter . So
f () = 0
Rootnding problems occur in many contexts. Sometimes they are
a direct formulation of some physical situtation, but more often,
they are an intermediate step in solving a much larger problem.
3 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. This
means that such a method given an initail guess x
0
will provide us
with a sequence of consecutively computed solutions
x
1
, x
2
, x
3
, . . . , x
n
, . . . such that x
n
.
We begin with the simplest of such methods, one which most
people use at some time.
Suppose we are given a function f (x) and we assume we have an
interval [a, b] containing the root, on which the function is
continuous.
We also assume we are given an error tolerance > 0, and we
want an approximate root e [a, b] for which
[ e[ <
4 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. This
means that such a method given an initail guess x
0
will provide us
with a sequence of consecutively computed solutions
x
1
, x
2
, x
3
, . . . , x
n
, . . . such that x
n
.
We begin with the simplest of such methods, one which most
people use at some time.
Suppose we are given a function f (x) and we assume we have an
interval [a, b] containing the root, on which the function is
continuous.
We also assume we are given an error tolerance > 0, and we
want an approximate root e [a, b] for which
[ e[ <
5 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. This
means that such a method given an initail guess x
0
will provide us
with a sequence of consecutively computed solutions
x
1
, x
2
, x
3
, . . . , x
n
, . . . such that x
n
.
We begin with the simplest of such methods, one which most
people use at some time.
Suppose we are given a function f (x) and we assume we have an
interval [a, b] containing the root, on which the function is
continuous.
We also assume we are given an error tolerance > 0, and we
want an approximate root e [a, b] for which
[ e[ <
6 / 94
Bisection method
Most methods for solving f (x) = 0 are iterative methods. This
means that such a method given an initail guess x
0
will provide us
with a sequence of consecutively computed solutions
x
1
, x
2
, x
3
, . . . , x
n
, . . . such that x
n
.
We begin with the simplest of such methods, one which most
people use at some time.
Suppose we are given a function f (x) and we assume we have an
interval [a, b] containing the root, on which the function is
continuous.
We also assume we are given an error tolerance > 0, and we
want an approximate root e [a, b] for which
[ e[ <
7 / 94
Bisection method
Bisection method is based on the following theorem:
Theorem
If f : [a, b] R is a continuous function on closed and bounded
interval [a, b] and
f (a) f (b) < 0
then there exists [a, b] such that f () = 0.
Therefore, further assume that the function f (x) changes sign on
[a, b].
8 / 94
Bisection method
Bisection method is based on the following theorem:
Theorem
If f : [a, b] R is a continuous function on closed and bounded
interval [a, b] and
f (a) f (b) < 0
then there exists [a, b] such that f () = 0.
Therefore, further assume that the function f (x) changes sign on
[a, b].
9 / 94
Bisection method
Bisection method is based on the following theorem:
Theorem
If f : [a, b] R is a continuous function on closed and bounded
interval [a, b] and
f (a) f (b) < 0
then there exists [a, b] such that f () = 0.
Therefore, further assume that the function f (x) changes sign on
[a, b].
10 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, )
Step 1: Dene
c =
a + b
2
Step 2: If b c _ , accept c as our root, and then stop.
Step 3: If b c > , then compare the sign of f (c) to that of
f (a) and f (b). If
sign(f (a)) sign(f (b)) _ 0
then replace a with c; and otherwise, replace b with c.
Return to Step 1.
Note that we prefer checking the sign using condition
sign(f (a)) sign(f (b)) _ 0 instead of using sign(f (a) f (b)) _ 0.
11 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, )
Step 1: Dene
c =
a + b
2
Step 2: If b c _ , accept c as our root, and then stop.
Step 3: If b c > , then compare the sign of f (c) to that of
f (a) and f (b). If
sign(f (a)) sign(f (b)) _ 0
then replace a with c; and otherwise, replace b with c.
Return to Step 1.
Note that we prefer checking the sign using condition
sign(f (a)) sign(f (b)) _ 0 instead of using sign(f (a) f (b)) _ 0.
12 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, )
Step 1: Dene
c =
a + b
2
Step 2: If b c _ , accept c as our root, and then stop.
Step 3: If b c > , then compare the sign of f (c) to that of
f (a) and f (b). If
sign(f (a)) sign(f (b)) _ 0
then replace a with c; and otherwise, replace b with c.
Return to Step 1.
Note that we prefer checking the sign using condition
sign(f (a)) sign(f (b)) _ 0 instead of using sign(f (a) f (b)) _ 0.
13 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, )
Step 1: Dene
c =
a + b
2
Step 2: If b c _ , accept c as our root, and then stop.
Step 3: If b c > , then compare the sign of f (c) to that of
f (a) and f (b). If
sign(f (a)) sign(f (b)) _ 0
then replace a with c; and otherwise, replace b with c.
Return to Step 1.
Note that we prefer checking the sign using condition
sign(f (a)) sign(f (b)) _ 0 instead of using sign(f (a) f (b)) _ 0.
14 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, )
Step 1: Dene
c =
a + b
2
Step 2: If b c _ , accept c as our root, and then stop.
Step 3: If b c > , then compare the sign of f (c) to that of
f (a) and f (b). If
sign(f (a)) sign(f (b)) _ 0
then replace a with c; and otherwise, replace b with c.
Return to Step 1.
Note that we prefer checking the sign using condition
sign(f (a)) sign(f (b)) _ 0 instead of using sign(f (a) f (b)) _ 0.
15 / 94
Bisection method
Bisection Algorithm: Bisect(f , a, b, )
Step 1: Dene
c =
a + b
2
Step 2: If b c _ , accept c as our root, and then stop.
Step 3: If b c > , then compare the sign of f (c) to that of
f (a) and f (b). If
sign(f (a)) sign(f (b)) _ 0
then replace a with c; and otherwise, replace b with c.
Return to Step 1.
Note that we prefer checking the sign using condition
sign(f (a)) sign(f (b)) _ 0 instead of using sign(f (a) f (b)) _ 0.
16 / 94
Bisection method
y
x
a
1
b
1
=b
2
c
1
=a
2
c
2
17 / 94
Bisection method
Example
Consider the function
f (x) = x
6
x 1
We want to nd the largest root with accuracy of = 0.001. It can
be seen form the graph of the function that the root is located in
[1, 2] . Also, note that the function is continuous. Let a = 1 and
b = 2, then f (a) = 1 and f (b) = 61, consequently the function
changes its sign and thus all conditions are being satised.
18 / 94
Bisection method
n a
n
b
n
c
n
f (c
n
) b
n
c
n
1 1.00000 2.00000 1.50000 8.891e + 00 5.000e 01
2 1.00000 1.50000 1.25000 1.565e + 00 2.500e 01
3 1.00000 1.25000 1.12500 9.771e 02 1.250e 01
4 1.12500 1.25000 1.18750 6.167e 01 6.250e 02
5 1.12500 1.18750 1.15625 2.333e 01 3.125e 02
6 1.12500 1.15625 1.14063 6.158e 02 1.563e 02
7 1.12500 1.14063 1.13281 1.958e 02 7.813e 03
8 1.13281 1.14063 1.13672 2.062e 02 3.906e 03
9 1.13281 1.13672 1.13477 4.268e 04 1.953e 03
10 1.13281 1.13477 1.13379 9.598e 03 9.766e 04
19 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
) =
1
2
n
(b a)
20 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
) =
1
2
n
(b a)
21 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
) =
1
2
n
(b a)
22 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
) =
1
2
n
(b a)
23 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
) =
1
2
n
(b a)
24 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
)
=
1
2
n
(b a)
25 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
) =
1
2
n
(b a)
26 / 94
Error analysis for bisection method
Let a
n
, b
n
and c
n
be the values provided by bisection method at
iteration n. Evidently,
b
n+1
a
n+1
=
1
2
(b
n
a
n
)
b
n
a
n
=
1
2
(b
n1
a
n1
)
=
1
2
2
(b
n2
a
n2
)
= . . .
=
1
2
n1
(b a)
Since either [a
n
, c
n
] or [c
n
, b
n
] we have
[ c
n
[ _ c
n
a
n
= b
n
c
n
=
1
2
(b
n
a
n
) =
1
2
n
(b a)
27 / 94
Error analysis for bisection method
[ c
n
[ _
1
2
n
(b a)
This relation provides us with a stopping criterion for bisection
method. Moreover, it follows that c
n
as n .
Suppose we want to estimate the number of iterations in bisection
method necessary to nd the root with an error tolerance ,
[ c
n
[ _
1
2
n
(b a) _
n _
ln
ba
ln 2
For previuos example we get
n _
ln
1
0.001
ln 2
- 9.97
28 / 94
Error analysis for bisection method
[ c
n
[ _
1
2
n
(b a)
This relation provides us with a stopping criterion for bisection
method. Moreover, it follows that c
n
as n .
Suppose we want to estimate the number of iterations in bisection
method necessary to nd the root with an error tolerance ,
[ c
n
[ _
1
2
n
(b a) _
n _
ln
ba
ln 2
For previuos example we get
n _
ln
1
0.001
ln 2
- 9.97
29 / 94
Error analysis for bisection method
[ c
n
[ _
1
2
n
(b a)
This relation provides us with a stopping criterion for bisection
method. Moreover, it follows that c
n
as n .
Suppose we want to estimate the number of iterations in bisection
method necessary to nd the root with an error tolerance ,
[ c
n
[ _
1
2
n
(b a) _
n _
ln
ba
ln 2
For previuos example we get
n _
ln
1
0.001
ln 2
- 9.97
30 / 94
Error analysis for bisection method
[ c
n
[ _
1
2
n
(b a)
This relation provides us with a stopping criterion for bisection
method. Moreover, it follows that c
n
as n .
Suppose we want to estimate the number of iterations in bisection
method necessary to nd the root with an error tolerance ,
[ c
n
[ _
1
2
n
(b a) _
n _
ln
ba
ln 2
For previuos example we get
n _
ln
1
0.001
ln 2
- 9.97
31 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
32 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
33 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
34 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
35 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
36 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
37 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
38 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
39 / 94
Advantages and Disadvantages of Bisection method
Advantages:
1 It always converges.
2 You have a guaranteed error bound, and it decreases with
each successive iteration.
3 You have a guaranteed rate of convergence. The error bound
decreases by 1/2 with each iteration.
Disadvantages:
1 It is relatively slow when compared with other rootnding
methods we will study, especially when the function f (x) has
several continuous derivatives about the root .
2 The algorithm has no check to see whether the is too small
for the computer arithmetic being used.
We also assume the function f (x) is continuous on the given
interval [a, b]; but there is no way for the computer to conrm this.
40 / 94
Rootnding
We want to nd the root of a given function f (x).
Thus we want
to nd the point x at which the graph of y = f (x) intersects the
x-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearby
problem".
How do we obtain a nearby problem for f (x) = 0?
Begin rst by asking for types of problems which we can solve
easily. At the top of the list should be that of nding where a
straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 for
some linear polynomial p(x) that approximates f (x) in the vicinity
of the root .
41 / 94
Rootnding
We want to nd the root of a given function f (x). Thus we want
to nd the point x at which the graph of y = f (x) intersects the
x-axis.
One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearby
problem".
How do we obtain a nearby problem for f (x) = 0?
Begin rst by asking for types of problems which we can solve
easily. At the top of the list should be that of nding where a
straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 for
some linear polynomial p(x) that approximates f (x) in the vicinity
of the root .
42 / 94
Rootnding
We want to nd the root of a given function f (x). Thus we want
to nd the point x at which the graph of y = f (x) intersects the
x-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearby
problem".
How do we obtain a nearby problem for f (x) = 0?
Begin rst by asking for types of problems which we can solve
easily. At the top of the list should be that of nding where a
straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 for
some linear polynomial p(x) that approximates f (x) in the vicinity
of the root .
43 / 94
Rootnding
We want to nd the root of a given function f (x). Thus we want
to nd the point x at which the graph of y = f (x) intersects the
x-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearby
problem".
How do we obtain a nearby problem for f (x) = 0?
Begin rst by asking for types of problems which we can solve
easily. At the top of the list should be that of nding where a
straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 for
some linear polynomial p(x) that approximates f (x) in the vicinity
of the root .
44 / 94
Rootnding
We want to nd the root of a given function f (x). Thus we want
to nd the point x at which the graph of y = f (x) intersects the
x-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearby
problem".
How do we obtain a nearby problem for f (x) = 0?
Begin rst by asking for types of problems which we can solve
easily. At the top of the list should be that of nding where a
straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 for
some linear polynomial p(x) that approximates f (x) in the vicinity
of the root .
45 / 94
Rootnding
We want to nd the root of a given function f (x). Thus we want
to nd the point x at which the graph of y = f (x) intersects the
x-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearby
problem".
How do we obtain a nearby problem for f (x) = 0?
Begin rst by asking for types of problems which we can solve
easily. At the top of the list should be that of nding where a
straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 for
some linear polynomial p(x) that approximates f (x) in the vicinity
of the root .
46 / 94
Rootnding
We want to nd the root of a given function f (x). Thus we want
to nd the point x at which the graph of y = f (x) intersects the
x-axis. One of the principles of numerical analysis is the following.
Numerical Analysis Principle
If you cannot solve the given problem, then solve a "nearby
problem".
How do we obtain a nearby problem for f (x) = 0?
Begin rst by asking for types of problems which we can solve
easily. At the top of the list should be that of nding where a
straight line intersects the x-axis.
Thus we seek to replace f (x) = 0 by that of solving p(x) = 0 for
some linear polynomial p(x) that approximates f (x) in the vicinity
of the root .
47 / 94
Rootnding
y
x
(x
0
,f (x
0
))
x
0
x
1
48 / 94
Newton's method
Let x
0
be an initial guess, suciently closed to the root .
Consider the tangent line to the graph of f (x) in (x
0
, f (x
0
)).
Tangent intersects x-axis at x
1
, a closer point to . Tangent has
equation
p
1
(x) = f (x
0
) + f
/
(x
0
)(x x
0
)
Since p
1
(x
1
) = 0 we get
f (x
0
) + f
/
(x
0
)(x
1
x
0
) = 0
x
1
= x
0
f (x
0
)
f
/
(x
0
)
Similarly, we get x
2
,
x
2
= x
1
f (x
1
)
f
/
(x
1
)
49 / 94
Newton's method
Let x
0
be an initial guess, suciently closed to the root .
Consider the tangent line to the graph of f (x) in (x
0
, f (x
0
)).
Tangent intersects x-axis at x
1
, a closer point to . Tangent has
equation
p
1
(x) = f (x
0
) + f
/
(x
0
)(x x
0
)
Since p
1
(x
1
) = 0 we get
f (x
0
) + f
/
(x
0
)(x
1
x
0
) = 0
x
1
= x
0
f (x
0
)
f
/
(x
0
)
Similarly, we get x
2
,
x
2
= x
1
f (x
1
)
f
/
(x
1
)
50 / 94
Newton's method
Let x
0
be an initial guess, suciently closed to the root .
Consider the tangent line to the graph of f (x) in (x
0
, f (x
0
)).
Tangent intersects x-axis at x
1
, a closer point to . Tangent has
equation
p
1
(x) = f (x
0
) + f
/
(x
0
)(x x
0
)
Since p
1
(x
1
) = 0 we get
f (x
0
) + f
/
(x
0
)(x
1
x
0
) = 0
x
1
= x
0
f (x
0
)
f
/
(x
0
)
Similarly, we get x
2
,
x
2
= x
1
f (x
1
)
f
/
(x
1
)
51 / 94
Newton's method
Let x
0
be an initial guess, suciently closed to the root .
Consider the tangent line to the graph of f (x) in (x
0
, f (x
0
)).
Tangent intersects x-axis at x
1
, a closer point to . Tangent has
equation
p
1
(x) = f (x
0
) + f
/
(x
0
)(x x
0
)
Since p
1
(x
1
) = 0 we get
f (x
0
) + f
/
(x
0
)(x
1
x
0
) = 0
x
1
= x
0
f (x
0
)
f
/
(x
0
)
Similarly, we get x
2
,
x
2
= x
1
f (x
1
)
f
/
(x
1
)
52 / 94
Newton's method
Let x
0
be an initial guess, suciently closed to the root .
Consider the tangent line to the graph of f (x) in (x
0
, f (x
0
)).
Tangent intersects x-axis at x
1
, a closer point to . Tangent has
equation
p
1
(x) = f (x
0
) + f
/
(x
0
)(x x
0
)
Since p
1
(x
1
) = 0 we get
f (x
0
) + f
/
(x
0
)(x
1
x
0
) = 0
x
1
= x
0
f (x
0
)
f
/
(x
0
)
Similarly, we get x
2
,
x
2
= x
1
f (x
1
)
f
/
(x
1
)
53 / 94
Newton's method
Let x
0
be an initial guess, suciently closed to the root .
Consider the tangent line to the graph of f (x) in (x
0
, f (x
0
)).
Tangent intersects x-axis at x
1
, a closer point to . Tangent has
equation
p
1
(x) = f (x
0
) + f
/
(x
0
)(x x
0
)
Since p
1
(x
1
) = 0 we get
f (x
0
) + f
/
(x
0
)(x
1
x
0
) = 0
x
1
= x
0
f (x
0
)
f
/
(x
0
)
Similarly, we get x
2
,
x
2
= x
1
f (x
1
)
f
/
(x
1
)
54 / 94
Newton's method
Repeat this process to obtaian the sequence x
1
, x
2
, x
3
, . . . that
hopefully will converge to .
General scheme for Newton's method consists in:
Starting with initial guess x
0
compute iteratively
x
n+1
= x
n
f (x
n
)
f
/
(x
n
)
, n = 0, 1, 2, . . .
55 / 94
Newton's method
Repeat this process to obtaian the sequence x
1
, x
2
, x
3
, . . . that
hopefully will converge to .
General scheme for Newton's method consists in:
Starting with initial guess x
0
compute iteratively
x
n+1
= x
n
f (x
n
)
f
/
(x
n
)
, n = 0, 1, 2, . . .
56 / 94
Newton's method
Repeat this process to obtaian the sequence x
1
, x
2
, x
3
, . . . that
hopefully will converge to .
General scheme for Newton's method consists in:
Starting with initial guess x
0
compute iteratively
x
n+1
= x
n
f (x
n
)
f
/
(x
n
)
, n = 0, 1, 2, . . .
57 / 94
Newton's method
Example
Apply Newton's method to
f (x) = x
6
x 1,
f
/
(x) = 6x
5
1
to get
x
n+1
= x
n
x
6
n
x
n
1
6x
5
n
1
, n _ 0
Use initial guess x
0
= 1.5.
58 / 94
Newton's method
Example
Apply Newton's method to
f (x) = x
6
x 1,
f
/
(x) = 6x
5
1
to get
x
n+1
= x
n
x
6
n
x
n
1
6x
5
n
1
, n _ 0
Use initial guess x
0
= 1.5.
59 / 94
Newton's method
Example
Apply Newton's method to
f (x) = x
6
x 1,
f
/
(x) = 6x
5
1
to get
x
n+1
= x
n
x
6
n
x
n
1
6x
5
n
1
, n _ 0
Use initial guess x
0
= 1.5.
60 / 94
Newton's method
n x
n
f (x
n
) x
n
x
n1
x
n
0 1.50000000 8.89e + 1
1 1.30049088 2.54e + 1 2.00e 1 3.65e 1
2 1.18148042 5.38e 1 1.19e 1 1.66e 1
3 1.13945559 4.92e 2 4.20e 2 4.68e 2
4 1.13477763 5.50e 4 4.68e 3 4.73e 3
5 1.13472415 7.11e 8 5.35e 5 5.35e 5
6 1.13472414 1.55e 15 6.91e 9 6.91e 9
True solution is = 1.134724138.
61 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)
implemented in some computers in the past.
Say, we are interested
in computing
a
b
= a
1
b
, where
1
b
is computed using Newton's
method.
f (x) = b
1
x
= 0,
with b positive. The root of this equation is: =
1
b
.
f
/
(x) =
1
x
2
and Newton's method for this problem becomes
x
n+1
= x
n
b
1
x
n
1
x
2
n
Simplifying
x
n+1
= x
n
(2 bx
n
), n _ 0
62 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)
implemented in some computers in the past. Say, we are interested
in computing
a
b
= a
1
b
, where
1
b
is computed using Newton's
method.
f (x) = b
1
x
= 0,
with b positive. The root of this equation is: =
1
b
.
f
/
(x) =
1
x
2
and Newton's method for this problem becomes
x
n+1
= x
n
b
1
x
n
1
x
2
n
Simplifying
x
n+1
= x
n
(2 bx
n
), n _ 0
63 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)
implemented in some computers in the past. Say, we are interested
in computing
a
b
= a
1
b
, where
1
b
is computed using Newton's
method.
f (x) = b
1
x
= 0,
with b positive. The root of this equation is: =
1
b
.
f
/
(x) =
1
x
2
and Newton's method for this problem becomes
x
n+1
= x
n
b
1
x
n
1
x
2
n
Simplifying
x
n+1
= x
n
(2 bx
n
), n _ 0
64 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)
implemented in some computers in the past. Say, we are interested
in computing
a
b
= a
1
b
, where
1
b
is computed using Newton's
method.
f (x) = b
1
x
= 0,
with b positive. The root of this equation is: =
1
b
.
f
/
(x) =
1
x
2
and Newton's method for this problem becomes
x
n+1
= x
n
b
1
x
n
1
x
2
n
Simplifying
x
n+1
= x
n
(2 bx
n
), n _ 0
65 / 94
Newton's method. Division example
Here we consider a division algorithm (based on Newton's method)
implemented in some computers in the past. Say, we are interested
in computing
a
b
= a
1
b
, where
1
b
is computed using Newton's
method.
f (x) = b
1
x
= 0,
with b positive. The root of this equation is: =
1
b
.
f
/
(x) =
1
x
2
and Newton's method for this problem becomes
x
n+1
= x
n
b
1
x
n
1
x
2
n
Simplifying
x
n+1
= x
n
(2 bx
n
), n _ 0
66 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1
= 1 bx
n+1
67 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1
= 1 bx
n+1
68 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1
= 1 bx
n+1
69 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1
= 1 bx
n+1
70 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1
= 1 bx
n+1
71 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1
= 1 bx
n+1
72 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1
= 1 bx
n+1
73 / 94
Newton's method. Division example
Initial guess x
0
must be close enough to the true solution and of
course x
0
> 0. Consider the error
x
n+1
=
1
b
x
n+1
=
1 bx
n+1
b
=
1 bx
n
(2 bx
n
)
b
=
(1 bx
n
)
2
b
On the other hand
Rel(x
n+1
) =
x
n+1
= 1 bx
n+1
74 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(x
n+1
) = (Rel(x
n
))
2
In order to guarantee convergence x
n
,
[Rel(x
0
)[ < 1
or
0 < x
0
<
2
b
For example, suppose that [Rel(x
0
)[ = 0.1. Then
Rel(x
1
) = 10
2
, Rel(x
2
) = 10
4
Rel(x
3
) = 10
8
, Rel(x
4
) = 10
16
75 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(x
n+1
) = (Rel(x
n
))
2
In order to guarantee convergence x
n
,
[Rel(x
0
)[ < 1
or
0 < x
0
<
2
b
For example, suppose that [Rel(x
0
)[ = 0.1. Then
Rel(x
1
) = 10
2
, Rel(x
2
) = 10
4
Rel(x
3
) = 10
8
, Rel(x
4
) = 10
16
76 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(x
n+1
) = (Rel(x
n
))
2
In order to guarantee convergence x
n
,
[Rel(x
0
)[ < 1
or
0 < x
0
<
2
b
For example, suppose that [Rel(x
0
)[ = 0.1. Then
Rel(x
1
) = 10
2
, Rel(x
2
) = 10
4
Rel(x
3
) = 10
8
, Rel(x
4
) = 10
16
77 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(x
n+1
) = (Rel(x
n
))
2
In order to guarantee convergence x
n
,
[Rel(x
0
)[ < 1
or
0 < x
0
<
2
b
For example, suppose that [Rel(x
0
)[ = 0.1. Then
Rel(x
1
) = 10
2
, Rel(x
2
) = 10
4
Rel(x
3
) = 10
8
, Rel(x
4
) = 10
16
78 / 94
Newton's method. Division example
It can be shown (try it!) that
Rel(x
n+1
) = (Rel(x
n
))
2
In order to guarantee convergence x
n
,
[Rel(x
0
)[ < 1
or
0 < x
0
<
2
b
For example, suppose that [Rel(x
0
)[ = 0.1. Then
Rel(x
1
) = 10
2
, Rel(x
2
) = 10
4
Rel(x
3
) = 10
8
, Rel(x
4
) = 10
16
79 / 94
Newton's method. Division example
y
x
y=b-1/x
1/b
(x
0
,f(x
0
))
x
0
x
1
2/b
b
80 / 94
Error analysis for Newton's method
Let f (x) C
2
[a, b] and [a, b]. Also let f
/
() ,= 0.
Consider
Taylor formula for f (x) about x
n
f (x) = f (x
n
) + (x x
n
)f
/
(x
n
) +
(x x
n
)
2
2
f
//
(
n
),
where
n
is between x and x
n
. Take x = to get
f () = f (x
n
) + ( x
n
)f
/
(x
n
) +
( x
n
)
2
2
f
//
(
n
),
with
n
between and x
n
. Since f () = 0 we have
0 =
f (x
n
)
f
/
(x
n
)
+ ( x
n
) + ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
x
n+1
= ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
81 / 94
Error analysis for Newton's method
Let f (x) C
2
[a, b] and [a, b]. Also let f
/
() ,= 0. Consider
Taylor formula for f (x) about x
n
f (x) = f (x
n
) + (x x
n
)f
/
(x
n
) +
(x x
n
)
2
2
f
//
(
n
),
where
n
is between x and x
n
.
Take x = to get
f () = f (x
n
) + ( x
n
)f
/
(x
n
) +
( x
n
)
2
2
f
//
(
n
),
with
n
between and x
n
. Since f () = 0 we have
0 =
f (x
n
)
f
/
(x
n
)
+ ( x
n
) + ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
x
n+1
= ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
82 / 94
Error analysis for Newton's method
Let f (x) C
2
[a, b] and [a, b]. Also let f
/
() ,= 0. Consider
Taylor formula for f (x) about x
n
f (x) = f (x
n
) + (x x
n
)f
/
(x
n
) +
(x x
n
)
2
2
f
//
(
n
),
where
n
is between x and x
n
. Take x = to get
f () = f (x
n
) + ( x
n
)f
/
(x
n
) +
( x
n
)
2
2
f
//
(
n
),
with
n
between and x
n
.
Since f () = 0 we have
0 =
f (x
n
)
f
/
(x
n
)
+ ( x
n
) + ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
x
n+1
= ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
83 / 94
Error analysis for Newton's method
Let f (x) C
2
[a, b] and [a, b]. Also let f
/
() ,= 0. Consider
Taylor formula for f (x) about x
n
f (x) = f (x
n
) + (x x
n
)f
/
(x
n
) +
(x x
n
)
2
2
f
//
(
n
),
where
n
is between x and x
n
. Take x = to get
f () = f (x
n
) + ( x
n
)f
/
(x
n
) +
( x
n
)
2
2
f
//
(
n
),
with
n
between and x
n
. Since f () = 0 we have
0 =
f (x
n
)
f
/
(x
n
)
+ ( x
n
) + ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
x
n+1
= ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
84 / 94
Error analysis for Newton's method
Let f (x) C
2
[a, b] and [a, b]. Also let f
/
() ,= 0. Consider
Taylor formula for f (x) about x
n
f (x) = f (x
n
) + (x x
n
)f
/
(x
n
) +
(x x
n
)
2
2
f
//
(
n
),
where
n
is between x and x
n
. Take x = to get
f () = f (x
n
) + ( x
n
)f
/
(x
n
) +
( x
n
)
2
2
f
//
(
n
),
with
n
between and x
n
. Since f () = 0 we have
0 =
f (x
n
)
f
/
(x
n
)
+ ( x
n
) + ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
x
n+1
= ( x
n
)
2
f
//
(
n
)
2f
/
(x
n
)
85 / 94
Error analysis for Newton's method
For previous example, f
//
(x) = 30x
4
.We have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
=
30
4
2(6
5
1)
- 2.42
Therefore
x
n+1
- 2.42( x
n
)
2
For example if n = 3, we get x
3
- 4.73e 03 and
x
4
- 2.42( x
3
)
2
- 5.42e 05,
a result in accordance with the result presented in the table:
x
4
- 5.35e 05.
86 / 94
Error analysis for Newton's method
For previous example, f
//
(x) = 30x
4
.We have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
=
30
4
2(6
5
1)
- 2.42
Therefore
x
n+1
- 2.42( x
n
)
2
For example if n = 3, we get x
3
- 4.73e 03 and
x
4
- 2.42( x
3
)
2
- 5.42e 05,
a result in accordance with the result presented in the table:
x
4
- 5.35e 05.
87 / 94
Error analysis for Newton's method
For previous example, f
//
(x) = 30x
4
.We have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
=
30
4
2(6
5
1)
- 2.42
Therefore
x
n+1
- 2.42( x
n
)
2
For example if n = 3, we get x
3
- 4.73e 03 and
x
4
- 2.42( x
3
)
2
- 5.42e 05,
a result in accordance with the result presented in the table:
x
4
- 5.35e 05.
88 / 94
Error analysis for Newton's method
For previous example, f
//
(x) = 30x
4
.We have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
=
30
4
2(6
5
1)
- 2.42
Therefore
x
n+1
- 2.42( x
n
)
2
For example if n = 3, we get x
3
- 4.73e 03 and
x
4
- 2.42( x
3
)
2
- 5.42e 05,
a result in accordance with the result presented in the table:
x
4
- 5.35e 05.
89 / 94
Error analysis for Newton's method
If iteration x
n
is close to we have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
= M
x
n+1
- M( x
n
)
2
M( x
n+1
) - (M( x
n
))
2
Inductively
M( x
n+1
) - (M( x
0
))
2
n
, n _ 0
In other words, in order to guarantee the convergence of Newton's
method we should have
[M( x
0
)[ < 1
[ x
0
[ <
1
[M[
=
2f
/
()
f
//
()
90 / 94
Error analysis for Newton's method
If iteration x
n
is close to we have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
= M
x
n+1
- M( x
n
)
2
M( x
n+1
) - (M( x
n
))
2
Inductively
M( x
n+1
) - (M( x
0
))
2
n
, n _ 0
In other words, in order to guarantee the convergence of Newton's
method we should have
[M( x
0
)[ < 1
[ x
0
[ <
1
[M[
=
2f
/
()
f
//
()
91 / 94
Error analysis for Newton's method
If iteration x
n
is close to we have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
= M
x
n+1
- M( x
n
)
2
M( x
n+1
) - (M( x
n
))
2
Inductively
M( x
n+1
) - (M( x
0
))
2
n
, n _ 0
In other words, in order to guarantee the convergence of Newton's
method we should have
[M( x
0
)[ < 1
[ x
0
[ <
1
[M[
=
2f
/
()
f
//
()
92 / 94
Error analysis for Newton's method
If iteration x
n
is close to we have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
= M
x
n+1
- M( x
n
)
2
M( x
n+1
) - (M( x
n
))
2
Inductively
M( x
n+1
) - (M( x
0
))
2
n
, n _ 0
In other words, in order to guarantee the convergence of Newton's
method we should have
[M( x
0
)[ < 1
[ x
0
[ <
1
[M[
=
2f
/
()
f
//
()
93 / 94
Error analysis for Newton's method
If iteration x
n
is close to we have
f
//
(
n
)
2f
/
(x
n
)
-
f
//
()
2f
/
()
= M
x
n+1
- M( x
n
)
2
M( x
n+1
) - (M( x
n
))
2
Inductively
M( x
n+1
) - (M( x
0
))
2
n
, n _ 0
In other words, in order to guarantee the convergence of Newton's
method we should have
[M( x
0
)[ < 1
[ x
0
[ <
1
[M[
=
2f
/
()
f
//
()
94 / 94
For x
n
close to , and therefore c
n
also close to ,
we have
x
n+1
f
00
()
2f
0
()
( x
n
)
2
Thus Newtons method is quadratically convergent,
provided f
0
() 6= 0 and f(x) is twice dierentiable in
the vicinity of the root .
We can also use this to explore the interval of con-
vergence of Newtons method. Write the above as
x
n+1
M ( x
n
)
2
, M =
f
00
()
2f
0
()
Multiply both sides by M to get
M ( x
n+1
) [M ( x
n
)]
2
M ( x
n+1
) [M ( x
n
)]
2
Then we want these quantities to decrease; and this
suggests choosing x
0
so that
|M ( x
0
)| < 1
| x
0
| <
1
|M|
=
2f
0
()
f
00
()
x
y
y=f(x)
(x
0
,f(x
0
))
(x
1
,f(x
1
))
x
2
x
0
x
1
x
y
y=f(x)
q(x) =
(x
1
x) f(x
0
) + (x x
0
) f(x
1
)
x
1
x
0
This is linear in x; and by direction evaluation, it satis-
es the interpolation conditions of (*). We now solve
the equation q(x) = 0, denoting the root by x
2
. This
yields
x
2
= x
1
f(x
1
)
f(x
1
) f(x
0
)
x
1
x
0
We can now repeat the process. Use x
1
and x
2
to
produce another secant line, and then uses its root
to approximate . This yields the general iteration
formula
x
n+1
= x
n
f(x
n
)
f(x
n
) f(x
n1
)
x
n
x
n1
, n = 1, 2, 3...
This is called the secant method for solving f(x) = 0.
Example We solve the equation
f(x) x
6
x 1 = 0
which was used previously as an example for both the
bisection and Newton methods. The quantity x
n
x
n1
is used as an estimate of x
n1
. The iterate
x
8
equals rounded to nine signicant digits. As with
Newtons method for this equation, the initial iterates
do not converge rapidly. But as the iterates become
closer to , the speed of convergence increases.
n x
n
f(x
n
) x
n
x
n1
x
n1
0 2.0 61.0
1 1.0 1.0 1.0
2 1.01612903 9.15E 1 1.61E 2 1.35E 1
3 1.19057777 6.57E 1 1.74E 1 1.19E 1
4 1.11765583 1.68E 1 7.29E 2 5.59E 2
5 1.13253155 2.24E 2 1.49E 2 1.71E 2
6 1.13481681 9.54E 4 2.29E 3 2.19E 3
7 1.13472365 5.07E 6 9.32E 5 9.27E 5
8 1.13472414 1.13E 9 4.92E 7 4.92E 7
It is clear from the numerical results that the se-
cant method requires more iterates than the New-
ton method. But note that the secant method does
not require a knowledge of f
0
(x), whereas Newtons
method requires both f(x) and f
0
(x).
Note also that the secant method can be considered
an approximation of the Newton method
x
n+1
= x
n
f(x
n
)
f
0
(x
n
)
by using the approximation
f
0
(x
n
)
f(x
n
) f(x
n1
)
x
n
x
n1
CONVERGENCE ANALYSIS
With a combination of algebraic manipulation and the
mean-value theorem from calculus, we can show
x
n+1
= ( x
n
) ( x
n1
)
"
f
00
(
n
)
2f
0
(
n
)
#
, (**)
with
n
and
n
unknown points. The point
n
is lo-
cated between the minimum and maximum of x
n1
, x
n
,
and ; and
n
is located between the minimum and
maximum of x
n1
and x
n
. Recall for Newtons method
that the Newton iterates satised
x
n+1
= ( x
n
)
2
"
f
00
(
n
)
2f
0
(x
n
)
#
which closely resembles (**) above.
Using (**), it can be shown that x
n
converges to ,
and moreover,
lim
n
| x
n+1
|
| x
n
|
r
=
f
00
()
2f
0
()
r1
c
where
1
2
(1 + sqrt(5))
.
= 1.62. This assumes that x
0
and x
1
are chosen suciently close to ; and how
close this is will vary with the function f. In addition,
the above result assumes f(x) has two continuous
derivatives for all x in some interval about .
The above says that when we are close to , that
| x
n+1
| c | x
n
|
r
This looks very much like the Newton result
x
n+1
M ( x
n
)
2
, M =
f
00
()
2f
0
()
and c = |M|
r1
. Both the secant and Newton meth-
ods converge at faster than a linear rate, and they are
called superlinear methods.
The secant method converge slower than Newtons
method; but it is still quite rapid. It is rapid enough
that we can prove
lim
n
|x
n+1
x
n
|
| x
n
|
= 1
and therefore,
| x
n
| |x
n+1
x
n
|
is a good error estimator.
A note of warning: Do not combine the secant for-
mula and write it in the form
x
n+1
=
f(x
n
)x
n1
f(x
n1
)x
n
f(x
n
) f(x
n1
)
This has enormous loss of signicance errors as com-
pared with the earlier formulation.
COSTS OF SECANT & NEWTON METHODS
The Newton method
x
n+1
= x
n
f(x
n
)
f
0
(x
n
)
, n = 0, 1, 2, ...
requires two function evaluations per iteration, that
of f(x
n
) and f
0
(x
n
). The secant method
x
n+1
= x
n
f(x
n
)
f(x
n
) f(x
n1
)
x
n
x
n1
, n = 1, 2, 3...
requires 1 function evaluation per iteration, following
the initial step.
For this reason, the secant method is often faster in
time, even though more iterates are needed with it
than with Newtons method to attain a similar accu-
racy.
ADVANTAGES & DISADVANTAGES
Advantages of secant method: 1. It converges at
faster than a linear rate, so that it is more rapidly
convergent than the bisection method.
2. It does not require use of the derivative of the
function, something that is not available in a number
of applications.
3. It requires only one function evaluation per iter-
ation, as compared with Newtons method which re-
quires two.
Disadvantages of secant method:
1. It may not converge.
2. There is no guaranteed error bound for the com-
puted iterates.
3. It is likely to have diculty if f
0
() = 0. This
means the x-axis is tangent to the graph of y = f(x)
at x = .
4. Newtons method generalizes more easily to new
methods for solving simultaneous systems of nonlinear
equations.
BRENTS METHOD
Richard Brent devised a method combining the advan-
tages of the bisection method and the secant method.
1. It is guaranteed to converge.
2. It has an error bound which will converge to zero
in practice.
3. For most problems f(x) = 0, with f(x) dieren-
tiable about the root , the method behaves like the
secant method.
4. In the worst case, it is not too much worse in its
convergence than the bisection method.
In Matlab, it is implemented as fzero; and it is present
in most Fortran numerical analysis libraries.
FIXED POINT ITERATION
We begin with a computational example. Consider
solving the two equations
E1: x = 1 + .5 sin x
E2: x = 3 + 2 sin x
Graphs of these two equations are shown on accom-
panying graphs, with the solutions being
E1: = 1.49870113351785
E2: = 3.09438341304928
We are going to use a numerical scheme called xed
point iteration. It amounts to making an initial guess
of x
0
and substituting this into the right side of the
equation. The resulting value is denoted by x
1
; and
then the process is repeated, this time substituting x
1
into the right side. This is repeated until convergence
occurs or until the iteration is terminated.
In the above cases, we show the results of the rst 10
iterations in the accompanying table. Clearly conver-
gence is occurring with E1, but not with E2. Why?
x
y
y = x
y = 1 + .5sin x
x
y
y = x
y = 3 + 2sin x
E1: x = 1 + .5 sin x
E2: x = 3 + 2 sin x
E1 E2
n x
n
x
n
0 0.00000000000000 3.00000000000000
1 1.00000000000000 3.28224001611973
2 1.42073549240395 2.71963177181556
3 1.49438099256432 3.81910025488514
4 1.49854088439917 1.74629389651652
5 1.49869535552190 4.96927957214762
6 1.49870092540704 1.06563065299216
7 1.49870112602244 4.75018861639465
8 1.49870113324789 1.00142864236516
9 1.49870113350813 4.68448404916097
10 1.49870113351750 1.00077863465869
The above iterations can be written symbolically as
E1 : x
n+1
= 1 + 0:5 sin x
n
E2 : x
n+1
= 3 + 2 sin x
n
for n = 0; 1; 2; : : : Why does one of these iterations
converge, but not the other? The graphs show similar
behaviour, so why the dierence? Consider one more
example:
Suppose we are solving the equation
x
2
5 = 0
with exact root =
p
5 2:2361 using iterates of the
form
x
n+1
= g(x
n
):
Consider four dierent iterations
I
1
: x
n+1
= 5 + x
n
x
2
n
;
I
2
: x
n+1
=
5
x
n
;
I
3
: x
n+1
= 1 + x
n
1
5
x
2
n
;
I
4
: x
n+1
=
1
2
x
n
+
5
x
n
:
All of them, in case they are convergent will converge
to =
p
5 (just take the limit as n ! 1 of each
relation).
I
1
I
2
I
3
I
4
n x
n
x
n
x
n
x
n
0 1:0e + 00 1:0 1:0 1:0
1 5:0000e + 00 5:0 1:8000 3:0000
2 1:5000e + 01 1:0 2:1520 2:3333
3 2:3500e + 02 5:0 2:2258 2:2381
4 5:5455e + 04 1:0 2:2350 2:2361
5 3:0753e + 09 5:0 2:2360 2:2361
6 9:4575e + 18 1:0 2:2361 2:2361
7 8:9445e + 37 5:0 2:2361 2:2361
8 8:0004e + 75 1:0 2:2361 2:2361
As another example, note that the Newton method
x
n+1
= x
n
f(x
n
)
f
0
(x
n
)
is also a xed point iteration, for the equation
x = x
f(x)
f
0
(x)
In general, we are interested in solving equations
x = g(x)
by means of xed point iteration:
x
n+1
= g(x
n
); n = 0; 1; 2; : : :
It is called xed point iteration because the root is
a xed point of the function g(x), meaning that is a
number for which
g() =
EXISTENCE THEOREM
We begin by asking whether the equation
x = g(x)
has a solution. For this to occur, the graphs of y =
x and y = g(x) must intersect, as seen on the earlier
graphs. There are several lemmas and theorems that give
conditions under which we are guaranteed there is a xed
point .
Lemma 1 Let g(x) be a continuous function on the in-
terval [a; b], and suppose it satises the property
a x b ) a g(x) b
Then the equation x = g(x) has at least one solution
in the interval [a; b].
The proof of this is fairly intuitive. Look at the function
f(x) = x g(x), a x b. Evaluating at the end-
points, f(a) 0; f(b) 0. The function f(x) is
continuous on [a; b]; and therefore it contains a zero in
the interval.
Theorem: Assume g(x) and g
0
(x) exist and are con-
tinuous on the interval [a, b]; and further, assume
a x b a g(x) b
max
axb
g
0
(x)
< 1
Then:
S1. The equation x = g(x) has a unique solution
in [a, b].
S2. For any initial guess x
0
in [a, b], the iteration
x
n+1
= g(x
n
), n = 0, 1, 2, ...
will converge to .
S3.
| x
n
|
n
1
|x
1
x
0
| , n 0
S4.
lim
n
x
n+1
x
n
= g
0
()
Thus for x
n
close to ,
x
n+1
g
0
() ( x
n
)
The proof is given in the text, and I go over only a
portion of it here. For S2, note that from (#), if x
0
is in [a, b], then
x
1
= g(x
0
)
is also in [a, b]. Repeat the argument to show that
x
2
= g(x
1
)
belongs to [a, b]. This can be continued by induction
to show that every x
n
belongs to [a, b].
We need the following general result. For any two
points w and z in [a, b],
g(w) g(z) = g
0
(c) (w z)
for some unknown point c between w and z. There-
fore,
|g(w) g(z)| |w z|
for any a w, z b.
For S3, subtract x
n+1
= g(x
n
) from = g() to get
x
n+1
= g() g(x
n
)
= g
0
(c
n
) ( x
n
) ($)
| x
n+1
| | x
n
| (*)
with c
n
between and x
n
. From (*), we have that
the error is guaranteed to decrease by a factor of
with each iteration. This leads to
| x
n
|
n
| x
n
| , n 0
With some extra manipulation, we can obtain the error
bound in S3.
For S4, use ($) to write
x
n+1
x
n
= g
0
(c
n
)
Since x
n
and c
n
is between and x
n
, we have
g
0
(c
n
) g
0
().
The statement
x
n+1
g
0
() ( x
n
)
tells us that when near to the root , the errors will
decrease by a constant factor of g
0
(). If this is nega-
tive, then the errors will oscillate between positive and
negative, and the iterates will be approaching from
both sides. When g
0
() is positive, the iterates will
approach from only one side.
The statements
x
n+1
= g
0
(c
n
) ( x
n
)
x
n+1
g
0
() ( x
n
)
also tell us a bit more of what happens when
g
0
()
> 1
Then the errors will increase as we approach the root
rather than decrease in size.
Look at the earlier examples
E1: x = 1 + .5 sin x
E2: x = 3 + 2 sin x
In the rst case E1,
g(x) = 1 + .5 sin x
g
0
(x) = .5 cos x
g
0
(
1
2
Therefore the xed point iteration
x
n+1
= 1 + .5 sin x
n
will converge for E1.
For the second case E2,
g(x) = 3 + 2 sin x
g
0
(x) = 2 cos x
g
0
() = 2 cos (3.09438341304928)
.
= 1.998
Therefore the xed point iteration
x
n+1
= 3 + 2 sin x
n
will diverge for E2.
Consider example x
2
5 = 0
(I
1
) g(x) = 5 + x x
2
; g
0
(x) = 1 2x; g
0
() =
1 2
p
5 < 1: Thus, x
n
= g(x
n1
) do not con-
verge to
p
5:
(I
2
) g(x) =
5
x
; g
0
(x) =
5
x
2
; g
0
() = 1: There-
fore, x
n
= g(x
n1
) can be either convergent or
divergent, but numerical results show it divergent.
(I
3
) g(x) = 1 +x
1
5
x
2
; g
0
(x) = 1
2
5
x; g
0
() =
1
2
5
p
5 0:106: Thus, x
n
= g(x
n1
) converge
to
p
5: Moreover, we have
j x
n+1
j 0:106 j x
n
j ;
if x
n
is suciently close to : The errors are de-
creasing with a liniar rate of 0:106.
(I
4
) g(x) =
1
2
x +
5
x
; g
0
(x) =
1
2
1
5
x
2
; g
0
() =
0:Sequence x
n
= g(x
n1
) will converge to
p
5;with
an order of convergence bigger than 1:
Sometimes it is dicult to express equation f(x) = 0 in
the form x = g(x); such that the resulting iterates will
converge. Such a process is presented in the following
examples.
Example 1 Let x
4
x 1 = 0; rewritten as
x =
4
p
1 +x;
which will provde us with iterations
x
0
= 1; x
n+1
=
4
p
1 +x
n
; n 1
This sequence will converge to 1:2207:
Example 2 Let x
3
+x 1 = 0; rewritten as
x =
1
1 +x
2
and its xed point iterations
x
0
= 1; x
n+1
=
1
1 +x
2
n
; n 1
that will converge to 0:6823: Iterations are repre-
sented graphically in the following gure
0
x
y
y=g(x)
=0.6823 x
0
x
2
x
1
x
3
y=x
x
0
x
1
x
2
y
x
O
y =x
y =g(x)
0 < g
0
() < 1
x
y
O
x
0
x
1
x
2
x
3
y =x
y =g(x)
1 < g
0
() < 0
x
y
O
x
0
x
1
x
2
y =x
y =g(x)
g
0
() > 1
y
x
O
y =x
y =g(x)
x
0
x
1
x
2
g
0
() < 1
Besides the convergence we would like to know how fast is
the sequence x
n
= g(x
n1
) converging to the solution,
in other words how fast the error x
n
is decreasing.
We will say that sequence x
n
o
n=0
converges to with
order of convergence p 1; if
[ x
n+1
[ c [ x
n
[
p
; n 0;
where c 0 is a constant. Cases p = 1, p = 2 and p =
3 are called linear, quadratic and cubic convergencies. In
case of linear convergence, constant c is called the rate
of linear convergence liniare and we require additionally
that c < 1; otherwise sequence of errors x
n
can fail
to converge to zero. Also, for linear convergence wer can
use the relation,
[ x
n+1
[ c
n
[ x
0
[ ; n 0:
Thus bisection method is linearly convergent with rate
1
2
;
Newtons method is quadratically convergent, and secant
method has order of convergence p =
1+
_
5
2
:
If
g
t
()
g
0
()
< 1 (**)
Then any suciently small number > 0, the interval
[a, b] = [ , + ] will satisfy the hypotheses of
the preceding theorem.
This means that if (**) is true, and if we choose x
0
suciently close to , then the xed point iteration
x
n+1
= g(x
n
) will converge and the earlier results
S1-S4 will all hold. The corollary does not tell us how
close we need to be to in order to have convergence.
NEWTONS METHOD
For Newtons method
x
n+1
= x
n
f(x
n
)
f
0
(x
n
)
we have it is a xed point iteration with
g(x) = x
f(x)
f
0
(x)
Check its convergence by checking the condition (**).
g
0
(x) = 1
f
0
(x)
f
0
(x)
+
f(x)f
00
(x)
[f
0
(x)]
2
=
f(x)f
00
(x)
[f
0
(x)]
2
g
0
() = 0
Therefore the Newton method will converge if x
0
is
chosen suciently close to .
HIGHER ORDER METHODS
What happens when g
0
() = 0? We use Taylors
theorem to answer this question.
Begin by writing
g(x) = g() + g
0
() (x ) +
1
2
g
00
(c) (x )
2
with c between x and . Substitute x = x
n
and
recall that g(x
n
) = x
n+1
and g() = . Also assume
g
0
() = 0.
Then
x
n+1
= +
1
2
g
00
(c
n
) (x
n
)
2
x
n+1
=
1
2
g
00
(c
n
) (x
n
)
2
with c
n
between and x
n
. Thus if g
0
() = 0, the
xed point iteration is quadratically convergent or bet-
ter. In fact, if g
00
() 6= 0, then the iteration is exactly
quadratically convergent.
ANOTHER RAPID ITERATION
Newtons method is rapid, but requires use of the
derivative f
0
(x). Can we get by without this. The
answer is yes! Consider the method
D
n
=
f(x
n
+ f(x
n
)) f(x
n
)
f(x
n
)
x
n+1
= x
n
f(x
n
)
D
n
This is an approximation to Newtons method, with
f
0
(x
n
) D
n
. To analyze its convergence, regard it
as a xed point iteration with
D(x) =
f(x + f(x)) f(x)
f(x)
g(x) = x
f(x)
D(x)
Then we can, with some diculty, show g
0
() = 0
and g
00
() 6= 0. This will prove this new iteration is
quadratically convergent.
FIXED POINT INTERATION: ERROR
Recall the result
lim
n
x
n
x
n1
= g
0
()
for the iteration
x
n
= g(x
n1
), n = 1, 2, ...
Thus
x
n
( x
n1
) (***)
with = g
0
() and || < 1.
If we were to know , then we could solve (***) for
:
x
n
x
n1
1
Usually, we write this as a modication of the cur-
rently computed iterate x
n
:
x
n
x
n1
1
=
x
n
x
n
1
+
x
n
x
n1
1
= x
n
+
1
[x
n
x
n1
]
The formula
x
n
+
1
[x
n
x
n1
]
is said to be an extrapolation of the numbers x
n1
and x
n
. But what is ?
From
lim
n
x
n
x
n1
= g
0
()
we have
x
n
x
n1
Unfortunately this also involves the unknown root
which we seek; and we must nd some other way of
estimating .
To calculate consider the ratio
n
=
x
n
x
n1
x
n1
x
n2
To see this is approximately as x
n
approaches ,
write
x
n
x
n1
x
n1
x
n2
=
g(x
n1
) g(x
n2
)
x
n1
x
n2
= g
0
(c
n
)
with c
n
between x
n1
and x
n2
. As the iterates ap-
proach , the number c
n
must also approach . Thus
n
approaches as x
n
.
We combine these results to obtain the estimation
b
x
n
= x
n
+
n
1
n
[x
n
x
n1
] ,
n
=
x
n
x
n1
x
n1
x
n2
We call
b
x
n
the Aitken extrapolate of {x
n2
, x
n1
, x
n
};
and
b
x
n
.
We can also rewrite this as
x
n
b
x
n
x
n
=
n
1
n
[x
n
x
n1
]
This is called Aitkens error estimation formula.
The accuracy of these procedures is tied directly to
the accuracy of the formulas
x
n
( x
n1
) , x
n1
( x
n2
)
If this is accurate, then so are the above extrapolation
and error estimation formulas.
EXAMPLE
Consider the iteration
x
n+1
= 6.28 + sin(x
n
), n = 0, 1, 2, ...
for solving
x = 6.28 + sin x
Iterates are shown on the accompanying sheet, includ-
ing calculations of
n
, the error estimate
x
n
b
x
n
x
n
=
n
1
n
[x
n
x
n1
] (Estimate)
The latter is called Estimate in the table. In this
instance,
g
0
()
.
= .9644
and therefore the convergence is very slow. This is
apparent in the table.
AITKENS ALGORITHM
Step 1: Select x
0
Step 2: Calculate
x
1
= g(x
0
), x
2
= g(x
1
)
Step3: Calculate
x
3
= x
2
+
2
1
2
[x
2
x
1
] ,
2
=
x
2
x
1
x
1
x
0
Step 4: Calculate
x
4
= g(x
3
), x
5
= g(x
4
)
and calculate x
6
as the extrapolate of {x
3
, x
4
, x
5
}.
Continue this procedure, ad innatum.
Of course in practice we will have some kind of er-
ror test to stop this procedure when believe we have
sucient accuracy.
EXAMPLE
Consider again the iteration
x
n+1
= 6.28 + sin(x
n
), n = 0, 1, 2, ...
for solving
x = 6.28 + sin x
Now we use the Aitken method, and the results are
shown in the accompanying table. With this we have
x
3
= 7.98 10
4
, x
6
= 2.27 10
6
In comparison, the original iteration had
x
6
= 1.23 10
2
GENERAL COMMENTS
Aitken extrapolation can greatly accelerate the con-
vergence of a linearly convergent iteration
x
n+1
= g(x
n
)
This shows the power of understanding the behaviour
of the error in a numerical process. From that un-
derstanding, we can often improve the accuracy, thru
extrapolation or some other procedure.
This is a justication for using mathematical analyses
to understand numerical methods. We will see this
repeated at later points in the course, and it holds
with many dierent types of problems and numerical
methods for their solution.
MULTIPLE ROOTS
We study two classes of functions for which there is
additional diculty in calculating their roots. The rst
of these are functions in which the desired root has a
multiplicity greater than 1. What does this mean?
Let be a root of the function f(x), and imagine
writing it in the factored form
f(x) = (x )
m
h(x)
with some integer m 1 and some continuous func-
tion h(x) for which h() 6= 0. Then we say that
is a root of f(x) of multiplicity m. For example, the
function
f(x) = e
x
2
1
has x = 0 as a root of multiplicity m = 2. In partic-
ular, dene
h(x) =
e
x
2
1
x
2
for x 6= 0.
Using Taylor polynomial approximations, we can show
for x 6= 0 that
h(x) 1 +
1
2
x
2
+
1
6
x
4
lim
x0
h(x) = 1
This leads us to extend the denition of h(x) to
h(x) =
e
x
2
1
x
2
, x 6= 0
h(0) = 1
Thus
f(x) = x
2
h(x)
as asserted and x = 0 is a root of f(x) of multiplicity
m = 2.
Roots for which m = 1 are called simple roots, and
the methods studied to this point were intended for
such roots. We now consider the case of m > 1.
If the function f(x) is m-times dierentiable around
, then we can dierentiate
f(x) = (x )
m
h(x)
m times to obtain an equivalent formulation of what
it means for the root to have multiplicity m.
For an example, consider the case
f(x) = (x )
3
h(x)
Then
f
0
(x) = 3 (x )
2
h(x) + (x )
3
h
0
(x)
(x )
2
h
2
(x)
h
2
(x) = 3h(x) + (x ) h
0
(x)
h
2
() = 3h() 6= 0
This shows is a root of f
0
(x) of multiplicity 2.
Dierentiating a second time, we can show
f
00
(x) = (x ) h
3
(x)
for a suitably dened h
3
(x) with h
3
() 6= 0, and is
a simple root of f
00
(x).
Dierentiating a third time, we have
f
000
() = h
3
() 6= 0
We can use this as part of a proof of the following:
is a root of f(x) of multiplicity m = 3 if and only if
f() = f
0
() = f
00
() = 0, f
000
() 6= 0
In general, is a root of f(x) of multiplicity m if and
only if
f() = = f
(m1)
() = 0, f
(m)
() 6= 0
DIFFICULTIES OF MULTIPLE ROOTS
There are two main diculties with the numerical cal-
culation of multiple roots (by which we mean m > 1
in the denition).
1. Methods such as Newtons method and the se-
cant method converge more slowly than for the
case of a simple root.
2. There is a large interval of uncertainty in the pre-
cise location of a multiple root on a computer or
calculator.
The second of these is the more dicult to deal with,
but we begin with the rst for the case of Newtons
method.
Recall that we can regard Newtons method as a xed
point method:
x
n+1
= g(x
n
), g(x) = x
f(x)
f
0
(x)
Then we substitute
f(x) = (x )
m
h(x)
to obtain
g(x) = x
(x )
m
h(x)
m(x )
m1
h(x) + (x )
m
h
0
(x)
= x
(x ) h(x)
mh(x) + (x ) h
0
(x)
Then we can use this to show
g
0
() = 1
1
m
=
m1
m
For m > 1, this is nonzero, and therefore Newtons
method is only linearly convergent:
x
n+1
( x
n
) , =
m1
m
Similar results hold for the secant method.
There are ways of improving the speed of convergence
of Newtons method, creating a modied method that
is again quadratically convergent. In particular, con-
sider the xed point iteration formula
x
n+1
= g(x
n
), g(x) = x m
f(x)
f
0
(x)
in which we assume to know the multiplicity m of
the root being sought. Then modifying the above
argument on the convergence of Newtons method,
we obtain
g
0
() = 1 m
1
m
= 0
and the iteration method will be quadratically conver-
gent.
But this is not the fundamental problem posed by
multiple roots.
NOISE IN FUNCTION EVALUATION
Recall the discussion of noise in evaluating a function
f(x), and in our case consider the evaluation for val-
ues of x near to . In the following gures, the noise
as measured by vertical distance is the same in both
graphs.
x
y
simple root
x
y
double root
Noise was discussed earlier and as example we used func-
tion
f(x) = x
3
3x
2
+ 3x 1 (x 1)
3
Because of the noise in evaluating f(x), it appears from
the graph that f(x) has many zeros around x = 1,
whereas the exact function outside of the computer has
only the root = 1; of multiplicity 3. Any rootnding
method to nd a multiple root that uses evaluation of
f(x) is doomed to having a large interval of uncertainty
as to the location of the root. If high accuracy is desired,
then the only satisfactory solution is to reformulate the
problem as a new problem F(x) = 0 in which is a sim-
ple root of F. Then use a standard rootnding method
to calculate . It is important that the evaluation of
F(x) not involve f(x) directly, as that is the source of
the noise and the uncertainly.
EXAMPLE
Consider nding the roots of
f(x) = (x 1:1)
3
(x 2:1)
= 2:7951 8:954x + 10:56x
2
5:4x
3
+x
4
This has a root at 1.1
n x
n
f(x
n
) x
n
Rate
0 0:800000 0:03510 0:300000
1 0:892857 0:01073 0:207143 0:690
2 0:958176 0:00325 0:141824 0:685
3 1:00344 0:00099 0:09656 0:681
4 1:03486 0:00029 0:06514 0:675
5 1:05581 0:00009 0:04419 0:678
6 1:07028 0:00003 0:02972 0:673
7 1:08092 0:0 0:01908 0:642
From an examination of the rate of linear convergence of
Newtons method applied to this function, one can guess
with high probability that the multiplicity is m = 3. Then
form exactly the second derivative
f
00
(x) = 21:12 32:4x + 12x
2
Applying Newtons method to this with a guess of x = 1
will lead to rapid convergence to = 1:1.
In general, if we know the root has multiplicity m > 1,
then replace the problem by that of solving
f
(m1)
(x) = 0
since is a simple root of this equation.
STABILITY
Generally we expect the world to be stable. By this,
we mean that if we make a small change in something,
then we expect to have this lead to other correspond-
ingly small changes. In fact, if we think about this
carefully, then we know this need not be true. We
now illustrate this for the case of rootnding.
Consider the polynomial
f(x) = x
7
28x
6
+ 322x
5
1960x
4
+6769x
3
13132x
2
+ 13068x 5040
This has the exact roots {1, 2, 3, 4, 5, 6, 7}. Now con-
sider the perturbed polynomial
F(x) = x
7
28.002x
6
+ 322x
5
1960x
4
+6769x
3
13132x
2
+ 13068x 5040
This is a relatively small change in one coecient, of
relative error
.002
28
= 7.14 10
5
What are the roots of F(x)?
Root of Root of Error
f(x) F(x)
1 1.0000028 2.8E 6
2 1.9989382 1.1E 3
3 3.0331253 0.033
4 3.8195692 0.180
5 5.4586758 +.54012578i .46 .54i
6 5.4586758 .54012578i .46 +.54i
7 7.2330128 0.233
Why have some of the roots departed so radically from
the original values? This phenomena goes under a
variety of names. We sometimes say this is an example
of an unstable or ill-conditioned rootnding problem.
These words are often used in a casual manner, but
they also have a very precise meaning in many areas
of numerical analysis (and more generally, in all of
mathematics).
A PERTURBATION ANALYSIS
We want to study what happens to the root of a func-
tion f(x) when it is perturbed by a small amount. For
some function g(x) and for all small , dene a per-
turbed function
F
(()) = 0
for all small values of . Dierentiate this as a function
of and using the chain rule. Then we obtain
F
0
(()) = f
0
(())
0
()
+g(()) + g
0
(())
0
() = 0
for all small . Substitute = 0, recall (0) =
0
,
and solve for
0
(0) to obtain
f
0
(
0
)
0
(0) +g(
0
) = 0
0
(0) =
g(
0
)
f
0
(
0
)
This then leads to
() (0) +
0
(0)
=
0
g(
0
)
f
0
(
0
)
(*)
Example: In our earlier polynomial example, consider
the simple root
0
= 3. Then
() 3
3
6
48
.
= 3 15.2
With = .002, we obtain
(.002) 3 15.2(.002)
.
= 3.0304
This is close to the actual root of 3.0331253.
However, the approximation (*) is not good at esti-
mating the change in the roots 5 and 6. By ob-
servation, the perturbation in the root is a complex
number, whereas the formula (*) predicts only a per-
turbation that is real. The value of is too large to
have (*) be accurate for the roots 5 and 6.
DISCUSSION
Looking again at the formula
()
0
g(
0
)
f
0
(
0
)
we have that the size of
g(
0
)
f
0
(
0
)
is an indication of the stability of the solution
0
.
If this quantity is large, then potentially we will have
diculty. Of course, not all functions g(x) are equally
possible, and we need to look only at functions g(x)
that will possibly occur in practice.
One quantity of interest is the size of f
0
(
0
). If it
is very small relative to g(
0
), then we are likely to
have diculty in nding
0
accurately.
INTERPOLATION
Interpolation is a process of nding a formula (often
a polynomial) whose graph will pass through a given
set of points (x, y).
As an example, consider dening
x
0
= 0, x
1
=
4
, x
2
=
2
and
y
i
= cos x
i
, i = 0, 1, 2
This gives us the three points
(0, 1) ,
4
,
1
sqrt(2)
2
, 0
y
1
y
0
x
1
x
0
!
(x x
0
)
Check each of these by evaluating them at x = x
0
and x
1
to see if the respective values are y
0
and y
1
.
Example. Following is a table of values for f(x) =
tan x for a few values of x.
x 1 1.1 1.2 1.3
tan x 1.5574 1.9648 2.5722 3.6021
Use linear interpolation to estimate tan(1.15). Then
use
x
0
= 1.1, x
1
= 1.2
with corresponding values for y
0
and y
1
. Then
tan x y
0
+
x x
0
x
1
x
0
[y
1
y
0
]
tan x y
0
+
x x
0
x
1
x
0
[y
1
y
0
]
tan (1.15) 1.9648 +
1.15 1.1
1.2 1.1
[2.5722 1.9648]
= 2.2685
The true value is tan 1.15 = 2.2345. We will want
to examine formulas for the error in interpolation, to
know when we have sucient accuracy in our inter-
polant.
x
y
1 1.3
y=tan(x)
x
y
1.1 1.2
y = tan(x)
y = p
1
(x)
QUADRATIC INTERPOLATION
We want to nd a polynomial
P
2
(x) = a
0
+ a
1
x + a
2
x
2
which satises
P
2
(x
i
) = y
i
, i = 0, 1, 2
for given data points (x
0
, y
0
) , (x
1
, y
1
) , (x
2
, y
2
). One
formula for such a polynomial follows:
P
2
(x) = y
0
L
0
(x) + y
1
L
1
(x) + y
2
L
2
(x) ()
with
L
0
(x) =
(xx
1
)(xx
2
)
(x
0
x
1
)(x
0
x
2
)
, L
1
(x) =
(xx
0
)(xx
2
)
(x
1
x
0
)(x
1
x
2
)
L
2
(x) =
(xx
0
)(xx
1
)
(x
2
x
0
)(x
2
x
1
)
The formula () is called Lagranges form of the in-
terpolation polynomial.
LAGRANGE BASIS FUNCTIONS
The functions
L
0
(x) =
(xx
1
)(xx
2
)
(x
0
x
1
)(x
0
x
2
)
, L
1
(x) =
(xx
0
)(xx
2
)
(x
1
x
0
)(x
1
x
2
)
L
2
(x) =
(xx
0
)(xx
1
)
(x
2
x
0
)(x
2
x
1
)
are called Lagrange basis functions for quadratic in-
terpolation. They have the properties
L
i
(x
j
) =
(
1, i = j
0, i 6= j
for i, j = 0, 1, 2. Also, they all have degree 2. Their
graphs are on an accompanying page.
As a consequence of each L
i
(x) being of degree 2, we
have that the interpolant
P
2
(x) = y
0
L
0
(x) + y
1
L
1
(x) + y
2
L
2
(x)
must have degree 2.
UNIQUENESS
Can there be another polynomial, call it Q(x), for
which
deg(Q) 2
Q(x
i
) = y
i
, i = 0, 1, 2
Thus, is the Lagrange formula P
2
(x) unique?
Introduce
R(x) = P
2
(x) Q(x)
From the properties of P
2
and Q, we have deg(R)
2. Moreover,
R(x
i
) = P
2
(x
i
) Q(x
i
) = y
i
y
i
= 0
for all three node points x
0
, x
1
, and x
2
. How many
polynomials R(x) are there of degree at most 2 and
having three distinct zeros? The answer is that only
the zero polynomial satises these properties, and there-
fore
R(x) = 0 for all x
Q(x) = P
2
(x) for all x
SPECIAL CASES
Consider the data points
(x
0
, 1), (x
1
, 1), (x
2
, 1)
What is the polynomial P
2
(x) in this case?
Answer: We must have the polynomial interpolant is
P
2
(x) 1
meaning that P
2
(x) is the constant function. Why?
First, the constant function satises the property of
being of degree 2. Next, it clearly interpolates the
given data. Therefore by the uniqueness of quadratic
interpolation, P
2
(x) must be the constant function 1.
Consider now the data points
(x
0
, mx
0
), (x
1
, mx
1
), (x
2
, mx
2
)
for some constant m. What is P
2
(x) in this case? By
an argument similar to that above,
P
2
(x) = mx for all x
Thus the degree of P
2
(x) can be less than 2.
HIGHER DEGREE INTERPOLATION
We consider now the case of interpolation by poly-
nomials of a general degree n. We want to nd a
polynomial P
n
(x) for which
deg(P
n
) n
P
n
(x
i
) = y
i
, i = 0, 1, , n
()
with given data points
(x
0
, y
0
) , (x
1
, y
1
) , , (x
n
, y
n
)
The solution is given by Lagranges formula
P
n
(x) = y
0
L
0
(x) + y
1
L
1
(x) + + y
n
L
n
(x)
The Lagrange basis functions are given by
L
k
(x) =
(x x
0
) ..(x x
k1
)(x x
k+1
).. (x x
n
)
(x
k
x
0
) ..(x
k
x
k1
)(x
k
x
k+1
).. (x
k
x
n
)
for k = 0, 1, 2, ..., n. The quadratic case was covered
earlier.
In a manner analogous to the quadratic case, we can
show that the above P
n
(x) is the only solution to the
problem ().
In the formula
L
k
(x) =
(x x
0
) ..(x x
k1
)(x x
k+1
).. (x x
n
)
(x
k
x
0
) ..(x
k
x
k1
)(x
k
x
k+1
).. (x
k
x
n
)
we can see that each such function is a polynomial of
degree n. In addition,
L
k
(x
i
) =
(
1, k = i
0, k 6= i
Using these properties, it follows that the formula
P
n
(x) = y
0
L
0
(x) + y
1
L
1
(x) + + y
n
L
n
(x)
satises the interpolation problem of nding a solution
to
deg(P
n
) n
P
n
(x
i
) = y
i
, i = 0, 1, , n
EXAMPLE
Recall the table
x 1 1.1 1.2 1.3
tan x 1.5574 1.9648 2.5722 3.6021
We now interpolate this table with the nodes
x
0
= 1, x
1
= 1.1, x
2
= 1.2, x
3
= 1.3
Without giving the details of the evaluation process,
we have the following results for interpolation with
degrees n = 1, 2, 3.
n 1 2 3
P
n
(1.15) 2.2685 2.2435 2.2296
Error .0340 .0090 .0049
It improves with increasing degree n, but not at a very
rapid rate. In fact, the error becomes worse when n is
increased further. Later we will see that interpolation
of a much higher degree, say n 10, is often poorly
behaved when the node points {x
i
} are evenly spaced.
A FIRST ORDER DIVIDED DIFFERENCE
For a given function f(x) and two distinct points x
0
and
x
1
, dene
f[x
0
; x
1
] =
f(x
1
) f(x
0
)
x
1
x
0
This is called a rst order divided dierence of f(x). By
the Mean-value theorem,
f(x
1
) f(x
0
) = f0(c)(x
1
x
0
)
for some c between x
0
and x
1
. Thus
f[x
0
; x
1
] = f0(c)
and the divided dierence is very much like the derivative,
especially if x
0
and x
1
are quite close together. In fact,
f0(
x
1
+ x
0
2
) f[x
0
; x
1
]
is quite an accurate approximation of the derivative
SECOND ORDER DIVIDED DIFFERENCES
Given three distinct points x
0
, x
1
, and x
2
, dene
f[x
0
; x
1
; x
2
] =
f[x
1
; x
2
] f[x
0
; x
1
]
x
2
x
0
This is called the second order divided dierence of f(x).
By a fairly complicated argument, we can show
f[x
0
; x
1
; x
2
] =
1
2
f
00
(c)
for some c intermediate to x
0
, x
1
, and x
2
. In fact, as we
investigate,
f
00
(x
1
) 2f[x
0
; x
1
; x
2
]
in the case the nodes are evenly spaced,
x
1
x
0
= x
2
x
1
:
EXAMPLE
Consider the table
x 1 1.1 1.2 1.3 1.4
cos x .54030 .45360 .36236 .26750 .16997
Let x
0
= 1, x
1
= 1.1, and x
2
= 1.2. Then
f[x
0
, x
1
] =
.45360 .54030
1.1 1
= .86700
f[x
1
, x
2
] =
.36236 .45360
1.1 1
= .91240
f[x
0
, x
1
, x
2
] =
f[x
1
, x
2
] f[x
0
, x
1
]
x
2
x
0
=
.91240 (.86700)
1.2 1.0
= .22700
For comparison,
f
0
x
1
+ x
0
2
log
10
e
c
2
x
#
= (x x
0
) (x
1
x)
"
log
10
e
2c
2
x
#
We usually are interpolating with x
0
x x
1
; and
in that case, we have
(x x
0
) (x
1
x) 0, x
0
c
x
x
1
(x x
0
) (x
1
x) 0, x
0
c
x
x
1
and therefore
(x x
0
) (x
1
x)
"
log
10
e
2x
2
1
#
log
10
x P
1
(x)
(x x
0
) (x
1
x)
"
log
10
e
2x
2
0
#
For h = x
1
x
0
small, we have for x
0
x x
1
log
10
x P
1
(x) (x x
0
) (x
1
x)
"
log
10
e
2x
2
0
#
Typical high school algebra textbooks contain tables
of log
10
x with a spacing of h = .01. What is the
error in this case? To look at this, we use
0 log
10
x P
1
(x) (x x
0
) (x
1
x)
"
log
10
e
2x
2
0
#
By simple geometry or calculus,
max
x
0
xx
1
(x x
0
) (x
1
x)
h
2
4
Therefore,
0 log
10
x P
1
(x)
h
2
4
"
log
10
e
2x
2
0
#
.
= .0543
h
2
x
2
0
If we want a uniform bound for all points 1 x
0
10,
we have
0 log
10
x P
1
(x)
h
2
log
10
e
8
.
= .0543h
2
0 log
10
x P
1
(x) .0543h
2
For h = .01, as is typical of the high school text book
tables of log
10
x,
0 log
10
x P
1
(x) 5.43 10
6
If you look at most tables, a typical entry is given to
only four decimal places to the right of the decimal
point, e.g.
log 5.41
.
= .7332
Therefore the entries are in error by as much as .00005.
Comparing this with the interpolation error, we see the
latter is less important than the rounding errors in the
table entries.
From the bound
0 log
10
x P
1
(x)
h
2
log
10
e
8x
2
0
.
= .0543
h
2
x
2
0
we see the error decreases as x
0
increases, and it is
about 100 times smaller for points near 10 than for
points near 1.
AN ERROR FORMULA:
THE GENERAL CASE
Recall the general interpolation problem: nd a poly-
nomial P
n
(x) for which deg(P
n
) n
P
n
(x
i
) = f(x
i
), i = 0, 1, , n
with distinct node points {x
0
, ..., x
n
} and a given
function f(x). Let [a, b] be a given interval on which
f(x) is (n + 1)-times continuously dierentiable; and
assume the points x
0
, ..., x
n
, and x are contained in
[a, b]. Then
f(x)P
n
(x) =
(x x
0
) (x x
1
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
with c
x
some point between the minimum and maxi-
mum of the points in {x, x
0
, ..., x
n
}.
f(x)P
n
(x) =
(x x
0
) (x x
1
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
As shorthand, introduce
n
(x) = (x x
0
) (x x
1
) (x x
n
)
a polynomial of degree n + 1 with roots {x
0
, ..., x
n
}.
Then
f(x) P
n
(x) =
n
(x)
(n + 1)!
f
(n+1)
(c
x
)
THE QUADRATIC CASE
For n = 2, we have
f(x) P
2
(x) =
(x x
0
) (x x
1
) (x x
2
)
3!
f
(3)
(c
x
)
(*)
with c
x
some point between the minimum and maxi-
mum of the points in {x, x
0
, x
1
, x
2
}.
To illustrate the use of this formula, consider the case
of evenly spaced nodes:
x
1
= x
0
+ h, x
2
= x
1
+ h
Further suppose we have x
0
x x
2
, as we would
usually have when interpolating in a table of given
function values (e.g. log
10
x). The quantity
2
(x) = (x x
0
) (x x
1
) (x x
2
)
can be evaluated directly for a particular x.
Graph of
2
(x) = (x + h) x(x h)
using (x
0
, x
1
, x
2
) = (h, 0, h):
x
y
h
-h
In the formula (), however, we do not know c
x
, and
therefore we replace
f
(3)
(c
x
)
with a maximum of
f
(3)
(x)
as x varies over x
0
x x
2
. This yields
|f(x) P
2
(x)|
|
2
(x)|
3!
max
x
0
xx
2
f
(3)
(x)
(**)
If we want a uniform bound for x
0
x x
2
, we must
compute
max
x
0
xx
2
|
2
(x)| = max
x
0
xx
2
|(x x
0
) (x x
1
) (x x
2
)|
Using calculus,
max
x
0
xx
2
|
2
(x)| =
2h
3
3 sqrt(3)
, at x = x
1
h
sqrt(3)
Combined with (), this yields
|f(x) P
2
(x)|
h
3
9 sqrt(3)
max
x
0
xx
2
f
(3)
(x)
for x
0
x x
2
.
For f(x) = log
10
x, with 1 x
0
x x
2
10, this
leads to
|log
10
x P
2
(x)|
h
3
9 sqrt(3)
max
x
0
xx
2
2 log
10
e
x
3
=
.05572 h
3
x
3
0
For the case of h = .01, we have
|log
10
x P
2
(x)|
5.57 10
8
x
3
0
5.57 10
8
Question: How much larger could we make h so that
quadratic interpolation would have an error compa-
rable to that of linear interpolation of log
10
x with
h = .01? The error bound for the linear interpolation
was 5.43 10
6
, and therefore we want the same to
be true of quadratic interpolation. Using a simpler
bound, we want to nd h so that
|log
10
x P
2
(x)| .05572 h
3
5 10
6
This is true if h = .04477. Therefore a spacing of
h = .04 would be sucient. A table with this spac-
ing and quadratic interpolation would have an error
comparable to a table with h = .01 and linear inter-
polation.
For the case of general n,
f(x) P
n
(x) =
(x x
0
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
=
n
(x)
(n + 1)!
f
(n+1)
(c
x
)
n
(x) = (x x
0
) (x x
1
) (x x
n
)
with c
x
some point between the minimum and max-
imum of the points in {x, x
0
, ..., x
n
}. When bound-
ing the error we replace f
(n+1)
(c
x
) with its maximum
over the interval containing {x, x
0
, ..., x
n
}, as we have
illustrated earlier in the linear and quadratic cases.
Consider now the function
n
(x)
(n + 1)!
over the interval determined by the minimum and
maximum of the points in {x, x
0
, ..., x
n
}. For evenly
spaced node points on [0, 1], with x
0
= 0 and x
n
= 1,
we give graphs for n = 2, 3, 4, 5 and for n = 6, 7, 8, 9
on accompanying pages.
DISCUSSION OF ERROR
Consider the error
f(x) P
n
(x) =
(x x
0
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
=
n
(x)
(n + 1)!
f
(n+1)
(c
x
)
n
(x) = (x x
0
) (x x
1
) (x x
n
)
as n increases and as x varies. As noted previously, we
cannot do much with f
(n+1)
(c
x
) except to replace it
with a maximum value of
f
(n+1)
(x)
over a suitable
interval. Thus we concentrate on understanding the
size of
n
(x)
(n + 1)!
ERROR FOR EVENLY SPACED NODES
We consider rst the case in which the node points
are evenly spaced, as this seems the natural way to
dene the points at which interpolation is carried out.
Moreover, using evenly spaced nodes is the case to
consider for table interpolation. What can we learn
from the given graphs?
The interpolation nodes are determined by using
h =
1
n
, x
0
= 0, x
1
= h, x
2
= 2h, ..., x
n
= nh = 1
For this case,
n
(x) = x(x h) (x 2h) (x 1)
Our graphs are the cases of n = 2, ..., 9.
x
y
n = 2
1
x
y
n = 3
1
x
y
n = 4
1
x
y
n = 5
1
Graphs of
n
(x) on [0, 1] for n = 2, 3, 4, 5
x
y
n = 6
1
x
y
n = 7
1
x
y
n = 8
1
x
y
n = 9
1
Graphs of
n
(x) on [0, 1] for n = 6, 7, 8, 9
Graph of
6
(x) = (x x
0
) (x x
1
) (x x
6
)
with evenly spaced nodes:
x
x
0
x
1
x
2
x
3
x
4
x
5
x
6
Using the following table
,
n M
n
n M
n
1 1.25E1 6 4.76E7
2 2.41E2 7 2.20E8
3 2.06E3 8 9.11E10
4 1.48E4 9 3.39E11
5 9.01E6 10 1.15E12
we can observe that the maximum
M
n
max
x
0
xx
n
|
n
(x)|
(n + 1)!
becomes smaller with increasing n.
From the graphs, there is enormous variation in the
size of
n
(x) as x varies over [0, 1]; and thus there
is also enormous variation in the error as x so varies.
For example, in the n = 9 case,
max
x
0
xx
1
|
n
(x)|
(n + 1)!
= 3.39 10
11
max
x
4
xx
5
|
n
(x)|
(n + 1)!
= 6.89 10
13
and the ratio of these two errors is approximately 49.
Thus the interpolation error is likely to be around 49
times larger when x
0
x x
1
as compared to the
case when x
4
x x
5
. When doing table inter-
polation, the point x at which you are interpolating
should be centrally located with respect to the inter-
polation nodes m{x
0
, ..., x
n
} being used to dene the
interpolation, if possible.
AN APPROXIMATION PROBLEM
Consider now the problem of using an interpolation
polynomial to approximate a given function f(x) on
a given interval [a, b]. In particular, take interpolation
nodes
a x
0
< x
1
< < x
n1
< x
n
b
and produce the interpolation polynomial P
n
(x) that
interpolates f(x) at the given node points. We would
like to have
max
axb
|f(x) P
n
(x)| 0 as n
Does it happen?
Recall the error bound
max
axb
|f(x) P
n
(x)|
max
axb
|
n
(x)|
(n + 1)!
max
axb
f
(n+1)
(x)
1
(n + 1)!
max
axb
j
n
(x)j max
axb
f
(n+1)
(x)
f
(n+1)
(x)
n+1
2
n
This turns out to be smaller than for evenly spaced cases;
and although this polynomial interpolation does not work
for all functions f(x), it works for all dierentiable func-
tions and more.
ANOTHER ERROR FORMULA
Recall the error formula
f(x) P
n
(x) =
n
(x)
(n + 1)!
f
(n+1)
(c)
n
(x) = (x x
0
) (x x
1
) (x x
n
)
with c between the minimum and maximum of {x
0
, ..., x
n
, x}.
A second formula is given by
f(x) P
n
(x) =
n
(x) f[x
0
, ..., x
n
, x]
To show this is a simple, but somewhat subtle argu-
ment.
Let P
n+1
(x) denote the polynomial of degree n+1
which interpolates f(x) at the points {x
0
, ..., x
n
, x
n+1
}.
Then
P
n+1
(x) = P
n
(x)
+f[x
0
, ..., x
n
, x
n+1
] (x x
0
) (x x
n
)
Substituting x = x
n+1
, and using the fact that P
n+1
(x)
interpolates f(x) at x
n+1
, we have
f(x
n+1
) = P
n
(x
n+1
)
+f[x
0
, ..., x
n
, x
n+1
] (x
n+1
x
0
) (x
n+1
x
n
)
f(x
n+1
) = P
n
(x
n+1
)
+f[x
0
, ..., x
n
, x
n+1
] (x
n+1
x
0
) (x
n+1
x
n
)
In this formula, the number x
n+1
is completely ar-
bitrary, other than being distinct from the points in
{x
0
, ..., x
n
}. To emphasize this fact, replace x
n+1
by
x throughout the formula, obtaining
f(x) = P
n
(x) + f[x
0
, ..., x
n
, x] (x x
0
) (x x
n
)
= P
n
(x) +
n
(x) f[x
0
, ..., x
n
, x]
provided x 6= x
0
, ..., x
n
.
The formula
f(x) = P
n
(x) + f[x
0
, ..., x
n
, x] (x x
0
) (x x
n
)
= P
n
(x) +
n
(x) f[x
0
, ..., x
n
, x]
is easily true for x a node point. Provided f(x) is
dierentiable, the formula is also true for x a node
point.
This shows
f(x) P
n
(x) =
n
(x) f[x
0
, ..., x
n
, x]
Compare the two error formulas
f(x) P
n
(x) =
n
(x) f[x
0
, ..., x
n
, x]
f(x) P
n
(x) =
n
(x)
(n + 1)!
f
(n+1)
(c)
Then
n
(x) f[x
0
, ..., x
n
, x] =
n
(x)
(n + 1)!
f
(n+1)
(c)
f[x
0
, ..., x
n
, x] =
f
(n+1)
(c)
(n + 1)!
for some c between the smallest and largest of the
numbers in {x
0
, ..., x
n
, x}.
To make this somewhat symmetric in its arguments,
let m = n + 1, x = x
n+1
. Then
f[x
0
, ..., x
m1
, x
m
] =
f
(m)
(c)
m!
with c an unknown number between the smallest and
largest of the numbers in {x
0
, ..., x
m
}. This was given
in an earlier lecture where divided dierences were in-
troduced.
PIECEWISE POLYNOMIAL INTERPOLATION
Recall the examples of higher degree polynomial in-
terpolation of the function f(x) =
1 + x
2
1
on
[5, 5]. The interpolants P
n
(x) oscillated a great
deal, whereas the function f(x) was nonoscillatory.
To obtain interpolants that are better behaved, we
look at other forms of interpolating functions.
Consider the data
x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0
What are methods of interpolating this data, other
than using a degree 6 polynomial. Shown in the text
are the graphs of the degree 6 polynomial interpolant,
along with those of piecewise linear and a piecewise
quadratic interpolating functions.
Since we only have the data to consider, we would gen-
erally want to use an interpolant that had somewhat
the shape of that of the piecewise linear interpolant.
x
y
1 2 3 4
1
2
The data points
x
y
1 2 3 4
1
2
Piecewise linear interpolation
x
y
1 2 3 4
1
2
3
4
Polynomial Interpolation
x
y
1 2 3 4
1
2
Piecewise quadratic interpolation
PIECEWISE POLYNOMIAL FUNCTIONS
Consider being given a set of data points (x
1
, y
1
), ...,
(x
n
, y
n
), with
x
1
< x
2
< < x
n
Then the simplest way to connect the points (x
j
, y
j
)
is by straight line segments. This is called a piecewise
linear interpolant of the data
n
(x
j
, y
j
)
o
. This graph
has corners, and often we expect the interpolant to
have a smooth graph.
To obtain a somewhat smoother graph, consider using
piecewise quadratic interpolation. Begin by construct-
ing the quadratic polynomial that interpolates
{(x
1
, y
1
), (x
2
, y
2
), (x
3
, y
3
)}
Then construct the quadratic polynomial that inter-
polates
{(x
3
, y
3
), (x
4
, y
4
), (x
5
, y
5
)}
Continue this process of constructing quadratic inter-
polants on the subintervals
[x
1
, x
3
], [x
3
, x
5
], [x
5
, x
7
], ...
If the number of subintervals is even (and therefore
n is odd), then this process comes out ne, with the
last interval being [x
n2
, x
n
]. This was illustrated
on the graph for the preceding data. If, however, n is
even, then the approximation on the last interval must
be handled by some modication of this procedure.
Suggest such!
With piecewise quadratic interpolants, however, there
are corners on the graph of the interpolating func-
tion. With our preceding example, they are at x
3
and
x
5
. How do we avoid this?
Piecewise polynomial interpolants are used in many
applications. We will consider them later, to obtain
numerical integration formulas.
SMOOTH NON-OSCILLATORY
INTERPOLATION
Let data points (x
1
, y
1
), ..., (x
n
, y
n
) be given, as let
x
1
< x
2
< < x
n
Consider nding functions s(x) for which the follow-
ing properties hold:
(1) s(x
i
) = y
i
, i = 1, ..., n
(2) s(x), s
0
(x), s
00
(x) are continuous on [x
1
, x
n
].
Then among such functions s(x) satisfying these prop-
erties, nd the one which minimizes the integral
Z
x
n
x
1
s
00
(x)
2
dx
The idea of minimizing the integral is to obtain an in-
terpolating function for which the rst derivative does
not change rapidly. It turns out there is a unique so-
lution to this problem, and it is called a natural cubic
spline function.
SPLINE FUNCTIONS
Let a set of node points {x
i
} be given, satisfying
a x
1
< x
2
< < x
n
b
for some numbers a and b. Often we use [a, b] =
[x
1
, x
n
]. A cubic spline function s(x) on [a, b] with
breakpoints or knots {x
i
} has the following prop-
erties:
1. On each of the intervals
[a, x
1
], [x
1
, x
2
], ..., [x
n1
, x
n
], [x
n
, b]
s(x) is a polynomial of degree 3.
2. s(x), s
0
(x), s
00
(x) are continuous on [a, b].
In the case that we have given data points (x
1
, y
1
),...,
(x
n
, y
n
), we say s(x) is a cubic interpolating spline
function for this data if
3. s(x
i
) = y
i
, i = 1, ..., n.
EXAMPLE
Dene
(x )
3
+
=
(
(x )
3
, x
0, x
This is a cubic spline function on (, ) with the
single breakpoint x
1
= .
Combinations of these form more complicated cubic
spline functions. For example,
s(x) = 3 (x 1)
3
+
2 (x 3)
3
+
is a cubic spline function on (, ) with the break-
points x
1
= 1, x
2
= 3.
Dene
s(x) = p
3
(x) +
n
X
j=1
a
j
x x
j
3
+
with p
3
(x) some cubic polynomial. Then s(x) is a
cubic spline function on (, ) with breakpoints
{x
1
, ..., x
n
}.
Return to the earlier problem of choosing an interpo-
lating function s(x) to minimize the integral
Z
x
n
x
1
s
00
(x)
2
dx
There is a unique solution to problem. The solution
s(x) is a cubic interpolating spline function, and more-
over, it satises
s
00
(x
1
) = s
00
(x
n
) = 0
Spline functions satisfying these boundary conditions
are called natural cubic spline functions, and the so-
lution to our minimization problem is a natural cubic
interpolatory spline function. We will show a method
to construct this function from the interpolation data.
Motivation for these boundary conditions can be given
by looking at the physics of bending thin beams of
exible materials to pass thru the given data. To the
left of x
1
and to the right of x
n
, the beam is straight
and therefore the second derivatives are zero at the
transition points x
1
and x
n
.
CONSTRUCTION OF THE
INTERPOLATING SPLINE FUNCTION
To make the presentation more specic, suppose we
have data
(x
1
, y
1
) , (x
2
, y
2
) , (x
3
, y
3
) , (x
4
, y
4
)
with x
1
< x
2
< x
3
< x
4
. Then on each of the
intervals
[x
1
, x
2
] , [x
2
, x
3
] , [x
3
, x
4
]
s(x) is a cubic polynomial. Taking the rst interval,
s(x) is a cubic polynomial and s
00
(x) is a linear poly-
nomial. Let
M
i
= s
00
(x
i
), i = 1, 2, 3, 4
Then on [x
1
, x
2
],
s
00
(x) =
(x
2
x) M
1
+ (x x
1
) M
2
x
2
x
1
, x
1
x x
2
We can nd s(x) by integrating twice:
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6 (x
2
x
1
)
+ c
1
x + c
2
We determine the constants of integration by using
s(x
1
) = y
1
, s(x
2
) = y
2
(*)
Then
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6 (x
2
x
1
)
+
(x
2
x) y
1
+ (x x
1
) y
2
x
2
x
1
x
2
x
1
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
for x
1
x x
2
.
Check that this formula satises the given interpola-
tion condition (*)!
We can repeat this on the intervals [x
2
, x
3
] and [x
3
, x
4
],
obtaining similar formulas.
For x
2
x x
3
,
s(x) =
(x
3
x)
3
M
2
+ (x x
2
)
3
M
3
6 (x
3
x
2
)
+
(x
3
x) y
2
+ (x x
2
) y
3
x
3
x
2
x
3
x
2
6
[(x
3
x) M
2
+ (x x
2
) M
3
]
For x
3
x x
4
,
s(x) =
(x
4
x)
3
M
3
+ (x x
3
)
3
M
4
6 (x
4
x
3
)
+
(x
4
x) y
3
+ (x x
3
) y
4
x
4
x
3
x
4
x
3
6
[(x
4
x) M
3
+ (x x
3
) M
4
]
We still do not know the values of the second deriv-
atives {M
1
, M
2
, M
3
, M
4
}. The above formulas guar-
antee that s(x) and s
00
(x) are continuous for
x
1
x x
4
. For example, the formula on [x
1
, x
2
]
yields
s(x
2
) = y
2
, s
00
(x
2
) = M
2
The formula on [x
2
, x
3
] also yields
s(x
2
) = y
2
, s
00
(x
2
) = M
2
All that is lacking is to make s
0
(x) continuous at x
2
and x
3
. Thus we require
s
0
(x
2
+ 0) = s
0
(x
2
0)
s
0
(x
3
+ 0) = s
0
(x
3
0)
(**)
This means
lim
x&x
2
s
0
(x) = lim
x%x
2
s
0
(x)
and similarly for x
3
.
To simplify the presentation somewhat, I assume in
the following that our node points are evenly spaced:
x
2
= x
1
+ h, x
3
= x
1
+ 2h, x
4
= x
1
+ 3h
Then our earlier formulas simplify to
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6h
+
(x
2
x) y
1
+ (x x
1
) y
2
h
h
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
for x
1
x x
2
, with similar formulas on [x
2
, x
3
] and
[x
3
, x
4
].
Without going thru all of the algebra, the conditions
(**) leads to the following pair of equations.
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
3
y
2
h
y
2
y
1
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
4
y
3
h
y
3
y
2
h
This gives us two equations in four unknowns. The
earlier boundary conditions on s
00
(x) gives us immedi-
ately
M
1
= M
4
= 0
Then we can solve the linear system for M
2
and M
3
.
EXAMPLE
Consider the interpolation data points
x 1 2 3 4
y 1
1
2
1
3
1
4
In this case, h = 1, and linear system becomes
2
3
M
2
+
1
6
M
3
= y
3
2y
2
+ y
1
=
1
3
1
6
M
2
+
2
3
M
3
= y
4
2y
3
+ y
2
=
1
12
This has the solution
M
2
=
1
2
, M
3
= 0
This leads to the spline function formula on each
subinterval.
On [1, 2],
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6h
+
(x
2
x) y
1
+ (x x
1
) y
2
h
h
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
=
(2 x)
3
0 + (x 1)
3
1
2
6
+
(2 x) 1 + (x 1)
1
2
1
6
h
(2 x) 0 + (x 1)
1
2
i
=
1
12
(x 1)
3
7
12
(x 1) + 1
Similarly, for 2 x 3,
s(x) =
1
12
(x 2)
3
+
1
4
(x 2)
2
1
3
(x 1) +
1
2
and for 3 x 4,
s(x) =
1
12
(x 4) +
1
4
x 1 2 3 4
y 1
1
2
1
3
1
4
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0.2
0.4
0.6
0.8
1
x
y
y = 1/x
y = s(x)
Graph of example of natural cubic spline
interpolation
x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating natural cubic spline function
ALTERNATIVE BOUNDARY CONDITIONS
Return to the equations
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
3
y
2
h
y
2
y
1
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
4
y
3
h
y
3
y
2
h
Sometimes other boundary conditions are imposed on
s(x) to help in determining the values of M
1
and
M
4
. For example, the data in our numerical exam-
ple were generated from the function f(x) =
1
x
. With
it, f
00
(x) =
2
x
3
, and thus we could use
M
1
= 2, M
4
=
1
32
With this we are led to a new formula for s(x), one
that approximates f(x) =
1
x
more closely.
THE CLAMPED SPLINE
In this case, we augment the interpolation conditions
s(x
i
) = y
i
, i = 1, 2, 3, 4
with the boundary conditions
s
0
(x
1
) = y
0
1
, s
0
(x
4
) = y
0
4
(#)
The conditions (#) lead to another pair of equations,
augmenting the earlier ones. Combined these equa-
tions are
h
3
M
1
+
h
6
M
2
=
y
2
y
1
h
y
0
1
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
3
y
2
h
y
2
y
1
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
4
y
3
h
y
3
y
2
h
h
6
M
3
+
h
3
M
4
= y
0
4
y
4
y
3
h
For our numerical example, it is natural to obtain
these derivative values from f
0
(x) =
1
x
2
:
y
0
1
= 1, y
0
4
=
1
16
When combined with your earlier equations, we have
the system
1
3
M
1
+
1
6
M
2
=
1
2
1
6
M
1
+
2
3
M
2
+
1
6
M
3
=
1
3
1
6
M
2
+
2
3
M
3
+
1
6
M
4
=
1
12
1
6
M
3
+
1
3
M
4
=
1
48
This has the solution
[M
1
, M
2
, M
3
, M
4
] =
173
120
,
7
60
,
11
120
,
1
60
h
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
We can substitute in from the data
x 1 2 3 4
y 1
1
2
1
3
1
4
and the solutions {M
i
}. Doing so, consider the error
f(x) s(x). As an example,
f(x) =
1
x
, f
3
2
=
2
3
, s
3
2
= .65260
This is quite a decent approximation.
THE GENERAL PROBLEM
Consider the spline interpolation problem with n nodes
(x
1
, y
1
) , (x
2
, y
2
) , ..., (x
n
, y
n
)
and assume the node points {x
i
} are evenly spaced,
x
j
= x
1
+ (j 1) h, j = 1, ..., n
We have that the interpolating spline s(x) on
x
j
x x
j+1
is given by
s(x) =
x
j+1
x
3
M
j
+
x x
j
3
M
j+1
6h
+
x
j+1
x
y
j
+
x x
j
y
j+1
h
h
6
h
x
j+1
x
M
j
+
x x
j
M
j+1
i
for j = 1, ..., n 1.
To enforce continuity of s
0
(x) at the interior node
points x
2
, ..., x
n1
, the second derivatives
n
M
j
o
must
satisfy the linear equations
h
6
M
j1
+
2h
3
M
j
+
h
6
M
j+1
=
y
j1
2y
j
+ y
j+1
h
for j = 2, ..., n 1. Writing them out,
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
1
2y
2
+ y
3
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
2
2y
3
+ y
4
h
.
.
.
h
6
M
n2
+
2h
3
M
n1
+
h
6
M
n
=
y
n2
2y
n1
+ y
n
h
This is a system of n2 equations in the n unknowns
{M
1
, ..., M
n
}. Two more conditions must be imposed
on s(x) in order to have the number of equations equal
the number of unknowns, namely n. With the added
boundary conditions, this form of linear system can be
solved very eciently.
BOUNDARY CONDITIONS
Natural boundary conditions
s
00
(x
1
) = s
00
(x
n
) = 0
Spline functions satisfying these conditions are called
natural cubic splines. They arise out the minimiza-
tion problem stated earlier. But generally they are not
considered as good as some other cubic interpolating
splines.
Clamped boundary conditions We add the condi-
tions
s
0
(x
1
) = y
0
1
, s
0
(x
n
) = y
0
n
with y
0
1
, y
0
n
given slopes for the endpoints of s(x) on
[x
1
, x
n
]. This has many quite good properties when
compared with the natural cubic interpolating spline;
but it does require knowing the derivatives at the end-
points.
Not a knot boundary conditions This is more com-
plicated to explain, but it is the version of cubic spline
interpolation that is implemented in Matlab.
THE NOT A KNOT CONDITIONS
As before, let the interpolation nodes be
(x
1
, y
1
) , (x
2
, y
2
) , ..., (x
n
, y
n
)
We separate these points into two categories. For
constructing the interpolating cubic spline function,
we use the points
(x
1
, y
1
) , (x
3
, y
3
) , ..., (x
n2
, y
n2
) , (x
n
, y
n
)
Thus deleting two of the points. We now have n 2
points, and the interpolating spline s(x) can be deter-
mined on the intervals
[x
1
, x
3
] , [x
3
, x
4
] , ..., [x
n3
, x
n2
] , [x
n2
, x
n
]
This leads to n 4 equations in the n 2 unknowns
M
1
, M
3
, ..., M
n2
, M
n
. The two additional boundary
conditions are
s(x
2
) = y
2
, s(x
n1
) = y
n1
These translate into two additional equations, and we
obtain a system of n2 linear simultaneous equations
in the n 2 unknowns M
1
, M
3
, ..., M
n2
, M
n
.
x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating cubic spline function with not-a knot
boundary conditions
MATLAB SPLINE FUNCTION LIBRARY
Given data points
(x
1
, y
1
) , (x
2
, y
2
) , ..., (x
n
, y
n
)
type arrays containing the x and y coordinates:
x = [x
1
x
2
...x
n
]
y = [y
1
y
2
...y
n
]
plot (x, y, o)
The last statement will draw a plot of the data points,
marking them with the letter oh. To nd the inter-
polating cubic spline function and evaluate it at the
points of another array xx, say
h = (x
n
x
1
) / (10 n) ; xx = x
1
: h : x
n
;
use
yy = spline (x, y, xx)
plot (x, y, o, xx, yy)
The last statement will plot the data points, as be-
fore, and it will plot the interpolating spline s(x) as a
continuous curve.
ERROR IN CUBIC SPLINE INTERPOLATION
Let an interval [a, b] be given, and then dene
h =
b a
n 1
, x
j
= a + (j 1)h, j = 1, ..., n
Suppose we want to approximate a given function
f(x) on the interval [a, b] using cubic spline inter-
polation. Dene
y
i
= f(x
i
), j = 1, ..., n
Let s
n
(x) denote the cubic spline interpolating this
data and satisfying the not a knot boundary con-
ditions. Then it can be shown that for a suitable
constant c,
E
n
max
axb
|f(x) s
n
(x)| ch
4
The corresponding bound for natural cubic spline in-
terpolation contains only a term of h
2
rather than h
4
;
it does not converge to zero as rapidly.
EXAMPLE
Take f(x) = arctan x on [0, 5]. The following ta-
ble gives values of the maximum error E
n
for various
values of n. The values of h are being successively
halved.
n E
n
E
1
2
n
/E
n
7 7.09E3
13 3.24E4 21.9
25 3.06E5 10.6
49 1.48E6 20.7
97 9.04E8 16.4
BEST APPROXIMATION
Given a function f(x) that is continuous on a given
interval [a, b], consider approximating it by some poly-
nomial p(x). To measure the error in p(x) as an ap-
proximation, introduce
E(p) = max
axb
|f(x) p(x)|
This is called the maximum error or uniform error of
approximation of f(x) by p(x) on [a, b].
With an eye towards eciency, we want to nd the
best possible approximation of a given degree n.
With this in mind, introduce the following:
n
(f) = min
deg(p)n
E(p)
= min
deg(p)n
"
max
axb
|f(x) p(x)|
#
The number
n
(f) will be the smallest possible uni-
form error, or minimax error, when approximating f(x)
by polynomials of degree at most n. If there is a
polynomial giving this smallest error, we denote it by
m
n
(x); thus E(m
n
) =
n
(f).
Example. Let f(x) = e
x
on [1, 1]. In the following
table, we give the values of E(t
n
), t
n
(x) the Tay-
lor polynomial of degree n for e
x
about x = 0, and
E(m
n
).
Maximum Error in:
n t
n
(x) m
n
(x)
1 7.18E 1 2.79E 1
2 2.18E 1 4.50E 2
3 5.16E 2 5.53E 3
4 9.95E 3 5.47E 4
5 1.62E 3 4.52E 5
6 2.26E 4 3.21E 6
7 2.79E 5 2.00E 7
8 3.06E 6 1.11E 8
9 3.01E 7 5.52E 10
Consider graphically how we can improve on the Tay-
lor polynomial
t
1
(x) = 1 + x
as a uniform approximation to e
x
on the interval [1, 1].
The linear minimax approximation is
m
1
(x) = 1.2643 + 1.1752x
x
y
-1 1
1
2
y=t
1
(x)
y=m
1
(x)
y=e
x
Linear Taylor and minimax approximations to e
x
x
y
-1 1
0.0516
Error in cubic Taylor approximation to e
x
x
y
-1 1
0.00553
-0.00553
Error in cubic minimax approximation to e
x
Accuracy of the minimax approximation.
n
(f)
[(b a)/2]
n+1
(n + 1)!2
n
max
axb
f
(n+1)
(x)
n
(e
x
)
e
(n + 1)!2
n
(*)
n Bound (*)
n
(f)
1 6.80E 1 2.79E 1
2 1.13E 1 4.50E 2
3 1.42E 2 5.53E 3
4 1.42E 3 5.47E 4
5 1.18E 4 4.52E 5
6 8.43E 6 3.21E 6
7 5.27E 7 2.00E 7
CHEBYSHEV POLYNOMIALS
Chebyshev polynomials are used in many parts of nu-
merical analysis, and more generally, in applications
of mathematics. For an integer n 0, dene the
function
T
n
(x) = cos
ncos
1
x
, 1 x 1 (1)
This may not appear to be a polynomial, but we will
show it is a polynomial of degree n. To simplify the
manipulation of (1), we introduce
= cos
1
(x) or x = cos(), 0 (2)
Then
T
n
(x) = cos(n) (3)
Example. n = 0
T
0
(x) = cos(0 ) = 1
n = 1
T
1
(x) = cos() = x
n = 2
T
2
(x) = cos(2) = 2 cos
2
() 1 = 2x
2
1
x
y
-1 1
1
-1
T
0
(x)
T
1
(x)
T
2
(x)
x
y
-1 1
1
-1
T
3
(x)
T
4
(x)
The triple recursion relation. Recall the trigonomet-
ric addition formulas,
cos( ) = cos() cos() sin() sin()
Let n 1, and apply these identities to get
T
n+1
(x) = cos[(n + 1)] = cos(n +)
= cos(n) cos() sin(n) sin()
T
n1
(x) = cos[(n 1)] = cos(n )
= cos(n) cos() + sin(n) sin()
Add these two equations, and then use (1) and (3) to
obtain
T
n+1
(x) +T
n1
= 2 cos(n) cos() = 2xT
n
(x)
T
n+1
(x) = 2xT
n
(x) T
n1
(x), n 1
(4)
This is called the triple recursion relation for the Cheby-
shev polynomials. It is often used in evaluating them,
rather than using the explicit formula (1).
Example. Recall
T
0
(x) = 1, T
1
(x) = x
T
n+1
(x) = 2xT
n
(x) T
n1
(x), n 1
Let n = 2. Then
T
3
(x) = 2xT
2
(x) T
1
(x)
= 2x(2x
2
1) x
= 4x
3
3x
Let n = 3. Then
T
4
(x) = 2xT
3
(x) T
2
(x)
= 2x(4x
3
3x) (2x
2
1)
= 8x
4
8x
2
+ 1
The minimum size property. Note that
|T
n
(x)| 1, 1 x 1 (5)
for all n 0. Also, note that
T
n
(x) = 2
n1
x
n
+ lower degree terms, n 1
(6)
This can be proven using the triple recursion relation
and mathematical induction.
Introduce a modied version of T
n
(x),
e
T
n
(x) =
1
2
n1
T
n
(x) = x
n
+lower degree terms (7)
From (5) and (6),
e
T
n
(x)
1
2
n1
, 1 x 1, n 1 (8)
Example.
e
T
4
(x) =
1
8
8x
4
8x
2
+ 1
= x
4
x
2
+
1
8
A polynomial whose highest degree term has a coe-
cient of 1 is called a monic polynomial. Formula (8)
says the monic polynomial
e
T
n
(x) has size 1/2
n1
on
1 x 1, and this becomes smaller as the degree
n increases. In comparison,
max
1x1
|x
n
| = 1
Thus x
n
is a monic polynomial whose size does not
change with increasing n.
Theorem. Let n 1 be an integer, and consider all
possible monic polynomials of degree n. Then the
degree n monic polynomial with the smallest maxi-
mum on [1, 1] is the modied Chebyshev polynomial
e
T
n
(x), and its maximum value on [1, 1] is 1/2
n1
.
This result is used in devising applications of Cheby-
shev polynomials. We apply it to obtain an improved
interpolation scheme.
A NEAR-MINIMAX APPROXIMATION METHOD
Let f(x) be continuous on [a, b] = [1, 1]. Consider
approximating f by an interpolatory polynomial of de-
gree at most n = 3. Let x
0
, x
1
, x
2
, x
3
be interpo-
lation node points in [1, 1]; let c
3
(x) be of degree
3 and interpolate f(x) at {x
0
, x
1
, x
2
, x
3
}. The in-
terpolation error is
f(x) c
3
(x) =
(x)
4!
f
(4)
(
x
), 1 x 1 (1)
(x) = (x x
0
)(x x
1
)(x x
2
)(x x
3
) (2)
with
x
in [1, 1]. We want to choose the nodes
{x
0
, x
1
, x
2
, x
3
} so as to minimize the maximum value
of |f(x) c
3
(x)| on [1, 1].
From (1), the only general quantity, independent of f,
is (x). Thus we choose {x
0
, x
1
, x
2
, x
3
} to minimize
max
1x1
|(x)| (3)
Expand to get
(x) = x
4
+ lower degree terms
This is a monic polynomial of degree 4. From the
theorem in the preceding section, the smallest possible
value for (3) is obtained with
(x) =
e
T
4
(x) =
T
4
(x)
2
3
=
1
8
(8x
4
8x
2
+ 1) (4)
and the smallest value of (3) is 1/2
3
in this case. The
equation (4) denes implicitly the nodes {x
0
, x
1
, x
2
, x
3
}:
they are the roots of T
4
(x).
In our case this means solving
T
4
(x) = cos(4) = 0, x = cos()
4 =
2
,
3
2
,
5
2
,
7
2
, . . .
=
8
,
3
8
,
5
8
,
7
8
, . . .
x = cos
, cos
3
8
, cos
5
8
, . . . (5)
using cos() = cos().
x = cos
, cos
3
8
, cos
5
8
, cos
7
8
, . . .
The rst four values are distinct; the following ones
are repetitive. For example,
cos
9
8
= cos
7
8
e
192
.
= 0.014158
By direct calculation,
max
1x1
|e
x
c
3
(x)|
.
= 0.00666
Interpolation Data: f(x) = e
x
i x
i
f(x
i
) f[x
0
, . . . , x
i
]
0 0.923880 2.5190442 2.5190442
1 0.382683 1.4662138 1.9453769
2 0.382683 0.6820288 0.7047420
3 0.923880 0.3969760 0.1751757
x
y
-1 1
0.00666
-0.00624
The error e
x
c
3
(x)
For comparison, E(t
3
)
.
= 0.0142 and
3
(e
x
)
.
= 0.00553.
THE GENERAL CASE
Consider interpolating f(x) on [1, 1] by a polyno-
mial of degree n, with the interpolation nodes
{x
0
, . . . , x
n
} in [1, 1]. Denote the interpolation poly-
nomial by c
n
(x). The interpolation error on [1, 1] is
given by
f(x) c
n
(x) =
(x)
(n + 1)!
f
(n+1)
(
x
) (7)
(x) = (x x
0
) (x x
n
)
with
x
and unknown point in [1, 1]. In order to
minimize the interpolation error, we seek to minimize
max
1x1
|(x)| (8)
The polynomial being minimized is monic of degree
n + 1,
(x) = x
n+1
+ lower degree terms
From the theorem of the preceding section, this min-
imum is attained by the monic polynomial
e
T
n+1
(x) =
1
2
n
T
n+1
(x)
Thus the interpolation nodes are the zeros of T
n+1
(x);
and by the procedure that led to (5), they are given
by
x
j
= cos
2j + 1
2n + 2
, j = 0, 1, . . . , n (9)
The near-minimax approximation c
n
(x) of degree n is
obtained by interpolating to f(x) at these n+1 nodes
on [1, 1].
The polynomial c
n
(x) is sometimes called a Cheby-
shev approximation.
Example. Let f(x) = e
x
. the following table contains
the maximum errors in c
n
(x) on [1, 1] for varying
n. For comparison, we also include the corresponding
minimax errors. These gures illustrate that for prac-
tical purposes, c
n
(x) is a satisfactory replacement for
the minimax approximation m
n
(x).
n max |e
x
c
n
(x)|
n
(e
x
)
1 3.72E 1 2.79E 1
2 5.65E 2 4.50E 2
3 6.66E 3 5.53E 3
4 6.40E 4 5.47E 4
5 5.18E 5 4.52E 5
6 3.80E 6 3.21E 6
THEORETICAL INTERPOLATION ERROR
For the error
f(x) c
n
(x) =
(x)
(n + 1)!
f
(n+1)
(
x
)
we have
max
1x1
|f(x) c
n
(x)|
max
1x1
|(x)|
(n + 1)!
max
11
|f()|
From the theorem of the preceding section,
max
1x1
e
T
n+1
(x)
= max
1x1
|(x)| =
1
2
n
in this case. Thus
max
1x1
|f(x) c
n
(x)|
1
(n + 1)!2
n
max
11
|f()|
OTHER INTERVALS
Consider approximating f(x) on the nite interval
[a, b]. Introduce the linear change of variables
x =
1
2
[(1 t) a + (1 + t) b] (10)
t =
2
b a
x
b + a
2
(11)
Introduce
F(t) = f
1
2
[(1 t) a + (1 + t) b]
, 1 t 1
The function F(t) on [1, 1] is equivalent to f(x) on
[a, b], and we can move between them via (10)-(11).
We can now proceed to approximate f(x) on [a, b] by
instead approximating F(t) on [1, 1].
Example. Approximating f(x) = cos x on [0, /2] is
equivalent to approximating
F(t) = cos
1 + t
4
, 1 t 1
NUMERICAL DIFFERENTIATION
There are two major reasons for considering numeri-
cally approximations of the dierentiation process.
1. Approximation of derivatives in ordinary dieren-
tial equations and partial dierential equations.
This is done in order to reduce the dierential
equation to a form that can be solved more easily
than the original dierential equation.
2. Forming the derivative of a function f(x) which is
known only as empirical data {(x
i
, y
i
) | i = 1, . . . , m}.
The data generally is known only approximately,
so that y
i
f(x
i
), i = 1, . . . , m.
Recall the denition
f
0
(x) = lim
h0
f(x +h) f(x)
h
This justies using
f
0
(x)
f(x +h) f(x)
h
D
h
f(x) (1)
for small values of h. The approximation D
h
f(x) is
called a numerical derivative of f(x) with stepsize h.
Example. Use D
h
f(x) to approximate the derivative
of f(x) = cos(x) at x = /6. In the table, the error
is almost halved when h is halved.
h D
h
f Error Ratio
0.1 0.54243 0.04243
0.05 0.52144 0.02144 1.98
0.025 0.51077 0.01077 1.99
0.0125 0.50540 0.00540 1.99
0.00625 0.50270 0.00270 2.00
0.003125 0.50135 0.00135 2.00
Error behaviour. Using Taylors theorem,
f(x +h) = f(x) +hf
0
(x) +
1
2
h
2
f
00
(c)
with c between x and x +h. Evaluating (1),
D
h
f(x) =
1
h
nh
f(x) +hf
0
(x) +
1
2
h
2
f
00
(c)
i
f(x)
o
= f
0
(x) +
1
2
hf
00
(c)
f
0
(x) D
h
f(x) =
1
2
hf
00
(c) (2)
Using a higher order Taylor expansion,
f
0
(x) D
h
f(x) =
1
2
hf
00
(x)
1
6
h
2
f
00
(c),
f
0
(x) D
h
f(x)
1
2
hf
00
(x) (3)
for small values of h.
For f(x) = cos x,
f
0
(x) D
h
f(x) =
1
2
hcos c, c
h
6
,
6
+h
i
In the preceding table, check the accuracy of the ap-
proximation (3) with x =
6
.
The formula (1),
f
0
(x)
f(x +h) f(x)
h
D
h
f(x)
is called a forward dierence formula for approximat-
ing f
0
(x). In contrast, the approximation
f
0
(x)
f(x) f(x h)
h
, h > 0 (4)
is called a backward dierence formula for approxi-
mating f
0
(x). A similar derivation leads to
f
0
(x)
f(x) f(x h)
h
=
h
2
f
00
(c) (5)
for some c between x and x h. The accuracy of
the backward dierence formula (4) is essentially the
same as that of the forward dierence formula (1).
The motivation for this formula is in applications to
solving dierential equations.
DIFFERENTIATION USING INTERPOLATION
Let P
n
(x) be the degree n polynomial that interpo-
lates f(x) at n + 1 node points x
0
, x
1
, . . . , x
n
. To
calculate f
0
(x) at some point x = t, use
f
0
(t) P
0
n
(t) (6)
Many dierent formulas can be obtained by varying n
and by varying the placement of the nodes x
0
, . . . , x
n
relative to the point t of interest.
Example. Take n = 2, and use evenly spaced nodes
x
0
, x
1
= x
0
+h, x
2
= x
1
+h. Then
P
2
(x) = f(x
0
)L
0
(x) +f(x
1
)L
1
(x) +f(x
2
)L
2
(x)
P
0
2
(x) = f(x
0
)L
0
0
(x) +f(x
1
)L
0
1
(x) +f(x
2
)L
0
2
(x)
with
L
0
(x) =
(x x
1
)(x x
2
)
(x
0
x
1
)(x
0
x
2
)
L
1
(x) =
(x x
0
)(x x
2
)
(x
1
x
0
)(x
1
x
2
)
L
2
(x) =
(x x
0
)(x x
1
)
(x
2
x
0
)(x
2
x
1
)
Forming the derivatives of these Lagrange basis func-
tions and evaluating them at x = x
1
f
0
(x
1
) P
0
2
(x
1
) =
f(x
1
+h) f(x
1
h)
2h
D
h
f(x
1
)
(7)
For the error,
f
0
(x
1
)
f(x
1
+h) f(x
1
h)
2h
=
h
2
6
f
000
(c
2
) (8)
with x
1
h c
2
x
1
+h.
A proof of this begins with the interpolation error for-
mula
f(x) P
2
(x) =
2
(x)f [x
0
, x
1
, x
2
, x]
2
(x) = (x x
0
) (x x
1
) (x x
2
)
Dierentiate to get
f
0
(x) P
0
2
(x) =
2
(x)
d
dx
f [x
0
, x
1
, x
2
, x]
+
0
2
(x)f [x
0
, x
1
, x
2
, x]
f
0
(x) P
0
2
(x) =
2
(x)
d
dx
f [x
0
, x
1
, x
2
, x]
+
0
2
(x)f [x
0
, x
1
, x
2
, x]
With properties of the divided dierence, we can show
f
0
(x)P
0
2
(x) =
1
24
2
(x)f
(4)
c
1,x
+
1
6
0
2
(x)f
(3)
c
2,x
with c
1,x
and c
2,x
between the smallest and largest of
the values {x
0
, x
1
, x
2
, x}. Letting x = x
1
and noting
that
2
(x
1
) = 0, we obtain (8).
Example. Take f(x) = cos(x) and x
1
=
1
6
. Then
(7) is illustrated as follows.
h D
h
f Error Ratio
0.1 0.49916708 0.0008329
0.05 0.49979169 0.0002083 4.00
0.025 0.49994792 0.00005208 4.00
0.0125 0.49998698 0.00001302 4.00
0.00625 0.49999674 0.000003255 4.00
Note the smaller errors and faster convergence as com-
pared to the forward dierence formula (1).
UNDETERMINED COEFFICIENTS
Derive an approximation for f
00
(x) at x = t. Write
f
00
(t) D
(2)
h
f(t) Af(t +h)
+Bf(t) +Cf(t h)
(9)
with A, B, and C unspecied constants. Use Taylor
polynomial approximations
f(t h) f(t) hf
0
(t) +
h
2
2
f
00
(t)
h
3
6
f
000
(t) +
h
4
24
f
(4)
(t)
f(t +h) f(t) +hf
0
(t) +
h
2
2
f
00
(t)
+
h
3
6
f
000
(t) +
h
4
24
f
(4)
(t)
(10)
Substitute into (9) and rearrange:
D
(2)
h
f(t) (A+B +C)f(t)
+h(AC)f
0
(t) +
h
2
2
(A+C)f
00
(t)
+
h
3
6
(AC)f
000
(t) +
h
4
24
(A+C)f
(4)
(t)
(11)
To have
D
(2)
h
f(t) f
00
(t) (12)
for arbitrary functions f(x), require
A+B +C = 0: coecient of f(t)
h(AC) = 0: coecient of f
0
(t)
h
2
2
(A+C) = 1: coecient of f
00
(t)
Solution:
A = C =
1
h
2
, B =
2
h
2
(13)
This determines
D
(2)
h
f(t) =
f(t +h) 2f(t) +f(t h)
h
2
(14)
For the error, substitute (13) into (11):
D
(2)
h
f(t) f
00
(t) +
h
2
12
f
(4)
(t)
Thus
f
00
(t)
f(t +h) 2f(t) +f(t h)
h
2
h
2
12
f
(4)
(t)
(15)
Example. Let f(x) = cos(x), t =
1
6
; use (14) to
calculate f
00
(t) = cos
1
6
.
h D
(2)
h
f Error Ratio
0.5 0.84813289 1.789E 2
0.25 0.86152424 4.501E 3 3.97
0.125 0.86489835 1.127E 3 3.99
0.0625 0.86574353 2.819E 4 4.00
0.03125 0.86595493 7.048E 5 4.00
EFFECTS OF ERROR IN FUNCTION VALUES
Recall
D
(2)
h
f(x
1
) =
f(x
2
) 2f(x
1
) +f(x
0
)
h
2
f
00
(x
1
)
with x
2
= x
1
+ h, x
0
= x
1
h. Assume the ac-
tual function values used in the computation contain
data error, and denote these values by
b
f
0
,
b
f
1
, and
b
f
2
.
Introduce the data errors:
i
= f(x
i
)
b
f
i
, i = 0, 1, 2 (16)
The actual quantity calculated is
c
D
(2)
h
f(x
1
) =
b
f
2
2
b
f
1
+
b
f
2
h
2
(17)
For the error in this quantity, replace
b
f
j
by f(x
j
)
j
,
j = 0, 1, 2, to obtain the following:
f
00
(x
1
)
c
D
(2)
h
f(x
1
) = f
00
(x
1
)
[f(x
2
)
2
] 2[f(x
1
)
1
] + [f(x
0
)
0
]
h
2
=
"
f
00
(x
1
)
f(x
2
) 2f(x
1
) +f(x
0
)
h
2
#
+
2
2
1
+
0
h
2
1
12
h
2
f
(4)
(x
1
) +
2
2
1
+
0
h
2
(18)
The last line uses (15).
The errors {
0
,
1
,
2
} are generally random in some
interval [, ]. If
n
b
f
0
,
b
f
1
,
b
f
2
o
are experimental data,
then is a bound on the experimental error. If
n
b
f
j
o
are obtained from computing f(x) in a computer, then
the errors
j
are the combination of rounding or chop-
ping errors and is a bound on these errors.
In either case, (18) yields the approximate inequality
f
00
(x
1
)
c
D
(2)
h
f(x
1
)
h
2
12
f
(4)
(x
1
)
+
4
h
2
(19)
This suggests that as h 0, the error will eventually
increase, because of the nal term
4
h
2
.
Example. Calculate
c
D
(2)
h
(x
1
) for f(x) = cos(x) at
x
1
=
1
6
. To show the eect of rounding errors, the
values
b
f
i
are obtained by rounding f(x
i
) to six signif-
icant digits; and the errors satisfy
|
i
| 5.0 10
7
= , i = 0, 1, 2
Other than these rounding errors, the formula
c
D
(2)
h
f(x
1
)
is calculated exactly. In this example, the bound (19)
becomes
f
00
(x
1
)
c
D
(2)
h
f(x
1
)
1
12
h
2
cos
1
6
4
h
2
(5 10
7
)
.
= 0.0722h
2
+
210
6
h
2
E(h)
For h = 0.125, the bound E(h)
.
= 0.00126, which is
not too far o from the actual error given in the table.
h
c
D
(2)
h
f(x
1
) Error
0.5 0.848128 0.017897
0.25 0.861504 0.004521
0.125 0.864832 0.001193
0.0625 0.865536 0.000489
0.03125 0.865280 0.000745
0.015625 0.860160 0.005865
0.0078125 0.851968 0.014057
0.00390625 0.786432 0.079593
The bound E(h) indicates that there is a smallest
value of h, call it h
. This leads to h
.
= 0.0726, which is
consistent with the behavior of the errors in the table.
LINEAR SYSTEMS
Consider the following example of a linear system:
x
1
+ 2x
2
+ 3x
3
= 5
x
1
+ x
3
= 3
3x
1
+ x
2
+ 3x
3
= 3
Its unique solution is
x
1
= 1, x
2
= 0, x
3
= 2
In general we want to solve n equations in n un-
knowns. For this, we need some simplifying nota-
tion. In particular we introduce arrays. We can think
of these as means for storing information about the
linear system in a computer. In the above case, we
introduce
A =
1 2 3
1 0 1
3 1 3
, b =
5
3
3
, x =
1
0
2
a
1,1
a
1,n
.
.
.
.
.
.
.
.
.
a
n,1
a
n,n
, b =
b
1
.
.
.
b
n
x =
x
1
.
.
.
x
n
A TRIDIAGONAL SYSTEM
Consider the tridiagonal linear system
3x
1
x
2
= 2
x
1
+ 3x
2
x
3
= 1
.
.
.
x
n2
+ 3x
n1
x
n
= 1
x
n1
+ 3x
n
= 2
The solution is
x
1
= = x
n
= 1
This has the associated arrays
A =
3 1 0 0
1 3 1 0
.
.
.
.
.
. 1 3 1
0 1 3
, b =
2
1
.
.
.
1
2
, x =
1
1
.
.
.
1
1
a
1
c
1
0 0
b
2
a
2
c
2
0
.
.
.
0 b
3
a
3
c
3
.
.
.
.
.
.
0 b
n
a
n
x
j
|x
k
|
1
3
x
k1
x
k+1
1
3
(|x
k
| + |x
k
|)
=
2
3
|x
k
|
This implies x
k
= 0, and therefore x = 0. A similar
proof is valid if k = 1 or k = n, using the rst or the
last equation, respectively.
Thus the original tridiagonal linear system Ax = b has
a unique solution x for each right side b.
METHODS OF SOLUTION
There are two general categories of numerical methods
for solving Ax = b.
Direct Methods: These are methods with a nite
number of steps; and they end with the exact solution
x, provided that all arithmetic operations are exact.
The most used of these methods is Gaussian elimi-
nation, which we begin with. There are other direct
methods, but we do not study them here.
Iteration Methods: These are used in solving all types
of linear systems, but they are most commonly used
with large sparse systems, especially those produced
by discretizing partial dierential equations. This is
an extremely active area of research.
MATRICES in MATLAB
Consider the matrices
A =
1 2 3
2 2 3
3 3 3
, b =
1
1
1
1 2
3 4
5 6
1 1
1 1
1 1
2 1
2 5
6 5
MULTIPLICATION BY A CONSTANT
c
a
1,1
a
1,n
.
.
.
.
.
.
.
.
.
a
m,1
a
m,n
ca
1,1
ca
1,n
.
.
.
.
.
.
.
.
.
ca
m,1
ca
m,n
EXAMPLE.
5
1 2
3 4
5 6
5 10
15 20
25 30
(1)
"
a b
c d
#
=
"
a b
c d
#
THE ZERO MATRIX 0
Dene the zero matrix of order m n as the matrix
of that order having all zero entries. It is sometimes
written as 0
mn
, but more commonly as simply 0.
Then for any matrix A of order m n,
A + 0 = 0 + A = A
The zero matrix 0
mn
acts in the same role as does
the number zero when doing arithmetic with real and
complex numbers.
EXAMPLE.
"
1 2
3 4
#
+
"
0 0
0 0
#
=
"
1 2
3 4
#
We denote by A the solution of the equation
A + B = 0
It is the matrix obtained by taking the negative of all
of the entries in A. For example,
"
a b
c d
#
+
"
a b
c d
#
=
"
0 0
0 0
#
"
a b
c d
#
=
"
a b
c d
#
= (1)
"
a b
c d
#
"
a
1,1
a
1,2
a
2,1
a
2,2
#
=
"
a
1,1
a
1,2
a
2,1
a
2,2
#
MATRIX MULTIPLICATION
Let A =
h
a
i,j
i
have order mn and B =
h
b
i,j
i
have
order n p. Then
C = AB
is a matrix of order m p and
c
i,j
= A
i,
B
,j
= a
i,1
b
1,j
+ a
i,2
b
2,j
+ + a
i,n
b
n,j
or equivalently
c
i,j
=
h
a
i,1
a
i,2
a
i,n
i
b
1,j
b
2,j
.
.
.
b
n,j
= a
i,1
b
1,j
+ a
i,2
b
2,j
+ + a
i,n
b
n,j
EXAMPLES
"
1 2 3
4 5 6
#
1 2
3 4
5 6
=
"
22 28
49 64
#
1 2
3 4
5 6
"
1 2 3
4 5 6
#
=
9 12 15
19 26 33
29 40 51
a
1,1
a
1,n
.
.
.
.
.
.
.
.
.
a
n,1
a
n,n
x
1
.
.
.
x
n
a
1,1
x
1
+ + a
1,n
x
n
.
.
.
a
n,1
x
1
+ + a
n,n
x
n
1 0 . . . 0
0 1 0
.
.
.
.
.
.
.
.
.
0 . . . 1
1
1
2
1
3
1
2
1
3
1
4
1
3
1
4
1
5
1
=
9 36 30
36 192 180
30 180 180
1 2 3
4 5 6
7 8 9
= 0
Therefore, the linear system
1 2 3
4 5 6
7 8 9
x
1
x
2
x
3
b
1
b
2
b
3
1 2 3
4 5 6
7 8 9
1
2
1
0
0
0
PARTITIONED MATRICES
Matrices can be built up from smaller matrices; or
conversely, we can decompose a large matrix into a
matrix of smaller matrices. For example, consider
A =
1 2 0
2 1 1
0 1 5
=
"
B c
d e
#
B =
"
1 2
2 1
#
c =
"
0
1
#
d =
h
0 1
i
e = 5
Matlab allows you to build up larger matrices out of
smaller matrices in exactly this manner; and smaller
matrices can be dened as portions of larger matrices.
We will often write an n n square matrix in terms
of its columns:
A =
h
A
,1
, ..., A
,n
i
For the n n identity matrix I, we write
I = [e
1
, ..., e
n
]
with e
j
denoting a column vector with a 1 in position
j and zeros elsewhere.
ARITHMETIC OF PARTITIONED MATRICES
As with matrices, we can do addition and multiplica-
tion with partitioned matrices provided the individual
constituent parts have the proper orders.
For example, let A, B, C, D be n n matrices. Then
"
I A
B I
# "
I C
D I
#
=
"
I + AD C + A
B + D I + BC
#
Let A be n n and x be a column vector of length
n. Then
Ax =
h
A
,1
, ..., A
,n
i
x
1
.
.
.
x
n
= x
1
A
,1
+ +x
n
A
,n
Compare this to
a
1,1
a
1,n
.
.
.
.
.
.
.
.
.
a
n,1
a
n,n
x
1
.
.
.
x
n
a
1,1
x
1
+ + a
1,n
x
n
.
.
.
a
n,1
x
1
+ + a
n,n
x
n
1 2 7
3 4 8
5 6 9
3 2 1
6 2 2
9 7 1
, b =
0
6
1
3 2 1
6 2 2
9 7 1
0
6
1
In step 1, we eliminate x
1
from equations 2 and 3.
We multiply row 1 by 2 and subtract it from row 2;
and we multiply row 1 by -3 and subtract it from row
3. This yields
3 2 1
0 2 4
0 1 2
0
6
1
3 2 1
0 2 4
0 1 2
0
6
1
In step 2, we eliminate x
2
from equation 3. We mul-
tiply row 2 by
1
2
and subtract from row 3. This yields
3 2 1
0 2 4
0 0 4
0
6
4
a
(1)
1,1
a
(1)
1,n
.
.
.
.
.
.
.
.
.
a
(1)
n,1
a
(1)
n,n
b
(1)
1
.
.
.
b
(1)
n
a
(1)
1,1
a
(1)
1,n
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 a
(n)
n,n
b
(1)
1
.
.
.
.
.
.
b
(n)
n
u
1,1
u
1,n
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 u
n,n
x
1
.
.
.
.
.
.
x
n
g
1
.
.
.
.
.
.
g
n
n
u
k,k+1
x
k+1
+ + u
k,n
x
n
o
u
k,k
for k = n1, ..., 1. What we have done here is simply
a more carefully dened and methodical version of
what you have done in high school algebra.
How do we carry out the conversion of
a
(1)
1,1
a
(1)
1,n
.
.
.
.
.
.
.
.
.
a
(1)
n,1
a
(1)
n,n
b
(1)
1
.
.
.
b
(1)
n
to
a
(1)
1,1
a
(1)
1,n
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 a
(n)
n,n
b
(1)
1
.
.
.
.
.
.
b
(n)
n
a
(1)
1,1
a
(1)
1,n
.
.
.
.
.
.
.
.
.
a
(1)
n,1
a
(1)
n,n
b
(1)
1
.
.
.
b
(1)
n
a
(1)
1,1
a
(1)
1,2
a
(1)
1,n
0 a
(2)
2,2
a
(2)
2,n
.
.
.
.
.
.
.
.
.
.
.
.
0 a
(2)
n,2
a
(2)
n,n
b
(1)
1
b
(2)
2
.
.
.
b
(2)
n
a
(1)
1,1
a
(1)
1,2
a
(1)
1,n
0 a
(2)
2,2
a
(2)
2,n
.
.
.
.
.
.
.
.
.
.
.
. 0 a
(k)
k,k
a
(k)
k,n
.
.
.
.
.
.
.
.
.
.
.
.
0 0 a
(k)
n,k
a
(k)
n,n
b
(1)
1
b
(2)
2
.
.
.
b
(k)
k
.
.
.
b
(k)
n
a
(1)
1,1
a
(1)
1,n
0
.
.
.
.
.
.
a
(k)
k,k
a
(k)
k,k+1
a
(k)
k,n
.
.
. 0 a
(k+1)
k+1,k+1
a
(k+1)
k+1,n
.
.
.
.
.
.
.
.
.
.
.
.
0 0 a
(k+1)
n,k+1
a
(k+1)
n,n
b
(1)
1
.
.
.
b
(k)
k
b
(k+1)
k+1
.
.
.
b
(k+1)
n
a
(1)
1,1
a
(1)
1,n
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 a
(n)
n,n
b
(1)
1
.
.
.
.
.
.
b
(n)
n
a
(1)
1,1
a
(1)
1,n
.
.
.
.
.
.
.
.
.
a
(1)
n,1
a
(1)
n,n
b
(1)
1
.
.
.
b
(1)
n
to
[A
(2)
| b
(2)
] =
a
(1)
1,1
a
(1)
1,2
a
(1)
1,n
0 a
(2)
2,2
a
(2)
2,n
.
.
.
.
.
.
.
.
.
.
.
.
0 a
(2)
n,2
a
(2)
n,n
b
(1)
1
b
(2)
2
.
.
.
b
(2)
n
What if a
(1)
1,1
= 0? In that case we look for an equation
in which the x
1
is present. To do this in such a way
as to avoid zero the maximum extant possible, we do
the following.
Look at all the elements in the rst column,
a
(1)
1,1
, a
(1)
2,1
, ..., a
(1)
n,1
and pick the largest in size. Say it is
a
(1)
k,1
= max
j=1,...,n
a
(1)
j,1
a
(1)
1,1
a
(1)
1,2
a
(1)
1,n
0 a
(2)
2,2
a
(2)
2,n
.
.
.
.
.
.
.
.
.
.
.
.
0 a
(2)
n,2
a
(2)
n,n
b
(1)
1
b
(2)
2
.
.
.
b
(2)
n
what if a
(2)
2,2
= 0? Then we proceed as before.
Among the elements
a
(2)
2,2
, a
(2)
3,2
, ..., a
(2)
n,2
pick the one of largest size:
a
(2)
k,2
= max
j=2,...,n
a
(2)
j,2
a
(1)
1,1
a
(1)
1,2
a
(1)
1,3
a
(1)
1,n
0 a
(2)
2,2
a
(2)
2,3
a
(2)
2,n
0 0 a
(3)
3,3
a
(3)
3,n
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 a
(3)
n,3
a
(3)
n,n
b
(1)
1
b
(2)
2
b
(3)
3
.
.
.
b
(3)
n
m
i,1
1, i = 2, ..., n
Thus in the calculation of a
(2)
i,j
and b
(2)
i
, we have that
the elements do not grow rapidly in size. This is in
comparison to what might happen otherwise, in which
the multipliers m
i,1
might have been very large. This
property is true of the multipliers at very step of the
elimination process:
m
i,k
1, i = k + 1, ..., n, k = 1, ..., n 1
The property
m
i,k
1, i = k + 1, ..., n
leads to good error propagation properties in Gaussian
elimination with partial pivoting. The only error in
Gaussian elimination is that derived from the round-
ing errors in the arithmetic operations. For example,
at the rst elimination step (eliminating x
1
from equa-
tions 2 thru n),
a
(2)
i,j
= a
(1)
i,j
m
i,1
a
(1)
1,j
, j = 2, ..., n
b
(2)
i
= b
(1)
i
m
i,1
b
(1)
1
The above property on the size of the multipliers pre-
vents these numbers and the errors in their calculation
from growing as rapidly as they might if no partial piv-
oting was used.
As an example of the improvement in accuracy ob-
tained with partial pivoting, see the example on pages
262-263.
OPERATION COUNTS
One of the major ways in which we compare the e-
ciency of dierent numerical methods is to count the
number of needed arithmetic operations. For solving
the linear system
a
1,1
x
1
+ + a
1,n
x
n
= b
1
.
.
.
a
n,1
x
1
+ + a
n,n
x
n
= b
n
using Gaussian elimination, we have the following op-
eration counts.
1. A U, where we are converting Ax = b to
Ux = g:
Divisions
n(n 1)
2
Additions
n(n 1)(2n 1)
6
Multiplications
n(n 1)(2n 1)
6
2. b g:
Additions
n(n 1)
2
Multiplications
n(n 1)
2
3. Solving Ux = g:
Divisions n
Additions
n(n 1)
2
Multiplications
n(n 1)
2
On some machines, the cost of a division is much
more than that of a multiplication; whereas on others
there is not any important dierence. We assume the
latter; and then the operation costs are as follows.
MD(A U) =
n
n
2
1
3
MD(b g) =
n(n 1)
2
MD(Find x) =
n(n + 1)
2
AS(A U) =
n(n 1)(2n 1)
6
AS(b g) =
n(n 1)
2
AS(Find x) =
n(n 1)
2
Thus the total number of operations is
Additions
2n
3
+ 3n
2
5n
6
Multiplications
and Divisions
!
n
3
+ 3n
2
n
3
Both are around
1
3
n
3
, and thus the total operations
account is approximately
2
3
n
3
What happens to the cost when n is doubled?
Solving Ax = b and Ax = c. What is the cost? Only
the modication of the right side is dierent in these
two cases. Thus the additional cost is
MD(b g)
MD(Find x)
!
= n
2
AS(b g)
AS(Find x)
!
= n(n 1)
The total is around 2n
2
operations, which is quite a
bit smaller than
2
3
n
3
when n is even moderately large,
say n = 100.
Thus one can solve the linear system Ax = c at little
additional cost to that for solving Ax = b. This has
important consequences when it comes to estimation
of the error in computed solutions.
CALCULATING THE MATRIX INVERSE
Consider nding the inverse of a 3 3 matrix
A =
a
1,1
a
1,2
a
1,3
a
2,1
a
2,2
a
2,3
a
3,1
a
3,2
a
3,3
=
h
A
,1
, A
,2
, A
,3
i
We want to nd a matrix
X =
h
X
,1
, X
,2
, X
,3
i
for which
AX = I
A
h
X
,1
, X
,2
, X
,3
i
= [e
1
, e
2
, e
3
]
h
AX
,1
, AX
,2
, AX
,3
i
= [e
1
, e
2
, e
3
]
This means we want to solve
AX
,1
= e
1
, AX
,2
= e
2
, AX
,3
= e
3
We want to solve three linear systems, all with the
same matrix of coecients A.
MATRIX INVERSE EXAMPLE
A =
1 1 2
1 1 1
1 1 0
1 1 2
1 1 1
1 1 0
1 0 0
0 1 0
0 0 1
m
2,1
= 1 m
3,1
= 1
1 1 2
0 0 3
0 2 2
1 0 0
1 1 0
1 0 1
1 1 2
0 2 2
0 0 3
1 0 0
1 0 1
1 1 0
1 1 2
0 2 2
0 0 3
1 0 0
1 0 1
1 1 0
1
6
1
3
1
2
1
6
1
3
1
2
1
3
1
3
0
2n
2
=
8
3
n
3
operations, approximately
It costs approximately four times as many operations
to invert A as to solve a single system. With attention
to the form of the right-hand sides in (1) this can be
reduced to 2n
3
operations.
MATLAB MATRIX OPERATIONS
To solve the linear system Ax = b in Matlab, use
x = A\ b
In Matlab, the command
inv (A)
will calculate the inverse of A.
There are many matrix operations built into Matlab,
both for general matrices and for special classes of
matrices. We do not discuss those here, but recom-
mend the student to investigate these thru the Matlab
help options.
GAUSSIAN ELIMINATION - REVISITED
Consider solving the linear system
2x
1
+ x
2
x
3
+ 2x
4
= 5
4x
1
+ 5x
2
3x
3
+ 6x
4
= 9
2x
1
+ 5x
2
2x
3
+ 6x
4
= 4
4x
1
+ 11x
2
4x
3
+ 8x
4
= 2
by Gaussian elimination without pivoting. We denote
this linear system by Ax = b. The augmented matrix
for this system is
[A | b] =
2 1 1 2
4 5 3 6
2 5 2 6
4 11 4 8
5
9
4
2
To eliminate x
1
from equations 2, 3, and 4, use mul-
tipliers
m
2,1
= 2, m
3,1
= 1, m
4,1
= 2
To eliminate x
1
from equations 2, 3, and 4, use mul-
tipliers
m
2,1
= 2, m
3,1
= 1, m
4,1
= 2
This will introduce zeros into the positions below the
diagonal in column 1, yielding
2 1 1 2
0 3 1 2
0 6 3 8
0 9 2 4
5
1
9
8
To eliminate x
2
from equations 3 and 4, use multipli-
ers
m
3,2
= 2, m
4,2
= 3
This reduces the augmented matrix to
2 1 1 2
0 3 1 2
0 0 1 4
0 0 1 2
5
1
11
5
To eliminate x
3
from equation 4, use the multiplier
m
4,3
= 1
This reduces the augmented matrix to
2 1 1 2
0 3 1 2
0 0 1 4
0 0 0 2
5
1
11
6
2 1 1 2
0 3 1 2
0 0 1 4
0 0 0 2
1 0 0 0
m
2,1
1 0 0
m
3,1
m
3,2
1 0
m
4,1
m
4,2
m
4,3
1
1 0 0 0
2 1 0 0
1 2 1 0
2 3 1 1
2 1 1 2
4 5 3 6
2 5 2 6
4 11 4 8
1 0 0 0
2 1 0 0
1 2 1 0
2 3 1 1
2 1 1 2
0 3 1 2
0 0 1 4
0 0 0 2
0 0 1 0
1 0 0 0
0 0 0 1
0 1 0 0
Then
PA =
0 0 1 0
1 0 0 0
0 0 0 1
0 1 0 0
a
1,1
a
1,2
a
1,3
a
1,4
a
2,1
a
2,2
a
2,3
a
2,4
a
3,1
a
3,2
a
3,3
a
3,4
a
4,1
a
4,2
a
4,3
a
4,4
A
3,
A
1,
A
4,
A
2,
PA =
0 0 1 0
1 0 0 0
0 0 0 1
0 1 0 0
a
1,1
a
1,2
a
1,3
a
1,4
a
2,1
a
2,2
a
2,3
a
2,4
a
3,1
a
3,2
a
3,3
a
3,4
a
4,1
a
4,2
a
4,3
a
4,4
A
3,
A
1,
A
4,
A
2,
2,1
g
1
+ g
2
= b
2
3,1
g
1
+
3,2
g
2
+ g
3
= b
3
.
.
.
n,1
g
1
+
n,n1
g
n1
+ g
n
= b
n
We solve it by forward substitution. Then we solve
the upper triangular system Ux = g by back substi-
tution.
VARIANTS OF GAUSSIAN ELIMINATION
If no partial pivoting is needed, then we can look for
a factorization
A = LU
without going thru the Gaussian elimination process.
For example, suppose A is 4 4. We write
a
1,1
a
1,2
a
1,3
a
1,4
a
2,1
a
2,2
a
2,3
a
2,4
a
3,1
a
3,2
a
3,3
a
3,4
a
4,1
a
4,2
a
4,3
a
4,4
1 0 0 0
2,1
1 0 0
3,1
3,2
1 0
4,1
4,2
4,3
1
u
1,1
u
1,2
u
1,3
u
1,4
0 u
2,2
u
2,3
u
2,4
0 0 u
3,3
u
3,4
0 0 0 u
4,4
To nd the elements
n
i,j
o
and
n
u
i,j
o
, we multiply
the right side matrices L and U and match the results
with the corresponding elements in A.
Multiplying the rst row of L times all of the columns
of U leads to
u
1,j
= a
1,j
, j = 1, 2, 3, 4
Then multiplying rows 2, 3, 4 times the rst column
of U yields
i,1
u
1,1
= a
i,1
, i = 2, 3, 4
and we can solve for
n
2,1
,
3,1
,
4,1
o
. We can con-
tinue this process, nding the second row of U and
then the second column of L, and so on. For example,
to solve for
4,3
, we need to solve for it in
4,1
u
1,3
+
4,2
u
2,3
+
4,3
u
3,3
= a
4,3
Why do this? A hint of an answer is given by this
last equation. If we had an n n matrix A, then we
would nd
n,n1
by solving for it in the equation
n,1
u
1,n1
+
n,2
u
2,n1
+ +
n,n1
u
n1,n1
= a
n,n1
n,n1
=
a
n,n1
h
n,1
u
1,n1
+ +
n,n2
u
n2,n1
i
u
n1,n1
Embedded in this formula we have a dot product. This
is in fact typical of this process, with the length of the
inner products varying from one position to another.
Recalling the discussion of dot products, we can evaluate
this last formula by using a higher precision arithmetic
and thus avoid many rounding errors.
This leads to a variant of Gaussian elimination in which
there are far fewer rounding errors.
With ordinary Gaussian elimination, the number of round-
ing errors is proportional to n
3
. This reduces the number
of rounding errors, with the number now being propor-
tional to only n
2
. This can lead to major increases in
accuracy, especially for matrices which are very sensitive
to small changes.
TRIDIAGONAL MATRICES
A =
b
1
c
1
0 0 0
a
2
b
2
c
2
0
0 a
3
b
3
c
3
.
.
.
.
.
.
.
.
. a
n1
b
n1
c
n1
0 a
n
b
n
1 1 0 0 0
1 2 1 0
0 1 2 1
.
.
.
.
.
.
.
.
. 1 2 1
0 1
n1
n
Then A
1
is given by
A
1
i,j
= max {i, j}
Thus the sparse matrix A can (and usually does) have
a dense inverse.
We factor A = LU, with
L =
1 0 0 0 0
2
1 0 0
0
3
1 0
.
.
.
.
.
.
.
.
.
n1
1 0
0
n
1
U =
1
c
1
0 0 0
0
2
c
2
0
0 0
3
c
3
.
.
.
.
.
.
.
.
. 0
n1
c
n1
0 0
n
j
g
j1
; j = 2; : : : ; n
Solving Ux = g:
x
n
=
g
n
n
x
j
=
g
j
c
j
x
j+1
j
; j = n 1; : : : ; 1
By doing a few multiplications of rows of L times
columns of U, we obtain the general pattern as fol-
lows.
1
= b
1
: row 1 of LU
1
= a
2
,
2
c
1
+
2
= b
2
: row 2 of LU
.
.
.
n1
= a
n
,
n
c
n1
+
n
= b
n
: row n of LU
These are straightforward to solve.
1
= b
1
j
=
a
j
j1
,
j
= b
j
j
c
j1
, j = 2, ..., n
OPERATIONS COUNT
Factoring A = LU.
Additions: n 1
Multiplications: n 1
Divisions: n 1
Solving Lz = f and Ux = z:
Additions: 2n 2
Multiplications: 2n 2
Divisions: n
Thus the total number of arithmetic operations is ap-
proximately 3n to factor A; and it takes about 5n to
solve the linear system using the factorization of A.
If we had A
1
at no cost, what would it cost to com-
pute x = A
1
f?
x
i
=
n
X
j=1
A
1
i,j
f
j
, i = 1, ..., n
MATLAB MATRIX OPERATIONS
To obtain the LU-factorization of a matrix, including
the use of partial pivoting, use the Matlab command
lu. In particular,
[L, U, P] = lu(X)
returns the lower triangular matrix L, upper triangular
matrix U, and permutation matrix P so that
PX = LU
NUMERICAL INTEGRATION
How do you evaluate
I =
Z
b
a
f(x) dx
From calculus, if F(x) is an antiderivative of f(x),
then
I =
Z
b
a
f(x) dx = F(x)|
b
a
= F(b) F(a)
However, in practice most integrals cannot be evalu-
ated by this means. And even when this can work, an
approximate numerical method may be much simpler
and easier to use. For example, the integrand in
Z
1
0
dx
1 + x
5
has an extremely complicated antiderivative; and it is
easier to evaluate the integral by approximate means.
Try evaluating this integral with Maple or Mathemat-
ica.
NUMERICAL INTEGRATION
A GENERAL FRAMEWORK
Returning to a lesson used earlier with rootnding:
If you cannot solve a problem, then replace it with a
near-by problem that you can solve.
In our case, we want to evaluate
I =
Z
b
a
f(x) dx
To do so, many of the numerical schemes are based
on choosing approximates of f(x). Calling one such
e
f(x), use
I
Z
b
a
e
f(x) dx
e
I
What is the error?
E = I
e
I =
Z
b
a
h
f(x)
e
f(x)
i
dx
|E|
Z
b
a
f(x)
e
f(x)
dx
(b a)
f
e
f
f
e
f
max
axb
f(x)
e
f(x)
2
i
=
4
.
= .785398
Error = .215
HOW TO OBTAIN GREATER ACCURACY?
How do we improve our estimate of the integral
I =
Z
b
a
f(x) dx
One direction is to increase the degree of the approxi-
mation, moving next to a quadratic interpolating poly-
nomial for f(x). We rst look at an alternative.
Instead of using the trapezoidal rule on the original
interval [a, b], apply it to integrals of f(x) over smaller
subintervals. For example:
I =
Z
c
a
f(x) dx +
Z
b
c
f(x) dx, c =
b+a
2
ca
2
[f(a) + f(c)] +
bc
2
[f(c) + f(b)]
=
h
2
[f(a) + 2f(c) + f(b)] T
2
(f), h =
ba
2
Example.
Z
/2
0
sin xdx
8
h
sin 0 + 2 sin
+ sin
2
i
.
= .948059
Error = .0519
x
y
a=x
0
b=x
3
x
1
x
2
y=f(x)
Illustrating I T
3
(f)
THE TRAPEZOIDAL RULE
We can continue as above by dividing [a, b] into even
smaller subintervals and applying
f(x) dx
2
[f() + f()] , ()
on each of the smaller subintervals. Begin by intro-
ducing a positive integer n 1,
h =
b a
n
, x
j
= a + j h, j = 0, 1, ..., n
Then
I =
Z
x
n
x
0
f(x) dx
=
Z
x
1
x
0
f(x) dx +
Z
x
2
x
1
f(x) dx + +
Z
x
n
x
n1
f(x) dx
Use [, ] = [x
0
, x
1
], [x
1
, x
2
], ..., [x
n1
, x
n
], for each
of which the subinterval has length h.
Then applying
f(x) dx
2
[f() + f()]
we have
I
h
2
[f(x
0
) + f(x
1
)] +
h
2
[f(x
1
) + f(x
2
)]
+
+
h
2
[f(x
n2
) + f(x
n1
)] +
h
2
[f(x
n1
) + f(x
n
)]
Simplifying,
I h
1
2
f(a) + f(x
1
) + + f(x
n1
) +
1
2
f(b)
T
n
(f)
This is called the composite trapezoidal rule, or
more simply, the trapezoidal rule.
Example. Again integrate sin x over
h
0,
2
i
. Then we
have
n T
n
(f) Error Ratio
1 .785398163 2.15E1
2 .948059449 5.19E2 4.13
4 .987115801 1.29E2 4.03
8 .996785172 3.21E3 4.01
16 .999196680 8.03E4 4.00
32 .999799194 2.01E4 4.00
64 .999949800 5.02E5 4.00
128 .999987450 1.26E5 4.00
256 .999996863 3.14E6 4.00
Note that the errors are decreasing by a constant fac-
tor of 4. Why do we always double n?
USING QUADRATIC INTERPOLATION
We want to approximate I =
R
b
a
f(x) dx using quadratic
interpolation of f(x). Interpolate f(x) at the points
{a, c, b}, with c =
1
2
(a + b). Also let h =
1
2
(b a).
The quadratic interpolating polynomial is given by
P
2
(x) =
(x c) (x b)
2h
2
f(a) +
(x a) (x b)
h
2
f(c)
+
(x a) (x c)
2h
2
f(b)
Replacing f(x) by P
2
(x), we obtain the approximation
Z
b
a
f(x) dx
Z
b
a
P
2
(x) dx
=
h
3
[f(a) + 4f(c) + f(b)] S
2
(f)
This is called Simpsons rule.
x
y
a b (a+b)/2
y=f(x)
Illustration of I S
2
(f)
Example.
Z
/2
0
sin xdx
/2
3
h
sin 0 + 4 sin
+ sin
2
i
.
= 1.00227987749221
Error = 0.00228
SIMPSONS RULE
As with the trapezoidal rule, we can apply Simpsons
rule on smaller subdivisions in order to obtain better
accuracy in approximating
I =
Z
b
a
f(x) dx
Again, Simpsons rule is given by
Z
f(x) dx
3
[f() + 4f() + f()] , =
+
2
and =
1
2
( ).
Let n be a positive even integer, and
h =
b a
n
, x
j
= a + j h, j = 0, 1, ..., n
Then write
I =
Z
x
n
x
0
f(x) dx
=
Z
x
2
x
0
f(x) dx +
Z
x
4
x
2
f(x) dx + +
Z
x
n
x
n2
f(x) dx
Apply
Z
f(x) dx
3
[f() + 4f() + f()] , =
+
2
to each of these subintegrals, with
[, ] = [x
0
, x
2
] , [x
2
, x
4
] , ..., [x
n2
, x
n
]
In all cases,
1
2
( ) = h. Then
I
h
3
[f(x
0
) + 4f(x
1
) + f(x
2
)]
+
h
3
[f(x
2
) + 4f(x
3
) + f(x
4
)]
+
+
h
3
[f(x
n4
) + 4f(x
n3
) + f(x
n2
)]
+
h
3
[f(x
n2
) + 4f(x
n1
) + f(x
n
)]
This can be simplied to
Z
b
a
f(x) dx S
n
(f)
h
3
[f(x
0
) + 4f(x
1
)
+2f(x
2
) + 4f(x
3
) + 2f(x
4
)
+ + 2f(x
n2
) + 4f(x
n1
) + f(x
n
)]
This is called the composite Simpsons rule or more
simply, .Simpsons rule
EXAMPLE
Approximate
Z
/2
0
sin xdx. The Simpson rule results
are as follows.
n S
n
(f) Error Ratio
2 1.00227987749221 2.28E3
4 1.00013458497419 1.35E4 16.94
8 1.00000829552397 8.30E6 16.22
16 1.00000051668471 5.17E7 16.06
32 1.00000003226500 3.23E8 16.01
64 1.00000000201613 2.02E9 16.00
128 1.00000000012600 1.26E10 16.00
256 1.00000000000788 7.88E12 16.00
512 1.00000000000049 4.92E13 15.99
Note that the ratios of successive errors have con-
verged to 16. Why? Also compare this table with
that for the trapezoidal rule. For example,
I T
4
= 1.29E 2
I S
4
= 1.35E 4
Example 1
I
(1)
=
Z
1
0
e
x
2
dx 0:746824132812427
I
(2)
=
Z
4
0
dx
1 + x
2
= arctan(4) 1:32581766366803
I
(3)
=
Z
2
0
dx
2 + cos x
=
2
p
3
3:62759872846844
Table 1. Trapezoidal rule applied to Example 1.
n I
(1)
I
(2)
I
(3)
Error R Error R Error R
2 1:6E 2 1:3E 1 5:6E 1
4 3:8E 3 4:02 3:6E 3 37:0 3:8E 2 14:9
8 9:6E 4 4:01 5:6E 4 6:4 1:9E 4 195:0
16 2:4E 4 4:00 1:4E 4 3:9 5:2E 9 37600
32 6:0E 5 4:00 3:6E 5 4:00
64 1:5E 5 4:00 9:0E 6 4:00
128 3:7E 6 4:00 2:3E 6 4:00
Table 2. Simpson rule applied to Example 1.
n I
(1)
I
(2)
I
(3)
Error R Error R Error R
2 3:6E 4 8:7E 2 1:26
4 3:1E 5 11:4 3:9E 2 2:2 1:4E 1 9:2
8 2:0E 6 15:7 2:0E 3 20 1:2E 2 11:2
16 1:3E 7 15:9 4:0E 6 485 6:4E 5 191
32 7:8E 9 16:0 2:3E 8 172 1:7E 9 37600
64 4:9E 10 16:0 1:5E 9 16
128 3:0E 11 16:0 9:2E 11 16
TRAPEZOIDAL METHOD
ERROR FORMULA
Theorem Let f(x) have two continuous derivatives on
the interval a x b. Then
E
T
n
(f)
Z
b
a
f(x) dx T
n
(f) =
h
2
(b a)
12
f
00
(c
n
)
for some c
n
in the interval [a, b].
Later I will say something about the proof of this re-
sult, as it leads to some other useful formulas for the
error.
The above formula says that the error decreases in
a manner that is roughly proportional to h
2
. Thus
doubling n (and halving h) should cause the error to
decrease by a factor of approximately 4. This is what
we observed with a past example from the preceding
section.
Example. Consider evaluating
I =
Z
2
0
dx
1 + x
2
using the trapezoidal method T
n
(f). How large should
n be chosen in order to ensure that
E
T
n
(f)
5 10
6
We begin by calculating the derivatives:
f
0
(x) =
2x
1 + x
2
2
, f
00
(x) =
2 + 6x
2
1 + x
2
3
From a graph of f
00
(x),
max
0x2
f
00
(x)
= 2
Recall that b a = 2. Therefore,
E
T
n
(f) =
h
2
(b a)
12
f
00
(c
n
)
E
T
n
(f)
h
2
(2)
12
2 =
h
2
3
E
T
n
(f) =
h
2
(b a)
12
f
00
(c
n
)
E
T
n
(f)
h
2
2
12
2 =
h
2
3
We bound
f
00
(c
n
)
E
T
n
(f)
5 10
6
(1)
To ensure this, we choose h so small that
h
2
3
5 10
6
This is equivalent to choosing h and n to satisfy
h .003873
n =
2
h
516.4
Thus n 517 will imply (1).
DERIVING THE ERROR FORMULA
There are two stages in deriving the error:
(1) Obtain the error formula for the case of a single
subinterval (n = 1);
(2) Use this to obtain the general error formula given
earlier.
For the trapezoidal method with only a single subin-
terval, we have
Z
+h
f(x) dx
h
2
[f() + f( + h)] =
h
3
12
f
00
(c)
for some c in the interval [, + h].
A sketch of the derivation of this error formula is given
in the problems.
Recall that the general trapezoidal rule T
n
(f) was ob-
tained by applying the simple trapezoidal rule to a sub-
division of the original interval of integration. Recall
dening and writing
h =
b a
n
, x
j
= a + j h, j = 0, 1, ..., n
I =
x
n
Z
x
0
f(x) dx
=
x
1
Z
x
0
f(x) dx +
x
2
Z
x
1
f(x) dx +
+
x
n
Z
x
n1
f(x) dx
I
h
2
[f(x
0
) + f(x
1
)] +
h
2
[f(x
1
) + f(x
2
)]
+
+
h
2
[f(x
n2
) + f(x
n1
)] +
h
2
[f(x
n1
) + f(x
n
)]
Then the error
E
T
n
(f)
Z
b
a
f(x) dx T
n
(f)
can be analyzed by adding together the errors over the
subintervals [x
0
, x
1
], [x
1
, x
2
], ..., [x
n1
, x
n
]. Recall
Z
+h
f(x) dx
h
2
[f() + f( + h)] =
h
3
12
f
00
(c)
Then on [x
j1
, x
j
],
x
j
Z
x
j1
f(x) dx
h
2
h
f(x
j1
) + f(x
j
)
i
=
h
3
12
f
00
(
j
)
with x
j1
j
x
j
, but otherwise
j
unknown.
Then combining these errors, we obtain
E
T
n
(f) =
h
3
12
f
00
(
1
)
h
3
12
f
00
(
n
)
This formula can be further simplied, and we will do
so in two ways.
Rewrite this error as
E
T
n
(f) =
h
3
n
12
"
f
00
(
1
) + + f
00
(
n
)
n
#
Denote the quantity inside the brackets by
n
. This
number satises
min
axb
f
00
(x)
n
max
axb
f
00
(x)
Since f
00
(x) is a continuous function (by original as-
sumption), we have that there must be some number
c
n
in [a, b] for which
f
00
(c
n
) =
n
Recall also that hn = b a. Then
E
T
n
(f) =
h
3
n
12
"
f
00
(
1
) + + f
00
(
n
)
n
#
=
h
2
(b a)
12
f
00
(c
n
)
This is the error formula given on the rst slide.
AN ERROR ESTIMATE
We now obtain a way to estimate the error E
T
n
(f).
Return to the formula
E
T
n
(f) =
h
3
12
f
00
(
1
)
h
3
12
f
00
(
n
)
and rewrite it as
E
T
n
(f) =
h
2
12
h
f
00
(
1
)h + + f
00
(
n
)h
i
The quantity
f
00
(
1
)h + + f
00
(
n
)h
is a Riemann sum for the integral
Z
b
a
f
00
(x) dx = f
0
(b) f
0
(a)
By this we mean
lim
n
h
f
00
(
1
)h + + f
00
(
n
)h
i
=
Z
b
a
f
00
(x) dx
Thus
f
00
(
1
)h + + f
00
(
n
)h f
0
(b) f
0
(a)
for larger values of n. Combining this with the earlier
error formula
E
T
n
(f) =
h
2
12
h
f
00
(
1
)h + + f
00
(
n
)h
i
we have
E
T
n
(f)
h
2
12
h
f
0
(b) f
0
(a)
i
e
E
T
n
(f)
This is a computable estimate of the error in the nu-
merical integration. It is called an asymptotic error
estimate.
Example. Consider evaluating
I(f) =
Z
0
e
x
cos xdx =
e
+ 1
2
.
= 12.070346
In this case,
f
0
(x) = e
x
[cos x sin x]
f
00
(x) = 2e
x
sin x
max
0x
f
00
(x)
f
00
(.75)
= 14. 921
Then
E
T
n
(f) =
h
2
(b a)
12
f
00
(c
n
)
E
T
n
(f)
h
2
12
14.921 = 3.906h
2
Also
e
E
T
n
(f) =
h
2
12
f
0
() f
0
(0)
=
h
2
12
[e
+ 1]
.
= 2.012h
2
I(f) T
n
(f)
h
2
12
f
0
(b) f
0
(a)
I(f) T
n
(f)
h
2
12
f
0
(b) f
0
(a)
CT
n
(f) T
n
(f)
h
2
12
f
0
(b) f
0
(a)
e
E
T
n
(f)
EXAMPLE
Consider evaluating
I =
Z
2
0
dx
1 + x
2
using Simpsons rule S
n
(f). How large should n be
chosen in order to ensure that
E
S
n
(f)
5 10
6
Begin by noting that
f
(4)
(x) = 24
5x
4
10x
2
+ 1
1 + x
2
5
max
0x1
f
(4)
(x)
= f
(4)
(0) = 24
Then
E
S
n
(f) =
h
4
(b a)
180
f
(4)
(c
n
)
E
S
n
(f)
h
4
2
180
24 =
4h
4
15
Then
E
S
n
(f)
5 10
6
is true if
4h
4
15
5 10
6
h .0658
n 30.39
Therefore, choosing n 32 will give the desired er-
ror bound. Compare this with the earlier trapezoidal
example in which n 517 was needed.
For the asymptotic error estimate, we have
f
000
(x) = 24x
x
2
1
1 + x
2
4
e
E
S
n
(f)
h
4
180
f
000
(2) f
000
(0)
=
h
4
180
144
625
=
4
3125
h
4
INTEGRATING sqrt(x)
Consider the numerical approximation of
Z
1
0
sqrt(x) dx =
2
3
In the following table, we give the errors when using
both the trapezoidal and Simpson rules.
n E
T
n
Ratio E
S
n
Ratio
2 6.311E 2 2.860E 2
4 2.338E 2 2.70 1.012E 2 2.82
8 8.536E 3 2.74 3.587E 3 2.83
16 3.085E 3 2.77 1.268E 3 2.83
32 1.108E 3 2.78 4.485E 4 2.83
64 3.959E 4 2.80 1.586E 4 2.83
128 1.410E 4 2.81 5.606E 5 2.83
The rate of convergence is slower because the func-
tion f(x) =sqrt(x) is not suciently dierentiable on
[0, 1]. Both methods converge with a rate propor-
tional to h
1.5
.
ASYMPTOTIC ERROR FORMULAS
If we have a numerical integration formula,
Z
b
a
f(x) dx
n
X
j=0
w
j
f(x
j
)
let E
n
(f) denote its error,
E
n
(f) =
Z
b
a
f(x) dx
n
X
j=0
w
j
f(x
j
)
We say another formula
e
E
n
(f) is an asymptotic error
formula this numerical integration if it satises
lim
n
e
E
n
(f)
E
n
(f)
= 1
Equivalently,
lim
n
E
n
(f)
e
E
n
(f)
E
n
(f)
= 0
These conditions say that
e
E
n
(f) looks increasingly
like E
n
(f) as n increases, and thus
E
n
(f)
e
E
n
(f)
Example. For the trapezoidal rule,
E
T
n
(f)
e
E
T
n
(f)
h
2
12
h
f
0
(b) f
0
(a)
i
This assumes f(x) has two continuous derivatives on
the interval [a, b].
Example. For Simpsons rule,
E
S
n
(f)
e
E
S
n
(f)
h
4
180
h
f
000
(b) f
000
(a)
i
This assumes f(x) has four continuous derivatives on
the interval [a, b].
Note that both of these formulas can be written in an
equivalent form as
e
E
n
(f) =
c
n
p
for appropriate constant c and exponent p. With the
trapezoidal rule, p = 2 and
c =
(b a)
2
12
h
f
0
(b) f
0
(a)
i
and for Simpsons rule, p = 4 with a suitable c.
The formula
e
E
n
(f) =
c
n
p
(2)
occurs for many other numerical integration formulas
that we have not yet dened or studied. In addition,
if we use the trapezoidal or Simpson rules with an
integrand f(x) which is not suciently dierentiable,
then (2) may hold with an exponent p that is less than
the ideal.
Example. Consider
I =
Z
1
0
x
dx
in which 1 < < 1, 6= 0. Then the conver-
gence of the trapezoidal rule can be shown to have an
asymptotic error formula
E
n
e
E
n
=
c
n
+1
(3)
for some constant c dependent on . A similar result
holds for Simpsons rule, with 1 < < 3, not an
integer. We can actually specify a formula for c; but
the formula is often less important than knowing that
(2) is valid for some c.
APPLICATION OF ASYMPTOTIC
ERROR FORMULAS
Assume we know that an asymptotic error formula
I I
n
c
n
p
is valid for some numerical integration rule denoted by
I
n
. Initially, assume we know the exponent p. Then
imagine calculating both I
n
and I
2n
. With I
2n
, we
have
I I
2n
c
2
p
n
p
This leads to
I I
n
2
p
[I I
2n
]
I
2
p
I
2n
I
n
2
p
1
= I
2n
+
I
2n
I
n
2
p
1
The formula
I I
2n
+
I
2n
I
n
2
p
1
(4)
is called Richardsons extrapolation formula.
Example. With the trapezoidal rule and with the in-
tegrand f(x) having two continuous derivatives,
I T
2n
+
1
3
[T
2n
T
n
]
Example. With Simpsons rule and with the integrand
f(x) having four continuous derivatives,
I S
2n
+
1
15
[S
2n
S
n
]
We can also use the formula (2) to obtain error esti-
mation formulas:
I I
2n
I
2n
I
n
2
p
1
(5)
This is called Richardsons error estimate. For exam-
ple, with the trapezoidal rule,
I T
2n
1
3
[T
2n
T
n
]
These formulas are illustrated for the trapezoidal rule
in an accompanying table, for
Z
0
e
x
cos xdx =
e
+ 1
2
.
= 12.07034632
AITKEN EXTRAPOLATION
In this case, we again assume
I I
n
c
n
p
But in contrast to previously, we do not know either
c or p. Imagine computing I
n
, I
2n
, and I
4n
. Then
I I
n
c
n
p
I I
2n
c
2
p
n
p
I I
4n
c
4
p
n
p
We can directly try to estimate I. Dividing
I I
n
I I
2n
2
p
I I
2n
I I
4n
Solving for I, we obtain
(I I
2n
)
2
(I I
n
) (I I
4n
)
I (I
n
+ I
4n
2I
2n
) I
n
I
4n
I
2
2n
I
I
n
I
4n
I
2
2n
I
n
+ I
4n
2I
2n
This can be improved computationally, to avoid loss
of signicance errors.
I I
4n
+
"
I
n
I
4n
I
2
2n
I
n
+ I
4n
2I
2n
I
4n
#
= I
4n
(I
4n
I
2n
)
2
(I
4n
I
2n
) (I
2n
I
n
)
This is called Aitkens extrapolation formula.
To estimate p, we use
I
2n
I
n
I
4n
I
2n
2
p
To see this, write
I
2n
I
n
I
4n
I
2n
=
(I I
n
) (I I
2n
)
(I I
2n
) (I I
4n
)
Then substitute from the following and simplify:
I I
n
c
n
p
I I
2n
c
2
p
n
p
I I
4n
c
4
p
n
p
Example. Consider the following table of numerical
integrals. What is its order of convergence?
n I
n
I
n
I
1
2
n
Ratio
2 .28451779686
4 .28559254576 1.075E 3
8 .28570248748 1.099E 4 9.78
16 .28571317731 1.069E 5 10.28
32 .28571418363 1.006E 6 10.62
64 .28571427643 9.280E 8 10.84
It appears
2
p
.
= 10.84, p
.
= log
2
10.84 = 3.44
We could now combine this with Richardsons error
formula to estimate the error:
I I
n
1
2
p
1
I
n
I
1
2
n
For example,
I I
64
1
9.84
[9.280E 8] = 9.43E 9
PERIODIC FUNCTIONS
A function f(x) is periodic if the following condition
is satised. There is a smallest real number > 0 for
which
f(x + ) = f(x), < x < (6)
The number is called the period of the function
f(x). The constant function f(x) 1 is also consid-
ered periodic, but it satises this condition with any
> 0. Basically, a periodic function is one which
repeats itself over intervals of length .
The condition (6) implies
f
(m)
(x + ) = f
(m)
(x), < x < (7)
for the m
th
-derivative of f(x), provided there is such
a derivative. Thus the derivatives are also periodic.
Periodic functions occur very frequently in applica-
tions of mathematics, reecting the periodicity of many
phenomena in the physical world.
PERIODIC INTEGRANDS
Consider the special class of integrals
I(f) =
Z
b
a
f(x) dx
in which f(x) is periodic, with ba an integer multiple
of the period for f(x). In this case, the performance
of the trapezoidal rule and other numerical integration
rules is much better than that predicted by earlier error
formulas.
To hint at this improved performance, recall
Z
b
a
f(x) dx T
n
(f)
e
E
n
(f)
h
2
12
h
f
0
(b) f
0
(a)
i
With our assumption on the periodicity of f(x), we
have
f(a) = f(b), f
0
(a) = f
0
(b)
Therefore,
e
E
n
(f) = 0
and we should expect improved performance in the
convergence behaviour of the trapezoidal sums T
n
(f).
If in addition to being periodic on [a, b], the integrand
f(x) also has m continous derivatives, then it can be
shown that
I(f) T
n
(f) =
c
n
m
+ smaller terms
By smaller terms, we mean terms which decrease
to zero more rapidly than n
m
.
Thus if f(x) is periodic with b a an integer multiple
of the period for f(x), and if f(x) is innitely dier-
entiable, then the error I T
n
decreases to zero more
rapidly than n
m
for any m > 0. For periodic inte-
grands, the trapezoidal rule is an optimal numerical
integration method.
Example. Consider evaluating
I =
Z
2
0
sin xdx
1 + e
sin x
Using the trapezoidal rule, we have the results in the
following table. In this case, the formulas based on
Richardson extrapolation are no longer valid.
n T
n
T
n
T
1
2
n
2 0.0
4 0.72589193317292 7.259E 1
8 0.74006131211583 1.417E 2
16 0.74006942337672 8.111E 6
32 0.74006942337946 2.746E 12
64 0.74006942337946 0.0
NUMERICAL INTEGRATION:
ANOTHER APPROACH
We look for numerical integration formulas
Z
1
1
f(x) dx
n
X
j=1
w
j
f(x
j
)
which are to be exact for polynomials of as large a
degree as possible. There are no restrictions placed
on the nodes
n
x
j
o
nor the weights
n
w
j
o
in working
towards that goal. The motivation is that if it is exact
for high degree polynomials, then perhaps it will be
very accurate when integrating functions that are well
approximated by polynomials.
There is no guarantee that such an approach will work.
In fact, it turns out to be a bad idea when the node
points
n
x
j
o
are required to be evenly spaced over the
interval of integration. But without this restriction on
n
x
j
o
we are able to develop a very accurate set of
quadrature formulas.
The case n = 1. We want a formula
w
1
f(x
1
)
1
R
1
f(x)dx
The weight w
1
and the nodex
1
are to be so chosen that
the formula is exact for polynomials of as large degree
as possible. To do this we substitute f(x) = 1 and
f(x) = x. The rst choice leads to
w
1
1
1
R
1
1dx
w
1
= 2
The choice f(x) = x leads to
w
1
x
1
1
R
1
xdx
x
1
= 0
The desired formula is
1
R
1
f(x)dx 2f(0)
It is called the midpoint rule.
The case n = 2. We want a formula
w
1
f(x
1
) + w
2
f(x
2
)
Z
1
1
f(x) dx
The weights w
1
, w
2
and the nodes x
1
, x
2
are to be so
chosen that the formula is exact for polynomials of as
large a degree as possible. We substitute and force
equality for
f(x) = 1, x, x
2
, x
3
This leads to the system
w
1
+ w
2
=
Z
1
1
1 dx = 2
w
1
x
1
+ w
2
x
2
=
Z
1
1
xdx = 0
w
1
x
2
1
+ w
2
x
2
2
=
Z
1
1
x
2
dx =
2
3
w
1
x
3
1
+ w
2
x
3
2
=
Z
1
1
x
3
dx = 0
The solution is given by
w
1
= w
2
= 1, x
1
=
1
sqrt(3)
, x
2
=
1
sqrt(3)
This yields the formula
Z
1
1
f(x) dx f
1
sqrt(3)
+ f
1
sqrt(3)
(1)
We say it has degree of precision equal to 3 since it
integrates exactly all polynomials of degree 3. We
can verify directly that it does not integrate exactly
f(x) = x
4
.
Z
1
1
x
4
dx =
2
5
f
1
sqrt(3)
+ f
1
sqrt(3)
=
2
9
Thus (1) has degree of precision exactly 3.
EXAMPLE Integrate
Z
1
1
dx
3 + x
= log 2
.
= 0.69314718
The formula (1) yields
1
3 + x
1
+
1
3 + x
2
= 0.69230769
Error = .000839
THE GENERAL CASE
We want to nd the weights {w
i
} and nodes {x
i
} so
as to have
Z
1
1
f(x) dx
n
X
j=1
w
j
f(x
j
)
be exact for a polynomials f(x) of as large a degree
as possible. As unknowns, there are n weights w
i
and
n nodes x
i
. Thus it makes sense to initially impose
2n conditions so as to obtain 2n equations for the 2n
unknowns. We require the quadrature formula to be
exact for the cases
f(x) = x
i
, i = 0, 1, 2, ..., 2n 1
Then we obtain the system of equations
w
1
x
i
1
+ w
2
x
i
2
+ + w
n
x
i
n
=
Z
1
1
x
i
dx
for i = 0, 1, 2, ..., 2n 1. For the right sides,
Z
1
1
x
i
dx =
2
i + 1
, i = 0, 2, ..., 2n 2
0, i = 1, 3, ..., 2n 1
The system of equations
w
1
x
i
1
+ + w
n
x
i
n
=
Z
1
1
x
i
dx, i = 0, ..., 2n 1
has a solution, and the solution is unique except for
re-ordering the unknowns. The resulting numerical
integration rule is called Gaussian quadrature.
In fact, the nodes and weights are not found by solv-
ing this system. Rather, the nodes and weights have
other properties which enable them to be found more
easily by other methods. There are programs to pro-
duce them; and most subroutine libraries have either
a program to produce them or tables of them for com-
monly used cases.
CHANGE OF INTERVAL
OF INTEGRATION
Integrals on other nite intervals [a, b] can be con-
verted to integrals over [1, 1], as follows:
Z
b
a
F(x) dx =
b a
2
Z
1
1
F
b + a + t(b a)
2
!
dt
based on the change of integration variables
x =
b + a + t(b a)
2
, 1 t 1
EXAMPLE Over the interval [0, ], use
x = (1 + t)
2
Then
Z
0
F(x) dx =
2
Z
1
1
F
(1 + t)
2
dt
AN ERROR FORMULA
The usual error formula for Gaussian quadrature for-
mula,
E
n
(f) =
Z
1
1
f(x) dx
n
X
j=1
w
j
f(x
j
)
is not particularly intuitive. It is given by
E
n
(f) = e
n
f
(2n)
(c
n
)
(2n)!
e
n
=
2
2n+1
(n!)
4
(2n + 1) [(2n)!]
2
4
n
for some a c
n
b.
To help in understanding the implications of this error
formula, introduce
M
k
= max
1x1
f
(k)
(x)
k!
With many integrands f(x), this sequence {M
k
} is
bounded or even decreases to zero. For example,
f(x) =
cos x
1
2 + x
M
k
1
k!
1
Then for our error formula,
E
n
(f) = e
n
f
(2n)
(c
n
)
(2n)!
|E
n
(f)| e
n
M
2n
(2)
By other methods, we can show
e
n
4
n
When combined with (2) and an assumption of uni-
form boundedness for {M
k
}, we have the error de-
creases by a factor of at least 4 with each increase of
n to n + 1. Compare this to the convergence of the
trapezoidal and Simpson rules for such functions, to
help explain the very rapid convergence of Gaussian
quadrature.
A SECOND ERROR FORMULA
Let f(x) be continuous for a x b; let n 1.
Then, for the Gaussian numerical integration formula
I
Z
b
a
f(x) dx
n
X
j=1
w
j
f(x
j
) I
n
on [a, b], the error in I
n
satises
|I(f) I
n
(f)| 2 (b a)
2n1
(f) (3)
Here
2n1
(f) is the minimax error of degree 2n 1
for f(x) on [a, b]:
m
(f) = min
deg(p)m
"
max
axb
|f(x) p(x)|
#
, m 0
EXAMPLE Let f(x) = e
x
2
. Then the minimax er-
rors
m
(f) are given in the following table.
m
m
(f) m
m
(f)
1 5.30E 2 6 7.82E 6
2 1.79E 2 7 4.62E 7
3 6.63E 4 8 9.64E 8
4 4.63E 4 9 8.05E 9
5 1.62E 5 10 9.16E 10
Using this table, apply (3) to
I =
Z
1
0
e
x
2
dx
For n = 3, (3) implies
|I I
3
| 2
5
e
x
2
.
= 3.24 10
5
The actual error is 9.55E 6.
INTEGRATING
A NON-SMOOTH INTEGRAND
Consider using Gaussian quadrature to evaluate
I =
Z
1
0
sqrt(x) dx =
2
3
n I I
n
Ratio
2 7.22E 3
4 1.16E 3 6.2
8 1.69E 4 6.9
16 2.30E 5 7.4
32 3.00E 6 7.6
64 3.84E 7 7.8
The column labeled Ratio is dened by
I I
1
2
n
I I
n
It is consistent with I I
n
c
n
3
, which can be proven
theoretically. In comparison for the trapezoidal and
Simpson rules, I I
n
c
n
1.5
WEIGHTED GAUSSIAN QUADRATURE
Consider needing to evaluate integrals such as
Z
1
0
f(x) log xdx,
Z
1
0
x
1
3
f(x) dx
How do we proceed? Consider numerical integration
formulas
Z
b
a
w(x)f(x) dx
n
X
j=1
w
j
f(x
j
)
in which f(x) is considered a nice function (one
with several continuous derivatives). The function
w(x) is allowed to be singular, but must be integrable.
We assume here that [a, b] is a nite interval. The
function w(x) is called a weight function, and it is
implicitly absorbed into the denition of the quadra-
ture weights {w
i
}. We again determine the nodes
{x
i
} and weights {w
i
} so as to make the integration
formula exact for f(x) a polynomial of as large a de-
gree as possible.
The resulting numerical integration formula
Z
b
a
w(x)f(x) dx
n
X
j=1
w
j
f(x
j
)
is called a Gaussian quadrature formula with weight
function w(x). We determine the nodes {x
i
} and
weights {w
i
} by requiring exactness in the above for-
mula for
f(x) = x
i
, i = 0, 1, 2, ..., 2n 1
To make the derivation more understandable, we con-
sider the particular case
Z
1
0
x
1
3
f(x) dx
n
X
j=1
w
j
f(x
j
)
We follow the same pattern as used earlier.
The case n = 1. We want a formula
w
1
f(x
1
)
Z
1
0
x
1
3
f(x) dx
The weight w
1
and the node x
1
are to be so chosen
that the formula is exact for polynomials of as large a
degree as possible. Choosing f(x) = 1, we have
w
1
=
Z
1
0
x
1
3
dx =
3
4
Choosing f(x) = x, we have
w
1
x
1
=
1
Z
0
x
1
3
xdx =
3
7
x
1
=
4
7
Thus
Z
1
0
x
1
3
f(x) dx
3
4
f
4
7
3
65
sqrt(35), x
2
=
7
13
+
3
65
sqrt(35)
w
1
=
3
8
3
392
sqrt(35), w
2
=
3
8
+
3
392
sqrt(35)
Numerically,
x
1
= .2654117024, x
2
= .8115113746
w
1
= .3297238792, w
2
= .4202761208
The formula
Z
1
0
x
1
3
f(x) dx w
1
f(x
1
) + w
2
f(x
2
) (4)
has degree of precision 3.
EXAMPLE Consider evaluating the integral
Z
1
0
x
1
3
cos xdx (5)
In applying (4), we take f(x) = cos x. Then
w
1
f(x
1
) + w
2
f(x
2
) = 0.6074977951
The true answer is
Z
1
0
x
1
3
cos xdx
.
= 0.6076257393
and our numerical answer is in error by E
2
.
= .000128.
This is quite a good answer involving very little com-
putational eort (once the formula has been deter-
mined). In contrast, the trapezoidal and Simpson
rules applied to (5) would converge very slowly be-
cause the rst derivative of the integrand is singular
at the origin.
CHANGE OF VARIABLES
As a side note to the preceding example, we observe
that the change of variables x = t
3
transforms the
integral (5) to
3
Z
1
0
t
3
cos
t
3
dt
and both the trapezoidal and Simpson rules will per-
form better with this formula, although still not as
good as our weighted Gaussian quadrature.
A change of the integration variable can often im-
prove the performance of a standard method, usually
by increasing the dierentiability of the integrand.
EXAMPLE Using x = t
r
for some r > 1, we have
Z
1
0
g(x) log x dx = r
Z
1
0
t
r1
g (t
r
) log t dt
The new integrand is generally smoother than the
original one.
INTERPOLATION
Interpolation is a process of nding a formula (often
a polynomial) whose graph will pass through a given
set of points (x, y).
As an example, consider dening
x
0
= 0, x
1
=
4
, x
2
=
2
and
y
i
= cos x
i
, i = 0, 1, 2
This gives us the three points
(0, 1) ,
4
,
1
sqrt(2)
2
, 0
y
1
y
0
x
1
x
0
!
(x x
0
)
Check each of these by evaluating them at x = x
0
and x
1
to see if the respective values are y
0
and y
1
.
Example. Following is a table of values for f(x) =
tan x for a few values of x.
x 1 1.1 1.2 1.3
tan x 1.5574 1.9648 2.5722 3.6021
Use linear interpolation to estimate tan(1.15). Then
use
x
0
= 1.1, x
1
= 1.2
with corresponding values for y
0
and y
1
. Then
tan x y
0
+
x x
0
x
1
x
0
[y
1
y
0
]
tan x y
0
+
x x
0
x
1
x
0
[y
1
y
0
]
tan (1.15) 1.9648 +
1.15 1.1
1.2 1.1
[2.5722 1.9648]
= 2.2685
The true value is tan 1.15 = 2.2345. We will want
to examine formulas for the error in interpolation, to
know when we have sucient accuracy in our inter-
polant.
x
y
1 1.3
y=tan(x)
x
y
1.1 1.2
y = tan(x)
y = p
1
(x)
QUADRATIC INTERPOLATION
We want to nd a polynomial
P
2
(x) = a
0
+ a
1
x + a
2
x
2
which satises
P
2
(x
i
) = y
i
, i = 0, 1, 2
for given data points (x
0
, y
0
) , (x
1
, y
1
) , (x
2
, y
2
). One
formula for such a polynomial follows:
P
2
(x) = y
0
L
0
(x) + y
1
L
1
(x) + y
2
L
2
(x) ()
with
L
0
(x) =
(xx
1
)(xx
2
)
(x
0
x
1
)(x
0
x
2
)
, L
1
(x) =
(xx
0
)(xx
2
)
(x
1
x
0
)(x
1
x
2
)
L
2
(x) =
(xx
0
)(xx
1
)
(x
2
x
0
)(x
2
x
1
)
The formula () is called Lagranges form of the in-
terpolation polynomial.
LAGRANGE BASIS FUNCTIONS
The functions
L
0
(x) =
(xx
1
)(xx
2
)
(x
0
x
1
)(x
0
x
2
)
, L
1
(x) =
(xx
0
)(xx
2
)
(x
1
x
0
)(x
1
x
2
)
L
2
(x) =
(xx
0
)(xx
1
)
(x
2
x
0
)(x
2
x
1
)
are called Lagrange basis functions for quadratic in-
terpolation. They have the properties
L
i
(x
j
) =
(
1, i = j
0, i 6= j
for i, j = 0, 1, 2. Also, they all have degree 2. Their
graphs are on an accompanying page.
As a consequence of each L
i
(x) being of degree 2, we
have that the interpolant
P
2
(x) = y
0
L
0
(x) + y
1
L
1
(x) + y
2
L
2
(x)
must have degree 2.
UNIQUENESS
Can there be another polynomial, call it Q(x), for
which
deg(Q) 2
Q(x
i
) = y
i
, i = 0, 1, 2
Thus, is the Lagrange formula P
2
(x) unique?
Introduce
R(x) = P
2
(x) Q(x)
From the properties of P
2
and Q, we have deg(R)
2. Moreover,
R(x
i
) = P
2
(x
i
) Q(x
i
) = y
i
y
i
= 0
for all three node points x
0
, x
1
, and x
2
. How many
polynomials R(x) are there of degree at most 2 and
having three distinct zeros? The answer is that only
the zero polynomial satises these properties, and there-
fore
R(x) = 0 for all x
Q(x) = P
2
(x) for all x
SPECIAL CASES
Consider the data points
(x
0
, 1), (x
1
, 1), (x
2
, 1)
What is the polynomial P
2
(x) in this case?
Answer: We must have the polynomial interpolant is
P
2
(x) 1
meaning that P
2
(x) is the constant function. Why?
First, the constant function satises the property of
being of degree 2. Next, it clearly interpolates the
given data. Therefore by the uniqueness of quadratic
interpolation, P
2
(x) must be the constant function 1.
Consider now the data points
(x
0
, mx
0
), (x
1
, mx
1
), (x
2
, mx
2
)
for some constant m. What is P
2
(x) in this case? By
an argument similar to that above,
P
2
(x) = mx for all x
Thus the degree of P
2
(x) can be less than 2.
HIGHER DEGREE INTERPOLATION
We consider now the case of interpolation by poly-
nomials of a general degree n. We want to nd a
polynomial P
n
(x) for which
deg(P
n
) n
P
n
(x
i
) = y
i
, i = 0, 1, , n
()
with given data points
(x
0
, y
0
) , (x
1
, y
1
) , , (x
n
, y
n
)
The solution is given by Lagranges formula
P
n
(x) = y
0
L
0
(x) + y
1
L
1
(x) + + y
n
L
n
(x)
The Lagrange basis functions are given by
L
k
(x) =
(x x
0
) ..(x x
k1
)(x x
k+1
).. (x x
n
)
(x
k
x
0
) ..(x
k
x
k1
)(x
k
x
k+1
).. (x
k
x
n
)
for k = 0, 1, 2, ..., n. The quadratic case was covered
earlier.
In a manner analogous to the quadratic case, we can
show that the above P
n
(x) is the only solution to the
problem ().
In the formula
L
k
(x) =
(x x
0
) ..(x x
k1
)(x x
k+1
).. (x x
n
)
(x
k
x
0
) ..(x
k
x
k1
)(x
k
x
k+1
).. (x
k
x
n
)
we can see that each such function is a polynomial of
degree n. In addition,
L
k
(x
i
) =
(
1, k = i
0, k 6= i
Using these properties, it follows that the formula
P
n
(x) = y
0
L
0
(x) + y
1
L
1
(x) + + y
n
L
n
(x)
satises the interpolation problem of nding a solution
to
deg(P
n
) n
P
n
(x
i
) = y
i
, i = 0, 1, , n
EXAMPLE
Recall the table
x 1 1.1 1.2 1.3
tan x 1.5574 1.9648 2.5722 3.6021
We now interpolate this table with the nodes
x
0
= 1, x
1
= 1.1, x
2
= 1.2, x
3
= 1.3
Without giving the details of the evaluation process,
we have the following results for interpolation with
degrees n = 1, 2, 3.
n 1 2 3
P
n
(1.15) 2.2685 2.2435 2.2296
Error .0340 .0090 .0049
It improves with increasing degree n, but not at a very
rapid rate. In fact, the error becomes worse when n is
increased further. Later we will see that interpolation
of a much higher degree, say n 10, is often poorly
behaved when the node points {x
i
} are evenly spaced.
A FIRST ORDER DIVIDED DIFFERENCE
For a given function f(x) and two distinct points x
0
and x
1
, dene
f[x
0
, x
1
] =
f(x
1
) f(x
0
)
x
1
x
0
This is called a rst order divided dierence of f(x).
By the Mean-value theorem,
f(x
1
) f(x
0
) = f
0
(c) (x
1
x
0
)
for some c between x
0
and x
1
. Thus
f[x
0
, x
1
] = f
0
(c)
and the divided dierence in very much like the deriv-
ative, especially if x
0
and x
1
are quite close together.
In fact,
f
0
x
1
+ x
0
2
f[x
0
, x
1
]
is quite an accurate approximation of the derivative
(see 5.4).
SECOND ORDER DIVIDED DIFFERENCES
Given three distinct points x
0
, x
1
, and x
2
, dene
f[x
0
, x
1
, x
2
] =
f[x
1
, x
2
] f[x
0
, x
1
]
x
2
x
0
This is called the second order divided dierence of
f(x).
By a fairly complicated argument, we can show
f[x
0
, x
1
, x
2
] =
1
2
f
00
(c)
for some c intermediate to x
0
, x
1
, and x
2
. In fact, as
we investigate in 5.4,
f
00
(x
1
) 2f[x
0
, x
1
, x
2
]
in the case the nodes are evenly spaced,
x
1
x
0
= x
2
x
1
EXAMPLE
Consider the table
x 1 1.1 1.2 1.3 1.4
cos x .54030 .45360 .36236 .26750 .16997
Let x
0
= 1, x
1
= 1.1, and x
2
= 1.2. Then
f[x
0
, x
1
] =
.45360 .54030
1.1 1
= .86700
f[x
1
, x
2
] =
.36236 .45360
1.1 1
= .91240
f[x
0
, x
1
, x
2
] =
f[x
1
, x
2
] f[x
0
, x
1
]
x
2
x
0
=
.91240 (.86700)
1.2 1.0
= .22700
For comparison,
f
0
x
1
+ x
0
2
log
10
e
c
2
x
#
= (x x
0
) (x
1
x)
"
log
10
e
2c
2
x
#
We usually are interpolating with x
0
x x
1
; and
in that case, we have
(x x
0
) (x
1
x) 0, x
0
c
x
x
1
(x x
0
) (x
1
x) 0, x
0
c
x
x
1
and therefore
(x x
0
) (x
1
x)
"
log
10
e
2x
2
1
#
log
10
x P
1
(x)
(x x
0
) (x
1
x)
"
log
10
e
2x
2
0
#
For h = x
1
x
0
small, we have for x
0
x x
1
log
10
x P
1
(x) (x x
0
) (x
1
x)
"
log
10
e
2x
2
0
#
Typical high school algebra textbooks contain tables
of log
10
x with a spacing of h = .01. What is the
error in this case? To look at this, we use
0 log
10
x P
1
(x) (x x
0
) (x
1
x)
"
log
10
e
2x
2
0
#
By simple geometry or calculus,
max
x
0
xx
1
(x x
0
) (x
1
x)
h
2
4
Therefore,
0 log
10
x P
1
(x)
h
2
4
"
log
10
e
2x
2
0
#
.
= .0543
h
2
x
2
0
If we want a uniform bound for all points 1 x
0
10,
we have
0 log
10
x P
1
(x)
h
2
log
10
e
8
.
= .0543h
2
0 log
10
x P
1
(x) .0543h
2
For h = .01, as is typical of the high school text book
tables of log
10
x,
0 log
10
x P
1
(x) 5.43 10
6
If you look at most tables, a typical entry is given to
only four decimal places to the right of the decimal
point, e.g.
log 5.41
.
= .7332
Therefore the entries are in error by as much as .00005.
Comparing this with the interpolation error, we see the
latter is less important than the rounding errors in the
table entries.
From the bound
0 log
10
x P
1
(x)
h
2
log
10
e
8x
2
0
.
= .0543
h
2
x
2
0
we see the error decreases as x
0
increases, and it is
about 100 times smaller for points near 10 than for
points near 1.
AN ERROR FORMULA:
THE GENERAL CASE
Recall the general interpolation problem: nd a poly-
nomial P
n
(x) for which deg(P
n
) n
P
n
(x
i
) = f(x
i
), i = 0, 1, , n
with distinct node points {x
0
, ..., x
n
} and a given
function f(x). Let [a, b] be a given interval on which
f(x) is (n + 1)-times continuously dierentiable; and
assume the points x
0
, ..., x
n
, and x are contained in
[a, b]. Then
f(x)P
n
(x) =
(x x
0
) (x x
1
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
with c
x
some point between the minimum and maxi-
mum of the points in {x, x
0
, ..., x
n
}.
f(x)P
n
(x) =
(x x
0
) (x x
1
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
As shorthand, introduce
n
(x) = (x x
0
) (x x
1
) (x x
n
)
a polynomial of degree n + 1 with roots {x
0
, ..., x
n
}.
Then
f(x) P
n
(x) =
n
(x)
(n + 1)!
f
(n+1)
(c
x
)
THE QUADRATIC CASE
For n = 2, we have
f(x) P
2
(x) =
(x x
0
) (x x
1
) (x x
2
)
3!
f
(3)
(c
x
)
(*)
with c
x
some point between the minimum and maxi-
mum of the points in {x, x
0
, x
1
, x
2
}.
To illustrate the use of this formula, consider the case
of evenly spaced nodes:
x
1
= x
0
+ h, x
2
= x
1
+ h
Further suppose we have x
0
x x
2
, as we would
usually have when interpolating in a table of given
function values (e.g. log
10
x). The quantity
2
(x) = (x x
0
) (x x
1
) (x x
2
)
can be evaluated directly for a particular x.
Graph of
2
(x) = (x + h) x(x h)
using (x
0
, x
1
, x
2
) = (h, 0, h):
x
y
h
-h
In the formula (), however, we do not know c
x
, and
therefore we replace
f
(3)
(c
x
)
with a maximum of
f
(3)
(x)
as x varies over x
0
x x
2
. This yields
|f(x) P
2
(x)|
|
2
(x)|
3!
max
x
0
xx
2
f
(3)
(x)
(**)
If we want a uniform bound for x
0
x x
2
, we must
compute
max
x
0
xx
2
|
2
(x)| = max
x
0
xx
2
|(x x
0
) (x x
1
) (x x
2
)|
Using calculus,
max
x
0
xx
2
|
2
(x)| =
2h
3
3 sqrt(3)
, at x = x
1
h
sqrt(3)
Combined with (), this yields
|f(x) P
2
(x)|
h
3
9 sqrt(3)
max
x
0
xx
2
f
(3)
(x)
for x
0
x x
2
.
For f(x) = log
10
x, with 1 x
0
x x
2
10, this
leads to
|log
10
x P
2
(x)|
h
3
9 sqrt(3)
max
x
0
xx
2
2 log
10
e
x
3
=
.05572 h
3
x
3
0
For the case of h = .01, we have
|log
10
x P
2
(x)|
5.57 10
8
x
3
0
5.57 10
8
Question: How much larger could we make h so that
quadratic interpolation would have an error compa-
rable to that of linear interpolation of log
10
x with
h = .01? The error bound for the linear interpolation
was 5.43 10
6
, and therefore we want the same to
be true of quadratic interpolation. Using a simpler
bound, we want to nd h so that
|log
10
x P
2
(x)| .05572 h
3
5 10
6
This is true if h = .04477. Therefore a spacing of
h = .04 would be sucient. A table with this spac-
ing and quadratic interpolation would have an error
comparable to a table with h = .01 and linear inter-
polation.
For the case of general n,
f(x) P
n
(x) =
(x x
0
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
=
n
(x)
(n + 1)!
f
(n+1)
(c
x
)
n
(x) = (x x
0
) (x x
1
) (x x
n
)
with c
x
some point between the minimum and max-
imum of the points in {x, x
0
, ..., x
n
}. When bound-
ing the error we replace f
(n+1)
(c
x
) with its maximum
over the interval containing {x, x
0
, ..., x
n
}, as we have
illustrated earlier in the linear and quadratic cases.
Consider now the function
n
(x)
(n + 1)!
over the interval determined by the minimum and
maximum of the points in {x, x
0
, ..., x
n
}. For evenly
spaced node points on [0, 1], with x
0
= 0 and x
n
= 1,
we give graphs for n = 2, 3, 4, 5 and for n = 6, 7, 8, 9
on accompanying pages.
DISCUSSION OF ERROR
Consider the error
f(x) P
n
(x) =
(x x
0
) (x x
n
)
(n + 1)!
f
(n+1)
(c
x
)
=
n
(x)
(n + 1)!
f
(n+1)
(c
x
)
n
(x) = (x x
0
) (x x
1
) (x x
n
)
as n increases and as x varies. As noted previously, we
cannot do much with f
(n+1)
(c
x
) except to replace it
with a maximum value of
f
(n+1)
(x)
over a suitable
interval. Thus we concentrate on understanding the
size of
n
(x)
(n + 1)!
ERROR FOR EVENLY SPACED NODES
We consider rst the case in which the node points
are evenly spaced, as this seems the natural way to
dene the points at which interpolation is carried out.
Moreover, using evenly spaced nodes is the case to
consider for table interpolation. What can we learn
from the given graphs?
The interpolation nodes are determined by using
h =
1
n
, x
0
= 0, x
1
= h, x
2
= 2h, ..., x
n
= nh = 1
For this case,
n
(x) = x(x h) (x 2h) (x 1)
Our graphs are the cases of n = 2, ..., 9.
x
y
n = 2
1
x
y
n = 3
1
x
y
n = 4
1
x
y
n = 5
1
Graphs of
n
(x) on [0, 1] for n = 2, 3, 4, 5
x
y
n = 6
1
x
y
n = 7
1
x
y
n = 8
1
x
y
n = 9
1
Graphs of
n
(x) on [0, 1] for n = 6, 7, 8, 9
Graph of
6
(x) = (x x
0
) (x x
1
) (x x
6
)
with evenly spaced nodes:
x
x
0
x
1
x
2
x
3
x
4
x
5
x
6
Using the following table
,
n M
n
n M
n
1 1.25E1 6 4.76E7
2 2.41E2 7 2.20E8
3 2.06E3 8 9.11E10
4 1.48E4 9 3.39E11
5 9.01E6 10 1.15E12
we can observe that the maximum
M
n
max
x
0
xx
n
|
n
(x)|
(n + 1)!
becomes smaller with increasing n.
From the graphs, there is enormous variation in the
size of
n
(x) as x varies over [0, 1]; and thus there
is also enormous variation in the error as x so varies.
For example, in the n = 9 case,
max
x
0
xx
1
|
n
(x)|
(n + 1)!
= 3.39 10
11
max
x
4
xx
5
|
n
(x)|
(n + 1)!
= 6.89 10
13
and the ratio of these two errors is approximately 49.
Thus the interpolation error is likely to be around 49
times larger when x
0
x x
1
as compared to the
case when x
4
x x
5
. When doing table inter-
polation, the point x at which you are interpolating
should be centrally located with respect to the inter-
polation nodes m{x
0
, ..., x
n
} being used to dene the
interpolation, if possible.
AN APPROXIMATION PROBLEM
Consider now the problem of using an interpolation
polynomial to approximate a given function f(x) on
a given interval [a, b]. In particular, take interpolation
nodes
a x
0
< x
1
< < x
n1
< x
n
b
and produce the interpolation polynomial P
n
(x) that
interpolates f(x) at the given node points. We would
like to have
max
axb
|f(x) P
n
(x)| 0 as n
Does it happen?
Recall the error bound
max
axb
|f(x) P
n
(x)|
max
axb
|
n
(x)|
(n + 1)!
max
axb
f
(n+1)
(x)
f
(n+1)
(x)
1 + x
2
and P
n
(x) on [5, 5] for the cases n = 8 and n = 12.
The case n = 10 is in the text on page 127. It can
be proven that for this function, the maximum er-
ror on [5, 5] does not converge to zero. Thus the
use of evenly spaced nodes is not necessarily a good
approach to approximating a function f(x) by inter-
polation.
Runges example with n = 10:
x
y
y=P
10
(x)
y=1/(1+x
2
)
OTHER CHOICES OF NODES
Recall the general error bound
max
axb
|f(x) P
n
(x)| max
axb
|
n
(x)|
(n + 1)!
max
axb
f
(n+1)
(x)
n+1
2
n
This turns out to be smaller than for evenly spaced
cases; and although this polynomial interpolation does
not work for all functions f(x), it works for all dier-
entiable functions and more.
ANOTHER ERROR FORMULA
Recall the error formula
f(x) P
n
(x) =
n
(x)
(n + 1)!
f
(n+1)
(c)
n
(x) = (x x
0
) (x x
1
) (x x
n
)
with c between the minimum and maximum of {x
0
, ..., x
n
, x}.
A second formula is given by
f(x) P
n
(x) =
n
(x) f[x
0
, ..., x
n
, x]
To show this is a simple, but somewhat subtle argu-
ment.
Let P
n+1
(x) denote the polynomial of degree n+1
which interpolates f(x) at the points {x
0
, ..., x
n
, x
n+1
}.
Then
P
n+1
(x) = P
n
(x)
+f[x
0
, ..., x
n
, x
n+1
] (x x
0
) (x x
n
)
Substituting x = x
n+1
, and using the fact that P
n+1
(x)
interpolates f(x) at x
n+1
, we have
f(x
n+1
) = P
n
(x
n+1
)
+f[x
0
, ..., x
n
, x
n+1
] (x
n+1
x
0
) (x
n+1
x
n
)
f(x
n+1
) = P
n
(x
n+1
)
+f[x
0
, ..., x
n
, x
n+1
] (x
n+1
x
0
) (x
n+1
x
n
)
In this formula, the number x
n+1
is completely ar-
bitrary, other than being distinct from the points in
{x
0
, ..., x
n
}. To emphasize this fact, replace x
n+1
by
x throughout the formula, obtaining
f(x) = P
n
(x) + f[x
0
, ..., x
n
, x] (x x
0
) (x x
n
)
= P
n
(x) +
n
(x) f[x
0
, ..., x
n
, x]
provided x 6= x
0
, ..., x
n
.
The formula
f(x) = P
n
(x) + f[x
0
, ..., x
n
, x] (x x
0
) (x x
n
)
= P
n
(x) +
n
(x) f[x
0
, ..., x
n
, x]
is easily true for x a node point. Provided f(x) is
dierentiable, the formula is also true for x a node
point.
This shows
f(x) P
n
(x) =
n
(x) f[x
0
, ..., x
n
, x]
Compare the two error formulas
f(x) P
n
(x) =
n
(x) f[x
0
, ..., x
n
, x]
f(x) P
n
(x) =
n
(x)
(n + 1)!
f
(n+1)
(c)
Then
n
(x) f[x
0
, ..., x
n
, x] =
n
(x)
(n + 1)!
f
(n+1)
(c)
f[x
0
, ..., x
n
, x] =
f
(n+1)
(c)
(n + 1)!
for some c between the smallest and largest of the
numbers in {x
0
, ..., x
n
, x}.
To make this somewhat symmetric in its arguments,
let m = n + 1, x = x
n+1
. Then
f[x
0
, ..., x
m1
, x
m
] =
f
(m)
(c)
m!
with c an unknown number between the smallest and
largest of the numbers in {x
0
, ..., x
m
}. This was given
in an earlier lecture where divided dierences were in-
troduced.
PIECEWISE POLYNOMIAL INTERPOLATION
Recall the examples of higher degree polynomial in-
terpolation of the function f(x) =
1 + x
2
1
on
[5, 5]. The interpolants P
n
(x) oscillated a great
deal, whereas the function f(x) was nonoscillatory.
To obtain interpolants that are better behaved, we
look at other forms of interpolating functions.
Consider the data
x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0
What are methods of interpolating this data, other
than using a degree 6 polynomial. Shown in the text
are the graphs of the degree 6 polynomial interpolant,
along with those of piecewise linear and a piecewise
quadratic interpolating functions.
Since we only have the data to consider, we would gen-
erally want to use an interpolant that had somewhat
the shape of that of the piecewise linear interpolant.
x
y
1 2 3 4
1
2
The data points
x
y
1 2 3 4
1
2
Piecewise linear interpolation
x
y
1 2 3 4
1
2
3
4
Polynomial Interpolation
x
y
1 2 3 4
1
2
Piecewise quadratic interpolation
PIECEWISE POLYNOMIAL FUNCTIONS
Consider being given a set of data points (x
1
, y
1
), ...,
(x
n
, y
n
), with
x
1
< x
2
< < x
n
Then the simplest way to connect the points (x
j
, y
j
)
is by straight line segments. This is called a piecewise
linear interpolant of the data
n
(x
j
, y
j
)
o
. This graph
has corners, and often we expect the interpolant to
have a smooth graph.
To obtain a somewhat smoother graph, consider using
piecewise quadratic interpolation. Begin by construct-
ing the quadratic polynomial that interpolates
{(x
1
, y
1
), (x
2
, y
2
), (x
3
, y
3
)}
Then construct the quadratic polynomial that inter-
polates
{(x
3
, y
3
), (x
4
, y
4
), (x
5
, y
5
)}
Continue this process of constructing quadratic inter-
polants on the subintervals
[x
1
, x
3
], [x
3
, x
5
], [x
5
, x
7
], ...
If the number of subintervals is even (and therefore
n is odd), then this process comes out ne, with the
last interval being [x
n2
, x
n
]. This was illustrated
on the graph for the preceding data. If, however, n is
even, then the approximation on the last interval must
be handled by some modication of this procedure.
Suggest such!
With piecewise quadratic interpolants, however, there
are corners on the graph of the interpolating func-
tion. With our preceding example, they are at x
3
and
x
5
. How do we avoid this?
Piecewise polynomial interpolants are used in many
applications. We will consider them later, to obtain
numerical integration formulas.
SMOOTH NON-OSCILLATORY
INTERPOLATION
Let data points (x
1
, y
1
), ..., (x
n
, y
n
) be given, as let
x
1
< x
2
< < x
n
Consider nding functions s(x) for which the follow-
ing properties hold:
(1) s(x
i
) = y
i
, i = 1, ..., n
(2) s(x), s
0
(x), s
00
(x) are continuous on [x
1
, x
n
].
Then among such functions s(x) satisfying these prop-
erties, nd the one which minimizes the integral
Z
x
n
x
1
s
00
(x)
2
dx
The idea of minimizing the integral is to obtain an in-
terpolating function for which the rst derivative does
not change rapidly. It turns out there is a unique so-
lution to this problem, and it is called a natural cubic
spline function.
SPLINE FUNCTIONS
Let a set of node points {x
i
} be given, satisfying
a x
1
< x
2
< < x
n
b
for some numbers a and b. Often we use [a, b] =
[x
1
, x
n
]. A cubic spline function s(x) on [a, b] with
breakpoints or knots {x
i
} has the following prop-
erties:
1. On each of the intervals
[a, x
1
], [x
1
, x
2
], ..., [x
n1
, x
n
], [x
n
, b]
s(x) is a polynomial of degree 3.
2. s(x), s
0
(x), s
00
(x) are continuous on [a, b].
In the case that we have given data points (x
1
, y
1
),...,
(x
n
, y
n
), we say s(x) is a cubic interpolating spline
function for this data if
3. s(x
i
) = y
i
, i = 1, ..., n.
EXAMPLE
Dene
(x )
3
+
=
(
(x )
3
, x
0, x
This is a cubic spline function on (, ) with the
single breakpoint x
1
= .
Combinations of these form more complicated cubic
spline functions. For example,
s(x) = 3 (x 1)
3
+
2 (x 3)
3
+
is a cubic spline function on (, ) with the break-
points x
1
= 1, x
2
= 3.
Dene
s(x) = p
3
(x) +
n
X
j=1
a
j
x x
j
3
+
with p
3
(x) some cubic polynomial. Then s(x) is a
cubic spline function on (, ) with breakpoints
{x
1
, ..., x
n
}.
Return to the earlier problem of choosing an interpo-
lating function s(x) to minimize the integral
Z
x
n
x
1
s
00
(x)
2
dx
There is a unique solution to problem. The solution
s(x) is a cubic interpolating spline function, and more-
over, it satises
s
00
(x
1
) = s
00
(x
n
) = 0
Spline functions satisfying these boundary conditions
are called natural cubic spline functions, and the so-
lution to our minimization problem is a natural cubic
interpolatory spline function. We will show a method
to construct this function from the interpolation data.
Motivation for these boundary conditions can be given
by looking at the physics of bending thin beams of
exible materials to pass thru the given data. To the
left of x
1
and to the right of x
n
, the beam is straight
and therefore the second derivatives are zero at the
transition points x
1
and x
n
.
CONSTRUCTION OF THE
INTERPOLATING SPLINE FUNCTION
To make the presentation more specic, suppose we
have data
(x
1
, y
1
) , (x
2
, y
2
) , (x
3
, y
3
) , (x
4
, y
4
)
with x
1
< x
2
< x
3
< x
4
. Then on each of the
intervals
[x
1
, x
2
] , [x
2
, x
3
] , [x
3
, x
4
]
s(x) is a cubic polynomial. Taking the rst interval,
s(x) is a cubic polynomial and s
00
(x) is a linear poly-
nomial. Let
M
i
= s
00
(x
i
), i = 1, 2, 3, 4
Then on [x
1
, x
2
],
s
00
(x) =
(x
2
x) M
1
+ (x x
1
) M
2
x
2
x
1
, x
1
x x
2
We can nd s(x) by integrating twice:
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6 (x
2
x
1
)
+ c
1
x + c
2
We determine the constants of integration by using
s(x
1
) = y
1
, s(x
2
) = y
2
(*)
Then
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6 (x
2
x
1
)
+
(x
2
x) y
1
+ (x x
1
) y
2
x
2
x
1
x
2
x
1
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
for x
1
x x
2
.
Check that this formula satises the given interpola-
tion condition (*)!
We can repeat this on the intervals [x
2
, x
3
] and [x
3
, x
4
],
obtaining similar formulas.
For x
2
x x
3
,
s(x) =
(x
3
x)
3
M
2
+ (x x
2
)
3
M
3
6 (x
3
x
2
)
+
(x
3
x) y
2
+ (x x
2
) y
3
x
3
x
2
x
3
x
2
6
[(x
3
x) M
2
+ (x x
2
) M
3
]
For x
3
x x
4
,
s(x) =
(x
4
x)
3
M
3
+ (x x
3
)
3
M
4
6 (x
4
x
3
)
+
(x
4
x) y
3
+ (x x
3
) y
4
x
4
x
3
x
4
x
3
6
[(x
4
x) M
3
+ (x x
3
) M
4
]
We still do not know the values of the second deriv-
atives {M
1
, M
2
, M
3
, M
4
}. The above formulas guar-
antee that s(x) and s
00
(x) are continuous for
x
1
x x
4
. For example, the formula on [x
1
, x
2
]
yields
s(x
2
) = y
2
, s
00
(x
2
) = M
2
The formula on [x
2
, x
3
] also yields
s(x
2
) = y
2
, s
00
(x
2
) = M
2
All that is lacking is to make s
0
(x) continuous at x
2
and x
3
. Thus we require
s
0
(x
2
+ 0) = s
0
(x
2
0)
s
0
(x
3
+ 0) = s
0
(x
3
0)
(**)
This means
lim
x&x
2
s
0
(x) = lim
x%x
2
s
0
(x)
and similarly for x
3
.
To simplify the presentation somewhat, I assume in
the following that our node points are evenly spaced:
x
2
= x
1
+ h, x
3
= x
1
+ 2h, x
4
= x
1
+ 3h
Then our earlier formulas simplify to
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6h
+
(x
2
x) y
1
+ (x x
1
) y
2
h
h
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
for x
1
x x
2
, with similar formulas on [x
2
, x
3
] and
[x
3
, x
4
].
Without going thru all of the algebra, the conditions
(**) leads to the following pair of equations.
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
3
y
2
h
y
2
y
1
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
4
y
3
h
y
3
y
2
h
This gives us two equations in four unknowns. The
earlier boundary conditions on s
00
(x) gives us immedi-
ately
M
1
= M
4
= 0
Then we can solve the linear system for M
2
and M
3
.
EXAMPLE
Consider the interpolation data points
x 1 2 3 4
y 1
1
2
1
3
1
4
In this case, h = 1, and linear system becomes
2
3
M
2
+
1
6
M
3
= y
3
2y
2
+ y
1
=
1
3
1
6
M
2
+
2
3
M
3
= y
4
2y
3
+ y
2
=
1
12
This has the solution
M
2
=
1
2
, M
3
= 0
This leads to the spline function formula on each
subinterval.
On [1, 2],
s(x) =
(x
2
x)
3
M
1
+ (x x
1
)
3
M
2
6h
+
(x
2
x) y
1
+ (x x
1
) y
2
h
h
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
=
(2 x)
3
0 + (x 1)
3
1
2
6
+
(2 x) 1 + (x 1)
1
2
1
6
h
(2 x) 0 + (x 1)
1
2
i
=
1
12
(x 1)
3
7
12
(x 1) + 1
Similarly, for 2 x 3,
s(x) =
1
12
(x 2)
3
+
1
4
(x 2)
2
1
3
(x 1) +
1
2
and for 3 x 4,
s(x) =
1
12
(x 4) +
1
4
x 1 2 3 4
y 1
1
2
1
3
1
4
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0.2
0.4
0.6
0.8
1
x
y
y = 1/x
y = s(x)
Graph of example of natural cubic spline
interpolation
x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating natural cubic spline function
ALTERNATIVE BOUNDARY CONDITIONS
Return to the equations
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
3
y
2
h
y
2
y
1
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
4
y
3
h
y
3
y
2
h
Sometimes other boundary conditions are imposed on
s(x) to help in determining the values of M
1
and
M
4
. For example, the data in our numerical exam-
ple were generated from the function f(x) =
1
x
. With
it, f
00
(x) =
2
x
3
, and thus we could use
M
1
= 2, M
4
=
1
32
With this we are led to a new formula for s(x), one
that approximates f(x) =
1
x
more closely.
THE CLAMPED SPLINE
In this case, we augment the interpolation conditions
s(x
i
) = y
i
, i = 1, 2, 3, 4
with the boundary conditions
s
0
(x
1
) = y
0
1
, s
0
(x
4
) = y
0
4
(#)
The conditions (#) lead to another pair of equations,
augmenting the earlier ones. Combined these equa-
tions are
h
3
M
1
+
h
6
M
2
=
y
2
y
1
h
y
0
1
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
3
y
2
h
y
2
y
1
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
4
y
3
h
y
3
y
2
h
h
6
M
3
+
h
3
M
4
= y
0
4
y
4
y
3
h
For our numerical example, it is natural to obtain
these derivative values from f
0
(x) =
1
x
2
:
y
0
1
= 1, y
0
4
=
1
16
When combined with your earlier equations, we have
the system
1
3
M
1
+
1
6
M
2
=
1
2
1
6
M
1
+
2
3
M
2
+
1
6
M
3
=
1
3
1
6
M
2
+
2
3
M
3
+
1
6
M
4
=
1
12
1
6
M
3
+
1
3
M
4
=
1
48
This has the solution
[M
1
, M
2
, M
3
, M
4
] =
173
120
,
7
60
,
11
120
,
1
60
h
6
[(x
2
x) M
1
+ (x x
1
) M
2
]
We can substitute in from the data
x 1 2 3 4
y 1
1
2
1
3
1
4
and the solutions {M
i
}. Doing so, consider the error
f(x) s(x). As an example,
f(x) =
1
x
, f
3
2
=
2
3
, s
3
2
= .65260
This is quite a decent approximation.
THE GENERAL PROBLEM
Consider the spline interpolation problem with n nodes
(x
1
, y
1
) , (x
2
, y
2
) , ..., (x
n
, y
n
)
and assume the node points {x
i
} are evenly spaced,
x
j
= x
1
+ (j 1) h, j = 1, ..., n
We have that the interpolating spline s(x) on
x
j
x x
j+1
is given by
s(x) =
x
j+1
x
3
M
j
+
x x
j
3
M
j+1
6h
+
x
j+1
x
y
j
+
x x
j
y
j+1
h
h
6
h
x
j+1
x
M
j
+
x x
j
M
j+1
i
for j = 1, ..., n 1.
To enforce continuity of s
0
(x) at the interior node
points x
2
, ..., x
n1
, the second derivatives
n
M
j
o
must
satisfy the linear equations
h
6
M
j1
+
2h
3
M
j
+
h
6
M
j+1
=
y
j1
2y
j
+ y
j+1
h
for j = 2, ..., n 1. Writing them out,
h
6
M
1
+
2h
3
M
2
+
h
6
M
3
=
y
1
2y
2
+ y
3
h
h
6
M
2
+
2h
3
M
3
+
h
6
M
4
=
y
2
2y
3
+ y
4
h
.
.
.
h
6
M
n2
+
2h
3
M
n1
+
h
6
M
n
=
y
n2
2y
n1
+ y
n
h
This is a system of n2 equations in the n unknowns
{M
1
, ..., M
n
}. Two more conditions must be imposed
on s(x) in order to have the number of equations equal
the number of unknowns, namely n. With the added
boundary conditions, this form of linear system can be
solved very eciently.
BOUNDARY CONDITIONS
Natural boundary conditions
s
00
(x
1
) = s
00
(x
n
) = 0
Spline functions satisfying these conditions are called
natural cubic splines. They arise out the minimiza-
tion problem stated earlier. But generally they are not
considered as good as some other cubic interpolating
splines.
Clamped boundary conditions We add the condi-
tions
s
0
(x
1
) = y
0
1
, s
0
(x
n
) = y
0
n
with y
0
1
, y
0
n
given slopes for the endpoints of s(x) on
[x
1
, x
n
]. This has many quite good properties when
compared with the natural cubic interpolating spline;
but it does require knowing the derivatives at the end-
points.
Not a knot boundary conditions This is more com-
plicated to explain, but it is the version of cubic spline
interpolation that is implemented in Matlab.
THE NOT A KNOT CONDITIONS
As before, let the interpolation nodes be
(x
1
, y
1
) , (x
2
, y
2
) , ..., (x
n
, y
n
)
We separate these points into two categories. For
constructing the interpolating cubic spline function,
we use the points
(x
1
, y
1
) , (x
3
, y
3
) , ..., (x
n2
, y
n2
) , (x
n
, y
n
)
Thus deleting two of the points. We now have n 2
points, and the interpolating spline s(x) can be deter-
mined on the intervals
[x
1
, x
3
] , [x
3
, x
4
] , ..., [x
n3
, x
n2
] , [x
n2
, x
n
]
This leads to n 4 equations in the n 2 unknowns
M
1
, M
3
, ..., M
n2
, M
n
. The two additional boundary
conditions are
s(x
2
) = y
2
, s(x
n1
) = y
n1
These translate into two additional equations, and we
obtain a system of n2 linear simultaneous equations
in the n 2 unknowns M
1
, M
3
, ..., M
n2
, M
n
.
x 0 1 2 2.5 3 3.5 4
y 2.5 0.5 0.5 1.5 1.5 1.125 0
x
y
1 2 3 4
1
2
Interpolating cubic spline function with not-a knot
boundary conditions
MATLAB SPLINE FUNCTION LIBRARY
Given data points
(x
1
, y
1
) , (x
2
, y
2
) , ..., (x
n
, y
n
)
type arrays containing the x and y coordinates:
x = [x
1
x
2
...x
n
]
y = [y
1
y
2
...y
n
]
plot (x, y, o)
The last statement will draw a plot of the data points,
marking them with the letter oh. To nd the inter-
polating cubic spline function and evaluate it at the
points of another array xx, say
h = (x
n
x
1
) / (10 n) ; xx = x
1
: h : x
n
;
use
yy = spline (x, y, xx)
plot (x, y, o, xx, yy)
The last statement will plot the data points, as be-
fore, and it will plot the interpolating spline s(x) as a
continuous curve.
ERROR IN CUBIC SPLINE INTERPOLATION
Let an interval [a, b] be given, and then dene
h =
b a
n 1
, x
j
= a + (j 1)h, j = 1, ..., n
Suppose we want to approximate a given function
f(x) on the interval [a, b] using cubic spline inter-
polation. Dene
y
i
= f(x
i
), j = 1, ..., n
Let s
n
(x) denote the cubic spline interpolating this
data and satisfying the not a knot boundary con-
ditions. Then it can be shown that for a suitable
constant c,
E
n
max
axb
|f(x) s
n
(x)| ch
4
The corresponding bound for natural cubic spline in-
terpolation contains only a term of h
2
rather than h
4
;
it does not converge to zero as rapidly.
EXAMPLE
Take f(x) = arctan x on [0, 5]. The following ta-
ble gives values of the maximum error E
n
for various
values of n. The values of h are being successively
halved.
n E
n
E
1
2
n
/E
n
7 7.09E3
13 3.24E4 21.9
25 3.06E5 10.6
49 1.48E6 20.7
97 9.04E8 16.4
BEST APPROXIMATION
Given a function f(x) that is continuous on a given
interval [a, b], consider approximating it by some poly-
nomial p(x). To measure the error in p(x) as an ap-
proximation, introduce
E(p) = max
axb
|f(x) p(x)|
This is called the maximum error or uniform error of
approximation of f(x) by p(x) on [a, b].
With an eye towards eciency, we want to nd the
best possible approximation of a given degree n.
With this in mind, introduce the following:
n
(f) = min
deg(p)n
E(p)
= min
deg(p)n
"
max
axb
|f(x) p(x)|
#
The number
n
(f) will be the smallest possible uni-
form error, or minimax error, when approximating f(x)
by polynomials of degree at most n. If there is a
polynomial giving this smallest error, we denote it by
m
n
(x); thus E(m
n
) =
n
(f).
Example. Let f(x) = e
x
on [1, 1]. In the following
table, we give the values of E(t
n
), t
n
(x) the Tay-
lor polynomial of degree n for e
x
about x = 0, and
E(m
n
).
Maximum Error in:
n t
n
(x) m
n
(x)
1 7.18E 1 2.79E 1
2 2.18E 1 4.50E 2
3 5.16E 2 5.53E 3
4 9.95E 3 5.47E 4
5 1.62E 3 4.52E 5
6 2.26E 4 3.21E 6
7 2.79E 5 2.00E 7
8 3.06E 6 1.11E 8
9 3.01E 7 5.52E 10
Consider graphically how we can improve on the Tay-
lor polynomial
t
1
(x) = 1 + x
as a uniform approximation to e
x
on the interval [1, 1].
The linear minimax approximation is
m
1
(x) = 1.2643 + 1.1752x
x
y
-1 1
1
2
y=t
1
(x)
y=m
1
(x)
y=e
x
Linear Taylor and minimax approximations to e
x
x
y
-1 1
0.0516
Error in cubic Taylor approximation to e
x
x
y
-1 1
0.00553
-0.00553
Error in cubic minimax approximation to e
x
Accuracy of the minimax approximation.
n
(f)
[(b a)/2]
n+1
(n + 1)!2
n
max
axb
f
(n+1)
(x)
n
(e
x
)
e
(n + 1)!2
n
(*)
n Bound (*)
n
(f)
1 6.80E 1 2.79E 1
2 1.13E 1 4.50E 2
3 1.42E 2 5.53E 3
4 1.42E 3 5.47E 4
5 1.18E 4 4.52E 5
6 8.43E 6 3.21E 6
7 5.27E 7 2.00E 7
CHEBYSHEV POLYNOMIALS
Chebyshev polynomials are used in many parts of nu-
merical analysis, and more generally, in applications
of mathematics. For an integer n 0, dene the
function
T
n
(x) = cos
ncos
1
x
, 1 x 1 (1)
This may not appear to be a polynomial, but we will
show it is a polynomial of degree n. To simplify the
manipulation of (1), we introduce
= cos
1
(x) or x = cos(), 0 (2)
Then
T
n
(x) = cos(n) (3)
Example. n = 0
T
0
(x) = cos(0 ) = 1
n = 1
T
1
(x) = cos() = x
n = 2
T
2
(x) = cos(2) = 2 cos
2
() 1 = 2x
2
1
x
y
-1 1
1
-1
T
0
(x)
T
1
(x)
T
2
(x)
x
y
-1 1
1
-1
T
3
(x)
T
4
(x)
The triple recursion relation. Recall the trigonomet-
ric addition formulas,
cos( ) = cos() cos() sin() sin()
Let n 1, and apply these identities to get
T
n+1
(x) = cos[(n + 1)] = cos(n +)
= cos(n) cos() sin(n) sin()
T
n1
(x) = cos[(n 1)] = cos(n )
= cos(n) cos() + sin(n) sin()
Add these two equations, and then use (1) and (3) to
obtain
T
n+1
(x) +T
n1
= 2 cos(n) cos() = 2xT
n
(x)
T
n+1
(x) = 2xT
n
(x) T
n1
(x), n 1
(4)
This is called the triple recursion relation for the Cheby-
shev polynomials. It is often used in evaluating them,
rather than using the explicit formula (1).
Example. Recall
T
0
(x) = 1, T
1
(x) = x
T
n+1
(x) = 2xT
n
(x) T
n1
(x), n 1
Let n = 2. Then
T
3
(x) = 2xT
2
(x) T
1
(x)
= 2x(2x
2
1) x
= 4x
3
3x
Let n = 3. Then
T
4
(x) = 2xT
3
(x) T
2
(x)
= 2x(4x
3
3x) (2x
2
1)
= 8x
4
8x
2
+ 1
The minimum size property. Note that
|T
n
(x)| 1, 1 x 1 (5)
for all n 0. Also, note that
T
n
(x) = 2
n1
x
n
+ lower degree terms, n 1
(6)
This can be proven using the triple recursion relation
and mathematical induction.
Introduce a modied version of T
n
(x),
e
T
n
(x) =
1
2
n1
T
n
(x) = x
n
+lower degree terms (7)
From (5) and (6),
e
T
n
(x)
1
2
n1
, 1 x 1, n 1 (8)
Example.
e
T
4
(x) =
1
8
8x
4
8x
2
+ 1
= x
4
x
2
+
1
8
A polynomial whose highest degree term has a coe-
cient of 1 is called a monic polynomial. Formula (8)
says the monic polynomial
e
T
n
(x) has size 1/2
n1
on
1 x 1, and this becomes smaller as the degree
n increases. In comparison,
max
1x1
|x
n
| = 1
Thus x
n
is a monic polynomial whose size does not
change with increasing n.
Theorem. Let n 1 be an integer, and consider all
possible monic polynomials of degree n. Then the
degree n monic polynomial with the smallest maxi-
mum on [1, 1] is the modied Chebyshev polynomial
e
T
n
(x), and its maximum value on [1, 1] is 1/2
n1
.
This result is used in devising applications of Cheby-
shev polynomials. We apply it to obtain an improved
interpolation scheme.