Chap1 Introduction

I.
Introduction
1.1 Why Numerical Methods?

Example 1.
Steady state heat conduction
T/y=0 (insulated)
Governing equation:
2
Heat flow
T=T 1
T= 0
T/x
=0
How do we determine the heat flow

from wall AB to wall AD?
A
Possible solutions:
B x
1. Experiment
T=T2 > T 1
2. Analytical solution
3. Numerical solution
y
Example 2
T/y=0 (Insulated)
Steady state heat conduction

T/x=0
in a non-simple geometry
Governing equation:
T=T 1
Heat flow
T/y=0
T= 0
How do we determine the heat flow
from wall AB to wall AD?
A
T=T 2 > T 1
T/x
=0
B x
Example 3
Unsteady state heat conduction in a non-simple geometry:
2
T
= T
t
Experimental approach:
Design the experiment
Set up a facility to satisfy boundary conditions (insulation and
constant temperature)
Prepare instrumentation
Perform experiment & collect data
(measure heat flux on wall AB, for example)
Analyze data and present data
Develop a model (say, to describe the effect of LAB/LCD on
heat transfer)
O.K. for all three examples; no information on T(x, y) inside
the cavity; relatively time consuming & tedious.
Analytical approach:
2
Solve mathematical equation ( T=0 or
2
T
= T) based on
t
a physical model.
Use method of separation of variable? Green's function?
O.K. for simple geometry in Example 1;
unlikely for Examples 2 & 3.
Numerical approach:
Solve equations that can be much more complicated than
2
T
= T using a computer.
t
Solution is discrete, approximate, but can be close to exact.

yn
yn+1
t
t1
t2
...
tn
t n+1
Computer program can handle more complicated geometries

(as in Examples 2 & 3 above).
Gain insight into the temperature distribution inside the cavity.
Cautionary Remarks:
NO numerical method is completely trouble free in all
situations.
NO numerical method is completely error free.
NO numerical method is optimal for all situations.
Be careful about:
ACCURACY, EFFICIENCY, & STABILITY.
* Example 4.
dy
= - 10y,
y(0) = 1
dt
First note: the exact solution is: yexact(t) = e-10t.
dy
y n +1 y n
as
Approximating: LHS
dt
t
& treating:
RHS -10y as -10 yn (n=0, 1, 2, 3,)
Solve a simple ODE:
y n +1 y n
= -10yn
t
yn+1 = yn -10t yn = yn (1-10t ),
i.e.
=>
with y0 =1.
yn+1 = yn (1-10t ) = yn-1 (1-10t )2
= yn-2 (1-10t )3 = ... = y0(1-10t )n
Choose t = 0.05, 0.1, 0.2, and 0.5,
=> 10t =0.5,
1, 2, and 5.
See what happens!
n
yn(t=0.05)
yn(t=0.1)
yn(t=0.2)
yn(t=0.5)
0.5
-1
-4
0.25
16
0.125
-1
64
0.0625
-256
0.03125
-1
1024
6
7
0.0156250
0.0078125
0
0
1
-1
-2048
Comments:
ok
inaccurate
oscillates
blows up
4
1.1
y_numerical
t=0.05
y_exact
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-0.1
0
0.5
1.5
(Thisgraphcomparestheexactsolutionwiththestablenumerical
solutionfort=0.05.)
Questions:
Why does the solution blow up for t=0.5?
How to detect/prevent numerical instability (blowing up) in general?
How to improve accuracy (c.f. the case with t=0.05)?
How to get solution efficiently if a large system is solved?
Graphs based on numerical solutions to heat transfer problems (Examples 1 & 2):
Steady State Temperature Contour
1.0
0.1
0.2
0.3
0.4
0.8
0.5
0.6
y
0.6
0.4
0.7
0.8
0.2
0.9
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Steady Temperature with a Block

1.0
0.1
0.2
0.8
0.3
0.6
0.4
y
0.5
0.4
0.6
0.7
0.8
0.9
0.2
0.0
0.0
0.2
0.4 x
0.6
0.8
1.0
1.2 Mathematical Preliminaries

1.2.1 Intermediate Value Theorem
Let f(x) be a continuous function on the finite interval
axb, and define
m = Infimum f(x),
M = Supremum f(x)
axb
axb
Then for any number z in the interval [m, M], there is at least
one point in [a, b] for which
f() = z.
In particular, there are points x and x in [a, b] for which
m = f( x )
M
and M = f( x ).
m
a
1.2.2 Mean Value Theorem

Let f(x) be continuous function on the finite interval axb,
and let it be differentiable for axb. Then there is at least
one point in [a, b] for which
f(b) - f(a) = f ( ) (b-a).
(1)
The graphical interpretation of this theorem is shown below.

f(b)
Slope:
f ( )
f(b) - f(a)
f(a)
a
1.2.3 Integral Mean Value Theorem (IMVT)

Let w(x) be nonnegative and integrable on [a, b] and let f(x) be
continuous on [a, b]. Then
b
w(x) f(x) dx = f()
w(x) dx
(2)
for some in [a, b].

(Proof: see Example 1 in Supplemental Reading)
1.2.4 Taylor series expansion

*
Let f(x) have n+1 continuous derivatives on [a, b] for some
n0 and let x, x0[a, b]. Then
f(x) = Pn(x) + R n+1(x)
(3)
where
Pn(x) = f(x0) +
x x0
( x x0 ) 2
f ( x0 ) +
f ( x0 )
1!
2!
( x x0 ) n ( n)
( x x0 ) 3
f ( x0 )
+
f ( x0 ) + ... +
n!
3!
1 x
n (n+1)(t) dt
Rn+1(x) =
(x t) f
n! x
0
( x x0 ) n +1 (n+1)
(IMVT) =
f
()
( between x0 & x)
(n + 1)!
(4)
(5)
= truncation error of the expansion = T.E.
(This figure shows the approximation of f(x) using 0th, 1st, & 2nd
order Taylor series expansions.)
9
* Examples: (In Eq.(4), replace x-x0 by h, x0 by x)

h3
h2
f(x+h) = f(x) + h f (x) +
f (x ) +
f (x ) +
3!
2!
(6)
h3
h2
f(x-h) = f(x) - h f (x) +
f (x ) f (x ) +
3!
2!
(7)
* ListofTaylorseriesforcommonlyencounteredfunctions
(withx0=0):
(1 + x) = 1 + x +
( 1)
2!
x2 +
( 1)( 2)
3!
x 3 + ... ,
1
=1 + x + x2 + x3 + x4 +
1 x
x x 2 x3 5x 4 7 x5
1+ x = 1+
+
+
+ ...
2 8 16 128 256
7
x3
x5 x
sin(x) = x +
,
3!
5! 7!
|x|<1
|x|<1
x2 x4 x6
cos(x) = 1- + + ,
2! 4! 6!
x 3 2 x 5 17 x 7 62 x 9
+
+
tan(x) = x + +
+ ,
3
15
315 2835
x 2 5x 4 61x 6 277x8
+
+
+ ... ,
sec x = 1+ +
2
24
720
8064
arcsin(x) = x +
arccos(x) =
arctan(x) = x -
x 3 3x 5 5 x 7 35x 9
+
+
+
+ ...
6
40 112 1152
x
x 3 3x 5 5 x 7
+ ... ,
6
40 112
x3 x5 x7 x9
+
+
+ ...
3
5
7
9
|x|<1
|x|< /2
|x|< /2
,
|x|< 1
|x|< 1
|x|< 1
x 2 x3 x5
+
+ +,
exp(x) = 1 + x +
2! 3!
5!
10
x
x3
x5
sinh(x) = x +
+
+
,
7!
3!
5!
x2 x4 x6
+
cosh(x) = 1+ +
+ ,
2! 4! 6!
x 2 5x 4 61x6 277x8
+
+
+
sech(x) = 1/cosh(x) = 12
24 720 8064
tanh(x) = x -
x 3 2 x 5 17 x 7 62 x 9
+
+
+ ... ,
3
15
315 2835
3x 5 5x 7 35x9
x3
+
,
sinh (x) = x +
6
40 112 1152
-1
-1
tanh (x) = x +
x3 x5 x7 x9
+
+
+
+ ... ,
3
5
7
9
x 2 x3
ln (1+x) = x - + +
3
2
2x 3 2 x5 2 x 7
1+ x
ln
+
+ ...
= 2x +
+
5
7
3
1 x
|x|< 1
|x|<1
* Challengingproblem:
CanyouuseTaylorseriesexpansiontoevaluatethefollowing
limit?
lim x 2 [(1 +
1 x +1
1
)
(1 + ) x ] = ?
x +1
x
(SeeExample2inSupplementalReadingMaterialfordetails)
11
* Example:
Whataretheerrors,orremainders,R4,inthe
Taylorseriesexpansionofsin(x)?
Soln.
f (x) = cos(x),
For f(x) = sin(x),
f (x) = -sin(x),
f (x) = -cos(x), f (x ) = sin(x)
=>
sin(x) = x - x 3 / 3! += P3(x) + R4(x)
3
with P3 (x) = x - x / 3!
& R4(x) =
1x
1x
3
3
(
x
t
)
f
(
t
)
dt
=
( x t ) sin(t )dt ,
3! 0
3! 0
1 1 x
4
sin(t ) d ( x t )
3! 4 0
x x
1 1
[sin(t )( x t ) 4 - ( x t ) 4 cos(t )dt ]
3! 4
0 0
1x
4
cos(t )( x t ) dt (IMVT=>)
4! 0
(integration by parts =>)
x
1
x5
4
cos(
)
(
x
t
)
dt
=
cos().
=
4!
5!
0
Hence sin(x) = x -
3!
5!
cos() for between 0 and x.
Note: R4(x) may also be expressed as

x
4
1
sin( ) ( x t ) 3 dt = x sin().
3!
4!
0
However, since is between 0 & x with |x|1, this estimate,
R4~ x 4 / 4! sin(), is not as useful.
Since cos() ~1 for small |x| so that R4~ x 5 / 5! more useful!
*
Use the next term in TS expansion, Pn(x), to represent Rn+1(x).

12
1.2.5 Taylor series expansion in two dimensions
Let f(x, y) be n+1 time continuously differentiable for all (x, y)

in some neighborhood of (x0, y0). Then,
n
m
f(x0+, y0+) = f(x0, y0) + m! D f ( x, y )
m =1
x
where D
0 , y0
1
Dn+1f(x, y) |x + ,y +
0
0
(n + 1)!
+
x
y
Example: Consider f(x, y) =
(8)
)and 0 1.
ln[1 + 2 x + x 2 + xy + y 3 ]1/ 2
Find its Taylor series expansion near x=y=0.

Solution: Method 1.
f(x, y) =
1
ln[1 + 2 x + x 2 + xy + y 3 ] , f(0, 0)=0
2
f
2 + 2x + y
=
,
x 2[1 + 2 x + x 2 + xy + y 3 ]
f
(0,0) = 1 ;
x
f
x + 3y2
=
,
y 2[1 + 2 x + x 2 + xy + y 3 ]
f
(0,0) = 0 ;
y
2 f
x 2
2[1 + 2 x + x 2 + xy + y 3 ] (2 + 2 x + y )(2 + 2 x + y )
2[1 + 2 x + x 2 + xy + y 3 ]2
2 f
x 2
2 f
y 2
(0,0) = 1 ;
6 y[1 + 2 x + x 2 + xy + y 3 ] ( x + 3 y 2 )( x + 3 y 2 )
2[1 + 2 x + x 2 + xy + y 3 ]2
,
13
2 f
y 2
(0,0) = 0 ;
2 f 1 + 2 x + x 2 + xy + y 3 ( x + 3 y 2 )(2 + 2 x + y )
=
.
xy
2[1 + 2 x + x 2 + xy + y 3 ]2
2 f
1
(0,0) =
xy
2
Thus, f(x, y) ~ 0 + x + 0 y +
1
2
1
1
( 1) x 2 + xy + 0 y 2
2
2
1
2
= x x 2 + xy + ...
Method II:
Let
= 2 x + x 2 + xy + y 3
Note: ln (1+) =
2 3
2
1
ln[1 + 2 x + x 2 + xy + y 3 ]
2
1
1
2
3
2
3 2
{[
2
x
+
x
+
xy
+
y
]
- [2 x + x + xy + y ] + }
=
2
2
=x+
1 2 1
1
1
1
x + xy (4 x 2 ) + ... = x x 2 + xy + ...
2
2
4
2
2
(This approach is much simpler!)
14
1.2.6 Fourier transformF:

i.
Periodicfunctions
f(x)
3L
5L
3L
p=2Lintheabovegraph.
Wenote: f(x+2L)=f(x)
f(x)isaperiodicfunctioniff(x+p)=f(x)
andpistheperiodofthisperiodicfunction
ii.Fourierseriesexpansion
f(x)= a0 + [an cos(nx / L) + bn sin( nx / L)]
where
n =1
a0 =
1 L
f ( x )dx
2L
ak =
1 L
f ( x ) cos(kx )dx
L L
bk =
1 L
f ( x )sin(kx )dx
L L
15
ExampleofFouriersseriesapp
plicationss:
Inmanysitu
uations,w
wearefacedwithamoreccomplicattedsignal
thaatmayno
othaveanobvioussinglep
period:
(T
The x-axiis is time in secondds.)
Iff we samp
ple this siignal at a rate of 4096
4
Hz, collect
c
10024 samples
a find th
and
he Fourieer series coefficien
c
nts we gett the specttrum:
The spectrum
m tells us that
t the orriginal siggnal is a composit
c
te of 3 purre
t
tones.
In fact,
f
thesee tones arre at 60 Hz,
H 300 Hz
H and 12000 Hz.
16
iii.
FouriertransformF
Considerf(x),<x<,with
| f ( x ) | dx
Fouriertransformoff(x)is:
f ( w) = 1 f ( x)e iwx dx.

2
(- < w < )
Fourierinversetransform:
1
iwx
f ( x) =
f ( w)e dw.
2
FouriertransformiscontinuousversionofFourierseries.
Or,FourierseriesisaDiscreteversionofFourierTransform.
PhysicalinterpretationoftheFouriertransformation:
istheenergyinthefrequencyrange(w, w+dw)
| f ( w) |2 dw
iscalledthespectraldensityorenergyspectrum.
| f ( w) |2
| f ( w) |2
| f ( w) |2 dw
dw
SomePropertiesofFouriertransform:
f ( x) f ( w)
a)LinearityoftheFouriertransform
1
F{af ( x) + bg ( x)} =
2
[af ( x) + bg ( x)]e iwx dx
17
1
1
iwx
iwx
(
)
(
)
=a
f
x
e
dx
+
b
g
x
e
dx = aF{ f (x)}+bF{g(x)}
2
2
b)Fouriertransformofderivativeoff(x):
F{ f '( x)} = iwF{ f ( x)}
Why?Note:
1
F{ f '( x)} =
2
f '( x)e iwx dx
1
=
f ( x)e iwx dx}
{ f ( x)eiwx |+
iw
2
= iwF{ f ( x)}
f()=0
Andsimilarly
F{ f "( x)} = (iw) 2 F{ f ( x)} = w2F{ f ( x)}
AnapplicationofFourierTransform
ApplyFouriertransformtoaconstantcoefficientODE:
ay+by+cy=g(t)
analgebraicequationisobtained.
F{ay"+by '+cy = g (t )}
w2 ay ( w) + iwby ( w) + cy ( w) = g ( w)
Thusitis
y ( w) = g ( w) /[c + ibw aw2 ]
easiertosolve.
y (t ) = F 1{ y ( w)}
18
WewilluseFouriertransformtostudythesolutionbehaviorof:
u
u
2u
3u
+c
= 2 + D 3
t
x
x
x
In order to understand the roles of advection (c), diffusion (), and

dispersion (D).
1.3 Sources of Errors in Computations:

1.3.1 Absolute and relative errors:
True value (T.V.) xT = Approximate value (A.V.) xA + Error ,
Absolute error = T.V. - A.V.
Relative error: Rel.( xA) =
(9)
T.V. - A.V.
=
xT
T.V.
(10)
1.3.2 Types of errors

Modeling error
-- e.g neglecting friction in computing

a bullet trajectory
Empirical measurements -- g (gravitational acceleration),
h (Planck constant), ...
Blunders
Input data inexact-- weather prediction based on data collected
Round-off error -- e.g. 3.1415927 instead of
3.1415926535897932384...
x2
+
Truncation error -- e.g. ex 1 + x +
or
dy y n +1 y n
dt
t
3!
4!
for small x,
for small t.
19
* Example: Surface area of the earth may be approximated as

A = 4 r2
Errors in various approximations:
Earth is modeled as a sphere (an idealization)
Earth radius r 6378.14 km from measurements.
3.14159265
Calculator or computer has finite length; result is rounded.
* Example of Truncation Error in Taylor series:
( x x0 ) 2
x x0
( x x0 ) 3
f(x) = f(x0) +
f ( x0 ) +
f ( x0 ) +
f ( x0 )
3!
1!
2!
( x x0 ) n ( n)
f ( x0 ) + Rn+1(x)
+ ... +
n!
Rn+1(x) = Remainder or
(4)
Truncation Error (ET)
Rn+1(x) or ET can be estimated as

Rn+1(x) =
( x x0 ) n +1 (n+1)
1 x
n
(n+1)
(t)
dt
=
f
f
( )
(
x
t
)
( n + 1)!
n! x
0
(5)
between x and x0.
To understand the roundoff error, we must first look into floating

point arithmetic.
20
1.4 Floating Point Arithmetic

1.4.1 Anatomy of a floating-point number
Three fieldsina32bitIEEE754float
Example: representationof(0.15625)10inabinary32bitfloat:
(0.15625)10=0.125+0.03125=1/23+1/25
= 0.15625
Example: how to represent 1/10 in binary?
Solution: 1/10=1/24+1/25+1/28+1/29+1/212+1/213+

=0.0001100110011001100110011001100....
The pattern repeats; never ending; the number is inexact.

Is such inexactness important?
Yes! Very important! You need to know your weapon well in order to
use it effectively.
21
Real-life Examples
--Disasters Caused by Computer Arithmetic Error
ThePatriotMissileFailure
On February 25, 1991, during the Gulf
War, an American Patriot Missile
battery in Dharan, Saudi Arabia, failed
to track and intercept an incoming
Iraqi Scud missile. The Scud struck an American Army barracks,
killing 28 soldiers and injuring around 100 other people.
The Patriot missile had an on-board timer that incremented
every tenth of a second
Software accumulated a floating point time value by adding 0.1
seconds
Problem is that 0.1 in floating point is not exactly 0.1. With a 23
bit representation it is really only 0.0999999046326.
Thus, after 100 hours (3,600,000 ticks), the software timer was
off by 0.3433 seconds.
Scud missile travels at 1676 m/s. In 0.3433 seconds, the Scud
was 573 meters away from where the Patriot thought it was.
This was far enough that the incoming Scud was outside the
"range gate" that the Patriot tracked.
See government investigation report:
http://www.fas.org/spp/starwars/gao/im92026.htm
22
General computer representation of a floating point number x

x:
. d1d2d3 ... dt* Be
(11)
( e.g. -.110101* 210 in binary)
= sign
= number base: 2 or 10 or 16
d1d2d3...dt = mantissa or fractional part of significand;

d1 0;
t
0di B-1, i=1, 2...t
= number of significant bits, e.g. t=24

gives PRECISION of x
= exponent or characteristic, -emin =L< e < U=emax

(e.g. -126<e<127)
it determines RANGE of x.
In reality, Eq. (11) represents

x=(
e.g.
d
d
d1 d 2
+ 2 + 33 ... + tt ) Be
B
B
B
B
-.110101* 211
1
2
= -( +
1
22
0
23
2 4 25
11
1
26
= 0.828125 * 2 = -1696
(12)
in base 2
) * 211
in base 10
23
1.4.2 IEEE Standard for single precision (for base 2 only)

* 32 bit IEEE 754 float:
S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
0 1
8 9
31
The value V represented by the word may be determined as follows:
If E=255 and F is nonzero, then V=NaN ("Not a number")
If E=255 and F is zero and S is 1, then V = -Infinity; (-1)S =-1.
If E=255 and F is zero and S is 0, then V=Infinity; (-1)S =1.
If 0<E<255 then V=(-1)**S * 2 ** (E -127) * (1.F)

where "1.F" is intended to represent the binary number created
by prefixing F with an implicit leading 1 (d0=1) and a binary point.
In the above, theexponentisstoredwith127addedtoit,

alsocalled"biasedwith127".
Thus,noneofthe8bitsisusedtostorethesignoftheexponent
E.
But,theactualexponenteisequaltoE127.
SinceE=255isfor V=NaN, the largest E is 254

=> U= 254-127 = 127
24
If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F)

These are "unnormalized" values.
That is why L = -126.
If E=0 and F is zero and S is 1, then V=-0
If E=0 and F is zero and S is 0, then V=0
The reason for having |L| <U is so that the reciprocal of the
smallest number, 1/2L, will not overflow. Although it is true that
the reciprocal of the largest number will underflow, underflow is
usually less serious than overflow.
25
1.4.3 IEEE Standard for double precision
Double precision referstoatypeoffloatingpointnumberthat

hasmoreprecision(thatis,moredigitstotherightofthe
decimalpoint)thanasingle precisionnumber.Thetermdouble
precisionissomethingofamisnomerbecausetheprecisionis
notreallydouble.
Theworddoublederivesfromthefactthatadoubleprecision
numberusestwiceasmanybitsasaregularfloatingpoint
number.Forexample,ifasingleprecisionnumberrequires32
bits,itsdoubleprecisioncounterpartwillbe64bitslong(see
thepartitionofthreefieldsshownabove).
Theextrabitsincreasenotonlytheprecision(t)butalsotherange
(e)ofmagnitudesthatcanberepresented.Theexactamount
bywhichtheprecisionandrangeofmagnitudesareincreased
dependsonwhatformattheprogramisusingtorepresent
floatingpointvalues.Mostcomputersuseastandardformat
knownastheIEEEfloatingpointformat.
26
Brief summary:
B
(mantisa)
Total
Length
IEEE SP
23
-126
127
32
IEEE DP
52
-1022
1023
64
SP:
23bitsgotothePRECISION
8bitsgototheRANGEforLorU
1bitgoestotheSIGN
Add
32bitsforsingleprecisionrepresentation
DP:
64=52(t)+11(e)+1(S)
27
1.4.4 Totalnumberoffloatingpointnumbers
ThefloatingpointnumberCANNOTrepresentarbitraryreal
numbersevenifitisonlyofmodestmagnitude.
Thetotalnumberoffloatingpointnumbersthatcanbe
producedbysuchasystem
x = . d1d2d3 ... dt* Be
is
2(B1)Bt1(UL+1)+1
2:
accountsforthesign
(B1):
numberofpossibilitiesinchoosingd1 0
Bt1:
numberofpossibilitiesinchoosingd2, d3, ... dt
UL+1: numberofpossibilitiesinchoosinge
1:
(13)
forrepresentingnumberx=0
Thatiswhy,forexample,0.1indecimalcannotberepresented
exactlybyabinarynumberrepresentation:
1/10=1/24+1/25+1/28+1/29+1/212+1/213+

=0.0001100110011001100110011001100....
28
1.4.5 Smallest and largest positive numbers

Smallest positive number x and underflow:
xL = (.100...0)B B-Emin = B-Emin-1
(14)
(=2-126-1= 5.877x10-39)
If x<xL underflow, i.e. computer may treat x as 0.
Largest positive number x and overflow:
xU = (....)B BEmax = (1-B-t) BEmax;
=B-1
(15)
(~2127=1.7x1038)
If x> BEmaxoverflow;
computers treat x as "", Inf., or "NaN =Not a Number "
Is it important to know conditions for overflow and underflow?

Absolutely!
29
Real-life Example
Ariane Rocket
On June 4, 1996 an
unmannedAriane 5 rocket
was launched.
The rocket was on course for 36 seconds and then veered off
and crashed
The internal reference system was trying to convert a 64-bit
floating point number to a 16-bit integer.
This caused an overflow which was sent to the onboard
computer.
The on-board computer interpreted the overflow as real flight
data and bad things happened.
The destroyed rocket and its cargo were valued at $500 million.
The rocket was on its first voyage, after a decade of
development costing $7 billion.
30
Overflow and underflow experiment:

Write a computer program, using x=2k, with k=1 to .., to
determine the largest floating point number your computer can
handle before overflow occurs.
Then use y=2-k, k=1 to ..., to determine the smallest floating
number your computer can handle before underflow occurs.
Results of the experiment n an Alpha workstation
single precision:
k_up = 127,
x_up= 1.7014118E+38
k_low = -126,
y_low= 1.1754944E-38
double precision:
k_up = 1023,
x_up = 8.988465674311580E+307
at k > 1023,
x blows up overflow
k_low = -1022
y = 2.225073858507201E-308
at k < -1022,
y=0.0
underflow
31
1.4.6 Round-off Error and machine precision

Rounding:
e.g. x = 2/3 = 0.666666666666...
To keep 7 decimal points,
5
rounding to nearest x =0.6666667 8th 6 > 10
added 1 to 7th 6.
* Example: T = 3.1415926535897932...
A = 3.14159
If
then |roundoff error| = |A - T| = 0.00000265358979...

< 0.000005
A
3.14
3.142
3.1416
Note:
| round off error |

0.00159... <
0.00040... <
0.0000073 <
0.005
0.0005
0.00005
x = 5.2468 ~ 5.247, or x ~ 5.25 but x 5.3.
If you want to keep one decimal, then x ~ 5.2.

i.e. round off is not transitive.
Chopping:
e.g. x = 2/3 = 0.666666666666...
To keep 7 decimal points, chopping x = 0.6666666
8th 6 is simply chopped.
e.g. A = 3.1415 after chopping.
| round off error | = 0.000092...< 0.0005
It is larger than rounding to nearest.
32
Is it important to care about Chopping and Rounding?

Major difference between Chopping and Rounding:
Error in chopping is always non-negative since the chopped
number is always no larger than the original number.
M
This can cause skew in summuation of x j !!!

j =1
Error in rounding can be either positive or negative.

M
Thus the round off error in computing x j willbesmallersincesome

j =1
oftheroundofferrorwillcancelout.
33
Real-life Example
Vancouver Stock Exchange

In 1982, the index was initiated with a starting value of
1000.000 with three digits of precision and truncation
After 22 months, the index was at 524.881
The index should have been at 1009.811.
Successive additions and subtractions introduced truncation
error that caused the index to be off so much.
(If you are interested, read the details in the next two pages for full
explanation.)
34
35
36
Machine precision or machine epsilon mach

-- accuracy or precision
chopping: mach = B1-t ( = 21-23 = 2.384 x 10-7 for B=2, t=23)
1
rounding: mach = 2 B1-t (= 2-23 = 1.192x 10-7 for B=2, t=23)
Note:
if x< mach then, 1+ x = 1 in machine computation.
Unit round of a computer is the number that satisfies the
following:
i)
it is a positive floating-point number
ii)
it is the smallest such number for which

fl (1 + ) > 1
(16)
where " fl " means the "floating-point" representation of

the number.
* Thus, for any x<, we have fl(1+ x) = 1.
=> precise measure of how many digits of accuracy are
possible in representing a number.
Machine precision experiment:

Write a computer program, using = 2-k, with k=1 to 34 for
single precision and k =1 to 60 for double precision, to
determine for the machine you are using for both single
precision and double precision operations.
37
* Example:
On an Alpha workstation or a Dec 5000 machine,
Single Precision:
k= 23
k= 24
x= 0.000000119
x= 5.9604645E-08
still truthful
no longer truthful
Program:
12
del=1.0
do k=1, 35
del=del/2.0
f=1+del
z=f-1
write(6,12) k, del, f, z
enddo
format(1x,i3,2x,f15.11,2x,f13.10,2x,f13.9)
stop
end
Results (Single precision):

K
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
del
0.50000000000
0.25000000000
0.12500000000
0.06250000000
0.03125000000
0.01562500000
0.00781250000
0.00390625000
0.00195312500
0.00097656250
0.00048828125
0.00024414062
0.00012207031
0.00006103516
0.00003051758
0.00001525879
0.00000762939
f
1.5000000000
1.2500000000
1.1250000000
1.0625000000
1.0312500000
1.0156250000
1.0078125000
1.0039062500
1.0019531250
1.0009765625
1.0004882812
1.0002441406
1.0001220703
1.0000610352
1.0000305176
1.0000152588
1.0000076294
z
0.500000000
0.250000000
0.125000000
0.062500000
0.031250000
0.015625000
0.007812500
0.003906250
0.001953125
0.000976562
0.000488281
0.000244141
0.000122070
0.000061035
0.000030518
0.000015259
0.000007629
38
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0.00000381470
0.00000190735
0.00000095367
0.00000047684
0.00000023842
0.00000011921
0.00000005960
0.00000002980
0.00000001490
0.00000000745
0.00000000373
0.00000000186
0.00000000093
0.00000000047
0.00000000023
0.00000000012
0.00000000006
1.0000038147
1.0000019073
1.0000009537
1.0000004768
1.0000002384
1.0000001192
1.0000000000
1.0000000000
1.0000000000
1.0000000000
1.0000000000
1.0000000000
1.0000000000
1.0000000000
1.0000000000
1.0000000000
1.0000000000
0.000003815
0.000001907
0.000000954
0.000000477
0.000000238
0.000000119
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
39
Double precision:
k= 53
x= 1.1102230E-016 still truthful
k= 54
x= 5.55E-017
no longer truthful
Program:
implicit double precision (a-h,o-z)
del=1.0
do k=1,64
del=del/2.0
f=1+del
z=f-1
write(8,12) k,del,f,z
enddo
12 format(1x,i4,2x,e21.13,2x,f22.17,2x,e16.9)
stop
end
Result (doubleprecision):
K
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
del
0.9313225746155E-09
0.4656612873077E-09
0.2328306436539E-09
0.1164153218269E-09
0.5820766091347E-10
0.2910383045673E-10
0.1455191522837E-10
0.7275957614183E-11
0.3637978807092E-11
0.1818989403546E-11
0.9094947017729E-12
0.4547473508865E-12
0.2273736754432E-12
0.1136868377216E-12
0.5684341886081E-13
0.2842170943040E-13
0.1421085471520E-13
1.00000000093132257
1.00000000046566129
1.00000000023283064
1.00000000011641532
1.00000000005820766
1.00000000002910383
1.00000000001455192
1.00000000000727596
1.00000000000363798
1.00000000000181899
1.00000000000090949
1.00000000000045475
1.00000000000022737
1.00000000000011369
1.00000000000005684
1.00000000000002842
1.00000000000001421
0.93132E-09
0.46566E-09
0.23283E-09
0.11642E-09
0.58208E-10
0.29104E-10
0.14552E-10
0.72760E-11
0.36380E-11
0.18190E-11
0.90949E-12
0.45475E-12
0.22737E-12
0.11369E-12
0.56843E-13
0.28422E-13
0.14211E-13
40
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
0.7105427357601E-14
0.3552713678801E-14
0.1776356839400E-14
0.8881784197001E-15
0.4440892098501E-15
0.2220446049250E-15
0.1110223024625E-15
0.5551115123126E-16
0.2775557561563E-16
0.1387778780781E-16
0.6938893903907E-17
0.3469446951954E-17
0.1734723475977E-17
0.8673617379884E-18
0.4336808689942E-18
0.2168404344971E-18
0.1084202172486E-18
1.00000000000000711
1.00000000000000355
1.00000000000000178
1.00000000000000089
1.00000000000000044
1.00000000000000022
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
0.71054E-14
0.35527E-14
0.17764E-14
0.88818E-15
0.44409E-15
0.22204E-15
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
64 0.5421010862428E-19 1.00000000000000000 0.00000E+00
41
1.5 Significant Digits

Definition:
XA has m significant digits w.r.t. XT if the error |XT -XA | has
magnitude 5 in the (m+1)th digits counting from the right of the
first non-zero digit in XT.
Examples:
Example 1.
XT = 3 . 1 7 2 8 6
1 23 4 56
If XA = 3.17, then | XT -XA| = 0.00286 < 0. 0 0 5

1 234 m + 1= 4 m =3
If XA = 3.172, then | XT -XA| = 0.00086 < 0. 0 0 5
1 2 3 4 m + 1= 4 m =3
If XA = 3.173, then | XT -XA | = 0.00014 < 0. 0 0 0 5
1 2345 m +1 = 5 m =4
Example 2.
XT = 3 8 9. 6 7 4
1 2 3 4 5 6
If XA = 3 8 9. 7 8, then |XT -XA| = 0. 1 0 6 < 0. 5
3 4 m + 1 = 4 m =3
If XA = 3 8 9. 7, then |XT -XA| = 0. 0 2 6 < 0 . 05
3 4 5 m + 1 = 5 m =4
42
1.6 Interaction of Roundoff Error with Truncation Error

f(x) = ex, EXACT derivative f (x) = ex.
Consider
at x=0,
EXACT value is f (0) = 1
* Finite differencemethod1:forwarddifference
TSexpansiontoO(h)=> f ( x, h )
f ( x + h) f ( x )
+ O(h) (17)
h
Thatis,numerically, f (0) = f / x =
* Error(x, h) = | f (x) - f/x | = | ex
eh 1
h
(method1)
f ( x + h) f ( x )
|:
h
1.E+00
abs(f'-1)
f(x)=exp(x)
1.E-01
'
f (0)=[exp(h)-1]/h
1.E-02
1.E-03
1.E-04
1.E-05
1.E-06
1.E-07
decreasing
truncation error
increasing
due to
roundoff error
1.E-08
1.E-09
1.E-15
1.E-13
1.E-11
1.E-09
1.E-07
1.E-05
1.E-03
1.E-01
1.E+01
* Whydoestheerrorbehaveinsuchamanner?
43
* RoundofferrorusingExcelincomputingthedifference
betweentwoO(1)numbers,[f(x+ h)f(x)],isroughlyaround
1.11*1016.
* Thustheroundofferror(R.E.)for f is
(R.E.~1.11*1016/ h
(18)
R.E.issmallforlargerx=h,butitwillincreaseashdecreases.
* TruncationerrorbasedonTaylorseriesexpansion:
f(x+h)=f(x)+ f (x) h+ f (x ) h2/2+ f (x ) h3/6+
[f(x+h)f(x)]/h= f (x) + f (x) h/2 + f (x ) h2/6+(19)
Thusinapproximating f (x) by[f(x+h)f(x)]/h,wecommit

anerrorof f (x ) h/2totheleadingorder.
HencetheT.E.inthisexample(x=0, f (0) = e 0 = 1)is:
T.E.=h/2
(20)
=>Totalerror=R.E.+T.E.~1.11*1016/ h+h/2
(21)
1.E+00
abs(f'-1)
f(x)=exp(x)
1.E-01
'
f (0)=[exp(h)-1]/h
1.E-02
1.E-03
1.E-04
1.E-05
1.E-06
1.E-07
decreasing
truncation error
increasing
due to
roundoff error
1.E-08
1.E-09
1.E-15
1.E-13
1.E-11
1.E-09
1.E-07
1.E-05
1.E-03
1.E-01
1.E+01
44
Finitedifferencemethod2:
Ifweusethecentraldifferenceschemetocompute f (x) :
f ( x, h) =[f(x+ h)f(xh)]/(2h)
thetruncationerrorissmallerasshownbelow.
Truncationerror(T.E.):
(22)
Taylorseriesexpansion:
f(xh)=f(x) f (x ) h+ f (x ) h2/2 f (x) h3/6+
[f(x+h)f(xh)]/(2h)= f (x) + f (x ) h2/6+
(23)
Thusinapproximating f (x ) by[f(x+h)f(xh)]/(2h),we
commitanerrorof f (x) h2/6totheleadingorder.
Hencethetruncationerrorinthisexampleis
T.E.= f (x) h2/6+.
(24)
f(x)=exp(x)
abs(f'-1)
1.E+00
abs(f2'-1)
1.E-01
f2'(0)=[exp(h)-exp(-h)]/(2h)
1.E-02
1.E-03
1.E-04
1.E-05
1.E-06
1.E-07
1.E-08 increasing
1.E-09 due to
decreasing
roundoff
error
1.E-10
truncation error
1.E-11
1.E-12
1.E-15 1.E-13 1.E-11 1.E-09 1.E-07 1.E-05 1.E-03 1.E-01 1.E+01
h
45
Predictedroundofferrorandtruncationerror
1.E+01
roundoff
TE1
TE2
1.E-01
1.E-03
1.E-05
1.11E-16/h
1.E-07
h^2/6
1.E-09
1.E-11
h/2
1.E-13
1.E-15
1E-15 1E-13 1E-11 1E-09 1E-07 1E-05 0.001
0.1
10
Comparison:predicted(roundoff+truncation)&actualerrors
1.E+00
abs(f'-1)
1.E-01
abs(f2'-1)
1.E-02
R+TE1
1.E-03
R+TE2
1.E-04
1.E-05
1.E-06
1.E-07
1.E-08
1.E-09
1.E-10
1.E-11
1.E-12
1E-15 1E-13 1E-11 1E-09 1E-07 1E-05 0.001
0.1
10
Error = Truncation Error + Round-off Error

46
1.7 Propagation of Errors
Consider zT = xT* yT;
* = algebraic operation: + - x
First, computer actually uses xA instead xT due to rounding

or the data itself contains error.
Second, after xA*yA is computed, computer rounds the product as
zA = fl(xA * yA).
(25)
Thus, the error in the operation * is
zT - zA = xT * yT - fl(xA * yA).
(26)
xT = xA + , yT = yA + .
(27)
zT - zA = xT*yT - xA*yA + [xA*yA - fl(xA*yA)]
(28)
Let
The error is
The second part in [ ] is simply due to machine rounding.

It can be easily estimated as
(xA* yA) mach = xA* yA
1 1-p
B
2
The first part xT*yT - xA*yA is the propagated error.
Now consider the propagated error in various operations.
47
1.7.1 Error in multiplication

Absolute error in multiplication:
xT yT - xA yA = xT yT - (xT - ) (yT - )
= xT + yT - .
(29)
xT yT x A y A
=
.
xT yT
yT xT xT yT
(30)
Relative error:
Rel.( xA yA) =
Assuming
xT
1 and
Rel.( xA yA)
yT
yT
1, we obtain
xT
= Rel.( xA) + Rel.( yA).
(31)
1.7.2 Error in division

Absolute error in division:
xT/ yT - xA / yA = xT/ yT - (xT - )/( yT - ).
Relative error in division:
Rel.(xA/yA) =
xT / yT x A / y A
xT / yT
=1
(32)
xA
xT
yT
1 Re l.( x A )
= 1
yA
1 Re l.( y A )
1- [1 Re l.( x A ) + Re l.( y A ) + ...] (TSexpan.)

Rel.( xA) - Rel.( yA) =
xT
yT
(33)
48
1.7.3 Error in addition:

Absolute error: xT + yT - (xA + yA) = +
(34)
Relative error: Rel. (xA + yA) =
(35)
xT + yT
1.7.4 Error in subtraction:

Absolute error:
xT - yT - (xA - yA) = -
Relative error:
Rel. (xA - yA) =
Note:

xT yT
(36)
(37)
xT yT may be small due to cancellation
large Rel.( xA yA).

i.e. loss of significance due to subtraction of nearly
equal quantities--- very important practical issue!
49
* Example: Error in subtraction:

Compute r = 13 - 168 (=xy).
Using 5-digit decimals, y = 168 => yA = 12.961 => rA = 0.039.
Exact number:
rT = 0.038518603... =>
Error(rA) = 0.038518603 - 0.039 = -0.00048.

or Rel. (rA ) = -1.25x10-2 which is not small.
Reason:
x = 13 and y = 168 are quite close =>
rA has only 2 significant digits after subtraction.

1
1
132 168
rA =
=
=
13 + 168 13 + 168 13 + 12.961
Improvement:
= 0.038519 with 5 significant digits.

00.038518603... .038519
= -1.03x10-5
0.038518603...
=>
Rel. (rA ) =
=>
the magnitude of this error is much smaller than the

previous one (1.25x10-2).
Lesson:
avoid subtraction of two close numbers!

Whenever you can, use double precision.
50
1.7.5 Induced error in evaluating functions
With one variable:

If f(x) has a continuous first order derivative in [a, b],
and xT and xA are in [a, b],
f(xT) - f(xA) f (xA)(xT - xA) + o(xT - xA)
(38)
With two variables:

f(xT, yT) - f(xA, yA) f x' (xA, yA) (xT - xA) + f y' (xA, yA) (yT - yA)
+ o(xT - xA, yT - yA)
Example:
(39)
f(x, y) = xy =>
f x' = yxy-1,
f y' = xy-1 logx
=> Error(fA) yA ( x A ) y A 1 Error(xA)

+ ( x A ) y A 1 logxA Error(yA)
=> Rel.(fA) yA Rel.( xA) + log xA Rel.( yA)
51
1.7.6 Error in summation

Consider
s = xj.
j =1
(40)
In a Fortran program, we write:

S=0
DO J = 1 TO M
S = S + X(J)
ENDDO
Equivalently, in the above code we are doing the following:

s2 = fl( x1 + x2) = (x1 + x2) (1 + 2);
(41a)
where 2 = machine error due to rounding

s3 = fl(x3 + s2) = (s2 + x3) (1 + 3)
(41b)
= [(x1 + x2) (1 + 2) + x3] (1 + 3)
(x1 + x2+ x3 ) + 2 (x1 + x2) + 3(x1 + x2+ x3)
(41c)
sk+1 = (sk + xk+1) (1 + k+1)

= (x1 + x2+ x3 +... xk+1) + 2(x1 + x2) + 3(x1 + x2+ x3) + ...
+ (x1 + x2+ x3+... +xk+1) k+1
(41d)
Error=s(x1+x2+x3+...+xM)
= 2(x1 + x2) + 3(x1 + x2+ x3) + ...+ (x1 + x2+ x3+... + xM)M
= x1(2 + 3 +... + M) + x2 (3+ 4 +... M) + ... + xM M
(42)
Since all i's are of same magnitude

=>
term x1 contributes the most while xM contributes the smallest;
=>
we should add from smallest (x1) to the largest (xM)

to reduce the overall machine error accumulation.
52
Example:
i)
M 1
Compute S(M) =
k =1 k
forM<108
summing from k=1 to M using single precision

(single: large to small)
ii)
summing from k=M to 1 using single precision

(single: small to large)
iii)
summing from k=1 to M using double precision

(double: large to small)
iv)
summing from k=M to 1 using double precision

(double: small to large)
asymptote = ln(M)+0.5772156649015328
M
single: large
single: small
double: large
double: small
to small
to large
to small
to large
asymptote
16384
10.2813063
10.28131294
10.2813068
10.28130678
10.2812767
32768
10.9744091
10.97444344
10.9744387
10.9744387
10.9744225
65536
11.667428
11.66758823
11.6675783
11.66757825
11.6675701
131072
12.3600855
12.36073208
12.3607216
12.36072161
12.3607178
262144
13.0513039
13.05388069
13.0538669
13.05386689
13.0538654
524288
13.7370176
13.74705601
13.7470131
13.74701311
13.7470112
1048576
14.4036837
14.44023132
14.4401598
14.44015982
14.4401588
2097152
15.4036827
15.13289833
15.1333068
15.13330676
15.1333065
4194304
15.4036827
15.82960701
15.8264538
15.82645382
15.8264542
8388608
15.4036827
16.51415253
16.5196009
16.51960094
16.5195999
16777216
15.4036827
17.23270798
17.2127481
17.21274809
17.2127476
53
18
Sum
17
16
15
single: large to small

single: small to large
dbl: large to small
dbl: small to large
asympt
14
13
12
11
M
0
1000000
1200000
1400000
1600000
1800000
8000000
6000000
4000000
2000000
10
Clearly,theresultbasedonsummationusingsingleprecision
andaddingfromlargetosmallvaluesaremostunsatisfactory
(sinceitisalreadyconverged).
54

Chap1 Introduction

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Chap1 Introduction

Hochgeladen von

Copyright:

Verfügbare Formate

I.

1.1 Why Numerical Methods?

Steady state heat conduction

How do we determine the heat flow

Steady state heat conduction

Solve mathematical equation ( T=0 or

Solution is discrete, approximate, but can be close to exact.

Computer program can handle more complicated geometries

Steady Temperature with a Block

1.2 Mathematical Preliminaries

1.2.2 Mean Value Theorem

The graphical interpretation of this theorem is shown below.

1.2.3 Integral Mean Value Theorem (IMVT)

w(x) f(x) dx = f()

for some in [a, b].

1.2.4 Taylor series expansion

= truncation error of the expansion = T.E.

* Examples: (In Eq.(4), replace x-x0 by h, x0 by x)

For f(x) = sin(x),

f (x) = -cos(x), f (x ) = sin(x)

sin(x) = x - x 3 / 3! += P3(x) + R4(x)

(integration by parts =>)

cos() for between 0 and x.

Note: R4(x) may also be expressed as

Use the next term in TS expansion, Pn(x), to represent Rn+1(x).

1.2.5 Taylor series expansion in two dimensions

Let f(x, y) be n+1 time continuously differentiable for all (x, y)

Example: Consider f(x, y) =

Find its Taylor series expansion near x=y=0.

(This approach is much simpler!)

1.2.6 Fourier transformF:

f(x)= a0 + [an cos(nx / L) + bn sin( nx / L)]

f ( w) = 1 f ( x)e iwx dx.

[af ( x) + bg ( x)]e iwx dx

F{ f '( x)} = iwF{ f ( x)}

f '( x)e iwx dx

F{ f "( x)} = (iw) 2 F{ f ( x)} = w2F{ f ( x)}

y ( w) = g ( w) /[c + ibw aw2 ]

In order to understand the roles of advection (c), diffusion (), and

1.3 Sources of Errors in Computations:

1.3.2 Types of errors

-- e.g neglecting friction in computing

* Example: Surface area of the earth may be approximated as

Truncation Error (ET)

Rn+1(x) or ET can be estimated as

between x and x0.

To understand the roundoff error, we must first look into floating

1.4 Floating Point Arithmetic

The pattern repeats; never ending; the number is inexact.

General computer representation of a floating point number x

. d1d2d3 ... dt* Be

( e.g. -.110101* 210 in binary)

d1d2d3...dt = mantissa or fractional part of significand;

0di B-1, i=1, 2...t

= number of significant bits, e.g. t=24

= exponent or characteristic, -emin =L< e < U=emax

In reality, Eq. (11) represents

1.4.2 IEEE Standard for single precision (for base 2 only)

The value V represented by the word may be determined as follows:

If E=255 and F is nonzero, then V=NaN ("Not a number")

If E=255 and F is zero and S is 1, then V = -Infinity; (-1)S =-1.

If E=255 and F is zero and S is 0, then V=Infinity; (-1)S =1.

If 0<E<255 then V=(-1)**S * 2 ** (E -127) * (1.F)

In the above, theexponentisstoredwith127addedtoit,

zT - zA = xTyT - xAyA + [xAyA - fl(xAyA)]

The first part xTyT - xAyA is the propagated error.