Sie sind auf Seite 1von 54

I.

Introduction

1.1 Why Numerical Methods?


Example 1.

Steady state heat conduction

T/y=0 (insulated)

Governing equation:
2

Heat flow

T=T 1

T= 0

T/x
=0

How do we determine the heat flow


from wall AB to wall AD?
A

Possible solutions:

B x

1. Experiment

T=T2 > T 1

2. Analytical solution
3. Numerical solution

y
Example 2

T/y=0 (Insulated)

Steady state heat conduction


T/x=0

in a non-simple geometry
Governing equation:

T=T 1

Heat flow

T/y=0

T= 0
How do we determine the heat flow
from wall AB to wall AD?
A

T=T 2 > T 1

T/x
=0

B x

Example 3
Unsteady state heat conduction in a non-simple geometry:
2
T
= T
t

Experimental approach:
Design the experiment
Set up a facility to satisfy boundary conditions (insulation and
constant temperature)
Prepare instrumentation
Perform experiment & collect data
(measure heat flux on wall AB, for example)
Analyze data and present data
Develop a model (say, to describe the effect of LAB/LCD on
heat transfer)
O.K. for all three examples; no information on T(x, y) inside
the cavity; relatively time consuming & tedious.
Analytical approach:
2

Solve mathematical equation ( T=0 or

2
T
= T) based on
t

a physical model.
Use method of separation of variable? Green's function?
O.K. for simple geometry in Example 1;
unlikely for Examples 2 & 3.

Numerical approach:
Solve equations that can be much more complicated than
2
T
= T using a computer.
t

Solution is discrete, approximate, but can be close to exact.


yn

yn+1

t
t1

t2

...

tn

t n+1

Computer program can handle more complicated geometries


(as in Examples 2 & 3 above).
Gain insight into the temperature distribution inside the cavity.
Cautionary Remarks:
NO numerical method is completely trouble free in all
situations.
NO numerical method is completely error free.
NO numerical method is optimal for all situations.
Be careful about:
ACCURACY, EFFICIENCY, & STABILITY.

* Example 4.

dy
= - 10y,
y(0) = 1
dt
First note: the exact solution is: yexact(t) = e-10t.
dy
y n +1 y n
as
Approximating: LHS
dt
t
& treating:
RHS -10y as -10 yn (n=0, 1, 2, 3,)
Solve a simple ODE:

y n +1 y n
= -10yn
t
yn+1 = yn -10t yn = yn (1-10t ),

i.e.
=>

with y0 =1.
yn+1 = yn (1-10t ) = yn-1 (1-10t )2
= yn-2 (1-10t )3 = ... = y0(1-10t )n
Choose t = 0.05, 0.1, 0.2, and 0.5,
=> 10t =0.5,
1, 2, and 5.
See what happens!
n

yn(t=0.05)

yn(t=0.1)

yn(t=0.2)

yn(t=0.5)

0.5

-1

-4

0.25

16

0.125

-1

64

0.0625

-256

0.03125

-1

1024

6
7

0.0156250
0.0078125

0
0

1
-1

-2048

Comments:

ok

inaccurate

oscillates

blows up
4

1.1

y_numerical
t=0.05
y_exact

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-0.1
0

0.5

1.5

(Thisgraphcomparestheexactsolutionwiththestablenumerical
solutionfort=0.05.)

Questions:
Why does the solution blow up for t=0.5?
How to detect/prevent numerical instability (blowing up) in general?
How to improve accuracy (c.f. the case with t=0.05)?
How to get solution efficiently if a large system is solved?

Graphs based on numerical solutions to heat transfer problems (Examples 1 & 2):
Steady State Temperature Contour
1.0
0.1

0.2

0.3
0.4

0.8

0.5
0.6
y

0.6
0.4

0.7
0.8

0.2
0.9
0.0
0.0

0.2

0.4

0.6

0.8

1.0

Steady Temperature with a Block


1.0
0.1
0.2

0.8

0.3
0.6

0.4

y
0.5

0.4

0.6

0.7

0.8
0.9

0.2

0.0
0.0

0.2

0.4 x

0.6

0.8

1.0

1.2 Mathematical Preliminaries


1.2.1 Intermediate Value Theorem
Let f(x) be a continuous function on the finite interval
axb, and define
m = Infimum f(x),
M = Supremum f(x)
axb
axb
Then for any number z in the interval [m, M], there is at least
one point in [a, b] for which
f() = z.
In particular, there are points x and x in [a, b] for which
m = f( x )
M

and M = f( x ).

m
a

1.2.2 Mean Value Theorem


Let f(x) be continuous function on the finite interval axb,
and let it be differentiable for axb. Then there is at least
one point in [a, b] for which
f(b) - f(a) = f ( ) (b-a).

(1)

The graphical interpretation of this theorem is shown below.


f(b)

Slope:
f ( )

f(b) - f(a)

f(a)
a

1.2.3 Integral Mean Value Theorem (IMVT)


Let w(x) be nonnegative and integrable on [a, b] and let f(x) be
continuous on [a, b]. Then
b

w(x) f(x) dx = f()

w(x) dx

(2)

for some in [a, b].


(Proof: see Example 1 in Supplemental Reading)

1.2.4 Taylor series expansion


*
Let f(x) have n+1 continuous derivatives on [a, b] for some
n0 and let x, x0[a, b]. Then
f(x) = Pn(x) + R n+1(x)

(3)

where
Pn(x) = f(x0) +

x x0
( x x0 ) 2
f ( x0 ) +
f ( x0 )
1!
2!

( x x0 ) n ( n)
( x x0 ) 3
f ( x0 )
+
f ( x0 ) + ... +
n!
3!
1 x
n (n+1)(t) dt
Rn+1(x) =
(x t) f
n! x
0
( x x0 ) n +1 (n+1)
(IMVT) =
f
()
( between x0 & x)
(n + 1)!

(4)

(5)

= truncation error of the expansion = T.E.

(This figure shows the approximation of f(x) using 0th, 1st, & 2nd
order Taylor series expansions.)
9

* Examples: (In Eq.(4), replace x-x0 by h, x0 by x)


h3
h2
f(x+h) = f(x) + h f (x) +
f (x ) +
f (x ) +
3!
2!

(6)

h3
h2
f(x-h) = f(x) - h f (x) +
f (x ) f (x ) +
3!
2!

(7)

* ListofTaylorseriesforcommonlyencounteredfunctions
(withx0=0):

(1 + x) = 1 + x +

( 1)
2!

x2 +

( 1)( 2)
3!

x 3 + ... ,

1
=1 + x + x2 + x3 + x4 +
1 x
x x 2 x3 5x 4 7 x5
1+ x = 1+
+

+
+ ...
2 8 16 128 256
7
x3
x5 x
sin(x) = x +
,
3!
5! 7!

|x|<1
|x|<1

x2 x4 x6
cos(x) = 1- + + ,
2! 4! 6!
x 3 2 x 5 17 x 7 62 x 9
+
+
tan(x) = x + +
+ ,
3
15
315 2835
x 2 5x 4 61x 6 277x8
+
+
+ ... ,
sec x = 1+ +
2
24
720
8064

arcsin(x) = x +
arccos(x) =

arctan(x) = x -

x 3 3x 5 5 x 7 35x 9
+
+
+
+ ...
6
40 112 1152
x

x 3 3x 5 5 x 7

+ ... ,
6
40 112

x3 x5 x7 x9
+

+
+ ...
3
5
7
9

|x|<1

|x|< /2
|x|< /2
,

|x|< 1
|x|< 1
|x|< 1

x 2 x3 x5
+
+ +,
exp(x) = 1 + x +
2! 3!
5!
10

x
x3
x5
sinh(x) = x +
+
+
,
7!
3!
5!
x2 x4 x6
+
cosh(x) = 1+ +
+ ,
2! 4! 6!
x 2 5x 4 61x6 277x8
+

+
+
sech(x) = 1/cosh(x) = 12
24 720 8064

tanh(x) = x -

x 3 2 x 5 17 x 7 62 x 9
+

+
+ ... ,
3
15
315 2835

3x 5 5x 7 35x9
x3
+
,
sinh (x) = x +
6
40 112 1152
-1

-1

tanh (x) = x +

x3 x5 x7 x9
+
+
+
+ ... ,
3
5
7
9

x 2 x3
ln (1+x) = x - + +
3
2

2x 3 2 x5 2 x 7
1+ x
ln
+
+ ...
= 2x +
+
5
7
3
1 x

|x|< 1

|x|<1

* Challengingproblem:
CanyouuseTaylorseriesexpansiontoevaluatethefollowing
limit?

lim x 2 [(1 +

1 x +1
1
)
(1 + ) x ] = ?
x +1
x

(SeeExample2inSupplementalReadingMaterialfordetails)

11

* Example:

Whataretheerrors,orremainders,R4,inthe

Taylorseriesexpansionofsin(x)?

Soln.

f (x) = cos(x),

For f(x) = sin(x),

f (x) = -sin(x),

f (x) = -cos(x), f (x ) = sin(x)

=>

sin(x) = x - x 3 / 3! += P3(x) + R4(x)

3
with P3 (x) = x - x / 3!

& R4(x) =

1x
1x
3
3
(
x
t
)
f
(
t
)
dt
=

( x t ) sin(t )dt ,
3! 0
3! 0

1 1 x
4
sin(t ) d ( x t )
3! 4 0

x x
1 1
[sin(t )( x t ) 4 - ( x t ) 4 cos(t )dt ]
3! 4
0 0

1x
4
cos(t )( x t ) dt (IMVT=>)
4! 0

(integration by parts =>)

x
1
x5
4
cos(

)
(
x

t
)
dt

=
cos().
=
4!
5!
0

Hence sin(x) = x -

3!

5!

cos() for between 0 and x.

Note: R4(x) may also be expressed as


x
4
1
sin( ) ( x t ) 3 dt = x sin().
3!
4!
0
However, since is between 0 & x with |x|1, this estimate,
R4~ x 4 / 4! sin(), is not as useful.
Since cos() ~1 for small |x| so that R4~ x 5 / 5! more useful!
*

Use the next term in TS expansion, Pn(x), to represent Rn+1(x).


12

1.2.5 Taylor series expansion in two dimensions

Let f(x, y) be n+1 time continuously differentiable for all (x, y)


in some neighborhood of (x0, y0). Then,
n

m
f(x0+, y0+) = f(x0, y0) + m! D f ( x, y )
m =1
x

where D

0 , y0

1
Dn+1f(x, y) |x + ,y +
0
0
(n + 1)!

+
x
y

Example: Consider f(x, y) =

(8)

)and 0 1.

ln[1 + 2 x + x 2 + xy + y 3 ]1/ 2

Find its Taylor series expansion near x=y=0.


Solution: Method 1.
f(x, y) =

1
ln[1 + 2 x + x 2 + xy + y 3 ] , f(0, 0)=0
2

f
2 + 2x + y
=
,
x 2[1 + 2 x + x 2 + xy + y 3 ]

f
(0,0) = 1 ;
x

f
x + 3y2
=
,
y 2[1 + 2 x + x 2 + xy + y 3 ]

f
(0,0) = 0 ;
y

2 f
x 2

2[1 + 2 x + x 2 + xy + y 3 ] (2 + 2 x + y )(2 + 2 x + y )
2[1 + 2 x + x 2 + xy + y 3 ]2
2 f
x 2

2 f
y 2

(0,0) = 1 ;

6 y[1 + 2 x + x 2 + xy + y 3 ] ( x + 3 y 2 )( x + 3 y 2 )
2[1 + 2 x + x 2 + xy + y 3 ]2

,
13

2 f
y 2

(0,0) = 0 ;

2 f 1 + 2 x + x 2 + xy + y 3 ( x + 3 y 2 )(2 + 2 x + y )
=
.
xy
2[1 + 2 x + x 2 + xy + y 3 ]2
2 f
1
(0,0) =
xy
2

Thus, f(x, y) ~ 0 + x + 0 y +
1
2

1
1
( 1) x 2 + xy + 0 y 2
2
2

1
2

= x x 2 + xy + ...
Method II:
Let

= 2 x + x 2 + xy + y 3

Note: ln (1+) =

2 3
2

1
ln[1 + 2 x + x 2 + xy + y 3 ]
2
1
1
2
3
2
3 2
{[
2
x
+
x
+
xy
+
y
]
- [2 x + x + xy + y ] + }
=
2
2

=x+

1 2 1
1
1
1
x + xy (4 x 2 ) + ... = x x 2 + xy + ...
2
2
4
2
2

(This approach is much simpler!)

14

1.2.6 Fourier transformF:


i.

Periodicfunctions

f(x)

3L

5L

3L

p=2Lintheabovegraph.

Wenote: f(x+2L)=f(x)
f(x)isaperiodicfunctioniff(x+p)=f(x)
andpistheperiodofthisperiodicfunction
ii.Fourierseriesexpansion

f(x)= a0 + [an cos(nx / L) + bn sin( nx / L)]

where

n =1

a0 =

1 L
f ( x )dx

2L

ak =

1 L
f ( x ) cos(kx )dx
L L

bk =

1 L
f ( x )sin(kx )dx
L L

15

ExampleofFouriersseriesapp
plicationss:
Inmanysitu
uations,w
wearefacedwithamoreccomplicattedsignal
thaatmayno
othaveanobvioussinglep
period:

(T
The x-axiis is time in secondds.)
Iff we samp
ple this siignal at a rate of 4096
4
Hz, collect
c
10024 samples
a find th
and
he Fourieer series coefficien
c
nts we gett the specttrum:

The spectrum
m tells us that
t the orriginal siggnal is a composit
c
te of 3 purre
t
tones.
In fact,
f
thesee tones arre at 60 Hz,
H 300 Hz
H and 12000 Hz.

16

iii.

FouriertransformF
Considerf(x),<x<,with

| f ( x ) | dx

Fouriertransformoff(x)is:

f ( w) = 1 f ( x)e iwx dx.


2

(- < w < )

Fourierinversetransform:

1
iwx
f ( x) =
f ( w)e dw.
2

FouriertransformiscontinuousversionofFourierseries.

Or,FourierseriesisaDiscreteversionofFourierTransform.

PhysicalinterpretationoftheFouriertransformation:

istheenergyinthefrequencyrange(w, w+dw)
| f ( w) |2 dw

iscalledthespectraldensityorenergyspectrum.
| f ( w) |2
| f ( w) |2

| f ( w) |2 dw

dw

SomePropertiesofFouriertransform:

f ( x) f ( w)

a)LinearityoftheFouriertransform

1
F{af ( x) + bg ( x)} =
2

[af ( x) + bg ( x)]e iwx dx

17

1
1
iwx
iwx
(
)
(
)
=a
f
x
e
dx
+
b
g
x
e
dx = aF{ f (x)}+bF{g(x)}

2
2

b)Fouriertransformofderivativeoff(x):

F{ f '( x)} = iwF{ f ( x)}

Why?Note:

1
F{ f '( x)} =
2

f '( x)e iwx dx

1
=
f ( x)e iwx dx}
{ f ( x)eiwx |+
iw
2

= iwF{ f ( x)}

f()=0

Andsimilarly

F{ f "( x)} = (iw) 2 F{ f ( x)} = w2F{ f ( x)}

AnapplicationofFourierTransform

ApplyFouriertransformtoaconstantcoefficientODE:

ay+by+cy=g(t)

analgebraicequationisobtained.

F{ay"+by '+cy = g (t )}

w2 ay ( w) + iwby ( w) + cy ( w) = g ( w)

Thusitis

y ( w) = g ( w) /[c + ibw aw2 ]

easiertosolve.

y (t ) = F 1{ y ( w)}

18

WewilluseFouriertransformtostudythesolutionbehaviorof:

u
u
2u
3u
+c
= 2 + D 3
t
x
x
x

In order to understand the roles of advection (c), diffusion (), and


dispersion (D).

1.3 Sources of Errors in Computations:


1.3.1 Absolute and relative errors:
True value (T.V.) xT = Approximate value (A.V.) xA + Error ,
Absolute error = T.V. - A.V.
Relative error: Rel.( xA) =

(9)

T.V. - A.V.
=
xT
T.V.

(10)

1.3.2 Types of errors


Modeling error

-- e.g neglecting friction in computing


a bullet trajectory
Empirical measurements -- g (gravitational acceleration),
h (Planck constant), ...
Blunders
Input data inexact-- weather prediction based on data collected
Round-off error -- e.g. 3.1415927 instead of
3.1415926535897932384...
x2
+
Truncation error -- e.g. ex 1 + x +

or

dy y n +1 y n

dt
t

3!

4!

for small x,

for small t.
19

* Example: Surface area of the earth may be approximated as


A = 4 r2
Errors in various approximations:
Earth is modeled as a sphere (an idealization)
Earth radius r 6378.14 km from measurements.
3.14159265
Calculator or computer has finite length; result is rounded.
* Example of Truncation Error in Taylor series:
( x x0 ) 2
x x0
( x x0 ) 3

f(x) = f(x0) +
f ( x0 ) +
f ( x0 ) +
f ( x0 )
3!
1!
2!
( x x0 ) n ( n)
f ( x0 ) + Rn+1(x)
+ ... +
n!

Rn+1(x) = Remainder or

(4)

Truncation Error (ET)

Rn+1(x) or ET can be estimated as


Rn+1(x) =

( x x0 ) n +1 (n+1)
1 x
n
(n+1)
(t)
dt
=
f
f
( )
(
x

t
)

( n + 1)!
n! x
0

(5)

between x and x0.

To understand the roundoff error, we must first look into floating


point arithmetic.

20

1.4 Floating Point Arithmetic


1.4.1 Anatomy of a floating-point number
Three fieldsina32bitIEEE754float

Example: representationof(0.15625)10inabinary32bitfloat:
(0.15625)10=0.125+0.03125=1/23+1/25

= 0.15625
Example: how to represent 1/10 in binary?
Solution: 1/10=1/24+1/25+1/28+1/29+1/212+1/213+

=0.0001100110011001100110011001100....

The pattern repeats; never ending; the number is inexact.


Is such inexactness important?
Yes! Very important! You need to know your weapon well in order to
use it effectively.

21

Real-life Examples
--Disasters Caused by Computer Arithmetic Error

ThePatriotMissileFailure
On February 25, 1991, during the Gulf
War, an American Patriot Missile
battery in Dharan, Saudi Arabia, failed
to track and intercept an incoming
Iraqi Scud missile. The Scud struck an American Army barracks,
killing 28 soldiers and injuring around 100 other people.
The Patriot missile had an on-board timer that incremented
every tenth of a second
Software accumulated a floating point time value by adding 0.1
seconds
Problem is that 0.1 in floating point is not exactly 0.1. With a 23
bit representation it is really only 0.0999999046326.
Thus, after 100 hours (3,600,000 ticks), the software timer was
off by 0.3433 seconds.
Scud missile travels at 1676 m/s. In 0.3433 seconds, the Scud
was 573 meters away from where the Patriot thought it was.
This was far enough that the incoming Scud was outside the
"range gate" that the Patriot tracked.
See government investigation report:
http://www.fas.org/spp/starwars/gao/im92026.htm

22

General computer representation of a floating point number x


x:

. d1d2d3 ... dt* Be

(11)

( e.g. -.110101* 210 in binary)

= sign

= number base: 2 or 10 or 16

d1d2d3...dt = mantissa or fractional part of significand;


d1 0;
t

0di B-1, i=1, 2...t

= number of significant bits, e.g. t=24


gives PRECISION of x

= exponent or characteristic, -emin =L< e < U=emax


(e.g. -126<e<127)
it determines RANGE of x.

In reality, Eq. (11) represents


x=(
e.g.

d
d
d1 d 2
+ 2 + 33 ... + tt ) Be
B
B
B
B

-.110101* 211
1
2

= -( +

1
22

0
23

2 4 25
11

1
26

= 0.828125 * 2 = -1696

(12)
in base 2

) * 211

in base 10

23

1.4.2 IEEE Standard for single precision (for base 2 only)


* 32 bit IEEE 754 float:
S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
0 1

8 9

31

The value V represented by the word may be determined as follows:

If E=255 and F is nonzero, then V=NaN ("Not a number")

If E=255 and F is zero and S is 1, then V = -Infinity; (-1)S =-1.

If E=255 and F is zero and S is 0, then V=Infinity; (-1)S =1.

If 0<E<255 then V=(-1)**S * 2 ** (E -127) * (1.F)


where "1.F" is intended to represent the binary number created
by prefixing F with an implicit leading 1 (d0=1) and a binary point.

In the above, theexponentisstoredwith127addedtoit,


alsocalled"biasedwith127".

Thus,noneofthe8bitsisusedtostorethesignoftheexponent
E.

But,theactualexponenteisequaltoE127.

SinceE=255isfor V=NaN, the largest E is 254


=> U= 254-127 = 127
24

If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F)


These are "unnormalized" values.
That is why L = -126.

If E=0 and F is zero and S is 1, then V=-0

If E=0 and F is zero and S is 0, then V=0

The reason for having |L| <U is so that the reciprocal of the
smallest number, 1/2L, will not overflow. Although it is true that
the reciprocal of the largest number will underflow, underflow is
usually less serious than overflow.

25

1.4.3 IEEE Standard for double precision

Double precision referstoatypeoffloatingpointnumberthat


hasmoreprecision(thatis,moredigitstotherightofthe
decimalpoint)thanasingle precisionnumber.Thetermdouble
precisionissomethingofamisnomerbecausetheprecisionis
notreallydouble.
Theworddoublederivesfromthefactthatadoubleprecision
numberusestwiceasmanybitsasaregularfloatingpoint
number.Forexample,ifasingleprecisionnumberrequires32
bits,itsdoubleprecisioncounterpartwillbe64bitslong(see
thepartitionofthreefieldsshownabove).
Theextrabitsincreasenotonlytheprecision(t)butalsotherange
(e)ofmagnitudesthatcanberepresented.Theexactamount
bywhichtheprecisionandrangeofmagnitudesareincreased
dependsonwhatformattheprogramisusingtorepresent
floatingpointvalues.Mostcomputersuseastandardformat
knownastheIEEEfloatingpointformat.

26

Brief summary:
B

(mantisa)

Total
Length

IEEE SP

23

-126

127

32

IEEE DP

52

-1022

1023

64

SP:

23bitsgotothePRECISION

8bitsgototheRANGEforLorU

1bitgoestotheSIGN

Add

32bitsforsingleprecisionrepresentation

DP:

64=52(t)+11(e)+1(S)

27

1.4.4 Totalnumberoffloatingpointnumbers

ThefloatingpointnumberCANNOTrepresentarbitraryreal
numbersevenifitisonlyofmodestmagnitude.
Thetotalnumberoffloatingpointnumbersthatcanbe
producedbysuchasystem
x = . d1d2d3 ... dt* Be

is

2(B1)Bt1(UL+1)+1

2:

accountsforthesign

(B1):

numberofpossibilitiesinchoosingd1 0

Bt1:

numberofpossibilitiesinchoosingd2, d3, ... dt

UL+1: numberofpossibilitiesinchoosinge

1:

(13)

forrepresentingnumberx=0

Thatiswhy,forexample,0.1indecimalcannotberepresented
exactlybyabinarynumberrepresentation:

1/10=1/24+1/25+1/28+1/29+1/212+1/213+

=0.0001100110011001100110011001100....

28

1.4.5 Smallest and largest positive numbers


Smallest positive number x and underflow:
xL = (.100...0)B B-Emin = B-Emin-1

(14)

(=2-126-1= 5.877x10-39)
If x<xL underflow, i.e. computer may treat x as 0.
Largest positive number x and overflow:
xU = (....)B BEmax = (1-B-t) BEmax;

=B-1

(15)

(~2127=1.7x1038)
If x> BEmaxoverflow;
computers treat x as "", Inf., or "NaN =Not a Number "

Is it important to know conditions for overflow and underflow?


Absolutely!

29

Real-life Example
--Disasters Caused by Computer Arithmetic Error
Ariane Rocket
On June 4, 1996 an
unmannedAriane 5 rocket
was launched.
The rocket was on course for 36 seconds and then veered off
and crashed
The internal reference system was trying to convert a 64-bit
floating point number to a 16-bit integer.
This caused an overflow which was sent to the onboard
computer.
The on-board computer interpreted the overflow as real flight
data and bad things happened.
The destroyed rocket and its cargo were valued at $500 million.
The rocket was on its first voyage, after a decade of
development costing $7 billion.

30

Overflow and underflow experiment:


Write a computer program, using x=2k, with k=1 to .., to
determine the largest floating point number your computer can
handle before overflow occurs.
Then use y=2-k, k=1 to ..., to determine the smallest floating
number your computer can handle before underflow occurs.
Results of the experiment n an Alpha workstation
single precision:
k_up = 127,

x_up= 1.7014118E+38

k_low = -126,

y_low= 1.1754944E-38

double precision:
k_up = 1023,

x_up = 8.988465674311580E+307

at k > 1023,

x blows up overflow

k_low = -1022

y = 2.225073858507201E-308

at k < -1022,

y=0.0

underflow

31

1.4.6 Round-off Error and machine precision


Rounding:
e.g. x = 2/3 = 0.666666666666...
To keep 7 decimal points,
5
rounding to nearest x =0.6666667 8th 6 > 10
added 1 to 7th 6.
* Example: T = 3.1415926535897932...
A = 3.14159

If

then |roundoff error| = |A - T| = 0.00000265358979...


< 0.000005

A
3.14
3.142
3.1416

Note:

| round off error |


0.00159... <
0.00040... <
0.0000073 <

0.005
0.0005
0.00005

x = 5.2468 ~ 5.247, or x ~ 5.25 but x 5.3.

If you want to keep one decimal, then x ~ 5.2.


i.e. round off is not transitive.
Chopping:
e.g. x = 2/3 = 0.666666666666...
To keep 7 decimal points, chopping x = 0.6666666
8th 6 is simply chopped.
e.g. A = 3.1415 after chopping.
| round off error | = 0.000092...< 0.0005
It is larger than rounding to nearest.
32

Is it important to care about Chopping and Rounding?


Major difference between Chopping and Rounding:
Error in chopping is always non-negative since the chopped
number is always no larger than the original number.
M

This can cause skew in summuation of x j !!!


j =1

Error in rounding can be either positive or negative.


M

Thus the round off error in computing x j willbesmallersincesome


j =1

oftheroundofferrorwillcancelout.

33

Real-life Example
--Disasters Caused by Computer Arithmetic Error

Vancouver Stock Exchange


In 1982, the index was initiated with a starting value of
1000.000 with three digits of precision and truncation
After 22 months, the index was at 524.881
The index should have been at 1009.811.
Successive additions and subtractions introduced truncation
error that caused the index to be off so much.
(If you are interested, read the details in the next two pages for full
explanation.)

34

35

36

Machine precision or machine epsilon mach


-- accuracy or precision
chopping: mach = B1-t ( = 21-23 = 2.384 x 10-7 for B=2, t=23)
1
rounding: mach = 2 B1-t (= 2-23 = 1.192x 10-7 for B=2, t=23)
Note:
if x< mach then, 1+ x = 1 in machine computation.
Unit round of a computer is the number that satisfies the
following:
i)

it is a positive floating-point number

ii)

it is the smallest such number for which


fl (1 + ) > 1

(16)

where " fl " means the "floating-point" representation of


the number.
* Thus, for any x<, we have fl(1+ x) = 1.
=> precise measure of how many digits of accuracy are
possible in representing a number.

Machine precision experiment:


Write a computer program, using = 2-k, with k=1 to 34 for
single precision and k =1 to 60 for double precision, to
determine for the machine you are using for both single
precision and double precision operations.

37

* Example:

On an Alpha workstation or a Dec 5000 machine,

Single Precision:
k= 23
k= 24

x= 0.000000119
x= 5.9604645E-08

still truthful
no longer truthful

Program:

12

del=1.0
do k=1, 35
del=del/2.0
f=1+del
z=f-1
write(6,12) k, del, f, z
enddo
format(1x,i3,2x,f15.11,2x,f13.10,2x,f13.9)
stop
end

Results (Single precision):


K
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

del
0.50000000000
0.25000000000
0.12500000000
0.06250000000
0.03125000000
0.01562500000
0.00781250000
0.00390625000
0.00195312500
0.00097656250
0.00048828125
0.00024414062
0.00012207031
0.00006103516
0.00003051758
0.00001525879
0.00000762939

f
1.5000000000
1.2500000000
1.1250000000
1.0625000000
1.0312500000
1.0156250000
1.0078125000
1.0039062500
1.0019531250
1.0009765625
1.0004882812
1.0002441406
1.0001220703
1.0000610352
1.0000305176
1.0000152588
1.0000076294

z
0.500000000
0.250000000
0.125000000
0.062500000
0.031250000
0.015625000
0.007812500
0.003906250
0.001953125
0.000976562
0.000488281
0.000244141
0.000122070
0.000061035
0.000030518
0.000015259
0.000007629
38

18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

0.00000381470
0.00000190735
0.00000095367
0.00000047684
0.00000023842
0.00000011921
0.00000005960
0.00000002980
0.00000001490
0.00000000745
0.00000000373
0.00000000186
0.00000000093
0.00000000047
0.00000000023
0.00000000012
0.00000000006

1.0000038147
1.0000019073
1.0000009537
1.0000004768
1.0000002384
1.0000001192
1.0000000000
1.0000000000
1.0000000000
1.0000000000
1.0000000000
1.0000000000
1.0000000000
1.0000000000
1.0000000000
1.0000000000
1.0000000000

0.000003815
0.000001907
0.000000954
0.000000477
0.000000238
0.000000119
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000

39

Double precision:
k= 53
x= 1.1102230E-016 still truthful
k= 54
x= 5.55E-017
no longer truthful
Program:
implicit double precision (a-h,o-z)
del=1.0
do k=1,64
del=del/2.0
f=1+del
z=f-1
write(8,12) k,del,f,z
enddo
12 format(1x,i4,2x,e21.13,2x,f22.17,2x,e16.9)
stop
end
Result (doubleprecision):
K

30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

del

0.9313225746155E-09
0.4656612873077E-09
0.2328306436539E-09
0.1164153218269E-09
0.5820766091347E-10
0.2910383045673E-10
0.1455191522837E-10
0.7275957614183E-11
0.3637978807092E-11
0.1818989403546E-11
0.9094947017729E-12
0.4547473508865E-12
0.2273736754432E-12
0.1136868377216E-12
0.5684341886081E-13
0.2842170943040E-13
0.1421085471520E-13

1.00000000093132257
1.00000000046566129
1.00000000023283064
1.00000000011641532
1.00000000005820766
1.00000000002910383
1.00000000001455192
1.00000000000727596
1.00000000000363798
1.00000000000181899
1.00000000000090949
1.00000000000045475
1.00000000000022737
1.00000000000011369
1.00000000000005684
1.00000000000002842
1.00000000000001421

0.93132E-09
0.46566E-09
0.23283E-09
0.11642E-09
0.58208E-10
0.29104E-10
0.14552E-10
0.72760E-11
0.36380E-11
0.18190E-11
0.90949E-12
0.45475E-12
0.22737E-12
0.11369E-12
0.56843E-13
0.28422E-13
0.14211E-13
40

47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

0.7105427357601E-14
0.3552713678801E-14
0.1776356839400E-14
0.8881784197001E-15
0.4440892098501E-15
0.2220446049250E-15
0.1110223024625E-15
0.5551115123126E-16
0.2775557561563E-16
0.1387778780781E-16
0.6938893903907E-17
0.3469446951954E-17
0.1734723475977E-17
0.8673617379884E-18
0.4336808689942E-18
0.2168404344971E-18
0.1084202172486E-18

1.00000000000000711
1.00000000000000355
1.00000000000000178
1.00000000000000089
1.00000000000000044
1.00000000000000022
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000
1.00000000000000000

0.71054E-14
0.35527E-14
0.17764E-14
0.88818E-15
0.44409E-15
0.22204E-15
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00
0.00000E+00

64 0.5421010862428E-19 1.00000000000000000 0.00000E+00

41

1.5 Significant Digits


Definition:
XA has m significant digits w.r.t. XT if the error |XT -XA | has
magnitude 5 in the (m+1)th digits counting from the right of the
first non-zero digit in XT.
Examples:
Example 1.

XT = 3 . 1 7 2 8 6
1 23 4 56

If XA = 3.17, then | XT -XA| = 0.00286 < 0. 0 0 5


1 234 m + 1= 4 m =3
If XA = 3.172, then | XT -XA| = 0.00086 < 0. 0 0 5
1 2 3 4 m + 1= 4 m =3
If XA = 3.173, then | XT -XA | = 0.00014 < 0. 0 0 0 5
1 2345 m +1 = 5 m =4
Example 2.

XT = 3 8 9. 6 7 4

1 2 3 4 5 6
If XA = 3 8 9. 7 8, then |XT -XA| = 0. 1 0 6 < 0. 5
3 4 m + 1 = 4 m =3
If XA = 3 8 9. 7, then |XT -XA| = 0. 0 2 6 < 0 . 05
3 4 5 m + 1 = 5 m =4
42

1.6 Interaction of Roundoff Error with Truncation Error


f(x) = ex, EXACT derivative f (x) = ex.

Consider

at x=0,

EXACT value is f (0) = 1

* Finite differencemethod1:forwarddifference

TSexpansiontoO(h)=> f ( x, h )

f ( x + h) f ( x )
+ O(h) (17)
h

Thatis,numerically, f (0) = f / x =
* Error(x, h) = | f (x) - f/x | = | ex

eh 1
h

(method1)

f ( x + h) f ( x )
|:
h

1.E+00
abs(f'-1)

f(x)=exp(x)
1.E-01
'

f (0)=[exp(h)-1]/h

1.E-02
1.E-03
1.E-04
1.E-05
1.E-06
1.E-07

decreasing
truncation error

increasing
due to
roundoff error

1.E-08
1.E-09
1.E-15

1.E-13

1.E-11

1.E-09

1.E-07

1.E-05

1.E-03

1.E-01

1.E+01

* Whydoestheerrorbehaveinsuchamanner?

43

* RoundofferrorusingExcelincomputingthedifference
betweentwoO(1)numbers,[f(x+ h)f(x)],isroughlyaround
1.11*1016.
* Thustheroundofferror(R.E.)for f is
(R.E.~1.11*1016/ h

(18)

R.E.issmallforlargerx=h,butitwillincreaseashdecreases.
* TruncationerrorbasedonTaylorseriesexpansion:
f(x+h)=f(x)+ f (x) h+ f (x ) h2/2+ f (x ) h3/6+

[f(x+h)f(x)]/h= f (x) + f (x) h/2 + f (x ) h2/6+(19)

Thusinapproximating f (x) by[f(x+h)f(x)]/h,wecommit


anerrorof f (x ) h/2totheleadingorder.
HencetheT.E.inthisexample(x=0, f (0) = e 0 = 1)is:
T.E.=h/2

(20)

=>Totalerror=R.E.+T.E.~1.11*1016/ h+h/2

(21)

1.E+00
abs(f'-1)

f(x)=exp(x)
1.E-01
'

f (0)=[exp(h)-1]/h

1.E-02
1.E-03
1.E-04
1.E-05
1.E-06
1.E-07

decreasing
truncation error

increasing
due to
roundoff error

1.E-08
1.E-09
1.E-15

1.E-13

1.E-11

1.E-09

1.E-07

1.E-05

1.E-03

1.E-01

1.E+01

44

Finitedifferencemethod2:
Ifweusethecentraldifferenceschemetocompute f (x) :

f ( x, h) =[f(x+ h)f(xh)]/(2h)

thetruncationerrorissmallerasshownbelow.

Truncationerror(T.E.):

(22)

Taylorseriesexpansion:
f(xh)=f(x) f (x ) h+ f (x ) h2/2 f (x) h3/6+

[f(x+h)f(xh)]/(2h)= f (x) + f (x ) h2/6+

(23)

Thusinapproximating f (x ) by[f(x+h)f(xh)]/(2h),we
commitanerrorof f (x) h2/6totheleadingorder.
Hencethetruncationerrorinthisexampleis

T.E.= f (x) h2/6+.

(24)

f(x)=exp(x)
abs(f'-1)
1.E+00
abs(f2'-1)
1.E-01
f2'(0)=[exp(h)-exp(-h)]/(2h)
1.E-02
1.E-03
1.E-04
1.E-05
1.E-06
1.E-07
1.E-08 increasing
1.E-09 due to
decreasing
roundoff
error
1.E-10
truncation error
1.E-11
1.E-12
1.E-15 1.E-13 1.E-11 1.E-09 1.E-07 1.E-05 1.E-03 1.E-01 1.E+01
h

45

Predictedroundofferrorandtruncationerror
1.E+01

roundoff
TE1
TE2

1.E-01
1.E-03
1.E-05

1.11E-16/h

1.E-07

h^2/6

1.E-09
1.E-11

h/2

1.E-13
1.E-15
1E-15 1E-13 1E-11 1E-09 1E-07 1E-05 0.001

0.1

10

Comparison:predicted(roundoff+truncation)&actualerrors
1.E+00
abs(f'-1)
1.E-01
abs(f2'-1)
1.E-02
R+TE1
1.E-03
R+TE2
1.E-04
1.E-05
1.E-06
1.E-07
1.E-08
1.E-09
1.E-10
1.E-11
1.E-12
1E-15 1E-13 1E-11 1E-09 1E-07 1E-05 0.001

0.1

10

Error = Truncation Error + Round-off Error


46

1.7 Propagation of Errors

Consider zT = xT* yT;

* = algebraic operation: + - x

First, computer actually uses xA instead xT due to rounding


or the data itself contains error.
Second, after xA*yA is computed, computer rounds the product as
zA = fl(xA * yA).

(25)

Thus, the error in the operation * is

zT - zA = xT * yT - fl(xA * yA).

(26)

xT = xA + , yT = yA + .

(27)

zT - zA = xT*yT - xA*yA + [xA*yA - fl(xA*yA)]

(28)

Let

The error is

The second part in [ ] is simply due to machine rounding.


It can be easily estimated as
(xA* yA) mach = xA* yA

1 1-p
B
2

The first part xT*yT - xA*yA is the propagated error.

Now consider the propagated error in various operations.

47

1.7.1 Error in multiplication


Absolute error in multiplication:
xT yT - xA yA = xT yT - (xT - ) (yT - )
= xT + yT - .

(29)

xT yT x A y A

=
.
xT yT
yT xT xT yT

(30)

Relative error:
Rel.( xA yA) =
Assuming

xT

1 and

Rel.( xA yA)

yT

yT

1, we obtain

xT

= Rel.( xA) + Rel.( yA).

(31)

1.7.2 Error in division


Absolute error in division:
xT/ yT - xA / yA = xT/ yT - (xT - )/( yT - ).
Relative error in division:
Rel.(xA/yA) =

xT / yT x A / y A
xT / yT

=1

(32)

xA
xT

yT
1 Re l.( x A )
= 1
yA
1 Re l.( y A )

1- [1 Re l.( x A ) + Re l.( y A ) + ...] (TSexpan.)


Rel.( xA) - Rel.( yA) =

xT

yT

(33)

48

1.7.3 Error in addition:


Absolute error: xT + yT - (xA + yA) = +

(34)

Relative error: Rel. (xA + yA) =

(35)

xT + yT

1.7.4 Error in subtraction:


Absolute error:

xT - yT - (xA - yA) = -

Relative error:

Rel. (xA - yA) =

Note:


xT yT

(36)
(37)

xT yT may be small due to cancellation

large Rel.( xA yA).


i.e. loss of significance due to subtraction of nearly
equal quantities--- very important practical issue!

49

* Example: Error in subtraction:


Compute r = 13 - 168 (=xy).
Using 5-digit decimals, y = 168 => yA = 12.961 => rA = 0.039.
Exact number:

rT = 0.038518603... =>

Error(rA) = 0.038518603 - 0.039 = -0.00048.


or Rel. (rA ) = -1.25x10-2 which is not small.
Reason:

x = 13 and y = 168 are quite close =>

rA has only 2 significant digits after subtraction.


1
1
132 168
rA =
=
=
13 + 168 13 + 168 13 + 12.961

Improvement:

= 0.038519 with 5 significant digits.


00.038518603... .038519
= -1.03x10-5
0.038518603...

=>

Rel. (rA ) =

=>

the magnitude of this error is much smaller than the


previous one (1.25x10-2).

Lesson:

avoid subtraction of two close numbers!


Whenever you can, use double precision.

50

1.7.5 Induced error in evaluating functions

With one variable:


If f(x) has a continuous first order derivative in [a, b],
and xT and xA are in [a, b],
f(xT) - f(xA) f (xA)(xT - xA) + o(xT - xA)

(38)

With two variables:


f(xT, yT) - f(xA, yA) f x' (xA, yA) (xT - xA) + f y' (xA, yA) (yT - yA)
+ o(xT - xA, yT - yA)

Example:

(39)

f(x, y) = xy =>

f x' = yxy-1,

f y' = xy-1 logx

=> Error(fA) yA ( x A ) y A 1 Error(xA)


+ ( x A ) y A 1 logxA Error(yA)

=> Rel.(fA) yA Rel.( xA) + log xA Rel.( yA)

51

1.7.6 Error in summation


Consider

s = xj.
j =1

(40)

In a Fortran program, we write:


S=0
DO J = 1 TO M
S = S + X(J)
ENDDO

Equivalently, in the above code we are doing the following:


s2 = fl( x1 + x2) = (x1 + x2) (1 + 2);

(41a)

where 2 = machine error due to rounding


s3 = fl(x3 + s2) = (s2 + x3) (1 + 3)

(41b)

= [(x1 + x2) (1 + 2) + x3] (1 + 3)

(x1 + x2+ x3 ) + 2 (x1 + x2) + 3(x1 + x2+ x3)

(41c)

sk+1 = (sk + xk+1) (1 + k+1)


= (x1 + x2+ x3 +... xk+1) + 2(x1 + x2) + 3(x1 + x2+ x3) + ...
+ (x1 + x2+ x3+... +xk+1) k+1

(41d)

Error=s(x1+x2+x3+...+xM)

= 2(x1 + x2) + 3(x1 + x2+ x3) + ...+ (x1 + x2+ x3+... + xM)M
= x1(2 + 3 +... + M) + x2 (3+ 4 +... M) + ... + xM M

(42)

Since all i's are of same magnitude


=>

term x1 contributes the most while xM contributes the smallest;

=>

we should add from smallest (x1) to the largest (xM)


to reduce the overall machine error accumulation.
52

Example:
i)

M 1

Compute S(M) =

k =1 k

forM<108

summing from k=1 to M using single precision


(single: large to small)

ii)

summing from k=M to 1 using single precision


(single: small to large)

iii)

summing from k=1 to M using double precision


(double: large to small)

iv)

summing from k=M to 1 using double precision


(double: small to large)

asymptote = ln(M)+0.5772156649015328
M

single: large

single: small

double: large

double: small

to small

to large

to small

to large

asymptote

16384

10.2813063

10.28131294

10.2813068

10.28130678

10.2812767

32768

10.9744091

10.97444344

10.9744387

10.9744387

10.9744225

65536

11.667428

11.66758823

11.6675783

11.66757825

11.6675701

131072

12.3600855

12.36073208

12.3607216

12.36072161

12.3607178

262144

13.0513039

13.05388069

13.0538669

13.05386689

13.0538654

524288

13.7370176

13.74705601

13.7470131

13.74701311

13.7470112

1048576

14.4036837

14.44023132

14.4401598

14.44015982

14.4401588

2097152

15.4036827

15.13289833

15.1333068

15.13330676

15.1333065

4194304

15.4036827

15.82960701

15.8264538

15.82645382

15.8264542

8388608

15.4036827

16.51415253

16.5196009

16.51960094

16.5195999

16777216

15.4036827

17.23270798

17.2127481

17.21274809

17.2127476

53

18

Sum

17
16
15

single: large to small


single: small to large
dbl: large to small
dbl: small to large
asympt

14
13
12
11

M
0

1000000

1200000

1400000

1600000

1800000

8000000

6000000

4000000

2000000

10

Clearly,theresultbasedonsummationusingsingleprecision
andaddingfromlargetosmallvaluesaremostunsatisfactory
(sinceitisalreadyconverged).

54

Das könnte Ihnen auch gefallen