IEEE Floating Point Accuracy Problems

IEEE floating point Accuracy problems
Ovidiu Constantin Novac1, Remus Gabriel Suciu2
University of Oradea, Romania,

1Department of Computers and Information Technology, Faculty of Electrical Engineering and Information Technology,
2Master Student at Faculty of Electrical Engineering and Information Technology
1, Universităţii Str., 410087 Oradea, Romania, E-Mail: ovnovac@uoradea.ro
Abstract – The IEEE Standard for Floating-Point The term floating point refers to the fact that a
Arithmetic (IEEE 754) is a technical standard for number's radix point (decimal point, or, more
floating-point computation established in 1985 by the commonly in computers, binary point) can "float"; that
Institute of Electrical and Electronics Engineers is, it can be placed anywhere relative to the significant
(IEEE). The standard addressed many problems found digits of the number. This position is indicated as the
in the diverse floating point implementations that made exponent component, and thus the floating-point
them difficult to use reliably and portably. Many representation can be thought of as a kind of scientific
hardware floating point units now use the IEEE 754 notation [1].
standard. This standard is commonly used in software A floating-point system can be used to represent,
that is not sensitive to calculation errors (such as stock with a fixed number of digits, numbers of different
market applications or industry applications). orders of magnitude: e.g. the distance between galaxies
or the diameter of an atomic nucleus can be expressed
Keywords: ieee 754; php; java; floating point; float; with the same unit of length. The result of this dynamic
range is that the numbers that can be represented are not
uniformly spaced; the difference between two
I. INTRODUCTION IN FLOATING POINT consecutive representable numbers grows with the
ARITHMETICS chosen scale [2].
The IEEE Standard for Floating-Point Arithmetic
In computing, floating-point arithmetic is arithmetic (IEEE 754) is a technical standard for floating-point
using formulaic representation of real numbers as an computation established in 1985 by the Institute of
approximation so as to support a trade-off between Electrical and Electronics Engineers (IEEE).
range and precision. A number is, in general, Over the years, a variety of floating-point
represented approximately to a fixed number of representations have been used in computers. However,
significant digits (the significand) and scaled using an since the 1990s, the most commonly encountered
exponent in some fixed base; the base for the scaling is representation is that defined by the IEEE 754 Standard.
normally two, ten, or sixteen. A number that can be The speed of floating-point operations, commonly
represented exactly is of the following form: measured in terms of FLOPS, is an important
characteristic of a computer system, especially for
applications that involve intensive mathematical
Figure 1 – formula for floating point representation calculations.
where significand is an integer (i.e., in Z), base is an
integer greater than or equal to two, and exponent is II. IEEE floating point
also an integer. For example:
The Java language supports two primitive floating
point types: float and double, and their wrapper class
counterparts, Float and Double. These are based on the
Figure 2 – Floating Point for base 10 IEEE 754 standard, which defines a binary standard for
32-bit floating point and 64-bit double precision
floating point binary-decimal numbers[4].
III. REPRESENTATION EXAMPLE
Decimal numbers can be represented exactly, if you

Figure 3 – IEEE Floating Point Representation have enough space - just not by floating binary point
numbers.
The following is a simple demonstration for the
Floating Point representation for number 3.5.
sign * exponent * mantissa.
3.5 = 1 * 2 * 1.75.
Figure 4 – IEEE Double Precision Floating Point
Representation
Mantissa:
1 + 11000000000000000000000
IEEE 754 represents floating point numbers as base -> 1/(2^0) + 1/(2^1) + 1/(2^2) +
2 decimal numbers in scientific notation. An IEEE 0*(2^3) + … + 0*(2^23)
floating point number dedicates 1 bit to the sign of the -> 1 + 0.5 + 0.25
number, 8 bits to the exponent, and 23 bits to the -> 1.75
mantissa, or fractional part. The exponent is interpreted Some numbers cannot be represented exactly (such
as a signed integer, allowing negative as well as as 0.33 or 0.9)
positive exponents. The fraction is represented as a
binary (base 2) decimal, meaning the highest-order bit IV. ROUNDING ERRORS
corresponds to a value of ½ (2-1), the second bit ¼ (2-
2), and so on. For double-precision floating point, 11 Squeezing infinitely many real numbers into a finite
bits are dedicated to the exponent and 52 bits to the number of bits requires an approximate representation.
mantissa [4]. Although there are infinitely many integers, in most
Table 1 – IEEE 754 Parameters programs the result of integer computations can be
stored in 32 bits. In contrast, given any fixed number of
bits, most calculations with real numbers will produce
quantities that cannot be exactly represented using that
many bits. Therefore the result of a floating-point
calculation must often be rounded in order to fit back
into its finite representation. This rounding error is the
characteristic feature of floating-point computation.
The section Relative Error and Ulps describes how it is
Abbreviations as follows: measured.
SGL – Single
DBL – Double Since most floating-point calculations have
rounding error anyway, does it matter if the basic
XTD – Extended
arithmetic operations introduce a little bit more
rounding error than necessary? That question is a main
In most modern hardware, the performance gained
theme throughout this section. The section Guard
by avoiding a shift for a subset of operands is
Digits discusses guard digits, a means of reducing the
negligible, and so the small wobble of = 2 makes it the
error when subtracting two nearby numbers. Guard
preferable base. Another advantage of using = 2 is that
digits were considered sufficiently important by IBM
there is a way to gain an extra bit of significance.12
that in 1968 it added a guard digit to the double
Since floating-point numbers are always normalized,
precision format in the System/360 architecture (single
the most significant bit of the significand is always 1,
precision already had a guard digit), and retrofitted all
and there is no reason to waste a bit of storage
existing machines in the field. Two examples are given
representing it. Formats that use this trick are said to
to illustrate the utility of guard digits.
have a hidden bit. It was already pointed out in
Floating-point Formats that this requires a special
The IEEE standard goes further than just requiring
convention for 0. The method given there was that an
the use of a guard digit. It gives an algorithm for
exponent of emin - 1 and a significand of all zeros
addition, subtraction, multiplication, division and
represents not , but rather 0.
square root, and requires that implementations produce represented exactly in binary floating point, arguably
the same result as that algorithm. Thus, when a program 0.2 is a simpler example as it is 1/5 - and 5 is the prime
is moved from one machine to another, the results of that causes problems between decimal and binary.
the basic operations will be the same in every bit if both
machines support the IEEE standard. This greatly
simplifies the porting of programs. Other uses of this 0.7 is actually represented as 0.699999988079071
precise specification are given in Exactly Rounded 0.1 is actually represented as
Operations. 0.10000000149011612
Floating point arithmetic is rarely exact. While If we add these two values we will obtain
some numbers, such as 0.5, can be exactly represented 0.7999999895691871
as a binary (base 2) decimal (since 0.5 equals 2-1), other And if we multiply it by 10 we obtain
numbers, such as 0.1, cannot be. As a result, floating 7.999999895691871 which if casted to int is 7 the same
point operations may result in rounding errors, yielding way 3.5 is 3 if casted to int.
a result that is close to -- but not equal to -- the result The other example still shows 7 because 0.6 is
you might expect. For example, the simple calculation actually represented as 0.6000000238418579 and
below results in 2.600000000000001, rather than 2.6. (0.6000000238418579 + 0.10000000149011612)*10 is
The number 61.0 has an exact binary representation 7,00000025331974.
because the integral portion of any number is always
exact. But the number 6.10 is not exact. V. RECOMMENDATIONS
Mathematically, there should be no intrinsic difference
between the two numbers -- they're just numbers. It would be best to try to avoid floating point
comparison entirely. This is, of course, not always
By contrast, if the decimal is moved one place in the possible, but you should be aware of the limitations of
other direction to produce the number 610, the same for floating point comparison. If you must compare
6100, 610000000, 610000000000000 and they're still floating point numbers to see if they are the same, you
exact, but at some point they aren’t exact anymore. should instead compare the absolute value of their
The mathematically "pure" way. In base 10, the difference with some pre-chosen epsilon value, so you
positional values are: are instead testing whether they are "close enough." (If
you don't know the scale of the underlying
measurements, using the test "abs(a/b - 1) < epsilon" is
likely to be more robust than simply comparing the
Figure 5 – Decimal scale
difference.) Even testing a value to see if it is greater
While in binary the positional values are: than or less than zero is risky -- calculations that are
"supposed to" result in a value slightly greater than zero
may in fact result in numbers that are slightly less than
zero due to accumulated rounding errors.
Figure 6 – Binary scale (represented in decimal)
Some non-integral values, like dollars-and-cents
In base 10 which 1/3 cannot be expressed exactly. decimals, require exactness. Floating point numbers are
It's 0.3333333... (recurring). The reason 0.1 can't be not exact, and manipulating them will result in
represented as a binary floating point number is for rounding errors. As a result, it is a bad idea to use
exactly the same reason. The numbers 3, and 9, and 27 floating point to try to represent exact quantities like
can be represented exactly - but not 1/3, 1/9 or 1/27. monetary amounts. Using floating point for dollars-
The problem is that 3 is a prime number which isn't and-cents calculations is a recipe for disaster. Floating
a factor of 10. That's not an issue when you want to point numbers are best reserved for values such as
multiply a number by 3: you can always multiply by an measurements, whose values are fundamentally inexact
integer without running into problems. But when you to begin with.
divide by a number which is prime and isn't a factor of For arbitrary precision mathematics PHP offers the
your base, you can run into trouble (and will do so if Binary Calculator (BC Math extension) which supports
you try to divide 1 by that number). numbers of any size and precision up to 2147483647-1
(or 0x7FFFFFFF-1) decimals, represented as strings.
Although 0.1 is usually used as the simplest
example of an exact decimal number which can't be
VI. CONCLUSIONS
The number values can only be represented exactly

if they can be represented by this formula:
exponent * mantissa.
IEEE Short Real: 32 bits 1 bit for the sign, 8 bits
for the exponent, and 23 bits for the mantissa. Also
called single precision.
IEEE Long Real: 64 bits 1 bit for the sign, 11 bits
for the exponent, and 52 bits for the mantissa. Also
called double precision.
0.7 is actually represented as 0.699999988079071.
0.1 is actually represented as
0.10000000149011612.
If we add these two values we will obtain
0.7999999895691871.
If the numbers are business-critical using number
rounding functions is recommended before operating
with them.
For arbitrary precision mathematics in PHP use the
Binary Calculator (BC Math extension) which supports
numbers of any size and precision up to 2147483647-1
(or 0x7FFFFFFF-1) decimals, represented as strings.
REFERENCES
[1] https://en.wikipedia.org/wiki/Floating-point_arithmetic
[2] W. Smith, Steven (1997). "Chapter 28, Fixed versus Floating
Point". The Scientist and Engineer's Guide to Digital Signal
Processing. California Technical Pub. p. 514. ISBN 0-9660176-
3-3. Retrieved 2015-07-21.
[3] https://en.wikipedia.org/wiki/IEEE_floating_point
[4] https://www.ibm.com/developerworks/library/j-jtp0114/
[5] http://docs.oracle.com/cd/E19957-01/806-
3568/ncg_goldberg.html
[6] http://grouper.ieee.org/groups/754/

IEEE Floating Point Accuracy Problems

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

IEEE Floating Point Accuracy Problems

Hochgeladen von

Copyright:

Verfügbare Formate

IEEE floating point Accuracy problems

Ovidiu Constantin Novac1, Remus Gabriel Suciu2

University of Oradea, Romania,

1, Universităţii Str., 410087 Oradea, Romania, E-Mail: ovnovac@uoradea.ro

Decimal numbers can be represented exactly, if you

The number values can only be represented exactly

Das könnte Ihnen auch gefallen