Real Number Representation and Floating Point Arithmetic

Lecture 10
Real Number Representation and Floating Point Arithmetic

4.1
Number Representation
Integer: The bit pattern represents the numeric value exactly. Real: The bit pattern is an approximation of the numeric value.
m bits integer part n bits fractional part
Fixed point: Floating point:
.
m+n bits to use wherever they are needed
.
4.2
The point moves left or right according to an exponent value.
Real Numbers in Binary

Fractions with powers of two in the denominator are easy:
1 = 1 2 1 = 0.1 2 1 = 1 2 2 = 0.01 4 1 = 1 2 3 = 0.001 8 1 = 1 2 4 = 0.0001 16 3 1 1 = + = 0.11 4 2 4
In fact, to represent a real number with a fractional part in binary, we need to approximate the fractional part by a sum of fractions with powers of two in the denominator.
4.3
Reals: From Decimal to Binary

Example : Convert 0.0937510 to binary.
Integer conversion called for successive divisions by 2; reals with a fractional part will call for a method based on multiplications by 2.
.09375 2 = 0.1875 .1875 2 = 0.375 .375 2 = 0.75
most significant bit (right after the binary point)
0.0937510 = 0.000112
.75 2 = 1.5
.5 2 = 1.0
least significant bit
4.4
First Problem
A number accurately represented in base 10 by a fixed number of digits may lead to a binary representation that requires an infinite number of digits (non-terminating fraction).
.1 2 .2 2 .4 2 .8 2 .6 2 .2 2 .4 2 .8 2
= = = = = = = =
0.2 0.4 0.8 1.6 1.2 0.4 0.8 1.6
0.110 = 0.00011...2
Since the space to store the number will always be limited, what we end up with is an approximation.
4.5
Second Problem
A finite number of digits leads to finite precision. The error in the representation is given by the distance between two representable numbers. Moreover, there is an infinite number of real values between, say, 0 and 1, while there is a finite number of binary values representable with n bits. gap
1 2
1 1 4 8
1 16
1 16
1 8
1 4
1 2
1
4.6
Floating Point Representation

1 bit X= S
X = (1) S M 2 E
E (exponent) M (mantissa)
The sign of the exponent goes into its representation. The exponent defines the range of the number.
The mantissa defines the precision of the number. The exponent can be adjusted so that all the information contained in X ends up in the mantissa bits.
Normalization: 0.0000010111 = 1.0111 2 6

Why use a bit to represent a value which is always known?
4.7
The IEEE 754 Standard (1985)

single = S
1 bit
E (exponent)
8 bits
M (mantissa)
23 bits
double = S E (exponent)
1 bit 11 bits
M (mantissa)
52 bits
The Exponent is represented in Excess B=127 for single and B=1023 for double precision.
X = (1)S (1+ M)2E

4.8
Special Cases
Zero: 0 00000000 1 00000000 Infinity: 0 11111111 1 11111111 NaN: 0 11111111 1 11111111 00000000000000000000000 00000000000000000000000 00000000000000000000000 00000000000000000000000 non-zero mantissa non-zero mantissa
4.9
Div by 0, sqrt(-1)
Start
1. Compare the exponents of the two numbers. Shift the smaller number to the right until its exponent would match the larger exponent
Floating Point Addition
2. Add the significands
3. Normalize the sum, either shifting right and incrementing the exponent or shifting left and decrementing the exponent
Overflow or underflow? No
Yes
Exception
4. Round the significand to the appropriate number of bits
No
Still normalized? Yes Done
4.10
10
Sign
Exponent
Significand
Sign
Exponent
Significand
Small ALU
Compare exponents
Exponent difference 0 1 0 1 0 1 Shift smaller number right
Control
Shift right
Big ALU
Add
1 Normalize
Increment or decrement
Shift left or right
Rounding hardware
Round
Sign
Exponent
Significand
4.11
11
Start
1. Add the biased exponents of the two numbers, subtracting the bias from the sum to get the new biased exponent
Floating Point Multiplication
2. Multiply the significands
3. Normalize the product if necessary, shifting it right and incrementing the exponent
Overflow or underflow? No
Yes
Exception
4. Round the significand to the appropriate number of bits
No
Still normalized? Yes
5. Set the sign of the product to positive if the signs of the original operands are the same; if they differ make the sign negative
4.12
Done
12

Real Number Representation and Floating Point Arithmetic

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Real Number Representation and Floating Point Arithmetic

Hochgeladen von

Copyright:

Verfügbare Formate

Lecture 10

Real Number Representation and Floating Point Arithmetic

Fixed point: Floating point:

The point moves left or right according to an exponent value.

Real Numbers in Binary

Reals: From Decimal to Binary

.09375 2 = 0.1875 .1875 2 = 0.375 .375 2 = 0.75

most significant bit (right after the binary point)

0.2 0.4 0.8 1.6 1.2 0.4 0.8 1.6

Floating Point Representation

Normalization: 0.0000010111 = 1.0111 2 6

The IEEE 754 Standard (1985)

X = (1)S (1+ M)2E

Floating Point Addition

2. Add the significands

4. Round the significand to the appropriate number of bits

Still normalized? Yes Done

Exponent difference 0 1 0 1 0 1 Shift smaller number right

Shift left or right

Floating Point Multiplication

2. Multiply the significands

4. Round the significand to the appropriate number of bits

Still normalized? Yes

Das könnte Ihnen auch gefallen