Beruflich Dokumente
Kultur Dokumente
1/14/15, 1:52 PM
Example 1
Let's add the following two numbers:
Variable sign exponent fraction
X
1001
110
0111
000
Page 1 of 4
1/14/15, 1:52 PM
rewriting Y. This will result in Y being not normalized, but value is equivalent to the normalized
Y.
Add x - y to Y's exponent. Shift the radix point of the mantissa (signficand) Y left by x - y to
compensate for the change in exponent.
The difference of the exponent is 2. So, add 2 to Y's exponent, and shift the radix point left by 2. This
results in 0.0100 x 22. This is still equivalent to the old value of Y. Call this readjusted value, Y'
3. Add the two mantissas of X and the adjusted Y' together.
We add 1.110two to 0.01two. The sum is: 10.0two. The exponent is still the exponent of X, which is 2.
4. If the sum in the previous step does not have a single bit of value 1, left of the radix point, then
adjust the radix point and exponent until it does.
In this case, the sum, 10.0two, has two bits left of the radix point. We need to move the radix point left
by 1, and increase the exponent by 1 to compensate.
This results in: 1.000 x 23.
5. Convert back to the one byte floating point representation.
Sum
X+Y
1010
000
Example 2
Let's add the following two numbers:
Variable sign exponent fraction
X
1001
110
0110
110
Page 2 of 4
1/14/15, 1:52 PM
Y.
Add x - y to Y's exponent. Shift the radix point of the mantissa (signficand) Y left by x - y to
compensate for the change in exponent.
The difference of the exponent is 3. So, add 3 to Y's exponent, and shift the radix point of Y left by 3.
This results in 0.00111 x 22. This is still equivalent to the old value of Y. Call this readjusted value, Y'
3. Add the two mantissas of X and the adjusted Y' together.
We add 1.110two to 0.00111two. The sum is: 1.11111two. The exponent is still the exponent of X, which
is 2.
4. If the sum in the previous step does not have a single bit of value 1, left of the radix point, then
adjust the radix point and exponent until it does.
In this case, the sum, 1.11111two, has a single 1 left of the radix point. So, the sum is normalized. We
do not need to adjust anything yet.
So the result is the same as before: 1.11111 x 23.
5. Convert back to the one byte floating point representation.
We only have 3 bits to represent the fraction. However, there were 5 bits in our answer. Obviously, it
looks like we should round, and real floating point hardware would do rounding.
However, for simplicity, we're going to truncate the additional two bits. After truncating, we get 1.111 x
22. We convert this back to floating point.
Sum
X+Y
1010
111
This example illustrates what happens if the exponents are separated by too much. In fact, if the exponent
differs by 4 or more, then effectively, you are adding 0 to the larger of the two numbers.
Negative Values
So far, we've only considered adding two non-negative numbers. What happens with negative values?
If you're doing it on paper, then you proceed with the sum as usual. Just do normal addition or subtraction.
If it's in hardware, you would probably convert the mantissas to two's complement, and perform the addition,
while keeping track of the radix point (read about fixed point representation.
Bias
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BinMath/addFloat.html
Page 3 of 4
1/14/15, 1:52 PM
Does the bias representation help us in floating point addition? The main difficulty lies in computing the
differences in the exponent. Still, that's not so bad because we can just do unsigned subtraction. For the most
part, the bias doesn't pose too many problems.
Overflow/Underflow
It's possible for a result to overflow (a result that's too large to be represented) or underflow (smaller in
magnitude than the smallest denormal, but not zero). Real hardware has rules to handle this. We won't worry
about it much, except to acknowledge that it can happen.
Summary
Adding two floating point values isn't so difficult. It basically consists of adjusting the number with the
smaller exponent (call this Y) to that of the larger (call it X), and shifting the radix point of the mantissa of
the Y left to compensate.
Once the addition is done, we may have to renormalize and to truncate bits if there are too many bits to be
represented.
If the differences in the exponent is too great, then the adding X + Y effectively results in X.
Real floating point hardware uses more sophisticated means to round the summed result. We take the
simplification of truncating bits if there are more bits than can be represented.
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BinMath/addFloat.html
Page 4 of 4