Sie sind auf Seite 1von 5

int and float data types

Pawan Aurora
September 19, 2016
In the C programming language one declares a variable to be of a particular type. The type of a variable determines (among other things) the
amount of memory that is reserved for storing a value that is assigned to
that variable. The amount of memory determines the range of values that
can be assigned to that variable. Hence it is important to know what values
are permitted and more importantly what values are NOT permitted for a
given data type.
In this article we will study the int and float data types via some
interesting examples where an uninformed programmer may get unexpected
output.
The memory of a computer is measured in bits, which stands for binary
digits. Recall that a computer works in the base 2 binary number system
where every digit is either 0 or 1. In our day to day activities we use the
base 10 decimal number system and would expect the computer to process
our decimal input and give a decimal output. The computer tries its best
to do so but some precision gets lost in translation. For example 0.1 can be
precisely represented in the decimal system but not in the binary.
The int data type typically reserves 32 bits of memory for storing the value
of a variable that is declared of type int. Since each bit can be either 0 or
1, that gives a total of 232 possibilities. Since int type also allows negative
values, half the space must go to storing negative numbers. Since zero is
neither positive nor negative, we have 231 as the smallest value that can be
stored in a variable of type int and 231 1 as the largest value. Note the
1 to make space for 0. These values are defined in the library limits.h as
constants INT MIN and INT MAX, respectively.
If during the execution of a C program, a variable of type int gets assigned
1

a value outside the allowed range, the user may get unexpected output. For
example the result of multiplying 123450 with 67890 is 8381020500. However,
the output of your program
int a = 123450;
int b = 67890;
int c = a*b;
printf("%d",c);
is 208914092.
The good thing about int type is that any value in the allowed range
can be precisely stored and retrieved. This is not true with the float type
discussed next.
The float data type typically reserves 32 bits of memory for storing
the value of a variable that is declared of type float. Since each bit can
be either 0 or 1, that gives a total of 232 possibilities, which is the same as
that for the int data type. But the float data type is meant to store real
numbers which are uncountably infinite. So it is impossible to represent all
possible real values using the float data type.
The basic idea for encoding floating point numbers is the same as used
in scientific notation where a mantissa is multiplied by ten raised to some
exponent. For instance, 5.4321 106 , where 5.4321 is the mantissa and 6
is the exponent. Scientific notation is exceptional at representing very large
and very small numbers. For example: 1.2 1050 , the number of atoms in
the earth, or 1.66 1027 , the weight of one atomic mass unit (a.m.u.). Notice that numbers represented in scientific notation are normalized so that
there is only a single nonzero digit left of the decimal point. This is achieved
by adjusting the exponent as needed. Observe that although scientific notation allows one to succinctly represent very large or very small quantities, it
sacrifices precision as a result. For e.g., we need just four digits 1, 2, 50 to
represent a very large quantity 1.2 1050 (here 10 is implicitly understood)
but what if we want to know the exact value which may take 51 digits to be
precisely stated.
Floating point representation is similar to scientific notation, except everything is carried out in base two, rather than base ten. The most common
format for representing floating point numbers is given by the ANSI/IEEE
Std. 754-1985. This standard defines the format for 32 bit numbers called
single precision, as well as 64 bit numbers called double precision. The 32
bits used in single precision are divided into three separate groups: bits 0
2

through 22 form the mantissa, bits 23 through 30 form the exponent, and
bit 31 (the leftmost bit) is the sign bit. These bits represent a floating point
number given as (1)S M 2E127 , where S stands for the sign bit, E is a
number between 0 and 255 represented by the eight exponent bits and M is
the mantissa, formed from the 23 bits as a binary fraction. Subtracting 127
from the value of E allows the exponent term to run from 127 to 128.
Let us consider a decimal fraction 3.1416. This is same as 3 + 1/10 +
4/100 + 1/1000 + 6/10000. Similarly a binary fraction 1.101010 is equivalent
to 1 + 1/2 + 0/4 + 1/8 + 0/16 + 1/32 + 0/64. Note that in the scientific
notation, the single digit to the left of the decimal point can take 9 non-zero
values. However, in the floating point binary representation the only nonzero value can be 1. So all the 23 bits of mantissa can be used to represent
the fractional part.
Now let us see what is the range of values that can be represented using
the method described above. The smallest value will have all the 31 bits as
zero thus forming the decimal number 1.0 2127 which is same as 5.9
1039 . Likewise the largest value will have all the bits as one, thus forming
the decimal number (1 + y) 2128 where y is the sum of the geometric
series 1/2 + 1/22 + 1/23 + . . . + 1/223 . That reduces to (2 223 ) 2128
or 6.8 1038 . Clearly, using a 32 bit floating point variable allows one to
represent numbers that are much larger and much smaller than those allowed
by a 32 bit integer variable. However, what we have gained in the range of
values we must lose in the precision since we can represent only 232 unique
values using a single precision floating point variable.
As a first example, let us try to store an integer value within the range
of INT MIN and INT MAX in the place reserved for a variable of type float.
float y = 100000009;
printf("Value of y is %f", y);
The above program outputs a value of 100000008.000000 which is not what
you expected. As an integer, one can precisely store this value since it lies
within the allowed range. Infact the binary representation of 100000009 is
101111101011110000100001001 which requires only 27 bits when stored in a
variable of type int. The remaining 5 leftmost bits are set to 0. Lets see
how we can store this value in a float data type using the method described
above. Since the number is positive, we can set the S value to 0. Note that we
can write 101111101011110000100001001 as 1.01111101011110000100001001
226 , which means that we can store this number precisely if we are allowed
3

26 bits of mantissa. As one would observe, the exponent of 26 can be easily


represented in the allowed 8 bits. However, a float data type allows only
23 bits of mantissa, which implies that only 23 of the 26 bits can be stored.
In such a scenario the best approximation of the given number is saved in
the memory reserved for the float variable. Such a number is one that is
closest to the given number and can be represented using the allowed number of bits. Let us consider two numbers, one that is the largest number
smaller than the given number and the other that is the smallest number
larger than the given number, such that both these numbers can be represented using 23 bits of mantissa and 8 bits of exponent. Clearly, one of these
numbers is the best approximation to the given number. In this case the
largest number smaller than the given number corresponds to a mantissa of
1.01111101011110000100001 and the smallest number larger than the given
number corresponds to a mantissa of 1.01111101011110000100010. So the
two numbers are 100000008 and 100000016. Clearly, 100000008 is the closest
and hence the best approximation.
Next we look at another interesting example where the value stored in a
variable of type float does not change as expected.
float x = INT_MAX;
printf("Value of x is %f\n", x);
printf("Value of x-64 is %f\n", x-64);
printf("Value of x-65 is %f", x-65);
The above program produces the following output:
Value of x is 2147483648
Value of x 64 is 2147483648
Value of x 65 is 2147483520
We have seen that INT MAX corresponds to a value of 231 1 which in binary is 1111111111111111111111111111111 or 31 1s. This can also be written as 1.111111111111111111111111111111 230 which has a mantissa of
30 bits. Since this cannot fit in the allowed 23 bits, an approximate value
is instead stored. Using the same idea as above, the closest value turns
out to be one more than the given value. So the variable x actually stores
the value 1.0 231 which is 2147483648. So one would expect x 64 to
be 2147483584 but the output remains unchanged as 2147483648. This
again we can argue using the ideas discussed above. 2147483584 in binary is
1111111111111111111111111000000 which is equal to
1.111111111111111111111111 230 . So the nearest value still remains as
4

1.0231 . However, when we subtract 65 from 2147483648 we get 2147483583


which in binary is 1111111111111111111111110111111 or
1.111111111111111111111110111111 230 , which comes closer to
1.11111111111111111111111 230 than to 1.0 231 . Alternatively, one could
look at the gap between 2147483648 and the largest number smaller than
it that can be represented as a single precision floating point number. This
gap turns out to be 128 (note that it is the gap between 2147483648 and
2147483520). Now, if the value of an expression turns out to be within half
of this value, then it is closer to the smaller number, else it is closer to the
larger number.

Das könnte Ihnen auch gefallen