last updated 11/18/01
Floating point numbers are a numeric data type provided on most machines (hardware) and programming languages. Floating point provides a type to represent very large numbers and very small numbers at the expense of exactness, precision, or significant digits.
Integers could be extended to include a radix point and then more digits follow to the right just as we do with decimal numbers.
The following sequence extends the positional notation of an integer to include fractional binary places
... 23 22 21 20 . 2-1 2-2 2-3 ...
So the binary value 1 0 0 1 . 1 0 1 = 8 + 1 + 1/2 + 1/8 = 9.62510
What does 1 1 0 1 . 0 1 1 0 = ___________10
The radix point in general (any base) indicates where the one's position is located
Subtract out decreasing powers of two. For example,
10.4 = 8 + 2 + 1/4 + 1/8 + 1/64 + 1/128 + .....
10.4 = 1 0 1 0 . 0 1 1 0 0 1 1 0 0 1 1...
Notice that a non-repeating decimal number may not have finite representation in binary. This is similar to trying to represent 1/3 in decimal. (.3333333333.....)
1/10 or a power of 2 multiples of 1/10 are such numbers.
For very large numbers we only really are interested in the first few digits of the number and its magnitude. Here we are using magnitude in the sense of numbers of digits of the number. More precisely we consider the digit position of the most significant (non-zero) digit. For very small numbers likewise.
In fact most measurements are approximations and therefore carrying precision beyond a few digits is not useful.
Any number can be expressed as a value between 1<= M <10 multiplied by 10e, where e represents the location of the decimal point
12345.67 = 1.234567 x 104
That is, moving the radix point to the left in the mantissa increases the exponent to counter the decreasing value of the mantissa.0.009876 = 9.876 x10-3
That is, moving the radix point to the right in the mantissa decreases the exponent to counter the increasing value of the mantissa.
We want a standard form to represent numbers in scientific notation. This is to always adjust the mantissa so that there is exactly one nonzero digit to the left of the decimal point. All other digits are to the right as decimal digits. This assures that M, the mantissa, is a value between 1 and 10, or 1<= M <10.
Calculators use this notation for displaying large and small values when digits wouldn't normally fit into the display window.
Instead of normalizing the mantissa to a value between 1 and 10, represent the original number as a fraction between one tenth and 1 or 0.1<= M <1
Notice the role of the base in the normalization process: the fraction range is 1/base <= M < 1 Or in traditional scientific notation, the base more generally defines the range as 1<=M<base. So in binary (base two) and using the variation on the normalization process, 1001.101 can be written as 10011012x 24 Now writing all parts of the normalized number in base two: .10011012 x 102100
0.000111 = .111 x 2-3
Of course with a negative exponent we will need to represent it in something like 2's complement, requiring knowledge of the number of bits used to store the exponent.
Floating point numbers are approximation of the real number system. As soon as we stop writing or tracking ALL digits, we have an approximation.
We keep a few (roughly 24) bits of the number converted to a fraction which is called the mantissa.
If we have a number of the form .ffffffffffmmmm the m bits or digits are less significant. If they're omitted the fraction is still fairly accurate. The more bits you can retain, the more "accuracy" you will keep in the number.
We store the exponent that tracks the position of the radix point.
We store the sign of the number and the sign of the exponent.
We don't store the base (as often displayed on a calculator).
32 bits are used in the following format
SEEEEEEEEMMM....M
S is the sign bit (0=positive; 1=negative) of the number as stored in the mantissa
The next 8 bits represent the exponent EEEEEEEE where its first bit is the sign bit (1=positive; 0=negative). Notice the inversion of the representation. Excess-128 notation is used instead of 2's complement. 8 bits again allow for only 256 different values to be represented. Again the range of -128 to 127 would be the values stored for the exponent.
Excess 128 is found by taking the exponent value and simply adding 128 to it. This results in the exponent looking like an unsigned number. For instance 0 becomes 128, 3 becomes 131, -5 becomes 123. If you look at the most significant bit then 1=positive and 0=negative.
Why would such a notation be useful?
The remaining 23 bits are used for the mantissa (fraction).
Notice that all normalized fractions will always have its first bit as one unless the mantissa is zero. So don't store it. Allowing a 24 bit normalized mantissa to be stored in 23 bits.
9.625
= 1001.101 (binary conversion)
= .1001101 x 24 (normalization)
in 32 bits using the IEEE format = 0 10000100 00110100...000
with bits regrouped 0100 0010 0001 1010 0...000 we get the value in hex = 4 4 1 A 0 0 0 0
Try -0.75 = _____________________________
Try 13.8 = ______________________________
Consider the 24 bit fraction of a floating point number. Recall that 3-4 bits (3.322 to be precise) are needed for each decimal digit. So 24 bits ~= 6-7 decimal digits.
Consider the 8 bit exponent. Its range is -128 to 127, and placed into the base we get 2-128 to 2127. Converting to decimal this translates to 10+/- 37 .
This is useful when you divide by zero or one of those very small numbers. It can be used for other situations.
Consider in general the binary operation M1 op M2
M1= F1*2e1
M2 = F2*2e2
M1*M2 = F1*2e1*F2*2e2= (F1*F2) * 2(e1+e2)
M1/M2 = F1*2e1/F2*2e2= (F1/F2) * 2(e1-e2)
This operation is more difficult in the hardware because addition and subtraction requires that e1=e2
CLRF dest ADDF src,dest SUBF src,dest MULF src,dest DIVF src,dest MNEGF src,dest CMPF src,dest TSTF dest CVTLF src,dest (convert to float) CVTFL src,dest (trunc) CVTRL src,dest (round)