Floating Point Numbers
FLOATING POINT VISUALLY EXPLAINED¶
HOW FLOATING POINT ARE USUALLY EXPLAINED¶
In the C language, floats are 32-bit container following the IEEE 754 standard. Their purpose is to store and allow operations on approximation of real numbers. The way I have seen them explained so far is as follow. The 32 bits are divided in three sections:
- 1 bit S for the sign
- 8 bits E for the exponent
- 23 bits for the mantissa
Floating Point Internals
The three sections of a floating point number.
So far, so good. Now, how numbers are interpreted is usually explained with the formula:
This is how everybody hates floating point to be explained to them.
This is usually where I flip the table. Maybe I am allergic to mathematic notation but something just doesn’t click when I read it.
Floating-point arithmetic is considered an esoteric subject by many people. - David Goldberg
A DIFFERENT WAY TO EXPLAIN…¶
Although correct, this way of explaining floating point will leaves some of us completely clueless. Fortunately, there is a different way to explain it. Instead of Exponent, think of a Window between two consecutive power of two integers. Instead of a Mantissa, think of an Offset within that window.
The window tells within which two consecutive power-of-two the number will be: [0.5,1], [1,2], [2,4], [4,8] and so on (up to [2127 ,2128]).
The offset divides the window in \(2^{23} = 8388608\) buckets. With the window and the offset you can approximate a number. The window is an excellent mechanism to protect from overflowing. Once you have reached the maximum in a window (e.g [2,4]), you can “float” it right and represent the number within the next window (e.g [4,8]). It only costs a little bit of precision since the window becomes twice as large.
The next figure illustrates how the number 6.1 would be encoded. The window must start at 4 and span to next power of two, 8. The offset is about half way down the window.
PRECISION¶
How much precision is lost when the window covers a wider range? Let’s take an example with window [1,2] where the 8388608 offsets cover a range of 1 which gives a precision of \((2 - 1) / 8388608 = 0.00000011920929\). In the window [2048,4096] the 8388608 offsets cover a range of \(4096 - 2048 = 2048\) which gives a precision of \(4096 - 2048 / 8388608 = 0.0002\).
AN OTHER EXAMPLE¶
Let’s take an other example with the detailed calculations of the floating point representation of a number we all know well: 3.14.
- The number \(3.14\) is positive \(→ S = 0\).
- The number \(3.14\) is between the power of two 2 and 4 so the floating window must start at \(2^{1} → E = 128\) (see formula where window is \(2^{(E−127)}\)).
- Finally there are \(2^{23}\) offsets available to express where \(3.14\) falls within the interval [2-4]. It is at \((3.14 - 2) / (4 - 2) = 0.57\) within the interval which makes the offset \(M = 2^{23} ∗ 0.57 = 4781507\)
Which in binary translates to:
- S = 0 = 0b
- E = 128 = 10000000b
- M = 4781507 = 10010001111010111000011b
The value \(3.14\) is therefore approximated to \(3.1400001049041748046875\). The corresponding value with the ugly formula:
And finally the graphic representation with window and offset:
CREDIT¶
All credit for this brilliant breakdown goes to Fabien Sanglard