Floating Point Numbers

FLOATING POINT VISUALLY EXPLAINED¶

HOW FLOATING POINT ARE USUALLY EXPLAINED¶

In the C language, floats are 32-bit container following the IEEE 754 standard. Their purpose is to store and allow operations on approximation of real numbers. The way I have seen them explained so far is as follow. The 32 bits are divided in three sections:

1 bit S for the sign
8 bits E for the exponent
23 bits for the mantissa

Floating Point Internals

The three sections of a floating point number.

So far, so good. Now, how numbers are interpreted is usually explained with the formula:

\[ (-1)^{S} * 1.M * 2^{(E-127)} \]

This is how everybody hates floating point to be explained to them.

This is usually where I flip the table. Maybe I am allergic to mathematic notation but something just doesn’t click when I read it.

Floating-point arithmetic is considered an esoteric subject by many people. - David Goldberg

A DIFFERENT WAY TO EXPLAIN…¶

Although correct, this way of explaining floating point will leaves some of us completely clueless. Fortunately, there is a different way to explain it. Instead of Exponent, think of a Window between two consecutive power of two integers. Instead of a Mantissa, think of an Offset within that window.

The window tells within which two consecutive power-of-two the number will be: [0.5,1], [1,2], [2,4], [4,8] and so on (up to [2127 ,2128]).

The offset divides the window in \(2^{23} = 8388608\) buckets. With the window and the offset you can approximate a number. The window is an excellent mechanism to protect from overflowing. Once you have reached the maximum in a window (e.g [2,4]), you can “float” it right and represent the number within the next window (e.g [4,8]). It only costs a little bit of precision since the window becomes twice as large.

The next figure illustrates how the number 6.1 would be encoded. The window must start at 4 and span to next power of two, 8. The offset is about half way down the window.

PRECISION¶

How much precision is lost when the window covers a wider range? Let’s take an example with window [1,2] where the 8388608 offsets cover a range of 1 which gives a precision of \((2 - 1) / 8388608 = 0.00000011920929\). In the window [2048,4096] the 8388608 offsets cover a range of \(4096 - 2048 = 2048\) which gives a precision of \(4096 - 2048 / 8388608 = 0.0002\).

AN OTHER EXAMPLE¶

Let’s take an other example with the detailed calculations of the floating point representation of a number we all know well: 3.14.

The number \(3.14\) is positive \(→ S = 0\).
The number \(3.14\) is between the power of two 2 and 4 so the floating window must start at \(2^{1} → E = 128\) (see formula where window is \(2^{(E−127)}\)).
Finally there are \(2^{23}\) offsets available to express where \(3.14\) falls within the interval [2-4]. It is at \((3.14 - 2) / (4 - 2) = 0.57\) within the interval which makes the offset \(M = 2^{23} ∗ 0.57 = 4781507\)

Which in binary translates to:

S = 0 = 0b
E = 128 = 10000000b
M = 4781507 = 10010001111010111000011b

The value \(3.14\) is therefore approximated to \(3.1400001049041748046875\). The corresponding value with the ugly formula:

\[ 3.14 = (-1)^{0} ∗ 1.57 ∗ 2^{(128 - 127)} \]

And finally the graphic representation with window and offset:

CREDIT¶

All credit for this brilliant breakdown goes to Fabien Sanglard