Floating Point

Like other languages you’ve used before, C has a float type that works for numbers with a decimal point in them:


#include <stdio.h>

int main() {
    float n = 8.4f;
    printf("%f\n", n * 5.0f);
    return 0;
}

But how does float actually work? How do we represent fractional numbers like this at the level of bits? The answers have profound implications for the performance and accuracy of any software that does serious numerical computation.

For example, see if you can predict what the last line of this example will print:


#include <stdio.h>

int main() {
    float x = 0.00000001f;
    float y = 0.00000002f;

    printf("x = %e\n", x);
    printf("y = %e\n", y);
    printf("y - x = %e\n", y - x);

    printf("1+x = %e\n", 1.0f + x);
    printf("1+y = %e\n", 1.0f + y);
    printf("(1+y) - (1+x) = %e\n", (1.0f + y) - (1.0f + x));

    return 0;
}

Understanding how float actually works is the key to avoiding surprising pitfalls like this.

Real Numbers in Binary

Before we get to computer representations, let’s think about binary numbers “on paper.” We’ve seen plenty of integers in binary notation; we can extend the same thinking to numbers with fractional parts.

Let’s return to elementary school again and think about how to read the decimal number 19.64. The digits to the right of the decimal point have place values too: those are the “tenths” and “hundredths” places. So here’s the value that decimal notation represents:

$19.64_{10} = 1 \times 10^1 + 9 \times 10^0 + 6 \times 10^{-1} + 4 \times 10^{-2}$

Beyond the decimal point, the place values are negative powers of ten. We can use exactly the same strategy in binary notation, with negative powers of two. For example, let’s read the binary number 10.01:

$10.01_2 = 1 \times 2^1 + 0 \times 2^0 + 0 \times 2^{-1} + 1 \times 2^{-2}$

So that’s $2 + \frac{1}{4}$ , or 2.25 in decimal.

The moral of this section is: binary numbers can have points too! But I suppose you call it the “binary point,” not the “decimal point.”

Fixed-Point Numbers

Next, computers need a way to encode numbers with binary points in bits. One way, called a fixed-point representation, relies on some sort of bookkeeping on the side to record the position of the binary point. To use fixed-point numbers, you (the programmer) have to decide two things:

How many bits are we going to use to represent our numbers? Call this bit count $n$ .
Where will the binary point go? Call this position $e$ for exponent. By convention, $e=0$ means the binary point goes at the very end (so it’s just a normal integer), $e=-1$ means there is one bit after the binary point.

The idea is that, if you read your $n$ bits as an integer $i$ , then the number those bits represent is $i \times 2^{e}$ . (This should look a little like scientific notation, where you might be accustomed to writing numbers like $34.10 \times 10^{-5}$ . It’s sort of like that, but with a base of 2 instead of 10.)

For example, let’s decide we’re going to use a fixed-point number system with 4 bits and a binary point right in the middle. In other words, $n = 4$ and $e = -2$ . In this number system, the bit pattern 1001 represents the value $10.01_2$ or $2.25_{10}$ .

It’s also possible to have positive exponents. If we pick a number system with $n = 4$ and $e = 2$ , then the same bit pattern 1001 represents the value $1001_2 \times 2^2 = 100100_2$ , or $36_{10}$ . So positive exponents have the effect of tacking $e$ zeroes onto the end of the binary number. (Sort of like how, in scientific notation, $\times 10^e$ tacks $e$ zeroes onto the end.)

Let’s stick with 4 bits and try it out. If $e = -3$ , what is the value represented by 1111? If $e = 1$ , what is the value represented by 0101?

The best and worst thing about fixed-point numbers is that the exponent $e$ is metadata and not part of the actual data that the computer stores. It’s in the eye of the beholder: the same bit pattern can represent many different numbers, depending on the exponent that the programmer has in mind. That means the programmer has to be able to predict the values of $e$ that they will need for any run of the program.

That’s a serious limitation, and it means that this strategy is not what powers the float type. On the other hand, if programs can afford the complexity to deal with this limitation, fixed-point numbers can be extremely efficient—so they’re popular in resource-constrained application domains like machine learning and digital signal processing. Most software, however, ends up using a different strategy that makes the exponent part of the data itself.

Floating-Point Numbers

The float type gets its name because, unlike a fixed-point representation, it lets the binary point float around. It does that by putting the point position right into the value itself. This way, every float can have a different $e$ value, so different floats can exist on very different scales:


#include <stdio.h>

int main() {
    float n = 34.10f;
    float big = n * 123456789.0f;
    float small = n / 123456789.0f;
    printf("big = %e\nsmall = %e\n", big, small);
    return 0;
}

The %e format specifier makes printf use scientific notation, so we can see that these values have very different magnitudes.

The key idea is that every float actually consists of three separate unsigned integers, packed together into one bit pattern:

A sign, $s$ , which is a single bit.
The exponent, an unsigned integer $e$ .
The significand (also called the mantissa), another unsigned integer $g$ .

Together, a given $s$ , $e$ , and $g$ represent this number:

$(-1)^s \times 1.g \times 2^{e-127}$

…where $1.g$ is some funky notation we’ll get to in a moment. Let’s break it down into the three terms:

$(-1)^s$ makes $s$ work as a sign bit: 0 for positive, 1 for negative. (Yes, floating point numbers use a sign–magnitude strategy: this means that +0.0 and -0.0 are distinct float values!)
$1.g$ means “take the bits from $g$ and put them all after the binary point, with a 1 in the ones place.” The significand is the “main” part of the number, so (in the normal case) it always represents a number between 1.0 and 2.0.
$2^{e-127}$ is a scaling term, i.e., it determines where the binary point goes. The $-127$ in there is a bias: this way, the unsigned exponent value $e$ can work to represent a wide range of both positive and negative binary-point position choices.

The float type is actually an international standard, universally implemented across programming languages and hardware platforms. So it behaves the same way regardless of the language you’re programming in and the CPU or GPU you run your code on. It works by packing the three essential values into 32 bits. From left to right:

1 sign bit
8 exponent bits
23 significand bits

To get more of a sense of how float works at the level of bits, now would be a great time to check out the amazing tool at float.exposed. You can click the bits to flip them and make any value you want.

Conversion Examples

As an exercise, we can try converting decimal numbers to floating-point representations by hand and using float.exposed to check our work. Let’s try representing the value 8.25 as a float:

First, let’s convert it to binary: $1000.01_2$
Next, normalize the number by shifting the binary point and multiplying by $2^{\text{something}}$ : $1.00001 \times 2^3$
Finally, break down the three components of the float:
- $s = 0$ , because it’s a positive number.
- $g$ is the bit pattern starting with 00001 and then a bunch of zeroes, i.e., we just read the bits after the “1.” in the binary number.
- $e = 3 + 127$ , where the 3 comes from the power of two in our normalized number, and we need to add 127 to account for the bias in the float representation.

Try entering these values (0, 00001000…, and 130) into float.exposed to see if it worked. It’s easiest to enter the exponent in the little text box and the significand by clicking bits in the bit pattern.

Can you convert -5.125 in the same way?

Checking In with C

To prove that float.exposed agrees with C, we can use a little program that reinterprets the bits it produces to a float and prints it out:


#include <stdio.h>
#include <stdint.h>
#include <string.h>

int main() {
    uint32_t bits = 0x41020000;

    // Copy the to a variable with a different type.
    float val;
    memcpy(&val, &bits, sizeof(val));

    // Print the bits as a floating-point number.
    printf("%f\n", val);
    return 0;
}

The memcpy function just copies bits from one location to another. Don’t worry about the details of how to invoke it yet; we’ll cover that later in 3410.

Special Cases

Annoyingly, we haven’t yet seen the full story for floating-point representations. The above rules apply to most float values, but there are a few special cases:

To represent +0.0 and -0.0, you have to set both $e = 0$ and $g = 0$ . (That is, use all zeroes for all the bits in both of those ranges.) We need this special case to “override” the significand’s implicit 1 that would otherwise make it impossible to represent zero. And requiring that $e=0$ ensures that there are only two zero values, not many different zeroes with different exponents.
When $e = 0$ but $g \neq 0$ , that’s a denormalized number. The rule is that denormalized numbers represent the value $(-1)^s \times 0.g \times 2^{-126}$ . The important difference is that we now use $0.g$ instead of $1.g$ . These values are useful to eke out the last drops of precision for extremely small numbers.
When $e$ is “all ones” and $g = 0$ , that represents infinity. (Yes, we have both +∞ and -∞.)
When $e$ is “all ones” and $g \neq 0$ , the value is called “not a number” or NaN for short. NaNs arise to represent erroneous computations

The rules around infinity and NaN can be a little confusing. For example, dividing zero by zero is NaN, but dividing other numbers by zero is infinity:


#include <stdio.h>

int main() {
    printf("%f\n", 0.0f / 0.0f);  // NaN
    printf("%f\n", 5.0f / 0.0f);  // infinity
    return 0;
}

Other Floating-Point Formats

All of this so far has been about one (very popular) floating-point format: float, also known as “single precision” or “32-bit float” or just f32. But there are many other formats that work using the same principles but with different details. A few to be aware of are:

double, a.k.a. “double precision” or f64, is a 64-bit format. If offers even more accuracy and dynamic range than 32-bit floats, at the cost of taking up twice as much space. There is still only one sign bit, but you get 11 exponent bits and 52 significand bits.
Half-precision floating point goes in the other direction: it’s only 16 bits in total (5 exponent bits, 10 significand bits).
The bfloat16 or “brain floating point” format is a different 16-bit floating-point format that was invented recently specifically for machine learning. It is just a small twist on “normal” half-precision floats that reallocates a few bits from the significand to the exponent (8 exponent bits, 7 significand bits). It turns out that having extra dynamic range, at the cost of precision, is exactly what lots of deep learning models need. So it has very quickly become implemented in lots of hardware.

Some General Guidelines

Now that you know how floating-point numbers work, we can justify a few common pieces of advice that programmers often get about using them:

Floating-point numbers are not real numbers. Expect to accumulate some error when you use them.
Never use floating-point numbers to represent currency. When people say $123.45, they want that exact number of cents, not $123.40000152. Use an integer number of cents: i.e., a fixed-point representation with a fixed decimal point.
If you ever end up comparing two floating-point numbers for equality, with f1 == f2, be suspicious. For example, try 0.1 + 0.2 == 0.3 to be disappointed. Consider using an “error tolerance” in comparisons, like abs(f1 - f2) < epsilon.
Floating-point arithmetic is slower and costs more energy than integer or fixed-point arithmetic. You get what you pay for: the flexibility of floating-point operations mean that they are fundamentally more complex for the hardware to execute. That’s why many practical machine learning systems convert (quantize) models to a fixed-point representation so they can run efficiently.

For many more details and much more advice, I recommend “What Every Computer Scientist Should Know About Floating-Point Arithmetic” by David Goldberg.

CS 3410