Floating point
2026-02-02
These slides are narrated. If you want to hear the narration for a particular slide, use the mouse to click near the bottom. We will autoplay by default.
(You can press a to start autoplaying.)
Base \(b\) notation for numbers:
Decimal and binary are base 10 and base 2, respectively.
Scientific notation works in any base: \[ m \times b^e, \quad 1 \leq m < b \]
Consider example decimal fraction: \[ \frac{1}{3} = 0.\overline{3} = 3.333\ldots \times 10^{-1} \] Equivalent binary fraction: \[ \frac{(1)_2}{(11)_2} = (0.\overline{01})_2 = (1.0101\ldots)_2 \times 2^{-2} \]
Consider \(r = p/q\) with \(p,q\) relatively prime
For \(1/5\) (for example), have decimal \(0.2\), but binary is \[ \frac{1_2}{(101)_2} = (0.\overline{0011})_2. \] This mismatch causes great confusion sometimes.
Our standard representations: \[ (-1)^s \times (1.b_1 b_2 \ldots b_{p-1})_2 \times 2^{e-\mathrm{bias}} \]
Consider 32-bit number 0.2f0: \[\begin{align}
&
\color{red}{0}\color{gray}{01111100}\color{blue}{10011001100110011001101} \\
\rightarrow
&
(-1)^{\color{red}0}
\times (1.{\color{blue}{10011001100110011001101}})_2
\times 2^{{\color{gray}{124}}-127} \\
=&({\color{red}+} 1) \times ({\color{blue}1.6 + \delta}) \times 2^{\color{gray}-3}
\end{align}\] where \(\delta\) represents the rounding error from having \(p = 24\).
Consider toy system with \(p = 3\). Have a gap near zero!
Consider toy system with \(p = 3\). Fill gap with subnormals
Zero(s) are subnormal number(s)!
+0.0-0.0-0.0 == 0.01.0/0.0 and 1.0/-0.0 are different!Floating point representations include:
These close the system – every floating point operation can yield some floating point result.
Standard formats per IEEE 754:
| Format | Julia name | \(p\) | \(e\) bits | \(\epsilon_{\mathrm{mach}}\) | bias |
|---|---|---|---|---|---|
| Double | Float64 |
53 | 11 | \(\approx 1.11 \times 10^{-16}\) | 1023 |
| Single | Float32 |
24 | 8 | \(\approx 5.96 \times 10^{-8}\) | 127 |
| Half | Float16 |
11 | 5 | \(\approx 4.88 \times 10^{-4}\) | 15 |
| Format | bits | Significand bits | Exponent bits | bias |
|---|---|---|---|---|
| Quad | 128 | 113 | 15 | 16383 |
| Double | 64 | 53 | 11 | 1023 |
| Single | 32 | 24 | 8 | 127 |
| Half | 16 | 11 | 5 | 15 |
| TF32 | 19 | 11 | 8 | 127 |
| BFloat16 | 16 | 8 | 8 | 127 |
Write \(\operatorname{fl}(\cdot)\) for rounding; so 0.2f0 gives \(\operatorname{fl}(0.2)\).
Rounding has different modes of operation:
Default is round to nearest even (others for interval arithmetic)
For simple operations \(\{ +, \times, \div, \sqrt{\cdot} \}\), return exact result, correctly rounded, e.g.
x + y \(\mapsto \operatorname{fl}(x+y)\),sqrt(x) \(\mapsto \operatorname{fl}(\sqrt{x})\)Transcendental functions are hard to correctly round (“table maker’s dilemma”), so often they allow a little more error.
Usual greater, less, equal, except:
-0.0 == +0.0Floating point numbers aren’t “fuzzy.”
(Careful) equality tests are allowed!
Exception when floating point violates some expected behavior