CS 4220: Numerical Analysis

Floating point

David Bindel

2026-02-02

Intro

These slides are narrated. If you want to hear the narration for a particular slide, use the mouse to click near the bottom. We will autoplay by default.

(You can press a to start autoplaying.)

Floating point fundamentals

Number systems revisited

Base \(b\) notation for numbers:

  • \(b\) symbols (digits) per place
  • \(k\)th place left of the point is \(b^k\) (start \(k = 0\))
  • \(k\)th place right of the point is \(b^{-k}\)

Decimal and binary are base 10 and base 2, respectively.

Scientific notation works in any base: \[ m \times b^e, \quad 1 \leq m < b \]

Scientific notation

Consider example decimal fraction: \[ \frac{1}{3} = 0.\overline{3} = 3.333\ldots \times 10^{-1} \] Equivalent binary fraction: \[ \frac{(1)_2}{(11)_2} = (0.\overline{01})_2 = (1.0101\ldots)_2 \times 2^{-2} \]

An aside: decimal vs binary

Consider \(r = p/q\) with \(p,q\) relatively prime

  • Always have a (maybe repeating) base \(b\) fraction
  • Finite iff \(b\) and \(q\) have same prime factors
  • If repeat, must happen after \(< q\) digits

For \(1/5\) (for example), have decimal \(0.2\), but binary is \[ \frac{1_2}{(101)_2} = (0.\overline{0011})_2. \] This mismatch causes great confusion sometimes.

Binary number formats

  • Floating point: \(m \times 2^e\) for variable \(e\) (\(1 \leq m < 2\))
    • Mostly, with some exceptional representations (upcoming)
    • The standard formats for most of scientific computing
    • Standardized by IEEE 754 (and 854) committees
  • Fixed point: \(m \times 2^e\) for fixed (implicit) \(e\)
    • Common in graphics, signal processing applications
    • Can mostly just manipulate with integer operations
    • Range is very restricted
  • Different error models (relative vs absolute) – to discuss

Normalized floating point numbers

Our standard representations: \[ (-1)^s \times (1.b_1 b_2 \ldots b_{p-1})_2 \times 2^{e-\mathrm{bias}} \]

  • Three parts: sign, significand (aka mantissa), and exponent
  • Get one “free” bit in significand from normalization
  • Machine epsilon (unit roundoff) is \(\epsilon_{\mathrm{mach}}= 2^{-p}\)
  • All zeros and all ones bit patterns for exceptional values

Example

Consider 32-bit number 0.2f0: \[\begin{align} & \color{red}{0}\color{gray}{01111100}\color{blue}{10011001100110011001101} \\ \rightarrow & (-1)^{\color{red}0} \times (1.{\color{blue}{10011001100110011001101}})_2 \times 2^{{\color{gray}{124}}-127} \\ =&({\color{red}+} 1) \times ({\color{blue}1.6 + \delta}) \times 2^{\color{gray}-3} \end{align}\] where \(\delta\) represents the rounding error from having \(p = 24\).

Subnormals

Consider toy system with \(p = 3\). Have a gap near zero!

  • Can’t represent zero (seems like a problem…)
  • Smallest positive / negative are \(x_{\pm} = \pm 2^{1-\mathrm{bias}}\)
  • Next out are \(y_{\pm} = \pm (1+2^{1-p}) \times 2^{1-\mathrm{bias}}\)
  • \(x_+ - y_+\) is closer to zero than to any normalized number!

Subnormals

Consider toy system with \(p = 3\). Fill gap with subnormals

  • Indicate with all zeros exponent field
  • Encoding interpreted as \[ (-1)^s \times (0.b_1 b_2 \ldots b_{p-1})_2 \times 2^{-\mathrm{bias}} \]

Zero

Zero(s) are subnormal number(s)!

  • The number with all zero bits is +0.0
  • There is also -0.0
  • Floating point compare doesn’t care: -0.0 == 0.0
  • But 1.0/0.0 and 1.0/-0.0 are different!

Infinities

  • Also want to represent \(\pm \infty\) (exact or overflow)
    • Note: Signed infinities go with signed zeros!
  • Representation:
    • All ones in the exponent field
    • All zeros in the significand field
    • Sign bit interpreted as normal

Not-a-Number

  • Final class of representations is NaN
  • Output of \(0/0\), \(\sqrt{-1}\), etc
    • May have a meaningful interpretation with more context
  • Representation
    • All ones in the exponent field
    • Anything nonzero in the significand field (often all ones)
    • Sign bit is ignored
  • Arithmetic with NaN is NaN; comparison with NaN is false

Representation summary

Floating point representations include:

  • Normalized numbers
  • Subnormal numbers
  • Infinities
  • NaNs

These close the system – every floating point operation can yield some floating point result.

Floating point formats

Standard formats per IEEE 754:

Format Julia name \(p\) \(e\) bits \(\epsilon_{\mathrm{mach}}\) bias
Double Float64 53 11 \(\approx 1.11 \times 10^{-16}\) 1023
Single Float32 24 8 \(\approx 5.96 \times 10^{-8}\) 127
Half Float16 11 5 \(\approx 4.88 \times 10^{-4}\) 15

Even more formats!

Format bits Significand bits Exponent bits bias
Quad 128 113 15 16383
Double 64 53 11 1023
Single 32 24 8 127
Half 16 11 5 15
TF32 19 11 8 127
BFloat16 16 8 8 127

Operations

Rounding

Write \(\operatorname{fl}(\cdot)\) for rounding; so 0.2f0 gives \(\operatorname{fl}(0.2)\).

Rounding has different modes of operation:

  • Round to nearest even (round to nearest, on tie go to 0)
  • Round toward 0
  • Round toward \(+\infty\)
  • Round toward \(-\infty\)

Default is round to nearest even (others for interval arithmetic)

Basic arithmetic

For simple operations \(\{ +, \times, \div, \sqrt{\cdot} \}\), return exact result, correctly rounded, e.g.

  • x + y \(\mapsto \operatorname{fl}(x+y)\),
  • sqrt(x) \(\mapsto \operatorname{fl}(\sqrt{x})\)
  • and so forth

Transcendental functions are hard to correctly round (“table maker’s dilemma”), so often they allow a little more error.

Comparisons

Usual greater, less, equal, except:

  • -0.0 == +0.0
  • NaN in any comparison yields false

Floating point numbers aren’t “fuzzy.”
(Careful) equality tests are allowed!

Exceptions

Exception when floating point violates some expected behavior

  • Inexact: Result had to be rounded
  • Underflow: Result too small, had to round to subnormal
  • Overflow: Result too big, had to round to \(\pm \infty\)
  • Divide by zero: Produced an exact \(\pm \infty\) (e.g. \(1/0\) or \(\log(0)\))
  • Invalid: Produced NaN

Reasoning about floating point

  • “Exact result, correctly rounded” is still hard to analyze!
    • True even if results are all normalized numbers
    • Gets trickier with exceptions beyond “inexact”
  • We want to do linear algebra, not track bits in the computer
  • Want a model that captures important bits of floating point