CS 6210: Matrix Computations

Floating point and error analysis basics

David Bindel

2025-09-08

Error analysis concepts

Absolute error

\[e_{\mbox{abs}} = |\hat{x}-x|\]

Same units as \(x\)
Can be misleading without context!

Relative error

\[e_{\mbox{rel}} = \frac{|\hat{x}-x|}{|x|}\]

Dimensionless
Familiar (“3 percent error”)
Can be a problem when \(x = 0\) (or “close”)

Mixed error

\[e_{\mbox{mix}} = \frac{|\hat{x}-x|}{|x| + \tau}\]

Makes sense even at \(x = 0\)
Requires a scale parameter \(\tau\)

Beyond scalars

Can do all the above with norms \[\begin{aligned} e_{\mbox{abs}} &= \|\hat{x}-x\| \\ e_{\mbox{rel}} &= \frac{\|\hat{x}-x\|}{\|x\|} \\ e_{\mbox{mix}} &= \frac{\|\hat{x}-x\|}{\|x\| + \tau} \end{aligned}\]

Can also give componentwise absolute or relative errors.
Dimensions deserve care!

Forward error

Backward error

Forward and backward error

Consider \(y = f(x)\). Forward error for \(\hat{y}\): \[ \hat{y}-y \] Can also consider backward error \(\hat{x}-x\): \[ \hat{y} = f(\hat{x}) \] Treats error as a perturbed input vs output.

First-order sensitivity

Suppose \(y = f(x)\) and perturbed version \[ \hat{y} = f(\hat{x}). \] First-order sensitivity from perturbation around \(x\): \[ \|\hat{y}-y\| \leq \|f'(x)\| \|\hat{x}-x\| + O(\|\hat{x}-x\|^2) \] assuming \(f'\) differentiable near \(x\).

But this is about absolute error.

Condition number

First-order bound on relation between relative changes in input and output: \[ \frac{\|\hat{y}-y\|}{\|y\|} \lesssim \kappa_f(x) \frac{\|\hat{x}-x\|}{\|x\|}. \] Q: How to get (tight) constant \(\kappa_f(x)\)?

Condition number

In general, have \[ \kappa_f(x) = \frac{\|f'(x)\| \|x\|}{\|f(x)\|} \]

Run into trouble if

\(f\) is not differentiable (or not at least Lipschitz)
\(f(x)\) is zero

Perturbing matrix problems

Consider \(\hat{y} = (A+E)x\) vs \(y = Ax\) (\(A\) invertible). \[ \frac{\|\hat{y}-y\|}{\|y\|} = \frac{\|Ex\|}{\|y\|} \leq \kappa(A) \frac{\|E\|}{\|A\|}. \] What should \(\kappa(A)\) be?

Write \(x = A^{-1} y\); then \[ \frac{\|Ex\|}{\|y\|} = \frac{\|EA^{-1} y\|}{\|y\|} \leq \|EA^{-1}\| \leq \|E\| \|A^{-1}\|. \] So \(\kappa(A) = \|A\| \|A^{-1}\|\).

Relative condition number

When \(\|~|A|~\| \leq \|A\|\) (true for all our favorites) and \(|E| < \epsilon |A|\) (elementwise): \[ \frac{\|Ex\|}{\|y\|} = \frac{\|EA^{-1} y\|}{\|y\|} \leq \|~|A|~|A^{-1}|~\| \epsilon \] \(\|~|A|~|A^{-1}|~\|\) is the relative condition number.

Perturbing matrix problems

Q: What about \(\hat{y} = A \hat{x}\)? Same \(\kappa(A)\)…

Floating point fundamentals

Number systems revisited

Base \(b\) notation for numbers:

\(b\) symbols (digits) per place
\(k\)th place left of the point is \(b^k\) (start \(k = 0\))
\(k\)th place right of the point is \(b^{-k}\)

Decimal and binary are base 10 and base 2, respectively.

Scientific notation works in any base: \[ m \times b^e, \quad 1 \leq m < b \]

Scientific notation

Consider example decimal fraction: \[ \frac{1}{3} = 0.\overline{3} = 3.333\ldots \times 10^{-1} \] Equivalent binary fraction: \[ \frac{(1)_2}{(11)_2} = (0.\overline{01})_2 = (1.0101\ldots)_2 \times 2^{-2} \]

An aside: decimal vs binary

Consider \(r = p/q\) with \(p,q\) relatively prime

Always have a (maybe repeating) base \(b\) fraction
Finite iff \(b\) and \(q\) have same prime factors
If repeat, must happen after \(< q\) digits

For \(1/5\) (for example), have decimal \(0.2\), but binary is \[ \frac{1_2}{(101)_2} = (0.\overline{0011})_2. \] This mismatch causes great confusion sometimes.

Binary number formats

Floating point: \(m \times 2^e\) for variable \(e\) (\(1 \leq m < 2\))
- Mostly, with some exceptional representations (upcoming)
- The standard formats for most of scientific computing
- Standardized by IEEE 754 (and 854) committees
Fixed point: \(m \times 2^e\) for fixed (implicit) \(e\)
- Common in graphics, signal processing applications
- Can mostly just manipulate with integer operations
- Range is very restricted
Different error models (relative vs absolute) – to discuss

Normalized floating point numbers

Our standard representations: \[ (-1)^s \times (1.b_1 b_2 \ldots b_{p-1})_2 \times 2^{e-\mathrm{bias}} \]

Three parts: sign, significand (aka mantissa), and exponent
Get one “free” bit in significand from normalization
Machine epsilon (unit roundoff) is \(\epsilon_{\mathrm{mach}}= 2^{-p}\)
All zeros and all ones bit patterns for exceptional values

Example

Consider 32-bit number 0.2f0: \[\begin{align} & \color{red}{0}\color{gray}{01111100}\color{blue}{10011001100110011001101} \\ \rightarrow & (-1)^{\color{red}0} \times (1.{\color{blue}{10011001100110011001101}})_2 \times 2^{{\color{gray}{124}}-127} \\ =&({\color{red}+} 1) \times ({\color{blue}1.6 + \delta}) \times 2^{\color{gray}-3} \end{align}\] where \(\delta\) represents the rounding error from having \(p = 24\).

Subnormals

Consider toy system with \(p = 3\). Have a gap near zero!

Can’t represent zero (seems like a problem…)
Smallest positive / negative are \(x_{\pm} = \pm 2^{1-\mathrm{bias}}\)
Next out are \(y_{\pm} = \pm (1+2^{1-p}) \times 2^{1-\mathrm{bias}}\)
\(x_+ - y_+\) is closer to zero than to any normalized number!

Subnormals

Consider toy system with \(p = 3\). Fill gap with subnormals

Indicate with all zeros exponent field
Encoding interpreted as \[ (-1)^s \times (0.b_1 b_2 \ldots b_{p-1})_2 \times 2^{-\mathrm{bias}} \]

Zero

Zero(s) are subnormal number(s)!

The number with all zero bits is +0.0
There is also -0.0
Floating point compare doesn’t care: -0.0 == 0.0
But 1.0/0.0 and 1.0/-0.0 are different!

Infinities

Also want to represent \(\pm \infty\) (exact or overflow)
- Note: Signed infinities go with signed zeros!
Representation:
- All ones in the exponent field
- All zeros in the significand field
- Sign bit interpreted as normal

Not-a-Number

Final class of representations is NaN
Output of \(0/0\), \(\sqrt{-1}\), etc
- May have a meaningful interpretation with more context
Representation
- All ones in the exponent field
- Anything nonzero in the significand field (often all ones)
- Sign bit is ignored
Arithmetic with NaN is NaN; comparison with NaN is false

Representation summary

Floating point representations include:

Normalized numbers
Subnormal numbers
Infinities
NaNs

These close the system – every floating point operation can yield some floating point result.

Floating point formats

Standard formats per IEEE 754:

Format	Julia name	\(p\)	\(e\) bits	\(\epsilon_{\mathrm{mach}}\)	bias
Double	`Float64`	53	11	\(\approx 1.11 \times 10^{-16}\)	1023
Single	`Float32`	24	8	\(\approx 5.96 \times 10^{-8}\)	127
Half	`Float16`	11	5	\(\approx 4.88 \times 10^{-4}\)	15

We usually default to double precision
Machine learning often uses half (or non-IEEE Bfloat16)
There are also quadruple formats

Operations

Rounding

Write \(\operatorname{fl}(\cdot)\) for rounding; so 0.2f0 gives \(\operatorname{fl}(0.2)\).

Rounding has different modes of operation:

Round to nearest even (round to nearest, on tie go to 0)
Round toward 0
Round toward \(+\infty\)
Round toward \(-\infty\)

Default is round to nearest even (others for interval arithmetic)

Basic arithmetic

For simple operations \(\{ +, \times, \div, \sqrt{\cdot} \}\), return exact result, correctly rounded, e.g.

x + y \(\mapsto \operatorname{fl}(x+y)\),
sqrt(x) \(\mapsto \operatorname{fl}(\sqrt{x})\)
and so forth

Transcendental functions are hard to correctly round (“table maker’s dilemma”), so often they allow a little more error.

Comparisons

Usual greater, less, equal, except:

-0.0 == +0.0
NaN in any comparison yields false

Floating point numbers aren’t “fuzzy.”
(Careful) equality tests are allowed!

Exceptions

Exception when floating point violates some expected behavior

Inexact: Result had to be rounded
Underflow: Result too small, had to round to subnormal
Overflow: Result too big, had to round to \(\pm \infty\)
Divide by zero: Produced an exact \(\pm \infty\) (e.g. \(1/0\) or \(\log(0)\))
Invalid: Produced NaN

Reasoning about floating point

“Exact result, correctly rounded” is still hard to analyze!
- True even if results are all normalized numbers
- Gets trickier with exceptions beyond “inexact”
We want to do linear algebra, not track bits in the computer
Want a model that captures important bits of floating point

Modeling floating point

Basic observation

If \(\operatorname{fl}(x)\) is a normalized number, then \[ \operatorname{fl}(x) = x(1+\delta) \] for some \(|\delta| \leq \epsilon_{\mathrm{mach}}\).

Not true for all \(x\) – only normalized
Lose some precision by only bounding rounding error

Example

# Q: Relative error in computed vs true z?
z = 1-x*x

\[\begin{aligned} t_1 &= x^2 (1+\delta_1) & \hat{z} &= (1-t_1)(1+\delta_2) \end{aligned}\]

\[\begin{aligned} \hat{z} &= z \left( 1 - \frac{\delta_1 x^2}{1-x^2} \right)(1+\delta_2) \\ & \approx z \left( 1 - \frac{\delta_1 x^2}{1-x^2} +\delta_2 \right) \end{aligned}\]

Seems OK if \(x\) is not too near \(\pm 1\).

First-order (aka linearized) analysis

For tiny \(\delta\), use Taylor approximations to simplify: \[\begin{aligned} (1+\delta_1)(1+\delta_2) & \approx 1+\delta_1 + \delta_2 \\ 1/(1+\delta) & \approx 1-\delta \\ \log(1+\delta) &\approx \delta \\ \exp(\delta) &\approx 1+\delta \end{aligned}\] We usually just do this without comment.

Beyond the model

Example (Sterbenz subtraction): If \(2x < y < 0.5x\) then \[ \operatorname{fl}(x-y) = x-y \]

Why? Exact result fits into the format!

But this says nothing about error in computed \(x\) and \(y\)!

Shortcomings

“All models are wrong, but some models are useful”

Underflow and overflow are not considered
Sometimes pessimistic

Finding and fixing problems

Cancellation

\[\begin{aligned} \hat{x} &= x(1+\delta_1) \\ \hat{y} &= y(1+\delta_2) \\ \operatorname{fl}(\hat{x}-\hat{y}) &= (x-y)\left(1 + \frac{\delta_1 x + \delta_2 y}{x-y} \right) \end{aligned}\]

Possible bad news — tight bound of \[ \frac{|\delta_1 x + \delta_2 y|}{|x-y|} \leq \frac{|x|+|y|}{|x-y|} \epsilon_{\mathrm{mach}} \]

Cancellation

Small errors rel to \(x, y\) are not always small rel to \(x-y\)!
But we can often rewrite to avoid cancellation

Quadratic example

Want: smaller root of \(p(x) = x^2 - 2x + z\) \[ r_- = 1-\sqrt{1-z} \] In code:

r1(z) = 1-sqrt(1-z)

Questions:

What goes wrong when \(z\) is near 0?
How do we get a more stable formula?

Quadratic example

Product of roots is \(r_+ r_- = z\), so can compute via \(r_+\): \[ r_- = \frac{z}{r_+} = \frac{z}{1-\sqrt{1-z}} \]

If we didn’t know this, could try an expansion: \[ r_- = \frac{z}{2-z/2} + O(z^3) \] and switch formulas based on \(z\).

Sensitive subproblem

Consider \(h = f \circ g\):

Suppose \(h\) is well conditioned
That doesn’t mean \(f\) and \(g\) are ill conditioned!
Can get in trouble computing via composition

Sensitive subproblems

function silly_sqrt(n=100)
  x = 2.0
  for k = 1:n
    x = sqrt(x)
  end
  for k = 1:n
    x = x^2
  end
  x
end

What would this compute in \(\mathbb{R}\)?
What happens in floating point?

An integral example

Consider computation of \[ E_n = \int_0^1 x^n e^{x-1} \, dx \] For large \(n\), have the estimate \[ E_n \approx \int_0^1 x^{n+1} \, dx = \frac{1}{n+2}, \] which is always a lower bound (\(x \leq \exp(x-1)\)).

Recurrence relation

Integration by parts gives: \[\begin{aligned} E_0 &= 1-1/e \\ E_n &= 1-nE_{n-1}, \quad n \geq 1 \end{aligned}\] What goes wrong if we implement this way?

Recurrence relation

Unstable recurrence

Get a problem even if only rounding was in \(E_0\)
Recurrence amplifies errors very rapidly
Note: If we run backwards, we get something stable!

Undetected underflow

Consider a stupid determinant \[ \det(10^{-2} I) \]

Evaluates to zero for \(n > 161\)!
Easy blunder in Bayesian statistics
Fix: take logs or do some quotients early!

Bad branches

function test_negative(x)
  if x < 0.0
    "$(x) is negative"
  elseif x >= 0.0
    "$(x) is non-negative"
  else
    "$(x) is ... uh..."
  end
end

test_negative(0.0/0.0)

"NaN is ... uh..."

Remember: All comparisons with NaN are false!
More subtle variant: Inconsistent decisions due to rounding
- Happens with geometric predicates