Floating point and error analysis basics
2025-09-08
\[e_{\mbox{abs}} = |\hat{x}-x|\]
\[e_{\mbox{rel}} = \frac{|\hat{x}-x|}{|x|}\]
\[e_{\mbox{mix}} = \frac{|\hat{x}-x|}{|x| + \tau}\]
Can do all the above with norms \[\begin{aligned} e_{\mbox{abs}} &= \|\hat{x}-x\| \\ e_{\mbox{rel}} &= \frac{\|\hat{x}-x\|}{\|x\|} \\ e_{\mbox{mix}} &= \frac{\|\hat{x}-x\|}{\|x\| + \tau} \end{aligned}\]
Consider \(y = f(x)\). Forward error for \(\hat{y}\): \[ \hat{y}-y \] Can also consider backward error \(\hat{x}-x\): \[ \hat{y} = f(\hat{x}) \] Treats error as a perturbed input vs output.
Suppose \(y = f(x)\) and perturbed version \[ \hat{y} = f(\hat{x}). \] First-order sensitivity from perturbation around \(x\): \[ \|\hat{y}-y\| \leq \|f'(x)\| \|\hat{x}-x\| + O(\|\hat{x}-x\|^2) \] assuming \(f'\) differentiable near \(x\).
But this is about absolute error.
First-order bound on relation between relative changes in input and output: \[ \frac{\|\hat{y}-y\|}{\|y\|} \lesssim \kappa_f(x) \frac{\|\hat{x}-x\|}{\|x\|}. \] Q: How to get (tight) constant \(\kappa_f(x)\)?
In general, have \[ \kappa_f(x) = \frac{\|f'(x)\| \|x\|}{\|f(x)\|} \]
Run into trouble if
Consider \(\hat{y} = (A+E)x\) vs \(y = Ax\) (\(A\) invertible). \[ \frac{\|\hat{y}-y\|}{\|y\|} = \frac{\|Ex\|}{\|y\|} \leq \kappa(A) \frac{\|E\|}{\|A\|}. \] What should \(\kappa(A)\) be?
Write \(x = A^{-1} y\); then \[ \frac{\|Ex\|}{\|y\|} = \frac{\|EA^{-1} y\|}{\|y\|} \leq \|EA^{-1}\| \leq \|E\| \|A^{-1}\|. \] So \(\kappa(A) = \|A\| \|A^{-1}\|\).
When \(\|~|A|~\| \leq \|A\|\) (true for all our favorites) and \(|E| < \epsilon |A|\) (elementwise): \[ \frac{\|Ex\|}{\|y\|} = \frac{\|EA^{-1} y\|}{\|y\|} \leq \|~|A|~|A^{-1}|~\| \epsilon \] \(\|~|A|~|A^{-1}|~\|\) is the relative condition number.
Q: What about \(\hat{y} = A \hat{x}\)? Same \(\kappa(A)\)…
Base \(b\) notation for numbers:
Decimal and binary are base 10 and base 2, respectively.
Scientific notation works in any base: \[ m \times b^e, \quad 1 \leq m < b \]
Consider example decimal fraction: \[ \frac{1}{3} = 0.\overline{3} = 3.333\ldots \times 10^{-1} \] Equivalent binary fraction: \[ \frac{(1)_2}{(11)_2} = (0.\overline{01})_2 = (1.0101\ldots)_2 \times 2^{-2} \]
Consider \(r = p/q\) with \(p,q\) relatively prime
For \(1/5\) (for example), have decimal \(0.2\), but binary is \[ \frac{1_2}{(101)_2} = (0.\overline{0011})_2. \] This mismatch causes great confusion sometimes.
Our standard representations: \[ (-1)^s \times (1.b_1 b_2 \ldots b_{p-1})_2 \times 2^{e-\mathrm{bias}} \]
Consider 32-bit number 0.2f0
: \[\begin{align}
&
\color{red}{0}\color{gray}{01111100}\color{blue}{10011001100110011001101} \\
\rightarrow
&
(-1)^{\color{red}0}
\times (1.{\color{blue}{10011001100110011001101}})_2
\times 2^{{\color{gray}{124}}-127} \\
=&({\color{red}+} 1) \times ({\color{blue}1.6 + \delta}) \times 2^{\color{gray}-3}
\end{align}\] where \(\delta\) represents the rounding error from having \(p = 24\).
Consider toy system with \(p = 3\). Have a gap near zero!
Consider toy system with \(p = 3\). Fill gap with subnormals
Zero(s) are subnormal number(s)!
+0.0
-0.0
-0.0 == 0.0
1.0/0.0
and 1.0/-0.0
are different!Floating point representations include:
These close the system – every floating point operation can yield some floating point result.
Standard formats per IEEE 754:
Format | Julia name | \(p\) | \(e\) bits | \(\epsilon_{\mathrm{mach}}\) | bias |
---|---|---|---|---|---|
Double | Float64 |
53 | 11 | \(\approx 1.11 \times 10^{-16}\) | 1023 |
Single | Float32 |
24 | 8 | \(\approx 5.96 \times 10^{-8}\) | 127 |
Half | Float16 |
11 | 5 | \(\approx 4.88 \times 10^{-4}\) | 15 |
Write \(\operatorname{fl}(\cdot)\) for rounding; so 0.2f0
gives \(\operatorname{fl}(0.2)\).
Rounding has different modes of operation:
Default is round to nearest even (others for interval arithmetic)
For simple operations \(\{ +, \times, \div, \sqrt{\cdot} \}\), return exact result, correctly rounded, e.g.
x + y
\(\mapsto \operatorname{fl}(x+y)\),sqrt(x)
\(\mapsto \operatorname{fl}(\sqrt{x})\)Transcendental functions are hard to correctly round (“table maker’s dilemma”), so often they allow a little more error.
Usual greater, less, equal, except:
-0.0 == +0.0
Floating point numbers aren’t “fuzzy.”
(Careful) equality tests are allowed!
Exception when floating point violates some expected behavior
If \(\operatorname{fl}(x)\) is a normalized number, then \[ \operatorname{fl}(x) = x(1+\delta) \] for some \(|\delta| \leq \epsilon_{\mathrm{mach}}\).
\[\begin{aligned} t_1 &= x^2 (1+\delta_1) & \hat{z} &= (1-t_1)(1+\delta_2) \end{aligned}\]
\[\begin{aligned} \hat{z} &= z \left( 1 - \frac{\delta_1 x^2}{1-x^2} \right)(1+\delta_2) \\ & \approx z \left( 1 - \frac{\delta_1 x^2}{1-x^2} +\delta_2 \right) \end{aligned}\]
Seems OK if \(x\) is not too near \(\pm 1\).
For tiny \(\delta\), use Taylor approximations to simplify: \[\begin{aligned} (1+\delta_1)(1+\delta_2) & \approx 1+\delta_1 + \delta_2 \\ 1/(1+\delta) & \approx 1-\delta \\ \log(1+\delta) &\approx \delta \\ \exp(\delta) &\approx 1+\delta \end{aligned}\] We usually just do this without comment.
Example (Sterbenz subtraction): If \(2x < y < 0.5x\) then \[ \operatorname{fl}(x-y) = x-y \]
Why? Exact result fits into the format!
But this says nothing about error in computed \(x\) and \(y\)!
“All models are wrong, but some models are useful”
\[\begin{aligned} \hat{x} &= x(1+\delta_1) \\ \hat{y} &= y(1+\delta_2) \\ \operatorname{fl}(\hat{x}-\hat{y}) &= (x-y)\left(1 + \frac{\delta_1 x + \delta_2 y}{x-y} \right) \end{aligned}\]
Possible bad news — tight bound of \[ \frac{|\delta_1 x + \delta_2 y|}{|x-y|} \leq \frac{|x|+|y|}{|x-y|} \epsilon_{\mathrm{mach}} \]
Want: smaller root of \(p(x) = x^2 - 2x + z\) \[ r_- = 1-\sqrt{1-z} \] In code:
Questions:
Product of roots is \(r_+ r_- = z\), so can compute via \(r_+\): \[ r_- = \frac{z}{r_+} = \frac{z}{1-\sqrt{1-z}} \]
If we didn’t know this, could try an expansion: \[ r_- = \frac{z}{2-z/2} + O(z^3) \] and switch formulas based on \(z\).
Consider \(h = f \circ g\):
Consider computation of \[ E_n = \int_0^1 x^n e^{x-1} \, dx \] For large \(n\), have the estimate \[ E_n \approx \int_0^1 x^{n+1} \, dx = \frac{1}{n+2}, \] which is always a lower bound (\(x \leq \exp(x-1)\)).
Integration by parts gives: \[\begin{aligned} E_0 &= 1-1/e \\ E_n &= 1-nE_{n-1}, \quad n \geq 1 \end{aligned}\] What goes wrong if we implement this way?
Consider a stupid determinant \[ \det(10^{-2} I) \]
"NaN is ... uh..."