Introduction to least squares
2025-09-24
Goal: \[ \mbox{minimize } \frac{1}{2} \|r\|^2 \mbox{ s.t. } r \in \Omega. \] What is the derivative of this objective?
Derivative: \[ \delta \left( \frac{1}{2} \|r\|^2 \right) = \Re \langle \delta r, r \rangle; \] (Get rid of the \(\Re\) part for real problems.)
Optimization: \[ \mbox{minimize } \frac{1}{2} \|r\|^2 \mbox{ s.t. } r(x) = b-Ax \] Stationarity condition: for any \(\delta x\), \[ \langle \delta r, r \rangle = -\langle A \delta x, r \rangle = 0 \]
Can write \(A^T r = 0\) as \[ A^T A x = A^T b \] Matrix \(M = A^T A\) is the Gram matrix
Unique solution if \(A\) is full column rank: \[ x = (A^T A)^{-1} A^T b = A^\dagger b \] where \(A^\dagger\) is the Moore-Penrose pseudoinverse.
Suppose \(b = Ax\). Then \[ A^\dagger b = (A^T A)^{-1} A^T b = (A^T A)^{-1} (A^T A) x = x \] That is: \(A^\dagger A = I\).
Now consider \[ \Pi = A A^\dagger = A (A^T A)^{-1} A^T \]
If \(\Pi\) is a projector, so is \(I-\Pi\). Why?
For the orthogonal projector, \(I-\Pi\) is the residual projector: \[ (I-\Pi) b = b-Ax_* = r, \quad x_* = A^\dagger b \] Exact recovery: \(\mathcal{R}(A) = \mathcal{N}(I-\Pi)\)
More generally, pseudoinverses generalize inverses when \(\mathcal{A} : \mathcal{V} \rightarrow \mathcal{W}\) is not both 1-1 and onto.
Suppose \(x = A^\dagger b + z\). Then \[ \|Ax-b\|^2 = \|Az\|^2 + \|(I-P)b\|^2 \] Why is this true? And why does it imply \(A^\dagger b\) is the minimizer?
Approach 1: \(A^T r = 0\) and \(r = b-Ax\), so \[ A^T A x = A^T b \] Approach 2: Keep \(r\) as an explicit variable \[ \begin{bmatrix} I & A \\ A^T & 0 \end{bmatrix} \begin{bmatrix} r \\ x \end{bmatrix} = \begin{bmatrix} b \\ 0 \end{bmatrix} \] Can get from 2 to 1 by Gaussian elimination…
Several slides ago, we switched from \[ \forall \delta x \in \mathcal{X}, \langle A \delta x, r \rangle = 0 \] to \[ A^T r = 0 \]
But the standard inner product is not the only inner product!
Consider a polynomial approximation problem: \[ \mbox{minimize } \|r(p)\|^2 \mbox{ s.t. } r(p) = p-f, p \in \mathcal{P}_d \] where we use the \(L^2([-1,1])\) norm. Steps:
What does this look like?
Consider \(A \in \mathbb{R}^{m \times n}\) with \(m < n\) (and full rank); solve \[ \mbox{minimize } \frac{1}{2} \|x\|^2 \mbox{ s.t. } Ax = b. \]
Solution: \(x = A^T (AA^T)^{-1} b = A^\dagger b\).
Suppose \[ y = Xc + \epsilon \]
Given \[ y = Xc + \epsilon \] the linear estimator \[ \hat{c} = My \] is unbiased if \(\mathbb{E}[\hat{c}] = MXc = c\), i.e. \(M\) a pseudoinverse of \(X\).
NB: If \(\operatorname{Var}[\epsilon] = \sigma^2 I\), then \(\operatorname{Var}[\hat{c}] = M (\sigma^2 I) M^T\).
Gauss-Markov: \(\hat{c} = X^\dagger y\) is the best linear unbiased estimator
A: Want to show \(\operatorname{Var}[M y] \succeq \operatorname{Var}[X^\dagger y]\) whenever \(MX=I\)
Argument: Write \(M = X^\dagger + D\) where \(DX = 0\) and expand.
NB: If \(\operatorname{Var}[\epsilon] = C\), need \(X^T C^{-1} (X\hat{c}-y) = 0\).
Suppose \(\epsilon \sim N(0,C)\). Likelihood of \(y\) is \[ \ell(c|y) = \frac{1}{\sqrt{\det(2\pi C)}} \exp\left( -\frac{1}{2} (y-Xc)^T C^{-1} (y-Xc) \right) \] Fixed \(C\): \(\ell\) is maximized at \(X^T C^{-1} (X\hat{c}-y) = 0\).
Non-Gaussian noise gives nonlinear estimators.
If \(\epsilon\) is iid, should reflect in residual:
Point: For statistics, there’s more than just “is it small” to \(r\)!
(But we’re mostly doing linear algebra here, not statistics…)
Want to solve normal equations \[ A^T A x = A^T b. \] How should we do it? Factorization!
Simple to compute Cholesky factorization \[ A^T A = R^T R. \] Then solve \[ R^T R x = A^T b \] by forward and back substitution.
Define \(\kappa(A) = \sigma_{\max}(A)/\sigma_{\min}(A) = \|A\| \|A^\dagger\|\)
Suppose \(A^T A = R^T R\). Fwd substitution on \(R^T R x = A^T b\): \[ R x = R^{-T} A^T b. \] Let \(Q = A R^{-1}\). Then \[\begin{aligned} Q^T Q &= R^{-T} A^T A R^{-1} \\ &= R^{-T} R^T R R^{-1} = I \end{aligned}\] That is, \(Q\) has orthonormal columns.
Suppose \(A = U \Sigma V^T\) an economy SVD. Then \[ A^\dagger = V \Sigma^{-1} U^T \]
\[\begin{aligned} A^\dagger &= (A^T A)^{-1} A^T & \Pi &= A (A^T A)^{-1} A^T \\ &= (R^T R)^{-1} A^T & &= A (R^T R)^{-1} A^T \\ &= R^{-1} Q^T & &= QQ^T \\ &= V \Sigma^{-1} U^T & &= UU^T \end{aligned}\]