CS 6210: Matrix Computations

Introduction to least squares

David Bindel

2025-09-24

Basics

Least squares problem

Goal: \[ \mbox{minimize } \frac{1}{2} \|r\|^2 \mbox{ s.t. } r \in \Omega. \] What is the derivative of this objective?

Calculus aside

Derivative: \[ \delta \left( \frac{1}{2} \|r\|^2 \right) = \Re \langle \delta r, r \rangle; \] (Get rid of the \(\Re\) part for real problems.)

Linear least squares

Optimization: \[ \mbox{minimize } \frac{1}{2} \|r\|^2 \mbox{ s.t. } r(x) = b-Ax \] Stationarity condition: for any \(\delta x\), \[ \langle \delta r, r \rangle = -\langle A \delta x, r \rangle = 0 \]

Residual is orthogonal to anything in \(\mathcal{R}(A)\)
Orthogonal is aka normal; hence, the normal equations

Linear least squares in pictures

Solving the normal equations

Can write \(A^T r = 0\) as \[ A^T A x = A^T b \] Matrix \(M = A^T A\) is the Gram matrix

\(m_{ij}\) is the dot product of columns \(a_{:,i}\) and \(a_{:,j}\)
SPD if \(A\) is full rank (which we’ll assume all of today)
We’ve seen this before (matrix rep’ns of inner products)

Solving the normal equations

Unique solution if \(A\) is full column rank: \[ x = (A^T A)^{-1} A^T b = A^\dagger b \] where \(A^\dagger\) is the Moore-Penrose pseudoinverse.

Exact recovery

Suppose \(b = Ax\). Then \[ A^\dagger b = (A^T A)^{-1} A^T b = (A^T A)^{-1} (A^T A) x = x \] That is: \(A^\dagger A = I\).

Pertinent projectors

Now consider \[ \Pi = A A^\dagger = A (A^T A)^{-1} A^T \]

Generally not an identity
It is a projector, i.e. \(\Pi^2 = \Pi\) (why?)
This is the orthogonal projector onto \(\mathcal{R}(A)\)

Residual projectors

If \(\Pi\) is a projector, so is \(I-\Pi\). Why?

For the orthogonal projector, \(I-\Pi\) is the residual projector: \[ (I-\Pi) b = b-Ax_* = r, \quad x_* = A^\dagger b \] Exact recovery: \(\mathcal{R}(A) = \mathcal{N}(I-\Pi)\)

Aside on pseudoinverses

More generally, pseudoinverses generalize inverses when \(\mathcal{A} : \mathcal{V} \rightarrow \mathcal{W}\) is not both 1-1 and onto.

Case 1: 1-1 and not onto (e.g. least squares)
- \(\mathcal{B} \mathcal{A} : \mathcal{V} \rightarrow \mathcal{V}\) is an identity
- \(\Pi = \mathcal{A} \mathcal{B} : \mathcal{W} \rightarrow \mathcal{W}\) is a projector (\(\Pi^2 = \Pi\))
Case 2: not 1-1, but onto (e.g. minimal norm)
- \(\Pi = \mathcal{B} \mathcal{A} : \mathcal{V} \rightarrow \mathcal{V}\) is a projector
- \(\mathcal{A} \mathcal{B} : \mathcal{W} \rightarrow \mathcal{W}\) is an identity
Case 3: neither 1-1 nor onto – projectors either way!

A different argument

Suppose \(x = A^\dagger b + z\). Then \[ \|Ax-b\|^2 = \|Az\|^2 + \|(I-P)b\|^2 \] Why is this true? And why does it imply \(A^\dagger b\) is the minimizer?

Normal equations two ways

Approach 1: \(A^T r = 0\) and \(r = b-Ax\), so \[ A^T A x = A^T b \] Approach 2: Keep \(r\) as an explicit variable \[ \begin{bmatrix} I & A \\ A^T & 0 \end{bmatrix} \begin{bmatrix} r \\ x \end{bmatrix} = \begin{bmatrix} b \\ 0 \end{bmatrix} \] Can get from 2 to 1 by Gaussian elimination…

Other inner products

Several slides ago, we switched from \[ \forall \delta x \in \mathcal{X}, \langle A \delta x, r \rangle = 0 \] to \[ A^T r = 0 \]

But the standard inner product is not the only inner product!

Polynomial picture

Consider a polynomial approximation problem: \[ \mbox{minimize } \|r(p)\|^2 \mbox{ s.t. } r(p) = p-f, p \in \mathcal{P}_d \] where we use the \(L^2([-1,1])\) norm. Steps:

Write \(p(x) = \sum_{j=0}^d q_j(x) c_j\) for some basis polynomials
Should satisfy \(\langle q_i, r \rangle = 0\) for each \(q_i\) in the basis

What does this look like?

Minimum norm problems

Consider \(A \in \mathbb{R}^{m \times n}\) with \(m < n\) (and full rank); solve \[ \mbox{minimize } \frac{1}{2} \|x\|^2 \mbox{ s.t. } Ax = b. \]

Solution: \(x = A^T (AA^T)^{-1} b = A^\dagger b\).

Approach 1: Calculus (Lagrange multipliers)
Approach 2: Consider \(A^\dagger b + z\) and show \(z\) must be zero

Least squares and statistics

Regression model

Suppose \[ y = Xc + \epsilon \]

Covariates (factors) \(X\) are known
Observations \(y\) are contaminated with noise \(\epsilon\)
- Usually \(\epsilon_i\) is iid normal with mean 0, variance \(\sigma^2\)
Goal: recover \(c\) as best we can, maybe predict at new points

Linear unbiased estimators

Given \[ y = Xc + \epsilon \] the linear estimator \[ \hat{c} = My \] is unbiased if \(\mathbb{E}[\hat{c}] = MXc = c\), i.e. \(M\) a pseudoinverse of \(X\).

NB: If \(\operatorname{Var}[\epsilon] = \sigma^2 I\), then \(\operatorname{Var}[\hat{c}] = M (\sigma^2 I) M^T\).

Gauss-Markov

Gauss-Markov: \(\hat{c} = X^\dagger y\) is the best linear unbiased estimator

“Best” means “minimal variance”
But what does this mean for matrices?

A: Want to show \(\operatorname{Var}[M y] \succeq \operatorname{Var}[X^\dagger y]\) whenever \(MX=I\)

Argument: Write \(M = X^\dagger + D\) where \(DX = 0\) and expand.

NB: If \(\operatorname{Var}[\epsilon] = C\), need \(X^T C^{-1} (X\hat{c}-y) = 0\).

Maximum likelihood

Suppose \(\epsilon \sim N(0,C)\). Likelihood of \(y\) is \[ \ell(c|y) = \frac{1}{\sqrt{\det(2\pi C)}} \exp\left( -\frac{1}{2} (y-Xc)^T C^{-1} (y-Xc) \right) \] Fixed \(C\): \(\ell\) is maximized at \(X^T C^{-1} (X\hat{c}-y) = 0\).

Non-Gaussian noise gives nonlinear estimators.

Interpreting the residual

If \(\epsilon\) is iid, should reflect in residual:

About same number of positive and negative \(r_i\)?
Auto-correlation between consecutive \(r_i\)?
Any concentration on particular “spatial” frequencies?

Point: For statistics, there’s more than just “is it small” to \(r\)!
(But we’re mostly doing linear algebra here, not statistics…)

Factorizations for least squares

Goal

Want to solve normal equations \[ A^T A x = A^T b. \] How should we do it? Factorization!

Cholesky

Simple to compute Cholesky factorization \[ A^T A = R^T R. \] Then solve \[ R^T R x = A^T b \] by forward and back substitution.

Cholesky

Define \(\kappa(A) = \sigma_{\max}(A)/\sigma_{\min}(A) = \|A\| \|A^\dagger\|\)

Note that \(\kappa(A^T A) = \kappa(A)^2\) (why?)
Maybe Cholesky isn’t great if \(A\) is somewhat ill-conditioned…

QR approach

Suppose \(A^T A = R^T R\). Fwd substitution on \(R^T R x = A^T b\): \[ R x = R^{-T} A^T b. \] Let \(Q = A R^{-1}\). Then \[\begin{aligned} Q^T Q &= R^{-T} A^T A R^{-1} \\ &= R^{-T} R^T R R^{-1} = I \end{aligned}\] That is, \(Q\) has orthonormal columns.

QR factorization

Note that \(\kappa(A) = \kappa(R)\)!
But forming \(Q = AR^{-1}\) requires Cholesky on \(A^T A\)?
No! Will talk about better ways to compute QR next time

SVD

Suppose \(A = U \Sigma V^T\) an economy SVD. Then \[ A^\dagger = V \Sigma^{-1} U^T \]

About as numerically stable as we can hope
But computing SVD is relatively expensive
Will mostly use SVD for ill-conditioned case

Factorization summary

\[\begin{aligned} A^\dagger &= (A^T A)^{-1} A^T & \Pi &= A (A^T A)^{-1} A^T \\ &= (R^T R)^{-1} A^T & &= A (R^T R)^{-1} A^T \\ &= R^{-1} Q^T & &= QQ^T \\ &= V \Sigma^{-1} U^T & &= UU^T \end{aligned}\]

QR will be our favorite “workhorse” factorization
Now we just need to know how to compute it!