Prof David Bindel

Please click the play button below.

Solve \[ Ax = b, \] where \(A\) is sparse (or data sparse).

What if we only know how to multiply by \(A\)?

About all you can do is keep multiplying! \[
\mathcal{K}_k(A,b) = \operatorname{span}\left\{
b, A b, A^2 b, \ldots, A^{k-1} b \right\}.
\] Gives surprisingly useful information!

If \(A\) is symmetric and positive definite,

\(Ax = b\) solves a minimization: \[\begin{aligned}
\phi(x) &= \frac{1}{2} x^T A x - x^T b\\
\nabla \phi(x) &= Ax - b.
\end{aligned}\] Idea: Minimize \(\phi(x)\) over \(\mathcal{K}_k(A,b)\).

Basis for the *method of conjugate gradients*

Idea: Minimize \(\|Ax-b\|^2\) over \(\mathcal{K}_k(A,b)\).

Yields *Generalized Minimum RESidual* (GMRES)

- KSPs are
*not*stationary

(no constant fixed-point iteration) - Convergence is surprisingly subtle!
- CG convergence upper bound via
*condition number*- Large condition number iff \(\phi(x)\) is narrow
- True for Poisson and company

*Preconditioned*problem \(M^{-1} A x = M^{-1} b\)- Whence \(M\)?
- From a stationary method?
- From a simpler/coarser discretization?
- From approximate factorization?

```
r = b-A*x;
p = 0; beta = 0;
z = Msolve(r);
rho = dot(r, z);
for i=1:nsteps
p = z + beta*p;
q = A*p;
alpha = rho/dot(p, q);
x += alpha*p;
r -= alpha*q;
if norm(r) < tol, break; end
z = Msolve(r);
rho_prev = rho;
rho = dot(r, z);
beta = rho/rho_prev;
end
```

- Solve with \(M\)
- Product with \(A\)
- Dot products and axpys

- Rearrange if \(M = LL^T\) is available
- Or build around “powers kernel”
- Old “s-step” approach of Chronopoulos and Gear
- CA-Krylov of Hoemmen and Demmel
- Hard to keep stable

Two real application levers:

- Better preconditioning
- Faster matvecs

Key: fast solve with \(M\), product with \(A\)

- Some preconditioners parallelize better!

(Jacobi vs Gauss-Seidel) - Balance speed with performance.
- Speed for set up of \(M\)?
- Speed to apply \(M\) after setup?

- Cheaper to do two multiplies/solves at once...
- Can’t exploit in obvious way — lose stability
- Variants allow multiple products (CA-Krylov)

- Lots of fiddling possible with \(M\); matvec with \(A\)?

Consider 5-point stencil on an \(n \times n\) mesh.

- Information moves one grid cell per matvec.
- Cost per matvec is \(O(n^2)\).
- At least \(O(n^3)\) work to get information across mesh!

- Time to converge \(\geq\) time to move info across
- For a 2D mesh: \(O(n)\) matvecs, \(O(n^3) = O(N^{3/2})\) cost
- For a 3D mesh: \(O(n)\) matvecs, \(O(n^4) = O(N^{4/3})\) cost
- “Long” meshes yield slow convergence

3D beats 2D because everything is closer!

- Advice: sparse direct for 2D, CG for 3D.
- Better advice: use a preconditioner!

Define the *condition number* for \(\kappa(L)\) s.p.d: \[\kappa(L) = \frac{\lambda_{\max}(L)}{\lambda_{\min}(L)}\] Describes how elongated the level surfaces of \(\phi\) are.

- For Poisson, \(\kappa(L) = O(h^{-2})\)
- Steps to halve error: \(O(\sqrt{\kappa}) = O(h^{-1})\).

Similar back-of-the-envelope estimates for some other PDEs. But these are not always that useful... can be pessimistic if there are only a few extreme eigenvalues.

Error \(e_k\) after \(k\) steps of CG gets smoother!

- CG already handles high-frequency error
- Want something to deal with lower frequency!
- Jacobi useless
- Doesn’t even change Krylov subspace!

Better idea: block Jacobi?

- Q: How should things split up?
- A: Minimize blocks across domain.
- Compatible with minimizing communication!

Generalizes block Gauss-Seidel

- Get ghost cell data (green)
- Solve
*everything*local (including neighbor data) - Update local values for next step (local)
- Default strategy in PETSc

- RAS moves info one processor per step
- For scalability, still need to get around this!
- Basic idea: use multiple grids
- Fine grid gives lots of work, kills high-freq error
- Coarse grid cheaply gets info across mesh, kills low freq

Can also tune matrix multiply

- Represented implicitly (regular grids)
- Example: Optimizing stencil operations (Datta)

- Or explicitly (e.g. compressed sparse column)
- Sparse matrix blocking and reordering
- Packages: Sparsity (Im), OSKI (Vuduc)
- Available as PETSc extension

Or further rearrange algorithm (Hoemmen, Demmel).

```
for (int i = 0; i < n; ++i) {
y[i] = 0;
for (int jj = ptr[i]; jj < ptr[i+1]; ++jj)
y[i] += A[jj]*x[col[jj]];
}
```

Problem: y[i] += A[jj]*x[**col[jj]**];

Memory access patterns:

- Elements of \(y\) accessed sequentially
- Elements of \(A\) accessed sequentially
- Access to \(x\) are all over!

Can help by switching to block CSR.

Switching to single precision, short indices can help memory traffic, too!

- Each processor gets a piece
- Many partitioning strategies
- Idea: re-order so one of these strategies is “good”

SpMV performance goals:

- Balance load?
- Balance storage?
- Minimize communication?
- Good cache re-use?

Reordering also comes up for GE!