# CS 5220

## Applications of Parallel Computers

### Krylov subspace methods

Prof David Bindel

Please click the play button below.

### Goal

Solve $Ax = b,$ where $$A$$ is sparse (or data sparse).

### Krylov Subspace Methods

What if we only know how to multiply by $$A$$?
About all you can do is keep multiplying! $\mathcal{K}_k(A,b) = \operatorname{span}\left\{ b, A b, A^2 b, \ldots, A^{k-1} b \right\}.$ Gives surprisingly useful information!

If $$A$$ is symmetric and positive definite,
$$Ax = b$$ solves a minimization: \begin{aligned} \phi(x) &= \frac{1}{2} x^T A x - x^T b\\ \nabla \phi(x) &= Ax - b. \end{aligned} Idea: Minimize $$\phi(x)$$ over $$\mathcal{K}_k(A,b)$$.
Basis for the method of conjugate gradients

### Example: GMRES

Idea: Minimize $$\|Ax-b\|^2$$ over $$\mathcal{K}_k(A,b)$$.
Yields Generalized Minimum RESidual (GMRES)

### Convergence of Krylov Subspace Methods

• KSPs are not stationary
(no constant fixed-point iteration)
• Convergence is surprisingly subtle!
• CG convergence upper bound via condition number
• Large condition number iff $$\phi(x)$$ is narrow
• True for Poisson and company

### Convergence of Krylov Subspace Methods

• Preconditioned problem $$M^{-1} A x = M^{-1} b$$
• Whence $$M$$?
• From a stationary method?
• From a simpler/coarser discretization?
• From approximate factorization?

### PCG

r = b-A*x;
p = 0; beta = 0;
z = Msolve(r);
rho = dot(r, z);
for i=1:nsteps
p = z + beta*p;
q = A*p;
alpha = rho/dot(p, q);
x += alpha*p;
r -= alpha*q;
if norm(r) < tol, break; end
z = Msolve(r);
rho_prev = rho;
rho = dot(r, z);
beta = rho/rho_prev;
end

### PCG parallel work

• Solve with $$M$$
• Product with $$A$$
• Dot products and axpys

### Pushing PCG

• Rearrange if $$M = LL^T$$ is available
• Or build around “powers kernel”
• Old “s-step” approach of Chronopoulos and Gear
• CA-Krylov of Hoemmen and Demmel
• Hard to keep stable

### Pushing PCG

Two real application levers:

• Better preconditioning
• Faster matvecs

### PCG bottlenecks

Key: fast solve with $$M$$, product with $$A$$

• Some preconditioners parallelize better!
(Jacobi vs Gauss-Seidel)
• Balance speed with performance.
• Speed for set up of $$M$$?
• Speed to apply $$M$$ after setup?
• Cheaper to do two multiplies/solves at once...
• Can’t exploit in obvious way — lose stability
• Variants allow multiple products (CA-Krylov)
• Lots of fiddling possible with $$M$$; matvec with $$A$$?

### Thinking on (basic) CG convergence

Consider 5-point stencil on an $$n \times n$$ mesh.

• Information moves one grid cell per matvec.
• Cost per matvec is $$O(n^2)$$.
• At least $$O(n^3)$$ work to get information across mesh!

### Convergence by counting

• Time to converge $$\geq$$ time to move info across
• For a 2D mesh: $$O(n)$$ matvecs, $$O(n^3) = O(N^{3/2})$$ cost
• For a 3D mesh: $$O(n)$$ matvecs, $$O(n^4) = O(N^{4/3})$$ cost
• “Long” meshes yield slow convergence

### Convergence by counting

3D beats 2D because everything is closer!

• Advice: sparse direct for 2D, CG for 3D.
• Better advice: use a preconditioner!

### Eigenvalue approach

Define the condition number for $$\kappa(L)$$ s.p.d: $\kappa(L) = \frac{\lambda_{\max}(L)}{\lambda_{\min}(L)}$ Describes how elongated the level surfaces of $$\phi$$ are.

### Eigenvalue approach

• For Poisson, $$\kappa(L) = O(h^{-2})$$
• Steps to halve error: $$O(\sqrt{\kappa}) = O(h^{-1})$$.

Similar back-of-the-envelope estimates for some other PDEs. But these are not always that useful... can be pessimistic if there are only a few extreme eigenvalues.

### Frequency-domain approach

Error $$e_k$$ after $$k$$ steps of CG gets smoother!

### Preconditioning Poisson

• CG already handles high-frequency error
• Want something to deal with lower frequency!
• Jacobi useless
• Doesn’t even change Krylov subspace!

### Preconditioning Poisson

Better idea: block Jacobi?

• Q: How should things split up?
• A: Minimize blocks across domain.
• Compatible with minimizing communication!

### Multiplicative Schwartz

Generalizes block Gauss-Seidel

• Get ghost cell data (green)
• Solve everything local (including neighbor data)
• Update local values for next step (local)
• Default strategy in PETSc

### Multilevel Ideas

• RAS moves info one processor per step
• For scalability, still need to get around this!
• Basic idea: use multiple grids
• Fine grid gives lots of work, kills high-freq error
• Coarse grid cheaply gets info across mesh, kills low freq

### Tuning matmul

Can also tune matrix multiply

• Represented implicitly (regular grids)
• Example: Optimizing stencil operations (Datta)
• Or explicitly (e.g. compressed sparse column)
• Sparse matrix blocking and reordering
• Packages: Sparsity (Im), OSKI (Vuduc)
• Available as PETSc extension

Or further rearrange algorithm (Hoemmen, Demmel).

### Reminder: Compressed sparse row

for (int i = 0; i < n; ++i) {
y[i] = 0;
for (int jj = ptr[i]; jj < ptr[i+1]; ++jj)
y[i] += A[jj]*x[col[jj]];
}

Problem: y[i] += A[jj]*x[col[jj]];

### Memory traffic in CSR multiply

Memory access patterns:

• Elements of $$y$$ accessed sequentially
• Elements of $$A$$ accessed sequentially
• Access to $$x$$ are all over!

Can help by switching to block CSR.
Switching to single precision, short indices can help memory traffic, too!

### Parallelizing matvec

• Each processor gets a piece
• Many partitioning strategies
• Idea: re-order so one of these strategies is “good”

### Reordering for matvec

SpMV performance goals: