CS 5220

Applications of Parallel Computers

Intro to Dense LA

Prof David Bindel

Please click the play button below.

Where we are

• This week: dense linear algebra
• Next week: sparse linear algebra

Nutshell NLA

• Linear systems: $$Ax = b$$
• Least squares: minimize $$\|Ax-b\|_2^2$$
• Eigenvalues: $$Ax = \lambda x$$

Nutshell NLA

• Basic paradigm: matrix factorization
• $$A = LU$$, $$A = LL^T$$
• $$A = QR$$
• $$A = V \Lambda V^{-1}$$, $$A = Q T Q^T$$
• $$A = U \Sigma V^T$$
• Factorization $$\equiv$$ switch to basis that makes problem easy

Dense and sparse

Two flavors: dense and sparse

Dense

Common structures, no complicated indexing

• General dense (all entries nonzero)
• Banded (zero below/above some diagonal)
• Symmetric/Hermitian
• Standard, robust algorithms (LAPACK)

Sparse

Stuff not stored in dense form!

• Maybe few nonzeros
• e.g. compressed sparse row formats
• May be implicit (e.g. via finite differencing)
• May be “dense”, but with compact repn (e.g. via FFT)
• Mostly iterations; varied and subtle
• Build on dense ideas

History: BLAS 1 (1973–1977)

15 ops (mostly) on vectors

• BLAS 1 == $$O(n^1)$$ ops on $$O(n^1)$$ data
• Up to four versions of each: S/D/C/Z
• Example: DAXPY
• Double precision (real)
• Computes $$Ax+y$$

BLAS 1 goals

• Raise level of programming abstraction
• Robust implementation (e.g. avoid over/underflow)
• Portable interface, efficient machine-specific implementation
• Used in LINPACK (and EISPACK?)

History: BLAS 2 (1984–1986)

25 ops (mostly) on matrix/vector pairs

• BLAS2 == $$O(n^2)$$ ops on $$O(n^2)$$ data
• Different data types and matrix types
• Example: DGEMV
• Double precision
• GEneral matrix
• Matrix-Vector product

BLAS 2 goals

• BLAS1 insufficient
• BLAS2 for better vectorization (when vector machines roamed)

History: BLAS 3 (1987–1988)

9 ops (mostly) on matrix/matrix

• BLAS3 == $$O(n^3)$$ ops on $$O(n^2)$$ data
• Different data types and matrix types
• Example: DGEMM
• Double precision
• GEneral matrix
• Matrix-Matrix product

BLAS 3 goals

Efficient cache utilization!

Why BLAS?

LU for $$2 \times 2$$: $\begin{bmatrix} a & b \\ c & d \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ c/a & 1 \end{bmatrix} \begin{bmatrix} a & b \\ 0 & d-bc/a \end{bmatrix}$

Why BLAS?

Block elimination $\begin{bmatrix} A & B \\ C & D \end{bmatrix} = \begin{bmatrix} I & 0 \\ CA^{-1} & I \end{bmatrix} \begin{bmatrix} A & B \\ 0 & D-CA^{-1}B \end{bmatrix}$

Why BLAS?

Block LU $\begin{split} \begin{bmatrix} A & B \\ C & D \end{bmatrix} &= \begin{bmatrix} L_{11} & 0 \\ L_{12} & L_{22} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix} \\ &= \begin{bmatrix} L_{11} U_{11} & L_{11} U_{12} \\ L_{12} U_{11} & L_{21} U_{12} + L_{22} U_{22} \end{bmatrix} \end{split}$

Why BLAS?

Think of $$A$$ as $$k \times k$$, $$k$$ moderate:

[L11,U11] = small_lu(A);   % Small block LU
U12 = L11\B;               % Triangular solve
L12 = C/U11;               % "
S   = D-L21*U12;           % Rank k update
[L22,U22] = lu(S);         % Finish factoring

Three level-3 BLAS calls!

• Two triangular solves
• One rank-$$k$$ update

LAPACK (1989-present)

• Supercedes earlier LINPACK and EISPACK
• High performance through BLAS
• Parallel to the extent BLAS are parallel (on SMP)
• Linear systems, least squares – near 100% BLAS 3
• Eigenproblems, SVD — only about 50% BLAS 3
• Careful error bounds on everything
• Lots of variants for different structures

ScaLAPACK (1995-present)

• MPI implementations
• Only a small subset of LAPACK functionality

PLASMA (2008-present)

Parallel LA Software for Multicore Architectures

• Target: Shared memory multiprocessors
• Stacks on LAPACK/BLAS interfaces
• Tile algorithms, tile data layout, dynamic scheduling
• Other algorithmic ideas, too (randomization, etc)

MAGMA (2008-present)

Matrix Algebra for GPU and Multicore Architectures

• Target: CUDA, OpenCL, Xeon Phi
• Still stacks (e.g. on CUDA BLAS)
• Again: tile algorithms + data, dynamic scheduling
• Mixed precision algorithms (+ iterative refinement)

And beyond!

SLATE???

Much is housed at UTK ICL