CS 5220

Applications of Parallel Computers

Intro to Dense LA

Prof David Bindel

Please click the play button below.

Where we are

  • This week: dense linear algebra
  • Next week: sparse linear algebra

Nutshell NLA

  • Linear systems: \(Ax = b\)
  • Least squares: minimize \(\|Ax-b\|_2^2\)
  • Eigenvalues: \(Ax = \lambda x\)

Nutshell NLA

  • Basic paradigm: matrix factorization
    • \(A = LU\), \(A = LL^T\)
    • \(A = QR\)
    • \(A = V \Lambda V^{-1}\), \(A = Q T Q^T\)
    • \(A = U \Sigma V^T\)
  • Factorization \(\equiv\) switch to basis that makes problem easy

Dense and sparse

Two flavors: dense and sparse

Dense

Common structures, no complicated indexing

  • General dense (all entries nonzero)
  • Banded (zero below/above some diagonal)
  • Symmetric/Hermitian
  • Standard, robust algorithms (LAPACK)

Sparse

Stuff not stored in dense form!

  • Maybe few nonzeros
    • e.g. compressed sparse row formats
  • May be implicit (e.g. via finite differencing)
  • May be “dense”, but with compact repn (e.g. via FFT)
  • Mostly iterations; varied and subtle
  • Build on dense ideas

History: BLAS 1 (1973–1977)

15 ops (mostly) on vectors

  • BLAS 1 == \(O(n^1)\) ops on \(O(n^1)\) data
  • Up to four versions of each: S/D/C/Z
  • Example: DAXPY
    • Double precision (real)
    • Computes \(Ax+y\)

BLAS 1 goals

  • Raise level of programming abstraction
  • Robust implementation (e.g. avoid over/underflow)
  • Portable interface, efficient machine-specific implementation
  • Used in LINPACK (and EISPACK?)

History: BLAS 2 (1984–1986)

25 ops (mostly) on matrix/vector pairs

  • BLAS2 == \(O(n^2)\) ops on \(O(n^2)\) data
  • Different data types and matrix types
  • Example: DGEMV
    • Double precision
    • GEneral matrix
    • Matrix-Vector product

BLAS 2 goals

  • BLAS1 insufficient
  • BLAS2 for better vectorization (when vector machines roamed)

History: BLAS 3 (1987–1988)

9 ops (mostly) on matrix/matrix

  • BLAS3 == \(O(n^3)\) ops on \(O(n^2)\) data
  • Different data types and matrix types
  • Example: DGEMM
    • Double precision
    • GEneral matrix
    • Matrix-Matrix product

BLAS 3 goals

Efficient cache utilization!

Why BLAS?

LU for \(2 \times 2\): \[ \begin{bmatrix} a & b \\ c & d \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ c/a & 1 \end{bmatrix} \begin{bmatrix} a & b \\ 0 & d-bc/a \end{bmatrix} \]

Why BLAS?

Block elimination \[ \begin{bmatrix} A & B \\ C & D \end{bmatrix} = \begin{bmatrix} I & 0 \\ CA^{-1} & I \end{bmatrix} \begin{bmatrix} A & B \\ 0 & D-CA^{-1}B \end{bmatrix} \]

Why BLAS?

Block LU \[\begin{split} \begin{bmatrix} A & B \\ C & D \end{bmatrix} &= \begin{bmatrix} L_{11} & 0 \\ L_{12} & L_{22} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix} \\ &= \begin{bmatrix} L_{11} U_{11} & L_{11} U_{12} \\ L_{12} U_{11} & L_{21} U_{12} + L_{22} U_{22} \end{bmatrix} \end{split}\]

Why BLAS?

Think of \(A\) as \(k \times k\), \(k\) moderate:

[L11,U11] = small_lu(A);   % Small block LU
U12 = L11\B;               % Triangular solve
L12 = C/U11;               % "
S   = D-L21*U12;           % Rank k update
[L22,U22] = lu(S);         % Finish factoring

Three level-3 BLAS calls!

  • Two triangular solves
  • One rank-\(k\) update

LAPACK (1989-present)

  • Supercedes earlier LINPACK and EISPACK
  • High performance through BLAS
    • Parallel to the extent BLAS are parallel (on SMP)
    • Linear systems, least squares – near 100% BLAS 3
    • Eigenproblems, SVD — only about 50% BLAS 3
  • Careful error bounds on everything
  • Lots of variants for different structures

ScaLAPACK (1995-present)

  • MPI implementations
  • Only a small subset of LAPACK functionality

PLASMA (2008-present)

Parallel LA Software for Multicore Architectures

  • Target: Shared memory multiprocessors
  • Stacks on LAPACK/BLAS interfaces
  • Tile algorithms, tile data layout, dynamic scheduling
  • Other algorithmic ideas, too (randomization, etc)

MAGMA (2008-present)

Matrix Algebra for GPU and Multicore Architectures

  • Target: CUDA, OpenCL, Xeon Phi
  • Still stacks (e.g. on CUDA BLAS)
  • Again: tile algorithms + data, dynamic scheduling
  • Mixed precision algorithms (+ iterative refinement)

And beyond!

SLATE???

Much is housed at UTK ICL