CS 5220

Applications of Parallel Computers

Intro to Dense LA

Prof David Bindel

Please click the play button below.

Where we are

This week: dense linear algebra
Next week: sparse linear algebra

Nutshell NLA

Linear systems: \(Ax = b\)
Least squares: minimize \(\|Ax-b\|_2^2\)
Eigenvalues: \(Ax = \lambda x\)

Nutshell NLA

Basic paradigm: matrix factorization
- \(A = LU\), \(A = LL^T\)
- \(A = QR\)
- \(A = V \Lambda V^{-1}\), \(A = Q T Q^T\)
- \(A = U \Sigma V^T\)
Factorization \(\equiv\) switch to basis that makes problem easy

Dense and sparse

Two flavors: dense and sparse

Dense

Common structures, no complicated indexing

General dense (all entries nonzero)
Banded (zero below/above some diagonal)
Symmetric/Hermitian
Standard, robust algorithms (LAPACK)

Fully dense matrices are the types of matrices that we’ve mostly discussed so far. Most of the elements are nonzero, or at least enough are nonzero that we don’t do anything special to distinguish the zeros from the nonzeros in our data structures. The data structures that we use for full dense matrices are pretty simple: row major or column major layouts, mostly. And the algorithms we use might be specialized based on some broad structural categories, like whether a matrix is symmetric or positive definite, but are otherwise rather general.

I tend to think of band matrices as dense, though really they are somewhere between. These types of matrices store lots of zeros implicitly, which is the hallmark of sparse matrix storage. But they don’t require any fancy indexing, and there are robust general purpose algorithms in libraries like LAPACK, and from that perspective they really are like full dense matrices.

Sparse

Stuff not stored in dense form!

Maybe few nonzeros
- e.g. compressed sparse row formats
May be implicit (e.g. via finite differencing)
May be “dense”, but with compact repn (e.g. via FFT)
Mostly iterations; varied and subtle
Build on dense ideas

Sparse matrices are matrices that aren’t stored in dense form. Really, the relevant point is that they require fewer than n^2 parameters to store, and some operations (like matrix-vector product) require fewer than n^2 operations. The simplest case is when most of the entries of the matrix are zero; that’s what we usually refer to when we say a matrix is sparse. We’ve already said something about formats for that, like compressed sparse row and compressed sparse column. But there are other “data-sparse” matrices out there that admit fast, implicit representations even if an explicit reprsentation would be dense. An example is the Fast Fourier Transform matrix.

But even if you think you’re only interested in problems that are sparse, like those arising from certain types of PDE discretizations, you need to know how numerical linear algebra with dense matrices works. At least, you need to know if you want your codes to run fast!

History: BLAS 1 (1973–1977)

15 ops (mostly) on vectors

BLAS 1 == \(O(n^1)\) ops on \(O(n^1)\) data
Up to four versions of each: S/D/C/Z
Example: DAXPY
- Double precision (real)
- Computes \(Ax+y\)

Before we dive into parallel stuff, let’s go back – way back – and talk about some of the early work on libraries for high-performance linear algebra.

Programming high-performance machines is not easy today, but it also wasn’t particularly easy in the 1970s. One of the difficulties, then and now, was writing code that would remain fast across multiple types of hardware. Portable correctness was easy, but folks then and now also wanted portable performance, which was and is hard.

One approach is to give up on portable performance and instead seek transportable performance. That is, we program for a portable high-level interface, but allow the implementations below that interface to be tuned platform by platform. We’ve mentioned this strategy before. One of the first examples for linear algebra was the Basic Linear Algebra Subroutines, or BLAS. The first routines developed are what is now known as the level 1 BLAS, though they weren’t called that at the time. These are functions that operate on vectors, like dot products and vector scaling and addition operations. They involve order n operations on order n data.

Because they were written in Fortran at the time, there were typically four versions of any given routine: single and double precision real and single and double precision complex versions. Also because of Fortran, the data type was encoded in the name, and the remaining six characters (since names could only be seven characters total) encoded the operation. So we have function name like DAXPY: double precision scale-and-sum operations (that is, computing alpha times x plus y, where x and y are vectors).

BLAS 1 goals

Raise level of programming abstraction
Robust implementation (e.g. avoid over/underflow)
Portable interface, efficient machine-specific implementation
Used in LINPACK (and EISPACK?)

History: BLAS 2 (1984–1986)

25 ops (mostly) on matrix/vector pairs

BLAS2 == \(O(n^2)\) ops on \(O(n^2)\) data
Different data types and matrix types
Example: DGEMV
- Double precision
- GEneral matrix
- Matrix-Vector product

BLAS 2 goals

BLAS1 insufficient
BLAS2 for better vectorization (when vector machines roamed)

History: BLAS 3 (1987–1988)

9 ops (mostly) on matrix/matrix

BLAS3 == \(O(n^3)\) ops on \(O(n^2)\) data
Different data types and matrix types
Example: DGEMM
- Double precision
- GEneral matrix
- Matrix-Matrix product

BLAS 3 goals

Efficient cache utilization!

Why BLAS?

LU for \(2 \times 2\): \[ \begin{bmatrix} a & b \\ c & d \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ c/a & 1 \end{bmatrix} \begin{bmatrix} a & b \\ 0 & d-bc/a \end{bmatrix} \]

How do we use the BLAS as a platform for building high-performance linear algebra codes? To explain, let’s walk through an example, a so-called LU factorization. We’ve written the factorization here (without pivoting) for a 2-by-2 problem. The factorization into a lower triangular matrix L with ones on the diagonal (called unit lower triangular) and an upper triangular matrix U captures everything you usually think about in the Gaussian elimination algorithm. The U matrix is your reduced form, and L stores the multipliers that are used at each stage of the elimination.

If you haven’t played with this for a while, I encourage you to take out a piece of paper and convince yourself that the factorization that we’ve written down is correct, and that it really does correspond to what you would do to solve a 2-by-2 linear system of equations with Gaussian elimination. Don’t worry, I’ll wait. The rest of the lecture – and the rest of the next two weeks – will go much more smoothly if you can follow this example.

Why BLAS?

Block elimination \[ \begin{bmatrix} A & B \\ C & D \end{bmatrix} = \begin{bmatrix} I & 0 \\ CA^{-1} & I \end{bmatrix} \begin{bmatrix} A & B \\ 0 & D-CA^{-1}B \end{bmatrix} \]

All right. One of the key ideas in numerical linear algebra is that of blocking, or thinking about a matrix as being a matrix of submatrices rather than just a matrix of individual scalar elements.

So suppose we have a matrix of dimension 6-by-6. We could potentially write it down as a block 2-by-2 matrix where the first block row or column has four elements, and the second block row or column has two elements. This is exactly the type of thing that we saw when we talked about tiling for matrix multiplication, but the idea works in other settings as well. For example, we can write down the 2-by-2 factorization that we discussed a moment ago with the scalar elements (a, b, c, d) replaced by submatrices (A, B, C, D). Now we have to be a little careful about the order of things that appear in products and quotients, since matrix multiplication is non-commutative, but otherwise the formula looks very much the same as the one we saw on the previous slide.

Why BLAS?

Block LU \[\begin{split} \begin{bmatrix} A & B \\ C & D \end{bmatrix} &= \begin{bmatrix} L_{11} & 0 \\ L_{12} & L_{22} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix} \\ &= \begin{bmatrix} L_{11} U_{11} & L_{11} U_{12} \\ L_{12} U_{11} & L_{21} U_{12} + L_{22} U_{22} \end{bmatrix} \end{split}\]

Why BLAS?

Think of \(A\) as \(k \times k\), \(k\) moderate:

[L11,U11] = small_lu(A);   % Small block LU
U12 = L11\B;               % Triangular solve
L12 = C/U11;               % "
S   = D-L21*U12;           % Rank k update
[L22,U22] = lu(S);         % Finish factoring

Three level-3 BLAS calls!

Two triangular solves
One rank-\(k\) update

LAPACK (1989-present)

Supercedes earlier LINPACK and EISPACK
High performance through BLAS
- Parallel to the extent BLAS are parallel (on SMP)
- Linear systems, least squares – near 100% BLAS 3
- Eigenproblems, SVD — only about 50% BLAS 3
Careful error bounds on everything
Lots of variants for different structures

Algorithms like blocked Gaussian elimination are at the heart of the LAPACK library, which has seen continuous development for over three decades now. LAPACK builds on the foundations of earlier libraries like LINPACK and EISPACK, but uses level 3 BLAS for performance as much as possible. LAPACK isn’t a small library, and there are lots of variants of routines in LAPACK for symmetric vs nonsymmetric or full vs banded structures; sometimes these involve fundamentally the same algorithms but with different data structures, and sometimes they involve rather different algorithms.

As an aside, It turns out that problems like eigenvalue computations are harder to accelerate this way than linear systems are, but there has been progress on that front as well; these routines are much faster than they used to be!

A lot of the innovation that went into LAPACK has to do with single-core performance. LAPACK is basically a serial library, though it can benefit from parallelization in the underlying BLAS library. Parallel takes on LAPACK came later, and are also an important part of the story.

In addition to being fast, the LAPACK authors have provided careful error analyses of all the routines involved. All of this means that if you are solving standard dense matrix computations on a single core (or maybe a small multiprocessor), you should probably just use LAPACK. It’s faster and more robust than anything that you’re likely to want to write yourself. And, indeed, software systems like MATLAB, R, Julia, and NumPy and SciPy mostly use LAPACK for their dense linear algebra.

ScaLAPACK (1995-present)

MPI implementations
Only a small subset of LAPACK functionality

PLASMA (2008-present)

Parallel LA Software for Multicore Architectures

Target: Shared memory multiprocessors
Stacks on LAPACK/BLAS interfaces
Tile algorithms, tile data layout, dynamic scheduling
Other algorithmic ideas, too (randomization, etc)

I mentioned a few slides ago that LAPACK is fundamentally serial, except for any parallelism inherited from the BLAS. But there’s a lot of room for interesting shared memory parallelism in dense linear algebra. Indeed, I’d argue that this is probably the most interesting flavor of parallelism for dense problems, for the simple reason that really gigantic general-purpose dense linear algebra problems aren’t that common. Really huge dense problems tend to come from applications where there is some type of data sparsity, and so people figure out algorithms with better complexity eventually. But problems big enough to benefit from shared memory parallelism are more common.

Anyhow: enter PLASMA. Building on top of the ideas of LAPACK and BLAS, PLASMA introduces a framework for tiled algorithms where the per-tile work can be dynamically scheduled onto different processors. There are a lot of other interesting ideas in PLASMA, too, some of which involve genuinely different algorithms rather than re-organizations of the standard LAPACK algorithms.

MAGMA (2008-present)

Matrix Algebra for GPU and Multicore Architectures

Target: CUDA, OpenCL, Xeon Phi
Still stacks (e.g. on CUDA BLAS)
Again: tile algorithms + data, dynamic scheduling
Mixed precision algorithms (+ iterative refinement)

And beyond!

SLATE???

Much is housed at UTK ICL

The story doesn’t end with PLASMA or MAGMA, either. The SLATE project is research in progress, an ambitious program for dense linear algebra capabilities at the exascale level. It’s the usual suspect, so I suspect that something good will come of it.

But despite all the work on PLASMA, MAGMA, and SLATE, there’s a lot out there that still sits on top of LAPACK and the BLAS. The LAPACK library really is remarkably well-adapted for what it’s good at, which is moderate scale dense linear algebra problems. It’s also remarkable how many things we can map to a small set of moderate-scale dense linear algebra problems; this is what makes libraries like LAPACK so interesting!

Next, though, we’ll turn to the algorithmic ideas underlying distributed-memory parallel matrix multiplication and Gaussian elimination. You are probably not going to write these types of codes yourself, but it turns out that the same computational structure can be found in other types of routines, too, and we’ll also talk a bit about those.