Prof David Bindel

Please click the play button below.

On board... or not.

\[ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ 13 \\ 22 \end{bmatrix} \]

\[ \begin{bmatrix} 1 & 0 & 0 \\ -4 & 1 & 0 \\ -7 & 0 & 1 \end{bmatrix} \left( \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ 13 \\ 22 \end{bmatrix} \right) \]

\[ \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & -6 & -11 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ -6 \end{bmatrix} \]

\[ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & -2 & 1 \\ \end{bmatrix} \left( \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & -6 & -11 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ -6 \end{bmatrix} \right) \]

\[ \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ 0 \end{bmatrix} \]

\[ \begin{bmatrix} 1 & 0 & 0 \\ 4 & 1 & 0 \\ 7 & 2 & 1 \end{bmatrix} \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \]

Overwrite \(A\) with \(L\) and \(U\)

```
for j = 1:n-1
for i = j+1:n
A(i,j) = A(i,j) / A(j,j); % Compute multiplier
for k = j+1:n
A(i,k) -= A(i,j) * A(j,k); % Update row
end
end
end
```

Overwrite \(A\) with \(L\) and \(U\)

```
for j = 1:n-1
A(j+1:n,j) = A(j+1:n,j)/A(j,j); % Compute multipliers
A(j+1:n,j+1:n) -= A(j+1:n,j) * A(j, j+1:n); % Trailing update
end
```

Stability is a problem! Compute \(PA = LU\)

```
p = 1:n;
for j = 1:n-1
[~,jpiv] = max(abs(A(j+1:n,j))); % Find pivot
A([j, j+jpiv-1],:) = A([j+jpiv-1, j]); % Swap pivot row
p([j, j+jpiv-1],:) = p([j+jpiv-1, j]); % Save pivot info
A(j+1:n,j) = A(j+1:n,j)/A(j,j); % Compute multipliers
A(j+1:n,j+1:n) -= A(j+1:n,j) * A(j, j+1:n); % Trailing update
end
```

Think in a way that uses level 3 BLAS \[ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix} \]

Think in a way that uses level 3 BLAS \[ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} U_{11} & L_{11} U_{12} \\ L_{21} U_{11} & L_{21} U_{12} + L_{22} U_{22} \end{bmatrix} \]

Think in a way that uses level 3 BLAS \[\begin{aligned} L_{11} U_{11} &= A_{11} \\ U_{12} &= L_{11}^{-1} A_{12} \\ L_{21} &= A_{21} U_{11}^{-1} \\ L_{22} U_{22} &= A_{22} - L_{21} U_{12} \end{aligned}\]

- Still haven’t showed how to do pivoting!
- Easier to draw diagrams from here
- Take 6210 or 4220 if you want more on LU!

Find pivot

Swap pivot row

Update within block column

Delayed update (at end of block)

*Delayed update*strategy lets us do LU fast- Could have also delayed application of pivots

- Same idea with other one-sided factorizations (QR)
- Decent multi-core speedup with parallel BLAS!

... assuming \(n\) sufficiently large.

Issues left over (block size?)...

What to do:

*Decompose*into work chunks*Assign*work to threads in a balanced way*Orchestrate*communication + synchronization*Map*which processors execute which threads

How should we share the matrix across ranks?

\[\begin{bmatrix} 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \end{bmatrix}\]

\[\begin{bmatrix} 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \end{bmatrix}\]

\[\begin{bmatrix} 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \end{bmatrix}\]

\[\begin{bmatrix} 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \end{bmatrix}\]

\[\begin{bmatrix} 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ \end{bmatrix}\]

- 1D col blocked: bad load balance
- 1D col cyclic: hard to use BLAS2/3
- 1D col block cyclic: factoring col a bottleneck
- Block skewed (a la Cannon): just complicated
- 2D row/col block: bad load balance
- 2D row/col block cyclic: win!

Find pivot (column broadcast)

Swap pivot row within block column + broadcast pivot

Update within block column

At end of block, broadcast swap info along rows

Apply all row swaps to other columns

Broadcast block \(L_{II}\) right

Update remainder of block row

Broadcast rest of block row down

Broadcast rest of block col right

Update of trailing submatrix

Communication costs:

- Lower bound: \(O(n^2/\sqrt{P})\) words, \(O(\sqrt{P})\) messages
- ScaLAPACK:
- \(O(n^2 \log P / \sqrt{P})\) words sent
- \(O(n \log p)\) messages
- Problem: reduction to find pivot in each column

- Tournaments for stability without partial pivoting

If you don’t care about dense LU?

Let’s review some ideas in a different setting...

Goal: All pairs shortest path lengths.

Idea: Dynamic programming! Define \[d_{ij}^{(k)} =
\mbox{shortest path $i$ to $j$ with intermediates in $\{1, \ldots, k\}$}.\] Then \[d_{ij}^{(k)} =
\min\left( d_{ij}^{(k-1)}, d_{ik}^{(k-1)} + d_{kj}^{(k-1)} \right)\] and \(d_{ij}^{(n)}\) is the desired shortest path length.

Floyd’s algorithm for all-pairs shortest paths:

```
for k=1:n
for i = 1:n
for j = 1:n
D(i,j) = min(D(i,j), D(i,k)+D(k,j));
```

Unpivoted Gaussian elimination (overwriting \(A\)):

```
for k=1:n
for i = k+1:n
A(i,k) = A(i,k) / A(k,k);
for j = k+1:n
A(i,j) = A(i,j)-A(i,k)*A(k,j);
```

- The same: \(O(n^3)\) time, \(O(n^2)\) space
- The same: can’t move \(k\) loop (data dependencies)
- ... at least, can’t without care!
- Different from matrix multiplication

- The same: \(x_{ij}^{(k)} = f\left(x_{ij}^{(k-1)}, g\left(x_{ik}^{(k-1)}, x_{kj}^{(k-1)}\right)\right)\)
- Same basic dependency pattern in updates!
- Similar algebraic relations satisfied

- Different: Update to full matrix vs trailing submatrix

How would we write

- Cache-efficient (blocked)
*serial*implementation? - Message-passing
*parallel*implementation?

The full picture could make a fun class project...

Next up: Sparse linear algebra and iterative solvers!