# CS 5220

## Applications of Parallel Computers

### Parallel LU

Prof David Bindel

Please click the play button below.

### Reminder: Evolution of LU

On board... or not.

### LU by example

$\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ 13 \\ 22 \end{bmatrix}$

### LU by example

$\begin{bmatrix} 1 & 0 & 0 \\ -4 & 1 & 0 \\ -7 & 0 & 1 \end{bmatrix} \left( \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ 13 \\ 22 \end{bmatrix} \right)$

### LU by example

$\begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & -6 & -11 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ -6 \end{bmatrix}$

### LU by example

$\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & -2 & 1 \\ \end{bmatrix} \left( \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & -6 & -11 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ -6 \end{bmatrix} \right)$

### LU by example

$\begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ 0 \end{bmatrix}$

### LU by example

$\begin{bmatrix} 1 & 0 & 0 \\ 4 & 1 & 0 \\ 7 & 2 & 1 \end{bmatrix} \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix}$

### Simple LU

Overwrite $$A$$ with $$L$$ and $$U$$

for j = 1:n-1
for i = j+1:n
A(i,j) = A(i,j) / A(j,j);    % Compute multiplier
for k = j+1:n
A(i,k) -= A(i,j) * A(j,k); % Update row
end
end
end

### Simple LU

Overwrite $$A$$ with $$L$$ and $$U$$

for j = 1:n-1
A(j+1:n,j) = A(j+1:n,j)/A(j,j);              % Compute multipliers
A(j+1:n,j+1:n) -= A(j+1:n,j) * A(j, j+1:n);  % Trailing update
end

### Pivoting

Stability is a problem! Compute $$PA = LU$$

p = 1:n;
for j = 1:n-1
[~,jpiv] = max(abs(A(j+1:n,j)));             % Find pivot
A([j, j+jpiv-1],:) = A([j+jpiv-1, j]);       % Swap pivot row
p([j, j+jpiv-1],:) = p([j+jpiv-1, j]);       % Save pivot info
A(j+1:n,j) = A(j+1:n,j)/A(j,j);              % Compute multipliers
A(j+1:n,j+1:n) -= A(j+1:n,j) * A(j, j+1:n);  % Trailing update
end

### Blocking

Think in a way that uses level 3 BLAS $\begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix}$

### Blocking

Think in a way that uses level 3 BLAS $\begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} U_{11} & L_{11} U_{12} \\ L_{21} U_{11} & L_{21} U_{12} + L_{22} U_{22} \end{bmatrix}$

### Blocking

Think in a way that uses level 3 BLAS \begin{aligned} L_{11} U_{11} &= A_{11} \\ U_{12} &= L_{11}^{-1} A_{12} \\ L_{21} &= A_{21} U_{11}^{-1} \\ L_{22} U_{22} &= A_{22} - L_{21} U_{12} \end{aligned}

### Enter pictures

• Still haven’t showed how to do pivoting!
• Easier to draw diagrams from here
• Take 6210 or 4220 if you want more on LU!

Find pivot

Swap pivot row

### Blocked GEPP

Update within block column

### Blocked GEPP

Delayed update (at end of block)

### Big idea

• Delayed update strategy lets us do LU fast
• Could have also delayed application of pivots
• Same idea with other one-sided factorizations (QR)
• Decent multi-core speedup with parallel BLAS!
... assuming $$n$$ sufficiently large.

Issues left over (block size?)...

### Explicit parallelization of GE

What to do:

• Decompose into work chunks
• Assign work to threads in a balanced way
• Orchestrate communication + synchronization
• Map which processors execute which threads

### Possible matrix layouts

How should we share the matrix across ranks?

### 1D col blocked

$\begin{bmatrix} 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \end{bmatrix}$

### 1D col cyclic

$\begin{bmatrix} 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \end{bmatrix}$

### 1D col block cyclic

$\begin{bmatrix} 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \end{bmatrix}$

### Block skewed

$\begin{bmatrix} 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \end{bmatrix}$

### 2D block cyclic

$\begin{bmatrix} 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ \end{bmatrix}$

### Possible matrix layouts

• 1D col cyclic: hard to use BLAS2/3
• 1D col block cyclic: factoring col a bottleneck
• Block skewed (a la Cannon): just complicated
• 2D row/col block cyclic: win!

### Distributed GEPP

Swap pivot row within block column + broadcast pivot

### Distributed GEPP

Update within block column

### Distributed GEPP

At end of block, broadcast swap info along rows

### Distributed GEPP

Apply all row swaps to other columns

### Distributed GEPP

Broadcast block $$L_{II}$$ right

### Distributed GEPP

Update remainder of block row

### Distributed GEPP

Broadcast rest of block row down

### Distributed GEPP

Broadcast rest of block col right

### Distributed GEPP

Update of trailing submatrix

### Cost of ScaLAPACK GEPP

Communication costs:

• Lower bound: $$O(n^2/\sqrt{P})$$ words, $$O(\sqrt{P})$$ messages
• ScaLAPACK:
• $$O(n^2 \log P / \sqrt{P})$$ words sent
• $$O(n \log p)$$ messages
• Problem: reduction to find pivot in each column
• Tournaments for stability without partial pivoting

If you don’t care about dense LU?
Let’s review some ideas in a different setting...

### Floyd-Warshall

Goal: All pairs shortest path lengths.
Idea: Dynamic programming! Define $d_{ij}^{(k)} = \mbox{shortest path i to j with intermediates in \{1, \ldots, k\}}.$ Then $d_{ij}^{(k)} = \min\left( d_{ij}^{(k-1)}, d_{ik}^{(k-1)} + d_{kj}^{(k-1)} \right)$ and $$d_{ij}^{(n)}$$ is the desired shortest path length.

### The same and different

Floyd’s algorithm for all-pairs shortest paths:

for k=1:n
for i = 1:n
for j = 1:n
D(i,j) = min(D(i,j), D(i,k)+D(k,j));

Unpivoted Gaussian elimination (overwriting $$A$$):

for k=1:n
for i = k+1:n
A(i,k) = A(i,k) / A(k,k);
for j = k+1:n
A(i,j) = A(i,j)-A(i,k)*A(k,j);

### The same and different

• The same: $$O(n^3)$$ time, $$O(n^2)$$ space
• The same: can’t move $$k$$ loop (data dependencies)
• ... at least, can’t without care!
• Different from matrix multiplication
• The same: $$x_{ij}^{(k)} = f\left(x_{ij}^{(k-1)}, g\left(x_{ik}^{(k-1)}, x_{kj}^{(k-1)}\right)\right)$$
• Same basic dependency pattern in updates!
• Similar algebraic relations satisfied
• Different: Update to full matrix vs trailing submatrix

### How far can we get?

How would we write

• Cache-efficient (blocked) serial implementation?
• Message-passing parallel implementation?

The full picture could make a fun class project...

### Onward!

Next up: Sparse linear algebra and iterative solvers!