CS 5220

Applications of Parallel Computers

Parallel LU

Prof David Bindel

Please click the play button below.

Reminder: Evolution of LU

On board... or not.

LU by example

\[ \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ 13 \\ 22 \end{bmatrix} \]

LU by example

\[ \begin{bmatrix} 1 & 0 & 0 \\ -4 & 1 & 0 \\ -7 & 0 & 1 \end{bmatrix} \left( \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ 13 \\ 22 \end{bmatrix} \right) \]

LU by example

\[ \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & -6 & -11 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ -6 \end{bmatrix} \]

LU by example

\[ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & -2 & 1 \\ \end{bmatrix} \left( \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & -6 & -11 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ -6 \end{bmatrix} \right) \]

LU by example

\[ \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -3 \\ 0 \end{bmatrix} \]

LU by example

\[ \begin{bmatrix} 1 & 0 & 0 \\ 4 & 1 & 0 \\ 7 & 2 & 1 \end{bmatrix} \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 10 \end{bmatrix} \]

Simple LU

Overwrite \(A\) with \(L\) and \(U\)

for j = 1:n-1
  for i = j+1:n
    A(i,j) = A(i,j) / A(j,j);    % Compute multiplier
    for k = j+1:n
      A(i,k) -= A(i,j) * A(j,k); % Update row
    end
  end
end

Simple LU

Overwrite \(A\) with \(L\) and \(U\)

for j = 1:n-1
  A(j+1:n,j) = A(j+1:n,j)/A(j,j);              % Compute multipliers
  A(j+1:n,j+1:n) -= A(j+1:n,j) * A(j, j+1:n);  % Trailing update
end

Pivoting

Stability is a problem! Compute \(PA = LU\)

p = 1:n;
for j = 1:n-1
  [~,jpiv] = max(abs(A(j+1:n,j)));             % Find pivot
  A([j, j+jpiv-1],:) = A([j+jpiv-1, j]);       % Swap pivot row
  p([j, j+jpiv-1],:) = p([j+jpiv-1, j]);       % Save pivot info
  A(j+1:n,j) = A(j+1:n,j)/A(j,j);              % Compute multipliers
  A(j+1:n,j+1:n) -= A(j+1:n,j) * A(j, j+1:n);  % Trailing update
end

Blocking

Think in a way that uses level 3 BLAS \[ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix} \]

Blocking

Think in a way that uses level 3 BLAS \[ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} = \begin{bmatrix} L_{11} U_{11} & L_{11} U_{12} \\ L_{21} U_{11} & L_{21} U_{12} + L_{22} U_{22} \end{bmatrix} \]

Blocking

Think in a way that uses level 3 BLAS \[\begin{aligned} L_{11} U_{11} &= A_{11} \\ U_{12} &= L_{11}^{-1} A_{12} \\ L_{21} &= A_{21} U_{11}^{-1} \\ L_{22} U_{22} &= A_{22} - L_{21} U_{12} \end{aligned}\]

Enter pictures

  • Still haven’t showed how to do pivoting!
  • Easier to draw diagrams from here
  • Take 6210 or 4220 if you want more on LU!

Blocked GEPP

Find pivot

Blocked GEPP

Swap pivot row

Blocked GEPP

Update within block column

Blocked GEPP

Delayed update (at end of block)

Big idea

  • Delayed update strategy lets us do LU fast
    • Could have also delayed application of pivots
  • Same idea with other one-sided factorizations (QR)
  • Decent multi-core speedup with parallel BLAS!
    ... assuming \(n\) sufficiently large.

Issues left over (block size?)...

Explicit parallelization of GE

What to do:

  • Decompose into work chunks
  • Assign work to threads in a balanced way
  • Orchestrate communication + synchronization
  • Map which processors execute which threads

Possible matrix layouts

How should we share the matrix across ranks?

1D col blocked

\[\begin{bmatrix} 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \end{bmatrix}\]

1D col cyclic

\[\begin{bmatrix} 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \\ 0 & 1 & 2 & 0 & 1 & 2 & 0 & 1 & 2 \end{bmatrix}\]

1D col block cyclic

\[\begin{bmatrix} 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 2 & 2 & 0 & 0 & 1 & 1 \end{bmatrix}\]

Block skewed

\[\begin{bmatrix} 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 0 & 0 & 0 & 1 & 1 & 1 & 2 & 2 & 2 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 2 & 2 & 2 & 0 & 0 & 0 & 1 & 1 & 1 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \\ 1 & 1 & 1 & 2 & 2 & 2 & 0 & 0 & 0 \end{bmatrix}\]

2D block cyclic

\[\begin{bmatrix} 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ 2 & 2 & 3 & 3 & 2 & 2 & 3 & 3 \\ \end{bmatrix}\]

Possible matrix layouts

  • 1D col blocked: bad load balance
  • 1D col cyclic: hard to use BLAS2/3
  • 1D col block cyclic: factoring col a bottleneck
  • Block skewed (a la Cannon): just complicated
  • 2D row/col block: bad load balance
  • 2D row/col block cyclic: win!

Distributed GEPP

Find pivot (column broadcast)

Distributed GEPP

Swap pivot row within block column + broadcast pivot

Distributed GEPP

Update within block column

Distributed GEPP

At end of block, broadcast swap info along rows

Distributed GEPP

Apply all row swaps to other columns

Distributed GEPP

Broadcast block \(L_{II}\) right

Distributed GEPP

Update remainder of block row

Distributed GEPP

Broadcast rest of block row down

Distributed GEPP

Broadcast rest of block col right

Distributed GEPP

Update of trailing submatrix

Cost of ScaLAPACK GEPP

Communication costs:

  • Lower bound: \(O(n^2/\sqrt{P})\) words, \(O(\sqrt{P})\) messages
  • ScaLAPACK:
    • \(O(n^2 \log P / \sqrt{P})\) words sent
    • \(O(n \log p)\) messages
    • Problem: reduction to find pivot in each column
  • Tournaments for stability without partial pivoting

If you don’t care about dense LU?
Let’s review some ideas in a different setting...

Floyd-Warshall

Goal: All pairs shortest path lengths.
Idea: Dynamic programming! Define \[d_{ij}^{(k)} = \mbox{shortest path $i$ to $j$ with intermediates in $\{1, \ldots, k\}$}.\] Then \[d_{ij}^{(k)} = \min\left( d_{ij}^{(k-1)}, d_{ik}^{(k-1)} + d_{kj}^{(k-1)} \right)\] and \(d_{ij}^{(n)}\) is the desired shortest path length.

The same and different

Floyd’s algorithm for all-pairs shortest paths:

for k=1:n
  for i = 1:n
    for j = 1:n
      D(i,j) = min(D(i,j), D(i,k)+D(k,j));

Unpivoted Gaussian elimination (overwriting \(A\)):

for k=1:n
  for i = k+1:n
    A(i,k) = A(i,k) / A(k,k);
    for j = k+1:n
      A(i,j) = A(i,j)-A(i,k)*A(k,j);

The same and different

  • The same: \(O(n^3)\) time, \(O(n^2)\) space
  • The same: can’t move \(k\) loop (data dependencies)
    • ... at least, can’t without care!
    • Different from matrix multiplication
  • The same: \(x_{ij}^{(k)} = f\left(x_{ij}^{(k-1)}, g\left(x_{ik}^{(k-1)}, x_{kj}^{(k-1)}\right)\right)\)
    • Same basic dependency pattern in updates!
    • Similar algebraic relations satisfied
  • Different: Update to full matrix vs trailing submatrix

How far can we get?

How would we write

  • Cache-efficient (blocked) serial implementation?
  • Message-passing parallel implementation?

The full picture could make a fun class project...

Onward!

Next up: Sparse linear algebra and iterative solvers!