Prof David Bindel

Please click the play button below.

Simple \(y = Ax\) involves two indices: \[ y_i = \sum_{j} A_{ij} x_j \] Sums can go in any order!

Organize \(y = Ax\) around rows or columns:

```
% Row-oriented
for i = 1:n
y(i) = A(i,:)*x;
end
% Col-oriented
y = 0;
for j = 1:n
y = y + A(:,j)*x(j);
end
```

... or deal with index space in other ways!

Broadcast \(x\), compute rows independently.

```
Allgather(xlocal, xall)
ylocal = Alocal * xall
```

Compute partial matvecs and reduce.

```
z = Alocal * xlocal
for j = 1:p
Reduce(sum, z[i], ylocal at proc i)
end
```

- Involves broadcast
*and*reduction - ... but with subsets of processors

- Basic operation: \(C = C+AB\)
- Computation: \(2n^3\) flops
- Goal: \(2n^3/p\) flops per processor, minimal communication
- Two main contenders: SUMMA and Cannon

- Block MATLAB notation: \(A(:,j)\) means \(j\)th block column
- Processor \(j\) owns \(A(:,j)\), \(B(:,j)\), \(C(:,j)\)
- \(C(:,j)\) depends on
*all*of \(A\), but only \(B(:,j)\) - How do we communicate pieces of \(A\)?

- Every process \(j\) can send data to \(j+1\) simultaneously
- Pass slices of \(A\) around the ring until everyone sees the whole matrix (\(p-1\) phases).

```
tmp = A(:,myproc)
C(myproc) += tmp*B(myproc,myproc)
for j = 1 to p-1
sendrecv tmp to myproc+1 mod p,
from myproc-1 mod p
C(myproc) += tmp*B(myproc-j mod p, myproc)
```

Performance model?

In a simple \(\alpha-\beta\) model, at each processor:

- \(p-1\) message sends (and simultaneous receives)
- Each message involves \(n^2/p\) data
- Communication cost: \((p-1) \alpha + (1-1/p) n^2 \beta\)

Recall outer product organization:

```
for k = 0:s-1
C += A(:,k)*B(k,:);
end
```

Parallel: Assume \(p = s^2\) processors, block \(s \times s\) matrices.

For a \(2 \times 2\) example: \[\begin{bmatrix}
C_{00} & C_{01} \\
C_{10} & C_{11}
\end{bmatrix} =
\begin{bmatrix}
A_{00} B_{00} & A_{00} B_{01} \\
A_{10} B_{00} & A_{10} B_{01}
\end{bmatrix} +
\begin{bmatrix}
A_{01} B_{10} & A_{01} B_{11} \\
A_{11} B_{10} & A_{11} B_{11}
\end{bmatrix}
\]

- Processor for each \((i,j)\) \(\implies\) parallel work for each \(k\)!
- Note everyone in row \(i\) uses \(A(i,k)\) at once,

and everyone in row \(j\) uses \(B(k,j)\) at once.

```
for k = 0:s-1
for each i in parallel
broadcast A(i,k) to row
for each j in parallel
broadcast A(k,j) to col
On processor (i,j), C(i,j) += A(i,k)*B(k,j);
end
```

If we have tree along each row/column, then

- \(\log(s)\) messages per broadcast
- \(\alpha + \beta n^2/s^2\) per message
- \(2 \log(s) (\alpha s + \beta n^2/s)\) total communication
- Compare to 1D ring: \((p-1) \alpha + (1-1/p) n^2 \beta\)

Note: Same ideas work with block size \(b < n/s\)

SUMMA + “pass it around?”

Idea: Reindex products in block matrix multiply \[\begin{aligned}
C(i,j) &= \sum_{k = 0}^{p-1} A(i,k) B(k,j) \\
&= \sum_{k = 0}^{p-1} A(i,\, k+i+j \mod p) \; B(k+i+j \mod p, j)
\end{aligned}\] For a fixed \(k\), a given block of \(A\) (or \(B\)) is needed for contribution to *exactly one* \(C(i,j)\).

\[\begin{bmatrix} C_{00} & C_{01} \\ C_{10} & C_{11} \end{bmatrix} = \begin{bmatrix} A_{00} B_{00} & A_{01} B_{11} \\ A_{11} B_{10} & A_{10} B_{01} \end{bmatrix} + \begin{bmatrix} A_{01} B_{10} & A_{00} B_{01} \\ A_{10} B_{00} & A_{11} B_{11} \end{bmatrix}\]

```
% Move A(i,j) to A(i,i+j)
for i = 0 to s-1
cycle A(i,:) left by i
% Move B(i,j) to B(i+j,j)
for j = 0 to s-1
cycle B(:,j) up by j
for k = 0 to s-1
in parallel;
C(i,j) = C(i,j) + A(i,j)*B(i,j);
cycle A(:,i) left by 1
cycle B(:,j) up by 1
```

- Assume 2D torus topology
- Initial cyclic shifts: \(\leq s\) messages each (\(\leq 2s\) total)
- For each phase: \(2\) messages each (\(2s\) total)
- Each message is size \(n^2/s^2\)
- Communication cost: \(4s(\alpha + \beta n^2/s^2) = 4(\alpha s + \beta n^2/s)\)
- This communication cost is optimal!

... but SUMMA is simpler, more flexible, almost as good

Recall \[\begin{aligned} \mathrm{Speedup} & := t_{\mathrm{serial}} / t_{\mathrm{parallel}} \\ \mathrm{Efficiency} & := \mathrm{Speedup}/p \end{aligned}\]

1D layout | \(\left( 1+O\left( \frac{p}{n} \right) \right)^{-1}\) |

SUMMA | \(\left( 1+O\left( \frac{\sqrt{p} \log p}{n} \right) \right)^{-1}\) |

Cannon | \(\left (1+O\left( \frac{\sqrt{p}}{n} \right) \right)^{-1}\) |

Assuming no overlap of communication and computation.