# CS 5220

## Applications of Parallel Computers

### Parallel matrix multiply

Prof David Bindel

Please click the play button below.

### Matrix vector product

Simple $$y = Ax$$ involves two indices: $y_i = \sum_{j} A_{ij} x_j$ Sums can go in any order!

### Matrix vector product

Organize $$y = Ax$$ around rows or columns:

% Row-oriented
for i = 1:n
y(i) = A(i,:)*x;
end

% Col-oriented
y = 0;
for j = 1:n
y = y + A(:,j)*x(j);
end

... or deal with index space in other ways!

### Parallel matvec: 1D row-blocked

Broadcast $$x$$, compute rows independently.

### Parallel matvec: 1D row-blocked

Allgather(xlocal, xall)
ylocal = Alocal * xall

### Parallel matvec: 1D col-blocked

Compute partial matvecs and reduce.

### Parallel matvec: 1D col-blocked

z = Alocal * xlocal
for j = 1:p
Reduce(sum, z[i], ylocal at proc i)
end

### Parallel matvec: 2D blocked

• ... but with subsets of processors

### Parallel matmul

• Basic operation: $$C = C+AB$$
• Computation: $$2n^3$$ flops
• Goal: $$2n^3/p$$ flops per processor, minimal communication
• Two main contenders: SUMMA and Cannon

### 1D layout

• Block MATLAB notation: $$A(:,j)$$ means $$j$$th block column
• Processor $$j$$ owns $$A(:,j)$$, $$B(:,j)$$, $$C(:,j)$$
• $$C(:,j)$$ depends on all of $$A$$, but only $$B(:,j)$$
• How do we communicate pieces of $$A$$?

### 1D layout on ring

• Every process $$j$$ can send data to $$j+1$$ simultaneously
• Pass slices of $$A$$ around the ring until everyone sees the whole matrix ($$p-1$$ phases).

### 1D layout on ring

tmp = A(:,myproc)
C(myproc) += tmp*B(myproc,myproc)
for j = 1 to p-1
sendrecv tmp to myproc+1 mod p,
from myproc-1 mod p
C(myproc) += tmp*B(myproc-j mod p, myproc)

Performance model?

### 1D layout on ring

In a simple $$\alpha-\beta$$ model, at each processor:

• $$p-1$$ message sends (and simultaneous receives)
• Each message involves $$n^2/p$$ data
• Communication cost: $$(p-1) \alpha + (1-1/p) n^2 \beta$$

### Outer product algorithm

Recall outer product organization:

for k = 0:s-1
C += A(:,k)*B(k,:);
end

### Outer product algorithm

Parallel: Assume $$p = s^2$$ processors, block $$s \times s$$ matrices.
For a $$2 \times 2$$ example: $\begin{bmatrix} C_{00} & C_{01} \\ C_{10} & C_{11} \end{bmatrix} = \begin{bmatrix} A_{00} B_{00} & A_{00} B_{01} \\ A_{10} B_{00} & A_{10} B_{01} \end{bmatrix} + \begin{bmatrix} A_{01} B_{10} & A_{01} B_{11} \\ A_{11} B_{10} & A_{11} B_{11} \end{bmatrix}$

• Processor for each $$(i,j)$$ $$\implies$$ parallel work for each $$k$$!
• Note everyone in row $$i$$ uses $$A(i,k)$$ at once,
and everyone in row $$j$$ uses $$B(k,j)$$ at once.

### Parallel outer product (SUMMA)

for k = 0:s-1
for each i in parallel
for each j in parallel
On processor (i,j), C(i,j) += A(i,k)*B(k,j);
end

### Parallel outer product (SUMMA)

If we have tree along each row/column, then

• $$\log(s)$$ messages per broadcast
• $$\alpha + \beta n^2/s^2$$ per message
• $$2 \log(s) (\alpha s + \beta n^2/s)$$ total communication
• Compare to 1D ring: $$(p-1) \alpha + (1-1/p) n^2 \beta$$

Note: Same ideas work with block size $$b < n/s$$

### Cannon’s algorithm

SUMMA + “pass it around?”

### Cannon’s algorithm

Idea: Reindex products in block matrix multiply \begin{aligned} C(i,j) &= \sum_{k = 0}^{p-1} A(i,k) B(k,j) \\ &= \sum_{k = 0}^{p-1} A(i,\, k+i+j \mod p) \; B(k+i+j \mod p, j) \end{aligned} For a fixed $$k$$, a given block of $$A$$ (or $$B$$) is needed for contribution to exactly one $$C(i,j)$$.

### Cannon’s algorithm

$\begin{bmatrix} C_{00} & C_{01} \\ C_{10} & C_{11} \end{bmatrix} = \begin{bmatrix} A_{00} B_{00} & A_{01} B_{11} \\ A_{11} B_{10} & A_{10} B_{01} \end{bmatrix} + \begin{bmatrix} A_{01} B_{10} & A_{00} B_{01} \\ A_{10} B_{00} & A_{11} B_{11} \end{bmatrix}$

### Cannon’s algorithm

% Move A(i,j) to A(i,i+j)
for i = 0 to s-1
cycle A(i,:) left by i

% Move B(i,j) to B(i+j,j)
for j = 0 to s-1
cycle B(:,j) up by j

for k = 0 to s-1
in parallel;
C(i,j) = C(i,j) + A(i,j)*B(i,j);
cycle A(:,i) left by 1
cycle B(:,j) up by 1

### Cost of Cannon

• Assume 2D torus topology
• Initial cyclic shifts: $$\leq s$$ messages each ($$\leq 2s$$ total)
• For each phase: $$2$$ messages each ($$2s$$ total)
• Each message is size $$n^2/s^2$$
• Communication cost: $$4s(\alpha + \beta n^2/s^2) = 4(\alpha s + \beta n^2/s)$$
• This communication cost is optimal!
... but SUMMA is simpler, more flexible, almost as good

### Speedup and efficiency

Recall \begin{aligned} \mathrm{Speedup} & := t_{\mathrm{serial}} / t_{\mathrm{parallel}} \\ \mathrm{Efficiency} & := \mathrm{Speedup}/p \end{aligned}

### Speedup and efficiency

 1D layout $$\left( 1+O\left( \frac{p}{n} \right) \right)^{-1}$$ SUMMA $$\left( 1+O\left( \frac{\sqrt{p} \log p}{n} \right) \right)^{-1}$$ Cannon $$\left (1+O\left( \frac{\sqrt{p}}{n} \right) \right)^{-1}$$

Assuming no overlap of communication and computation.