Instruction level parallelism (ILP): run multiple instructions simultaneously.

Single-instruction multiple-data parallelism (SIMD): a single special instruction operates on a vector of numbers.

Multi-thread parallelism: multiple independent worker threads collaborate over a shared-memory abstraction.

Distributed computing: multiple independent worker machines collaborate over a network.

...and programs need to be written to take advantage of these parallel features of the hardware.

Throughput: how many examples can we process in a fixed amount of time

- More generally, how much data can we process in fixed time

Latency: how quickly can we finish processing a single example

- More generally, how much time passes between when we input data and when we get the output for that data

When thinking about making a component of an algorithm parallel, it's important to keep in mind what metrics we care about.

In order to optimize the ML pipeline, we need to reason about how we can best use parallelism at each stage. Since we've been talking about training for most of the class, let's look at how we can use these types of parallelism to accelerate training an ML model.

We can get a major boost in performance by building linear algebra kernels (e.g. matrix multiplies, vector adds, et cetera) that use parallelism to run fast.

Since matrix multiplies represent most of the computation of training a deep neual network, this can result in a major end-to-end speedup of the training pipeline.

This mostly involves ILP and SIMD parallelism, and (for larger matrices) it can also use multi-threading.

Use BLAS and other highly optimized matrix/tensor libraries for arithmetic.

Write algorithms in terms of matrix operations rather than vector operations as much as possible.

Use broadcast operations rather than loops. In languages like Python where the compiler won't automatically vectorize. In languages like C, the compiler will (try to) automatically vectorize loops so broadcast operations are neither necessary nor supported.

Contain explicitly parallelized implementations of some special common operations (such as convolutions for CNNs)

You already saw this in the demo last time, but being able to take advantage of a highly-optimized mathematical kernel library gives a huge performance boost: potentially an order of magnitude above even `-O3`

optimized C.

What types of computations does BLAS support? Here's a non-exhaustive list:

- axpy: $y = ax + y$ for vectors $x$ and $y$ and scalar $a$.
- dot: $x^T y$
- norm: $\| x \|^2 = x^T x$
- matrix-vector multiply-add: $y = \beta y + \alpha A x$
- matrix-matrix multiply-add: $C = \beta C + \alpha A B$

...there are more operations supported too, such as specialized ops for symmetric and structured matrices. More fast linear algebra operations appear in more advanced libraries such as LAPACK.

One important thing to know about parallel programs is that they need some sort of parallelism in the underlying algorithm to be able to work.

Usually, this parallelism comes in the form of a **parallel loop**, where multiple iterations of a loop can be run (at least partially) in parallel.

For example, look at this non-parallel `C`

implementation of a matrix-vector multiply-add $y = \beta y + \alpha A x$ for $A \in \R^{m \times n}$, $x \in \R^n$ and $y \in \R^m$.

```
void mv_multiply_add(int m, int n, float* y, float* A, float* x) {
for (int i = 0; i < m; i++) {
float acc = 0.0; // product accumulator
for (int j = 0; j < n; j++) {
acc += A[i*n+j] * x[j];
}
y[i] = beta * y[i] + alpha * acc;
}
}
```

What operations within these loops could be run in parallel?

We can think about this as determining how much parallelism is "available" for a program to use.

**Simple heuristic:** The more parallelism is available for use in an algorithm, the more an implementation can utilize the parallel resources of the hardware, and the faster it will run.

For example, compare the following matrix-matrix multiply-add $Y = \beta Y + \alpha A X$ for $A \in \R^{m \times n}$, $X \in \R^{n \times p}$ and $Y \in \R^{m \times p}$.

```
void mm_multiply_add(int m, int n, int p, float* Y, float* A, float* X) {
for (int k = 0; k < p; k++) {
for (int i = 0; i < m; i++) {
float acc = 0.0; // product accumulator
for (int j = 0; j < n; j++) {
acc += A[i*n+j] * x[j*p+k];
}
y[i*p+k] = beta * y[i*p+k] + alpha * acc;
}
}
}
```

The addition of an extra outer loop (over $k \in \{1, \ldots, p\}$) adds more operations to the algorithm that could be parallelized by an efficient implementation.

Do we expect this to benefit more or less from parallelism when compared to the matrix-vector multiply-add?

You've probably observed this effect while working on the programming assignments in this course, and parallelism is part of the reason why this happens.

You're all familiar with the broadcasting capabilities of numpy.

But did you know that one way that these broadcasting operations can be faster...is because of parallelism?

Compare the python code to implement the elementwise product of two vectors $x \odot y$:

In [1]:

```
# imports we'll need
import os
os.environ["OMP_NUM_THREADS"] = "4"
os.environ["MKL_NUM_THREADS"] = "4"
os.environ["OPENBLAS_NUM_THREADS"] = "4"
import numpy
import time
```

In [2]:

```
def elementwise_product(x, y):
assert(len(x) == len(y))
z = numpy.zeros(len(x))
for i in range(len(x)):
z[i] = x[i] * y[i]
return z
```

with the python code to do this using numpy broadcasting

In [3]:

```
def broadcast_elementwise_product(x, y):
return x * y
```

The latter is certainly simpler. But is it faster?

In [4]:

```
n = 1024 * 1024 * 1
x = numpy.random.rand(n)
y = numpy.random.rand(n)
```

In [5]:

```
t = time.time()
elementwise_product(x, y)
elapsed = time.time() - t
print("time elapsed: %f" % elapsed)
```

In [6]:

```
t = time.time()
for i in range(10):
broadcast_elementwise_product(x, y)
elapsed = time.time() - t
print("time elapsed: %f" % elapsed)
```

Most of the speedup here is coming from getting rid of the python interpreter overhead...but some of it also comes from parallelism. Although numpy does not use multithreading for broadcast operations by default, it is benefitting from vectorization and ILP.

Nevertheless you should always broadcast whenever possible.

In [7]:

```
n = 1024 * 16
p = 32
A = numpy.random.rand(n, n)
X = numpy.random.rand(n, p)
X1 = numpy.random.rand(n, 1)
# for A * X, need O(n * n * p)
# for A * X1, need O(n * n)...naively should be 32x faster
```

In [10]:

```
t = time.time()
for i in range(10):
numpy.dot(A, X)
elapsed_4t_mm = time.time() - t
print("time elapsed: %f" % elapsed_4t_mm)
```

In [11]:

```
t = time.time()
for i in range(10):
numpy.dot(A, X1)
elapsed_4t_mv = time.time() - t
print("time elapsed: %f" % elapsed_4t_mv)
```

Restart the Kernel Now

In [1]:

```
import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
import numpy
import time
```

In [2]:

```
n = 1024 * 16
p = 32
A = numpy.random.rand(n, n)
X = numpy.random.rand(n, p)
X1 = numpy.random.rand(n, 1)
```

In [3]:

```
t = time.time()
for i in range(10):
numpy.dot(A, X)
elapsed_1t_mm = time.time() - t
print("time elapsed: %f" % elapsed_1t_mm)
```

In [4]:

```
t = time.time()
for i in range(10):
numpy.dot(A, X1)
elapsed_1t_mv = time.time() - t
print("time elapsed: %f" % elapsed_1t_mv)
```

Most ML frameworks involve BLAS-like hand-tuned parallel implementations of operations that are common in deep learning.

- For example, the linear convolution layer of a CNN.

These implementations should be used whenever possible, and can mean big speedups for commonly used DNN architectures.

What can we parallelize within a single iteration of SGD?

- Parallelize across examples in the minibatch

Parallelize processing of examples, like on-the-fly feature extraction or kernel computations

Parallelize across the layers of a neural network (model parallelism) while computing gradients

- Assign each layer to a parallel worker

Is the outer loop of SGD a parallel loop?

**No,** there are inherent sequential dependencies.

One idea: just let a bunch of worker threads run it in parallel as if it were a parallel loop, and use synchronization constructs like locks to prevent race conditions caused by the threads getting in each others' way.

- This can work, but is also slow because of the overhead of the synchronization.

An even crazier idea (called **asynchronous SGD** or **Hogwild!**):

- just
**have multiple worker threads run SGD in parallel without any synchronization**.

Now race conditions could occur!

But it turns out that in many cases this is fine, and the algorithm converges just as well (sometimes we can even prove this).

**Intuition:** the noise/error from the race conditions just adds a small amount to the noise/error already present in the algorithm because of the random gradient sampling.

What opportunities are there to use parallelism to speed up hyperparameter optimization? What types of parallelism in the hardware are possible to use for these opportunities?

Parallelizing across multiple settings of the hyperparameters.

- Grid search
- Random search

Preprocessing examples

Parallel inference

- Easy to use parallelism to increase throughput by just having more machines run the same inference algorithm
- A little more tricky, but still possible, to decrease latency

We can reason about parallelism by reasoning about the sources of parallelism in the hardware, the availability of parallelism in the algorithm, and the use of parallel resources in the implementation.

Existing ML frameworks do a lot of this automatically for common tasks like deep learning training.