# CS 5220

## Applications of Parallel Computers

### Performance Basics

Prof David Bindel

Please click the play button below.

## Starting on the Soap Box

### Starting on the Soap Box

The goal is right enough, fast enough — not flop/s.

### Starting on the Soap Box

Performance is not all that matters.

• Portability, readability, debuggability matter too!
• Want to make intelligent trade-offs.

### Starting on the Soap Box

starts with a single core.

• Even single-core performance is hard.
• Helps to build on well-engineered libraries.

### Starting on the Soap Box

Parallel efficiency is hard!

• $$p$$ processors $$\neq$$ speedup of $$p$$
• Different algorithms parallelize differently.
• Speed vs untuned serial code is cheating!

## Peak performance

### Whence Rmax?

Top 500 benchmark reports:

• Rmax: Linpack flop/s
• Rpeak: Theoretical peak flop/s

Measure the first; how do we know the second?

### What is a float?

• (Binary) scientific notation
• Extras: inf, NaN, de-normalized numbers
• IEEE 754 standard: encodings, arithmetic rules

### What is a float?

Common floating point formats

• 64-bit double precision (DP)
• 32-bit single precision (SP)

Linpack results are double precision

### What is a float?

Less common

• Extended precisions (often 80 bits)

• 16-bit half precision (multiple)

• Decimal formats

Lots of interest in 16-bit formats for ML

### What is a flop?

• Basic floating point operations: $+, -, \times, /, \sqrt{\cdot}$

• FMA (fused multiply-add): $d = ab + c$

• Costs depend on precision and op

• Often focus on add, multiply, FMA (flams)

### Flops / cycle / core

Processor does more than one thing at a time. On my laptop (2018 MacBook Air):

• Two vector FMAs can start in one cycle

• Vector FMA does four DP FMAs at once

• Often count an FMA as two flops

### Flops / cycle (one core)

$2 \frac{\mbox{flops}}{\mbox{FMA}} \times 4 \frac{\mbox{FMA}}{\mbox{vector FMA}} \times 2 \frac{\mbox{vector FMA}}{\mbox{cycle}} = 16 \frac{\mbox{flops}}{\mbox{cycle}}$

### Flops / sec (one core)

\begin{aligned} 16 \frac{\mbox{flops}}{\mbox{cycle}} \times (1.6 \times 10^9) \frac{\mbox{cycle}}{\mbox{sec}} &= 25.6 \times 10^9 \frac{\mbox{flops}}{\mbox{sec}} \\ &= 25.6~\mbox{Gflop/s} \end{aligned}

### Flops / sec

$25.6 \frac{\mbox{Gflop/s}}{\mbox{core}} \times 2~\mbox{cores} = 51.2~\mbox{Gflop/s}$

Things get more complicated if there are different core types (e.g. CPU cores and GPU cores)

### Some historical context

Note: CM-5/1024 peak was 131 Gflop/s.

This was the top machine on the first Linpack benchmark list (July 1993).

## The Cost of Computing

### The Cost of Computing

Consider a simple serial code:


// Accumulate C += A*B for n-by-n matrices
for (i = 0; i < n; ++i)
for (j = 0; j < n; ++j)
for (k = 0; k < n; ++k)
C[i+j*n] += A[i+k*n] * B[k+j*n];


### The Cost of Computing

Simplest model:

1. Dominant cost is $$2n^3$$ flops (adds and multiplies)
2. One flop per clock cycle
3. Expected time is $\mbox{Time (s)} \approx \frac{2n^3 \mbox{ flops}} {25.6 \cdot 10^9 \mbox{ flop/s}}$

Problem: Model assumptions are wrong!

dbindel@MacBook-Air-5 codes % gcc naive-matmul.c
dbindel@MacBook-Air-5 codes % time ./a.out
./a.out  112.04s user 0.43s system 99% cpu 1:53.25 total


### The Cost of Computing

Dominant cost is $$2n^3$$ flops (adds and multiplies)?

• Dominant cost is often memory traffic!

• Special case of a communication cost

### The Cost of Computing

Two pieces to cost of fetching data

Latency

Time from operation start to first result (s)

Bandwidth

Rate at which data arrives (bytes/s)

### The Cost of Computing

• Usually latency $$\gg$$ bandwidth$$^{-1}$$ $$\gg$$ time per flop

• Latency to L3 cache is 10s of ns

• DRAM is $$3$$$$4 \times$$ slower

• Partial solution: caches (to discuss next time)

### The Cost of Computing

Makes DRAM ($$\sim 100$$ ns) look even worse: $100 \mbox{ ns} \times 25.6 \mbox{ Gflop/s} = 2560 \mbox{ flops}$

### The Cost of Computing

• Lose orders of magnitude if too many memory refs

• And getting full vectorization is also not easy!

• We’ll talk more about (single-core) arch next time

### The Cost of Computing

What to take away from this example?

### The Cost of Computing

• Simplest: asymptotic complexity (e.g. $$O(n^3)$$ flops)

• Counting every detail just complicates life

• But we want enough detail to predict something

### The Cost of Computing

Watch out for hidden costs

• Flops are not the only cost!

• Memory/communication costs are often killers

• Integer computation may play a role as well

### The Cost of Computing

Haven’t even talked about > 1 core yet!

## The Cost of Computing

### The Cost of (Parallel) Computing

Simple model:

• Serial task takes time $$T$$ (or $$T(n)$$)

• Deploy $$p$$ processors

• Parallel time is $$T(n)/p$$

... and you should be suspicious by now!

### The Cost of (Parallel) Computing

Why is parallel time not $$T/p$$?

• Intrinsically serial work

• Idle time due to synchronization

• Contention for resources

### Quantifying Parallel Performance

• Starting point: good serial performance

• Scaling study: compare parallel to serial time as a function of number of processors ($$p$$) \begin{aligned} \mbox{Speedup} &= \frac{\mbox{Serial time}}{\mbox{Parallel time}} \\[2mm] \mbox{Efficiency} &= \frac{\mbox{Speedup}}{p} \end{aligned}

• Ideally, speedup = $$p$$. Usually, speedup $$< p$$.

• Barriers to perfect speedup

• Serial work (Amdahl’s law)

### Amdahl’s Law

Parallel scaling study where some serial code remains: \begin{aligned} p = & \mbox{ number of processors} \\ s = & \mbox{ fraction of work that is serial} \\ t_s = & \mbox{ serial time} \\ t_p = & \mbox{ parallel time} \geq s t_s + (1-s) t_s / p \end{aligned}

$\mbox{Speedup} = \frac{t_s}{t_p} = \frac{1}{s + (1-s) / p} < \frac{1}{s}$

### Amdahl’s Law

$\mbox{Speedup} < \frac{1}{s}$

So $$1\%$$ serial work $$\implies$$ max speedup < $$100 \times$$, regardless of $$p$$.

### Strong and weak scaling

Ahmdahl looks bad! But two types of scaling studies:

Strong scaling

Fix problem size, vary $$p$$

Weak scaling

Fix work per processor, vary $$p$$

### Strong and weak scaling

For weak scaling, study scaled speedup $S(p) = \frac{T_{\mbox{serial}}(n(p))}{T_{\mbox{parallel}}(n(p), p)}$ Gustafson’s Law: $S(p) \leq p - \alpha(p-1)$ where $$\alpha$$ is the fraction of work that is serial.

### Imperfect parallelism

Problem is not just with purely serial work, but

• Work that offers limited parallelism

### Dependencies

Main pain point: dependency between computations


a = f(x)
b = g(x)
c = h(a,b)


Compute a and b in parallel, but finish both before c!
Limits amount of parallel work available.

This is a true dependency (read-after-write). Also have false dependencies (write-after-read and write-after-write) that can be dealt with more easily.

### Granularity

• Coordination is expensive
- including parallel start/stop!
• Need to do enough work to amortize parallel costs
• Not enough to have parallel work, need big chunks!
• Chunk size depends on the machine.

## Patterns and Benchmarks

### Pleasing Parallelism

“Pleasingly parallel” (aka “embarrassingly parallel”) tasks require very little coordination, e.g.:

• Monte Carlo computations with independent trials
• Mapping many data items independently

Result is “high-throughput” computing – easy to get impressive speedups!

### Patterns and Benchmarks

• What is the best performance I reasonably expect?
• How do I get that performance?

### Patterns and Benchmarks

Matrix-matrix multiply:

• Is not pleasingly parallel.