Prof David Bindel

Please click the play button below.

Climate modeling

CAD tools (computers, buildings, airplanes, ...)

Computational biology

Computational finance

Machine learning and statistical models

Game physics and movie special effects

Medical imaging

Take a minute to Google "HPC X" where X is your favorite application. What comes up?

If you have no favorite applications, you might poke through the front page of HPCWire to see some things that others care about!

Scientific computing went parallel long ago

Want an answer that is right enough, fast enough

Either of those might imply a lot of work!

We like to ask for more as machines get bigger

We have a lot of data, too

Today: Hard to get a non-parallel computer!

How many cores are in your laptop?

How many in NVidia's latest accelerator?

What's the biggest single node EC2 instance?

**Basics:**architecture, parallel concepts, locality and parallelism in scientific codes**Technology:**OpenMP, MPI, CUDA/OpenCL, cloud systems, compilers and tools**Patterns:**Monte Carlo, dense and sparse linear algebra and PDEs, graph partitioning and load balancing, fast multipole, fast transforms

Reason about code performance

Many factors: HW, SW, algorithms

Want simple “good enough” models

Learn about high-performance computing (HPC)

Learn parallel concepts and vocabulary

Experience parallel platforms (HW and SW)

Read/judge HPC literature

Apply model numerical HPC patterns

Tune existing codes for modern HW

Apply good software practices

Speed records for the Linpack benchmark:

Speed measured in flop/s (floating point ops / second):

Giga (\(10^9\)) – a single core

Tera (\(10^{12}\)) – a big machine

Peta (\(10^{15}\)) – current top 10 machines

Exa (\(10^{18}\)) – favorite of funding agencies

Look at the report. What does it say about:

Peak flop rate, Linpack rate, HPCG rate?

Energy use and cooling?

Individual processor architecture?

Network organization?

Software stack?

Graph processing benchmark (data-intensive)

Metric: traversed edges per second (TEPS)

What is Fujitsu Fugaku in GTEPS?

How do the top machines compare between Top 500 and Graph 500?

Some high-end machines look like high-end clusters

Except custom networks.

And then some machines look very different.

Achievable performance is

\(\ll\) peak performance

Application-dependent

Peak is hard on more modest platforms, too!

So how fast can I make my computation?

Peak \(>\) Linpack \(>\) Gordon Bell \(>\) Typical

Measuring performance of real applications is hard

What figure of merit (flops, TEPS, ...?)

Typically a few bottlenecks slow things down

Why they slow down can be tricky!

*Really* care about time-to-solution

Sophisticated methods get answer in fewer flops

... but may look bad in benchmarks (lower flop rates!)

See also David Bailey’s comments:

Starting point: good *serial* performance.

Strong scaling: compare parallel to serial time on the same problem instance as a function of number of processors (\(p\)) \[\begin{aligned} \mbox{Speedup} &= \frac{\mbox{Serial time}}{\mbox{Parallel time}} \\[2mm] \mbox{Efficiency} &= \frac{\mbox{Speedup}}{p} \end{aligned}\]

Ideally, speedup = \(p\). Usually, speedup \(< p\), because:

Serial work (Amdahl’s law)

Parallel overheads (communication, synchronization)

\[\begin{aligned} p = & \mbox{ number of processors} \\ s = & \mbox{ fraction of work that is serial} \\ t_s = & \mbox{ serial time} \\ t_p = & \mbox{ parallel time} \geq s t_s + (1-s) t_s / p \end{aligned}\]

\[\mbox{Speedup} = \frac{t_s}{t_p} = \frac{1}{s + (1-s) / p} < \frac{1}{s}\]

So \(1\%\) serial work \(\implies\) max speedup < \(100 \times\), regardless of \(p\).

Let’s try a simple parallel attendance count:

**Parallel computation:**Rightmost person in each row counts number in row.**Synchronization:**Raise your hand when you have a count**Communication:**When all hands are raised, each row representative adds their count to a tally and says the sum (going front to back).

Parameters: \[\begin{aligned} n = & \mbox{ number of students (80)} \\ r = & \mbox{ number of rows} \\ t_c = & \mbox{ time to count one student (0.3)} \\ t_t = & \mbox{ time to say tally (1)} \\ t_s \approx & ~n t_c \\ t_p \approx & ~n t_c / r + r t_t \end{aligned}\]

How much could I possibly speed up?

\[\mathrm{speedup} < \frac{1}{2} \sqrt{\frac{n t_c}{t_t}}\]

The problem size \(n\) is small

The communication cost is relatively large

The serial computation cost is relatively large

Common suspects for parallel performance problems!

We have (arguably)

*exaflop*machinesBut codes rarely get peak performance

Better comparison: tuned serial performance

Common measures:

*speedup*and*efficiency*

Strong scaling: study speedup with increasing \(p\)

Weak scaling: increase both \(p\) and \(n\)

Serial overheads, communication kill speedup

Simple models help us understand scaling