CS 5220

Applications of Parallel Computers

Prof David Bindel

Please click the play button below.

The Computational Science & Engineering Picture

background Layer 1 Application Analysis Computation

Applications Everywhere!

  • Climate modeling

  • CAD tools (computers, buildings, airplanes, ...)

  • Computational biology

  • Computational finance

  • Machine learning and statistical models

  • Game physics and movie special effects

  • Medical imaging

Question for Discussion

Take a minute to Google "HPC X" where X is your favorite application. What comes up?

If you have no favorite applications, you might poke through the front page of HPCWire to see some things that others care about!

Why Parallel Computing?

Scientific computing went parallel long ago

  • Want an answer that is right enough, fast enough

  • Either of those might imply a lot of work!

  • We like to ask for more as machines get bigger

  • We have a lot of data, too

Why Parallel Computing?

Today: Hard to get a non-parallel computer!

  • How many cores are in your laptop?

  • How many in NVidia's latest accelerator?

  • What's the biggest single node EC2 instance?

Lecture Plan

  1. Basics: architecture, parallel concepts, locality and parallelism in scientific codes

  2. Technology: OpenMP, MPI, CUDA/OpenCL, cloud systems, compilers and tools

  3. Patterns: Monte Carlo, dense and sparse linear algebra and PDEs, graph partitioning and load balancing, fast multipole, fast transforms

Objectives

Reason about code performance

  • Many factors: HW, SW, algorithms

  • Want simple “good enough” models

Objectives

Learn about high-performance computing (HPC)

  • Learn parallel concepts and vocabulary

  • Experience parallel platforms (HW and SW)

  • Read/judge HPC literature

  • Apply model numerical HPC patterns

  • Tune existing codes for modern HW

Objectives

Apply good software practices

How Fast Can We Go?

Speed records for the Linpack benchmark:

http://www.top500.org

Speed measured in flop/s (floating point ops / second):

  • Giga (\(10^9\)) – a single core

  • Tera (\(10^{12}\)) – a big machine

  • Peta (\(10^{15}\)) – current top 10 machines

  • Exa (\(10^{18}\)) – favorite of funding agencies

Fujitsu Fugaku

Look at the report. What does it say about:

  • Peak flop rate, Linpack rate, HPCG rate?

  • Energy use and cooling?

  • Individual processor architecture?

  • Network organization?

  • Software stack?

Alternate: Graph 500

Graph processing benchmark (data-intensive)

  • Metric: traversed edges per second (TEPS)

  • What is Fujitsu Fugaku in GTEPS?

  • How do the top machines compare between Top 500 and Graph 500?

Punchline

  • Some high-end machines look like high-end clusters

    • Except custom networks.

    • And then some machines look very different.

  • Achievable performance is

    • \(\ll\) peak performance

    • Application-dependent

  • Peak is hard on more modest platforms, too!

Practical Performance

So how fast can I make my computation?

Peak \(>\) Linpack \(>\) Gordon Bell \(>\) Typical

Practical Performance

Measuring performance of real applications is hard

  • What figure of merit (flops, TEPS, ...?)

  • Typically a few bottlenecks slow things down

  • Why they slow down can be tricky!

Practical Performance

Really care about time-to-solution

  • Sophisticated methods get answer in fewer flops

  • ... but may look bad in benchmarks (lower flop rates!)

Practical Performance

See also David Bailey’s comments:

Quantifying Performance

Starting point: good serial performance.

Quantifying Performance

Strong scaling: compare parallel to serial time on the same problem instance as a function of number of processors (\(p\)) \[\begin{aligned} \mbox{Speedup} &= \frac{\mbox{Serial time}}{\mbox{Parallel time}} \\[2mm] \mbox{Efficiency} &= \frac{\mbox{Speedup}}{p} \end{aligned}\]

Quantifying Performance

Ideally, speedup = \(p\). Usually, speedup \(< p\), because:

  • Serial work (Amdahl’s law)

  • Parallel overheads (communication, synchronization)

Amdahl’s Law

\[\begin{aligned} p = & \mbox{ number of processors} \\ s = & \mbox{ fraction of work that is serial} \\ t_s = & \mbox{ serial time} \\ t_p = & \mbox{ parallel time} \geq s t_s + (1-s) t_s / p \end{aligned}\]

\[\mbox{Speedup} = \frac{t_s}{t_p} = \frac{1}{s + (1-s) / p} < \frac{1}{s}\]

So \(1\%\) serial work \(\implies\) max speedup < \(100 \times\), regardless of \(p\).

A Thought Experiment

Let’s try a simple parallel attendance count:

  • Parallel computation: Rightmost person in each row counts number in row.

  • Synchronization: Raise your hand when you have a count

  • Communication: When all hands are raised, each row representative adds their count to a tally and says the sum (going front to back).

A Toy Analysis

Parameters: \[\begin{aligned} n = & \mbox{ number of students (80)} \\ r = & \mbox{ number of rows} \\ t_c = & \mbox{ time to count one student (0.3)} \\ t_t = & \mbox{ time to say tally (1)} \\ t_s \approx & ~n t_c \\ t_p \approx & ~n t_c / r + r t_t \end{aligned}\]

How much could I possibly speed up?

A Toy Analysis

Modeling Speedup

\[\mathrm{speedup} < \frac{1}{2} \sqrt{\frac{n t_c}{t_t}}\]

  • The problem size \(n\) is small

  • The communication cost is relatively large

  • The serial computation cost is relatively large

Common suspects for parallel performance problems!

Summary: Thinking about Parallel Performance

  • We have (arguably) exaflop machines

  • But codes rarely get peak performance

  • Better comparison: tuned serial performance

  • Common measures: speedup and efficiency

Summary: Thinking about Parallel Performance

  • Strong scaling: study speedup with increasing \(p\)

  • Weak scaling: increase both \(p\) and \(n\)

  • Serial overheads, communication kill speedup

  • Simple models help us understand scaling