CS 5220 performance intro

# CS 5220: Applications of Parallel Computers ## Intro to Performance Analysis ## 27 Aug 2015

## Reading Note This is a supplement to [the notes](/2015/08/10/performance.html). Go read them, too!

## How Fast Can We Go? - Speed in flop/s for Linpack: [top500](http://www.top500.org) - Giga ($10^9$) -- a single core - Tera ($10^{12}$) -- a big machine - Peta ($10^{15}$) -- current top 10 machines (5 in US) - Exa ($10^{18}$) -- favorite of funding agencies - Current record-holder: China's Tianhe-2 - 33.9 Petaflop/s (54.9 theoretical peak) - 17.8 MW + cooling

## Tianhe-2 environment Commodity nodes, custom interconnect: - Xeon E5-2692 nodes with Phi accelerators - Intel compilers + Intel math kernel libraries - MPICH2 MPI with customized channel - Kylin Linux - *TH Express-2 interconnect*

## A US Contender [Sequoia at LLNL (3 of 500)](http://www.top500.org/system/177556) - 20.1 Petaflop/s theoretical peak - 17.2 Petaflop/s Linpack benchmark (86%) - 14.4 Petaflop/s in a bubble-cloud sim (72%) - 2013 Gordon Bell Prize - 2010 Prize was 30% peak on ORNL Jaguar - Performance on more standard code? - 10% is probably very good!

## Parallel Performance in Practice - Peak > Linpack > Gordon Bell > Typical - Measuring performance of real applications is hard - Typically a few bottlenecks slow things down - Figuring out why can be tricky! - And we *really* care about time-to-solution - Sophisticated methods get answers in fewer flops - ... but may look bad in flop rate benchmarks - Lots of delusion and deception in performance analysis - Read [the notes]((/2015/08/10/performance.html)!

## Quantifying Parallel Performance - Starting point: good *serial* performance - Strong scaling: compare parallel to serial time (fixed size) - Speedup = Serial Time / Parallel Time - Efficiency = Speedup / p - Ideally, speedup = p; usually lower - Barriers to perfect speedup - Serial work (Amdahl's law) - Parallel overheads (communication, synchronization)

Amdahl

\begin{align} p = & \mbox{ number of processors} \\ s = & \mbox{ fraction of work that is serial} \\ t_s = & \mbox{ serial time} \\ t_p = & \mbox{ parallel time} \geq s t_s + (1-s) t_s/p \end{align}
$$ \mbox{Speedup} = \frac{t_s}{t_p} = \frac{1}{s + (1-s)/p} < \frac{1}{s} $$

Things look better if $n$ grows with $p$ (a weak scaling study)

## Summary - We're approaching exaflop *peak* rates - Codes rarely get peak performance - Better: Compare to tuned serial performance - Measure *speedup* and *efficiency* - Strong scaling: increase $p$, fix $n$ - Weak scaling: increase both $p$ and $n$ - Serial overheads and communication kill speedup - Simple analytical models help understand scaling