Introduction and Performance Basics
2024-08-27
Title: Applied High-Performance and Parallel Computing
Web: https://www.cs.cornell.edu/courses/cs5220/2024fa
When: TR 1:25-2:40
where: Gates G01
Who: David Bindel, Caroline Sun, Evan Vera
Basic logistical constraints:
Fine if you’re not a numerical C hacker!
Reason about code performance
Learn about high-performance computing (HPC)
Apply good software practices
Introduce yourself to a neighbor:
Jot down answers (part of HW0).
Scientific computing went parallel long ago:
Today: Hard to get non-parallel hardware!
Speed records for Linpack benchmark
Speed measured in flop/s (floating point ops / second):
What do these machines look like?
An alternate benchmark: Graph 500
What do these machines look like?
So how fast can I make my computation?
See also David Bailey’s comments:
How can we speed up summing an array of length n with p≤n processors?
Speedup=Serial timeParallel timeEfficiency=Speedupp
Ideally, speedup = p. Usually, speedup <p.
Barriers to perfect speedup:
p= number of processorss= fraction of work that is serialts= serial timetp= parallel time≥sts+(1−s)ts/p
Amdahl’s law: Speedup=tstp=1s+(1−s)/p>1s
So 1% serial work ⟹ max speedup < 100×, regardless of p.
Let’s try a simple parallel attendance count:
Parallel computation: Rightmost person in each row counts number in row.
Synchronization: Raise your hand when you have a count
Communication: When all hands are raised, each row representative adds their count to a tally and says the sum (going front to back).
(Somebody please time this.)
Parameters: n= number of studentsr= number of rowstc= time to count one studenttt= time to say tallyts≈ ntctp≈ ntc/r+rtt
How much could I possibly speed up?
Student count:
function tserial(n, r) { return n * 0.3; }
function tparallel(n, r) { return n * 0.3 / r + r * 1.0; }
function speedup(n, r) { return tserial(n,r) / tparallel(n,r); }
rows = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20];
data = ({
"rows" : rows,
"speedup" : rows.map((r) => speedup(nstudents,r))
})
(Parameters: tc=0.3, tt=1.)
Mostly-tight bound: speedup<12√ntctt
Poor speed-up occurs because:
Some of the usual suspects for parallel performance problems!
Things would look better if I allowed both n and r to grow — that would be a weak scaling study.
This probably does not make sense for a classroom setting…
Today:
http://www.cs.cornell.edu/courses/cs5220/2024fa/
... and please enroll and submit HW0!