Parallel HW and Models
2024-09-05
c4-standard-2
(vs e2
)
Log-log plot showing memory/compute bottlenecks.
Roofline: An Insightful Visual Performance Model for Multicore Architectures, Communications of the ACM, 2009, 52(4).
Basic components: processors, memory, interconnect.
Programming model through languages, libraries.
For performance, need cost models (involves HW)!
How can we parallelize dot product?
Program consists of threads of control.
Consider pdot
on p≪n processors:
Of course, it can’t be that simple…
A race condition is when:
Consider s += partial
on two CPUs (s
shared).
Processor 1
load S
add partial
…
store S
…
…
Processor 2
…
…
load S
…
add partial
store S
Implicitly assumed sequential consistency:
Can consider s += partial
a critical section
Dot product with mutex:
l
partial
l
s += partial
l
Still need to synchronize on return…
Processor 1
Processor 2
What if both processors execute step 1 simultaneously?
Shared memory correctness is hard
And this is before we talk performance!
Shared memory is expensive!
Processor 1
Processor 2
What could go wrong?
Processor 1
Processor 2
Better, but what if more than two processors?
MPI_Sendrecv
MPI_Allreduce
Parallel performance is limited by:
Overcome these limits by understanding common patterns of parallelism and locality in applications.
Can get more parallelism / locality through modeling
Often get parallelism at multiple levels
More about parallelism and locality in simulations!