Applications of Parallel Computers
Parallel Machines and Models
Prof David Bindel
Please click the play button below.
Welcome to another edition of 5220. For the last few slide decks,
we've been talking about single core performance, and maybe you
thought "when do we get to the parallel case"? Well, we start
today. This is an overview lecture to set the stage; we'll get
into much more detail on the topics presented today in later
slide decks.
Parallel computer hardware
Have processors , memory , interconnect .
Where is memory physically?
Is it attached to processors?
What is the network connectivity?
So far, our discussions of parallel machines have mostly focused
on hardware. We've talked about peak flop rates and power usage,
and of course we've mentioned what types of processors these
machines use. But after our discussion of memory, maybe it won't
surprise you to learn that the positioning of memory and the way
we communicate with memory -- and with other processors -- plays
just as big a role in performance. In general, parallel
architecture is about dealing with these three components: the
individual processors, the memory, and the networks used to
connect everything. I'll sometimes say network fabric or
interconnect when referring to the network; these terms
basically all mean the same thing. The network is often the thing
that most distinguishes a supercomputer from a small cluster.
Parallel programming model
Programming model through languages, libraries.
What are the control mechanisms?
What data semantics? Private, shared?
What synchronization constructs?
For performance, need cost models!
This class is pretty low-level, but it's still worth
distinguishing between the hardware and the programming model --
that is, the abstractions we use for writing our parallel codes.
The model tells us how we initiate and control parallel jobs,
share data between processors, and synchronize the efforts of
different processors. The parallel programming models we'll
discuss are pretty close to the way that we think about certain
types of hardware, but they aren't identical. We can implement
shared memory programming abstractions even if we only have
hardware support for passing messages around, and we can implement
message passing on top of shared memory hardware. Indeed, these
can be really useful things to do! So it is worthwhile keeping
the programming abstraction distinct from the hardware in our
minds. Of course, if we want to think about performance as well
as correctness of our parallel codes, we need to have some
understanding of the hardware, too. At the very least, we need to
know enough to build cost models that will help us predict
performance and guide us toward good implementations.
Simple example
double dot(int n, double* x, double* y)
{
double s = 0;
for (int i = 0; i < n; ++i)
s += x[i] * y[i];
return s;
}
Examples always help to make things concrete. For this lecture,
our running example will be dot products. From our centroid
example, we know this naïve implementation of the dot product
might not be optimal on a single core; but this doesn't really
matter for our discussion today. What matters is that it is
pretty obvious how to split the dot product into independent
pieces of work that we could assign to different processors.
Simple example
double pdot(int n, double* x, double* y)
{
double s = 0;
for (int p = 0; p < NUM_PROC; ++p) { // Loop to parallelize
int i = p*n/NUM_PROC;
int inext = (p+1)*n/NUM_PROC;
double partial = dot(inext-i, x+i, y+i);
s += partial;
}
return s;
}
Well, let's be a little more explicit about how we might partition
the work. The idea is that we are going to split the big dot product
into smaller dot products, each about size n/p. Then we take the
partial dot products, and accumulate them into the total sum.
I've summarized the logic in the pdot code above, while leaving
vague how we would actually parallelize the main loop.
Simple example
How can we parallelize?
Where do arrays \(x\) and \(y\) live? One CPU? Partitioned?
Who does what work?
How do we combine to get a single final result?
Actually, there is a lot that goes into thinking about parallel
implementation for even this simple example. We've referred to
the arrays x and y, but in a parallel setting, there is a question
of where they live. Are they on the memory of a particular
processor? Does that concept even make sense? Then, of course,
there's the issue of who does what work. We suggested one way for
splitting up the sum, but there are other ways to do the split-up
as well. Finally, once each processor has done its work, we need
to combine the partial sums, which means some type of
communication.
OK. Let's turn now to a couple of different ways we might do this.
Shared memory model
The first programming model we'll discuss is shared memory.
Shared memory model
Program consists of threads of control.
Can be created dynamically
Each has private variables (e.g. local)
Each has shared variables (e.g. heap)
Communication through shared variables
Coordinate by synchronizing on variables
Examples: OpenMP, pthreads
In shared memory programming systems, like OpenMP or pthreads,
there are independent "threads" of execution that communicate
through a common memory space. A thread has its own program
counter and call stack, so typically would have some private
(stack) variables as well as accessing shared space. Threads
may correspond to physical processors, but they don't strictly
need to. In some systems, threads can be dynamically added or
removed; in others, there is a fixed pool of threads. The
tricky part of shared memory programming is synchronizing the
access to the shared space.
Shared memory dot product
Dot product of two \(n\) vectors on \(p \ll n\) processors:
Each CPU: partial sum (\(n/p\) elements, local)
Everyone tallies partial sums
Can we go home now?
In words, one shared memory or thread-based approach to dot
products involves each CPU taking a partial dot product, and
then adding that partial sum into a shared accumulator.
Of course, it can't be that easy...
Race condition
A race condition :
Two threads access same variable
At least one write.
Access are concurrent – no ordering guarantees
Could happen simultaneously!
The problem here is that each thread is reading and writing to
the same shared sum variable, but so far we haven't said
anything about synchronization. That's dangerous because it
sets us up for what is called a race condition. This is when
two threads are accessing the same variable, with at least one
write, concurrent access, and no ordering guarantees. Race
conditions lead to unpredictable variations in results.
Race to the dot
Consider S += partial on two CPUs
Let's consider what could go wrong with two processors trying to
simultaneously update the shared sum without synchronization.
Race to the dot
P1
P2
load S
add partial
load S
store S
add partial
store S
Let's consider what happens at a pseudo-assembly language
update. An update operation consists of three parts: first, we
load the current sum into a register; then we update with the
partial sum; and then we store back. If both processors read
before either one writes back the update, then part of the sum
will be lost. The thing that's really awful about this problem
is that it won't always happen! Whether we get the full sum, or
one part, or the other depends on how the reads and writes
interleave with teach other, which is pretty unpredictable and
will vary from run to run.
Sequential consistency
Idea: Looks like processors take turns, in order
Convenient for thinking through correctness
Really hard for performance!
Will talk about memory models later
You might think the explanation of the race condition in the
previous slide is unintuitive. But even then, it's worse than
you think. The explanation that we gave implicitly involves the
idea of sequential consistency: that is, the state of memory is
consistent with some serial execution of interleaving of
instructions from the different threads. In reality, though,
it's really hard to have sequential consistency and still get
good performance on modern machines. Some computer architects
still try, but for the most part we have to live with weaker
models of consistency, where even stranger things can happen
than what we described. We'll talk about these alternate models
of memory later in the class.
Shared memory dot with locks
Solution: consider S += partial_sum a critical section
Only one CPU at a time allowed in critical section
Can violate invariants locally
Enforce via a lock or mutex
How do we avoid this type of race condition? The problem we saw
really had to do with the fact that load, add, and store from
one thread could interleave with the same operations from
another thread. We can avoid that by requiring that each thread
get a lock, also called a mutual exclusion variable (or mutex),
before applying the update. This establishes what is called a
critical section - critical sections are parts of the program
that only one thread can enter at a time. We will talk about
locks and critical sections in more detail later.
Shared memory dot with locks
Dot product with mutex:
Create global mutex l
Compute partial_sum
Lock l
S += partial_sum
Unlock l
OK, let's sketch how this works. We start by creating a lock or
mutex variable. Only one thread can "hold" or "acquire" the
lock at a time. So for the dot product, we compute our partial
sum, acquire the lock, update the global sum, and release the
lock. This means each update is applied in sequence, without
interleaving. Of course, we don't know in what order the
partial sums will be accumulated, and that matters in floating
point, though only at the level of roundoff. So this code,
though correct, doesn't provide bitwise reproducibility of
results. As I might have already said, parallel execution is
subtle stuff!
A problem
Processor 1:
Acquire lock 1
Acquire lock 2
Do something
Release locks
Processor 2:
Acquire lock 2
Acquire lock 1
Do something
Release locks
What if both processors execute step 1 simultaneously?
In the dot product example, we only need one lock, which we use
to protect accesses to the global sum. But what happens if we
need more than one lock because we want to compute more than one
thing? Let's consider the example above. If both threads are
able to execute the first step simultaneously, then we run into
trouble at the second step. The first thread holds lock 1 and
wants lock 2; and the second thread holds lock 2 and wants lock 1.
Nobody can make progress! This situation is called deadlock.
We'll talk about this more later; it turns out that there are
ways that we can ensure that we avoid deadlock, which those of
you who took an OS class probably studied already (and maybe
forgot!). But let's now briefly mention a synchronization approach
that will definitely not deadlock, and is really useful for lots
of scientific codes.
Shared memory with barriers
Lots of sci codes have phases (e.g. time steps)
Communication only needed at end of phases
Idea: synchronize on end of phase with barrier
More restrictive (less efficient?) than small locks
But easier to think through! (e.g. less chance of deadlocks)
Sometimes called bulk synchronous programming
In many scientific codes, and really in many codes in general,
the computation has natural phases. Within each phase, we can
do independent work; for correctness, we just need to ensure
that one phase is completely done before the next can start. A
barrier is a synchronization construct that does exactly this:
every computation on every thread before the barrier has to
finish before we start computations after the barrier. This is
not as flexible as fine-grain locking, but it's also less
difficult to reason about correctness of code written with
barriers. This style of programming, where we have phases
separated by barriers, is sometimes called bulk synchronous
programming (or BSP for short).
Dot with barriers
partial[threadid] = local partial sum
barrier
sum = sum(partial)
What does the dot product look like with barriers? A typical
organization might involve each thread writing a partial sum
into an array, then a barrier, and then each thread summing the
array to get a sum (in a private variable). If we wanted the sum to go
into a global space, we might have a designated thread copy the
result out.
Punchline
Shared memory correctness is hard
Too little synchronization: races
Too much synchronization: deadlock
And this is before we talk performance!
So far, we've talked mostly about correctness in the shared
memory model. And it's not that simple! We have to
synchronize to avoid data races, but too much synchronization
(done without care) might lead to deadlock. And none of this
has even touched on performance yet! But to say anything about
performance, we need to say a bit more about hardware.
Shared memory machines
So let's talk now about shared memory hardware.
Uniform shared memory
Processors and memories talk through a bus
Symmetric Multiprocessor (SMP)
Hard to scale to lots of processors (think \(\leq 32\) )
Bus becomes bottleneck
Cache coherence via snooping
One of the most straightforward approaches to shared memory is
to attach processors and main memory to a common bus. The
caches live with the processors and not with the memory, so we
need a way to keep them consistent (or coherent) with the main
memory in the face of memory writes that other processors might
commit. A bus is a shared broadcast medium; only one core or
memory can send on the bus at a time, but anyone can hear what
is being sent. On the down side, the more processors and
memories are on the bus, the harder it is to share that
resource. On the up side, because everyone can see the bus, it
is possible to keep the caches consistent by "snooping" on
memory traffic sent across the bus.
Multithreaded processor machine
Maybe threads > processors!
Idea: Switch threads on long latency ops.
Called hyperthreading by Intel
Cray MTA was an extreme example
The shared memory interface also sometimes makes sense when
threads don't exactly correspond to processors. Usually, that
means having more threads than processors. When a thread has to
wait for the results of a memory read, disk write, or other
long-latency task, it can yield the processor for use by other
threads. Sometimes this is a purely software setup, and
sometimes it involves hardware support. On modern Intel chips,
this is called hyper-threading, and we mentioned it briefly in
our discussion of single-core architecture. But there are some
architectures that have gone way more extreme than what Intel
did. The Cray MTA machine took a really long time to access
memory (though that time was pretty uniform), and tried to hide
that latency with a *lot* of threads. Needless to say, it was
not easy to program. I had an amusing semester in graduate
school listening to one of my officemates swear at that
machine. Good times! And better him than me.
Distributed shared memory
Non-Uniform Memory Access (NUMA)
Memory logically shared, physically distributed
Any processor can access any address
Close accesses are faster than far accesses
Cache coherence is still a pain
Most big modern chips are NUMA
Many-core accelerators tend to be NUMA as well
When we have a lot of processors, we are generally forced to
move away from sending all data across a single bus. In this
case, there is a more complex network that connects the
processors and memories. Often, memories are physically located
together with processors, so each processor has "local" memory
and "remote" memory. These memories may all be accessed through
the same logical address space, but it takes different amounts
of time to read or write data depending on whether the read is
to local or remote memory addresses. This is called non-uniform
memory access, or NUMA.
NUMA systems scale to large numbers of cores much better than
uniform access (or "symmetric") multiprocessors. Most modern
big chips are NUMA, as are most many-core accelerators.
Unfortunately, it is harder to keep the caches in a NUMA chip
consistent with each other than it is in an SMP (though there
are mechanisms for this).
Punchline
Shared memory is expensive!
Uniform access means bus contention
Non-uniform access scales better
(but now access costs vary)
Cache coherence is tricky
May forgo sequential consistency for performance
So: shared memory hardware presents some challenges. If we want
uniform memory access, we need a beefy bus to connect the
processors and memories together, and contention for that bus
limits our performance. Giving up on uniform access means that
we can scale to more processors, but now the costs of accessing
memories vary - though maybe we are OK with this, as there is
already variation in access times due to the effect of the cache
system. Keeping the caches coherent with each other and with
main memory is a challenge, particularly in the non-uniform
case. There are clever solutions, but for the sake of
performance, we often don't seek to make those solutions
perfectly maintain sequential consistency.
Don't worry if you didn't catch all that. We'll go into it in
more detail when we talk about shared memory programming in
OpenMP in a couple weeks.
Message passing model
The major alternative to shared memory programming is
programming via message passing.
Message-passing programming model
Collection of named processes
Data is partitioned
Communication by send/receive of explicit message
Lingua franca: MPI (Message Passing Interface)
In the message-passing model of parallel programming, we have a
collection of named processes, each with private memory spaces.
The data for the program is partitioned across these private
memories; there is no global shared space. Communication
happens by explicitly sending and receiving messages between
processors. The standard approach to message-passing
parallelism in scientific computing is the Message Passing
Interface, or MPI.
Message passing dot product: v1
Processor 1:
Partial sum s1
Send s1 to P2
Receive s2 from P2
s = s1 + s2
Processor 2:
Partial sum s2
Send s2 to P1
Receive s1 from P1
s = s1 + s2
What could go wrong? Think of phones vs letters...
So let's talk about how parallel dot product might work with two
processors in a message-passing model. Each processor holds a
part of x and a part of y in its memory. The processor dots its
piece, then sends the partial sum to the other processor. Then
the other processor receives the outside partial sum, adds it to
the partial sum that it computed, and that gives the overall dot
product.
Alas, you knew it couldn't be quite that easy. The problem is
that sending a message *may* be less like a letter, and more
like placing a telephone call. You can't hang up the phone
while you are dialing and waiting to deliver the message! In a
world where there are no busy signals or answering machines, if
I call you at the same time that you call me, maybe we both
spend forever waiting for the other side to pick up. So this
code is prone to deadlock.
As it turns out, the system has buffers in it, and whether
sending a message is phone-call-like or letter-like depends on
the state of those buffers. So the same code may work most of
the time, but periodically deadlock because of the issue we
sketched above. This is pretty maddening to debug.
Message passing dot product: v1
Processor 1:
Partial sum s1
Send s1 to P2
Receive s2 from P2
s = s1 + s2
Processor 2:
Partial sum s2
Receive s1 from P1
Send s2 to P1
s = s1 + s2
Better, but what if more than two processors?
There's a way around the potential deadlock issue in the
previous example. If the first processor sends before
receiving, and the second processor receives before sending,
then we no longer have the possibility of deadlock. P1 sends
and P2 receives; then P2 sends and P1 receives. The order is
nailed down.
Of course, this business of swapping the sends and receives is
fine for two processors communicating with each other in a
fairly simple exchange. What if there are more processors, and
the communication between them is more complex?
MPI: the de facto standard
Pro: Portability
Con: least-common-denominator for mid 80s
The “assembly language” (or C?) of parallelism...
but, alas, assembly language can be high performance.
There are other message-passing programming environments out
there, but the lingua franca in scientific computing is MPI: the
Message-Passing Interface. It can be pretty low-level, but
there are a number of implementations that provide pretty good
performance.
Punchline
Message passing hides less than shared memory
But correctness is still subtle
Shared memory programming sometimes seems like magic pixie dust:
we take a serial code and sprinkle in some parallel constructs,
and we get out immediate speedup. Unfortunately, there are a
lot of subtleties in both the performance and correctness of
shared memory code. Codes written in a message-passing style
are often more verbose, and they have their own subtleties when
it comes to correctness. But it is sometimes easier to reason
about performance, because the communication events are
explicit, and not hidden behind an innocuous looking read from a
variable that happens to be stored in a remote shared memory.
Distributed memory machines
All right. We've talked about message-passing programming. Now
let's talk about distributed memory machines.
Distributed memory machines
Each node has local memory
... and no direct access to memory on other nodes
Nodes communicate via network interface
Example: most modern clusters!
In distributed memory machines, the hardware doesn't provide
direct support for memory on other nodes. Instead, nodes
communicate with each other by sending messages across a network
via a network interface card (or NIC). Most modern clusters are
distributed memory between nodes, but have a shared memory
architecture within a given node. We can mix-and-match these
architectural ideas.
Back of the envelope
c is 3 billion m/s.
One light-ns is about 0.3 m (about a foot)
A big machine is often over 300 feet across
May still be dominated by NIC latency (microseconds)
Across a big machine will always be order(s)-of-magnitude slower
than local memory accesses
Another reason locality matters!
Sending a message across the network can be a lot more expensive
than retrieving data from memory, even when there is a cache miss.
The problem is that one light nanosecond - or the distance that
light travels in one nanosecond - is about a foot. And big
supercomputers often have a space footprint the size of a football
field. That means that even without the overheads for routing
data through a network, simple speed-of-light delays might give us
a delay of something like 600 ns to send data across the machine
and back. That's significantly worse than fetching data from
DRAM! On top of that, we might spend on the order of microseconds
to get data through the NIC and onto the network. And we can do a lot
of flops in a couple microseconds!
Paths to Parallel Performance
All right. We've talked about shared memory programming, shared
memory hardware, message passing, and distributed memory.
Correctness is hard in every case we've considered; performance
is harder. We don't want to learn all the minutiae in order to
write fast code with these programming models and hardware
platforms. So how can we do this?
Reminder: what do we want?
High-level: solve big problems fast
Start with good serial performance
Given \(p\) processors, could then ask for
Good speedup : \(p^{-1}\) times serial time
Good scaled speedup : \(p \times\) the work, same time
Easiest to get speedup from bad serial code!
First, let's be clear what we want. We want good scaling
in either the strong sense - where the problem stays the same
as the processor counts vary; or in the weak sense - where the
work per processor stays the same as the processor counts vary.
But good scaling is a hollow victory if we start with poor
single-core performance, so ideally we will start with a
well-tuned serial code and then parallelize.
The story so far
Parallel performance is limited by:
Single-core performance
Communication and synchronization costs
Non-parallel work (Amdahl)
Plan now: talk about how to overcome these limits for some types
of scientific applications
What limits the performance of parallel codes? As we just said,
one factor is single-core performance. But it also matters how
much time we spend on communication and synchronization;
and how much parallelism is in our workload. If much of the
workload is serial, we are not going to get great parallel speedups no
matter how many processors we might use!
The key to getting lots of parallelism and minimizing
communication costs is to exploit parallelism and locality in
the problem. How this works varies from problem to problem, but
there are some common patterns that appear across many
scientific codes.
Parallelism and locality
Can get more parallelism / locality through model
Limited range of dependency between time steps
Can neglect or approximate far-field effects
Why do independent parallel work and local communication arise
naturally in so many simulations? It's really because a lot of
interactions in the physical world are local. For example, just
as there is a speed of light or sound in physics, there is a
rate at which information can travel across a computational mess
in an explicit time stepper - and that rate is usually connected
to the rates in the physics problem. Also, when we look at
computations in which far-away things have influence, the
influence of those far-away things can often be approximated in
a pretty simple way. When we model gravity in the solar system,
we treat the planets as point masses; and if we are going to
compute the influence of distant star systems, we will probably
just treat them as a single point mass! This is a pretty big
simplification over modeling the gravitational attraction of the
matter as it is truly spread over space.
Parallelism and locality
Often get parallism at multiple levels
Hierarchical circuit simulation
Interacting models for climate
Parallelizing individual experiments in MC or optimization
Moreover, there is often a natural partitioning or hierarchy in
how we treat models, and this structure helps us when it comes
time to find parallelism. We might be able to use parallel
resources to simulate parts of a circuit or different components
of a climate model, and then combine those parts later on
without too much communication.
Yes, talking about this will mean that I talk about the physics
and modeling in simulations. So this means that all of you who
like numerics and physics but don't know architecture get to
enjoy the change of pace, and those of you who knew about
computer architecture but know little of physics modeling will
get to learn something new!
Next up
Parallel patterns in simulation
Discrete events and particle systems
Differential equations (ODEs and PDEs)
So the plan for next week is to explore locality and parallelism
in different types of simulations. We will probably treat
discrete event simulations and particle systems on Tuesday, and
ordinary and partial differential equation models on Thursday.
But in this new world order, I can combine slide decks when it
makes sense, so you might also get everything in a giant deck on
Tuesday. We will see. Either way, until next time!