CS 5220

Applications of Parallel Computers

Parallel Machines and Models

Please click the play button below.

Parallel computer hardware

Have processors, memory, interconnect.

Where is memory physically?
Is it attached to processors?
What is the network connectivity?

So far, our discussions of parallel machines have mostly focused on hardware. We've talked about peak flop rates and power usage, and of course we've mentioned what types of processors these machines use. But after our discussion of memory, maybe it won't surprise you to learn that the positioning of memory and the way we communicate with memory -- and with other processors -- plays just as big a role in performance. In general, parallel architecture is about dealing with these three components: the individual processors, the memory, and the networks used to connect everything. I'll sometimes say network fabric or interconnect when referring to the network; these terms basically all mean the same thing. The network is often the thing that most distinguishes a supercomputer from a small cluster.

Parallel programming model

Programming model through languages, libraries.

What are the control mechanisms?
What data semantics? Private, shared?
What synchronization constructs?

For performance, need cost models!

This class is pretty low-level, but it's still worth distinguishing between the hardware and the programming model -- that is, the abstractions we use for writing our parallel codes. The model tells us how we initiate and control parallel jobs, share data between processors, and synchronize the efforts of different processors. The parallel programming models we'll discuss are pretty close to the way that we think about certain types of hardware, but they aren't identical. We can implement shared memory programming abstractions even if we only have hardware support for passing messages around, and we can implement message passing on top of shared memory hardware. Indeed, these can be really useful things to do! So it is worthwhile keeping the programming abstraction distinct from the hardware in our minds. Of course, if we want to think about performance as well as correctness of our parallel codes, we need to have some understanding of the hardware, too. At the very least, we need to know enough to build cost models that will help us predict performance and guide us toward good implementations.

Simple example

double dot(int n, double* x, double* y)
{
    double s = 0;
    for (int i = 0; i < n; ++i)
        s += x[i] * y[i];
    return s;
}

Simple example

double pdot(int n, double* x, double* y)
{
    double s = 0;
    for (int p = 0; p < NUM_PROC; ++p) { // Loop to parallelize
      int i = p*n/NUM_PROC;
      int inext = (p+1)*n/NUM_PROC;
      double partial = dot(inext-i, x+i, y+i);
      s += partial;
    }
    return s;
}

Simple example

How can we parallelize?

Where do arrays \(x\) and \(y\) live? One CPU? Partitioned?
Who does what work?
How do we combine to get a single final result?

Shared memory model

Program consists of threads of control.

Can be created dynamically
Each has private variables (e.g. local)
Each has shared variables (e.g. heap)
Communication through shared variables
Coordinate by synchronizing on variables
Examples: OpenMP, pthreads

Shared memory dot product

Dot product of two \(n\) vectors on \(p \ll n\) processors:

Each CPU: partial sum (\(n/p\) elements, local)
Everyone tallies partial sums

Can we go home now?

Race condition

A race condition:

Two threads access same variable
At least one write.
Access are concurrent – no ordering guarantees
- Could happen simultaneously!

Race to the dot

Consider S += partial on two CPUs

Race to the dot

P1	P2
load S
add partial
	load S
store S
	add partial
	store S

Sequential consistency

Idea: Looks like processors take turns, in order
Convenient for thinking through correctness
Really hard for performance!
Will talk about memory models later

Shared memory dot with locks

Solution: consider S += partial_sum a critical section

Only one CPU at a time allowed in critical section
Can violate invariants locally
Enforce via a lock or mutex

Shared memory dot with locks

Dot product with mutex:

Create global mutex l
Compute partial_sum
Lock l
S += partial_sum
Unlock l

A problem

Processor 1:

Acquire lock 1
Acquire lock 2
Do something
Release locks

Processor 2:

Acquire lock 2
Acquire lock 1
Do something
Release locks

What if both processors execute step 1 simultaneously?

In the dot product example, we only need one lock, which we use to protect accesses to the global sum. But what happens if we need more than one lock because we want to compute more than one thing? Let's consider the example above. If both threads are able to execute the first step simultaneously, then we run into trouble at the second step. The first thread holds lock 1 and wants lock 2; and the second thread holds lock 2 and wants lock 1. Nobody can make progress! This situation is called deadlock. We'll talk about this more later; it turns out that there are ways that we can ensure that we avoid deadlock, which those of you who took an OS class probably studied already (and maybe forgot!). But let's now briefly mention a synchronization approach that will definitely not deadlock, and is really useful for lots of scientific codes.

Shared memory with barriers

Lots of sci codes have phases (e.g. time steps)
Communication only needed at end of phases
Idea: synchronize on end of phase with barrier
- More restrictive (less efficient?) than small locks
- But easier to think through! (e.g. less chance of deadlocks)
Sometimes called bulk synchronous programming

Dot with barriers

partial[threadid] = local partial sum
barrier
sum = sum(partial)

Punchline

Shared memory correctness is hard

Too little synchronization: races
Too much synchronization: deadlock

And this is before we talk performance!

Shared memory machines

Uniform shared memory

Processors and memories talk through a bus
Symmetric Multiprocessor (SMP)
Hard to scale to lots of processors (think \(\leq 32\))
- Bus becomes bottleneck
- Cache coherence via snooping

Multithreaded processor machine

Maybe threads > processors!
Idea: Switch threads on long latency ops.
Called hyperthreading by Intel
Cray MTA was an extreme example

The shared memory interface also sometimes makes sense when threads don't exactly correspond to processors. Usually, that means having more threads than processors. When a thread has to wait for the results of a memory read, disk write, or other long-latency task, it can yield the processor for use by other threads. Sometimes this is a purely software setup, and sometimes it involves hardware support. On modern Intel chips, this is called hyper-threading, and we mentioned it briefly in our discussion of single-core architecture. But there are some architectures that have gone way more extreme than what Intel did. The Cray MTA machine took a really long time to access memory (though that time was pretty uniform), and tried to hide that latency with a *lot* of threads. Needless to say, it was not easy to program. I had an amusing semester in graduate school listening to one of my officemates swear at that machine. Good times! And better him than me.

Distributed shared memory

Non-Uniform Memory Access (NUMA)
Memory logically shared, physically distributed
Any processor can access any address
Close accesses are faster than far accesses
Cache coherence is still a pain
Most big modern chips are NUMA
Many-core accelerators tend to be NUMA as well

When we have a lot of processors, we are generally forced to move away from sending all data across a single bus. In this case, there is a more complex network that connects the processors and memories. Often, memories are physically located together with processors, so each processor has "local" memory and "remote" memory. These memories may all be accessed through the same logical address space, but it takes different amounts of time to read or write data depending on whether the read is to local or remote memory addresses. This is called non-uniform memory access, or NUMA. NUMA systems scale to large numbers of cores much better than uniform access (or "symmetric") multiprocessors. Most modern big chips are NUMA, as are most many-core accelerators. Unfortunately, it is harder to keep the caches in a NUMA chip consistent with each other than it is in an SMP (though there are mechanisms for this).

Punchline

Shared memory is expensive!

Uniform access means bus contention
Non-uniform access scales better
(but now access costs vary)
Cache coherence is tricky
May forgo sequential consistency for performance

So: shared memory hardware presents some challenges. If we want uniform memory access, we need a beefy bus to connect the processors and memories together, and contention for that bus limits our performance. Giving up on uniform access means that we can scale to more processors, but now the costs of accessing memories vary - though maybe we are OK with this, as there is already variation in access times due to the effect of the cache system. Keeping the caches coherent with each other and with main memory is a challenge, particularly in the non-uniform case. There are clever solutions, but for the sake of performance, we often don't seek to make those solutions perfectly maintain sequential consistency. Don't worry if you didn't catch all that. We'll go into it in more detail when we talk about shared memory programming in OpenMP in a couple weeks.

Message passing model

Message-passing programming model

Collection of named processes
Data is partitioned
Communication by send/receive of explicit message
Lingua franca: MPI (Message Passing Interface)

Message passing dot product: v1

Processor 1:

Partial sum s1
Send s1 to P2
Receive s2 from P2
s = s1 + s2

Processor 2:

Partial sum s2
Send s2 to P1
Receive s1 from P1
s = s1 + s2

What could go wrong? Think of phones vs letters...

So let's talk about how parallel dot product might work with two processors in a message-passing model. Each processor holds a part of x and a part of y in its memory. The processor dots its piece, then sends the partial sum to the other processor. Then the other processor receives the outside partial sum, adds it to the partial sum that it computed, and that gives the overall dot product. Alas, you knew it couldn't be quite that easy. The problem is that sending a message *may* be less like a letter, and more like placing a telephone call. You can't hang up the phone while you are dialing and waiting to deliver the message! In a world where there are no busy signals or answering machines, if I call you at the same time that you call me, maybe we both spend forever waiting for the other side to pick up. So this code is prone to deadlock. As it turns out, the system has buffers in it, and whether sending a message is phone-call-like or letter-like depends on the state of those buffers. So the same code may work most of the time, but periodically deadlock because of the issue we sketched above. This is pretty maddening to debug.

Message passing dot product: v1

Processor 1:

Partial sum s1
Send s1 to P2
Receive s2 from P2
s = s1 + s2

Processor 2:

Partial sum s2
Receive s1 from P1
Send s2 to P1
s = s1 + s2

Better, but what if more than two processors?

MPI: the de facto standard

Pro: Portability
Con: least-common-denominator for mid 80s

The “assembly language” (or C?) of parallelism...
but, alas, assembly language can be high performance.

Punchline

Message passing hides less than shared memory
But correctness is still subtle

Distributed memory machines

Each node has local memory
- ... and no direct access to memory on other nodes
Nodes communicate via network interface
Example: most modern clusters!

Back of the envelope

c is 3 billion m/s.
One light-ns is about 0.3 m
(about a foot)
A big machine is often over 300 feet across
May still be dominated by NIC latency (microseconds)
Across a big machine will always be order(s)-of-magnitude slower than local memory accesses
Another reason locality matters!

Sending a message across the network can be a lot more expensive than retrieving data from memory, even when there is a cache miss. The problem is that one light nanosecond - or the distance that light travels in one nanosecond - is about a foot. And big supercomputers often have a space footprint the size of a football field. That means that even without the overheads for routing data through a network, simple speed-of-light delays might give us a delay of something like 600 ns to send data across the machine and back. That's significantly worse than fetching data from DRAM! On top of that, we might spend on the order of microseconds to get data through the NIC and onto the network. And we can do a lot of flops in a couple microseconds!

Paths to Parallel Performance

Reminder: what do we want?

High-level: solve big problems fast
Start with good serial performance
Given \(p\) processors, could then ask for
- Good speedup: \(p^{-1}\) times serial time
- Good scaled speedup: \(p \times\) the work, same time
Easiest to get speedup from bad serial code!

The story so far

Parallel performance is limited by:

Single-core performance
Communication and synchronization costs
Non-parallel work (Amdahl)

Plan now: talk about how to overcome these limits for some types of scientific applications

Parallelism and locality

Can get more parallelism / locality through model

Limited range of dependency between time steps
Can neglect or approximate far-field effects

Why do independent parallel work and local communication arise naturally in so many simulations? It's really because a lot of interactions in the physical world are local. For example, just as there is a speed of light or sound in physics, there is a rate at which information can travel across a computational mess in an explicit time stepper - and that rate is usually connected to the rates in the physics problem. Also, when we look at computations in which far-away things have influence, the influence of those far-away things can often be approximated in a pretty simple way. When we model gravity in the solar system, we treat the planets as point masses; and if we are going to compute the influence of distant star systems, we will probably just treat them as a single point mass! This is a pretty big simplification over modeling the gravitational attraction of the matter as it is truly spread over space.

Parallelism and locality

Often get parallism at multiple levels

Hierarchical circuit simulation
Interacting models for climate
Parallelizing individual experiments in MC or optimization

Next up

Parallel patterns in simulation

Discrete events and particle systems
Differential equations (ODEs and PDEs)

CS 5220

Applications of Parallel Computers

Parallel Machines and Models

Parallel computer hardware

Parallel programming model

Simple example

Simple example

Simple example

Shared memory model

Shared memory model

Shared memory dot product

Race condition

Race to the dot

Race to the dot

P1

P2

Sequential consistency

Shared memory dot with locks

Shared memory dot with locks

A problem

Shared memory with barriers

Dot with barriers

Punchline

Shared memory machines

Uniform shared memory

Multithreaded processor machine

Distributed shared memory

Punchline

Message passing model

Message-passing programming model

Message passing dot product: v1

Message passing dot product: v1

MPI: the de facto standard

Punchline

Distributed memory machines

Distributed memory machines

Back of the envelope

Paths to Parallel Performance

Reminder: what do we want?

The story so far

Parallelism and locality

Parallelism and locality

Next up