# CS 5220

## Applications of Parallel Computers

### Parallel Machines and Models

Prof David Bindel

Please click the play button below.

### Parallel computer hardware

Have processors, memory, interconnect.

• Where is memory physically?
• Is it attached to processors?
• What is the network connectivity?

### Parallel programming model

Programming model through languages, libraries.

• What are the control mechanisms?
• What data semantics? Private, shared?
• What synchronization constructs?

For performance, need cost models!

### Simple example

double dot(int n, double* x, double* y)
{
double s = 0;
for (int i = 0; i < n; ++i)
s += x[i] * y[i];
return s;
}

### Simple example

double pdot(int n, double* x, double* y)
{
double s = 0;
for (int p = 0; p < NUM_PROC; ++p) { // Loop to parallelize
int i = p*n/NUM_PROC;
int inext = (p+1)*n/NUM_PROC;
double partial = dot(inext-i, x+i, y+i);
s += partial;
}
return s;
}

### Simple example

How can we parallelize?

• Where do arrays $$x$$ and $$y$$ live? One CPU? Partitioned?
• Who does what work?
• How do we combine to get a single final result?

## Shared memory model

### Shared memory model

Program consists of threads of control.

• Can be created dynamically
• Each has private variables (e.g. local)
• Each has shared variables (e.g. heap)
• Communication through shared variables
• Coordinate by synchronizing on variables

### Shared memory dot product

Dot product of two $$n$$ vectors on $$p \ll n$$ processors:

1. Each CPU: partial sum ($$n/p$$ elements, local)
2. Everyone tallies partial sums

Can we go home now?

### Race condition

A race condition:

• Two threads access same variable
• At least one write.
• Access are concurrent – no ordering guarantees
• Could happen simultaneously!

### Race to the dot

Consider S += partial on two CPUs

store S
store S

### Sequential consistency

• Idea: Looks like processors take turns, in order
• Convenient for thinking through correctness
• Really hard for performance!
• Will talk about memory models later

### Shared memory dot with locks

Solution: consider S += partial_sum a critical section

• Only one CPU at a time allowed in critical section
• Can violate invariants locally
• Enforce via a lock or mutex

### Shared memory dot with locks

Dot product with mutex:

1. Create global mutex l
2. Compute partial_sum
3. Lock l
4. S += partial_sum
5. Unlock l

### A problem

Processor 1:

1. Acquire lock 1
2. Acquire lock 2
3. Do something
4. Release locks

Processor 2:

1. Acquire lock 2
2. Acquire lock 1
3. Do something
4. Release locks

What if both processors execute step 1 simultaneously?

### Shared memory with barriers

• Lots of sci codes have phases (e.g. time steps)
• Communication only needed at end of phases
• Idea: synchronize on end of phase with barrier
• More restrictive (less efficient?) than small locks
• But easier to think through! (e.g. less chance of deadlocks)
• Sometimes called bulk synchronous programming

### Dot with barriers

1. partial[threadid] = local partial sum
2. barrier
3. sum = sum(partial)

### Punchline

Shared memory correctness is hard

• Too little synchronization: races

And this is before we talk performance!

## Shared memory machines

### Uniform shared memory

• Processors and memories talk through a bus
• Symmetric Multiprocessor (SMP)
• Hard to scale to lots of processors (think $$\leq 32$$)
• Bus becomes bottleneck
• Cache coherence via snooping

• Idea: Switch threads on long latency ops.
• Cray MTA was an extreme example

### Distributed shared memory

• Non-Uniform Memory Access (NUMA)
• Memory logically shared, physically distributed
• Any processor can access any address
• Close accesses are faster than far accesses
• Cache coherence is still a pain
• Most big modern chips are NUMA
• Many-core accelerators tend to be NUMA as well

### Punchline

Shared memory is expensive!

• Uniform access means bus contention
• Non-uniform access scales better
(but now access costs vary)
• Cache coherence is tricky
• May forgo sequential consistency for performance

## Message passing model

### Message-passing programming model

• Collection of named processes
• Data is partitioned
• Communication by send/receive of explicit message
• Lingua franca: MPI (Message Passing Interface)

### Message passing dot product: v1

Processor 1:

1. Partial sum s1
2. Send s1 to P2
4. s = s1 + s2

Processor 2:

1. Partial sum s2
2. Send s2 to P1
4. s = s1 + s2

What could go wrong? Think of phones vs letters...

### Message passing dot product: v1

Processor 1:

1. Partial sum s1
2. Send s1 to P2
4. s = s1 + s2

Processor 2:

1. Partial sum s2
3. Send s2 to P1
4. s = s1 + s2

Better, but what if more than two processors?

### MPI: the de facto standard

• Pro: Portability
• Con: least-common-denominator for mid 80s

The “assembly language” (or C?) of parallelism...
but, alas, assembly language can be high performance.

### Punchline

• Message passing hides less than shared memory
• But correctness is still subtle

## Distributed memory machines

### Distributed memory machines

• Each node has local memory
• Nodes communicate via network interface
• Example: most modern clusters!

### Back of the envelope

• c is 3 billion m/s.
• One light-ns is about 0.3 m
• A big machine is often over 300 feet across
• May still be dominated by NIC latency (microseconds)
• Across a big machine will always be order(s)-of-magnitude slower than local memory accesses
• Another reason locality matters!

## Paths to Parallel Performance

### Reminder: what do we want?

• High-level: solve big problems fast
• Given $$p$$ processors, could then ask for
• Good speedup: $$p^{-1}$$ times serial time
• Good scaled speedup: $$p \times$$ the work, same time
• Easiest to get speedup from bad serial code!

### The story so far

Parallel performance is limited by:

• Single-core performance
• Communication and synchronization costs
• Non-parallel work (Amdahl)

Plan now: talk about how to overcome these limits for some types of scientific applications

### Parallelism and locality

Can get more parallelism / locality through model

• Limited range of dependency between time steps
• Can neglect or approximate far-field effects

### Parallelism and locality

Often get parallism at multiple levels

• Hierarchical circuit simulation
• Interacting models for climate
• Parallelizing individual experiments in MC or optimization

### Next up

Parallel patterns in simulation

• Discrete events and particle systems
• Differential equations (ODEs and PDEs)