CS 5220

Applications of Parallel Computers

Intro to Shared Memory

Prof David Bindel

Please click the play button below.

Message passing pain

Common message passing pattern

  • Logical global structure
  • Local representation per processor
  • Local data may have redundancy
    • Example: Data in ghost cells
    • Example: Replicated book-keeping data

Message passing pain

Big pain point:

  • Many partly-overlapping representations
  • Need consistent picture across processes

Wouldn’t it be nice to have just one representation?

Shared memory vs message passing

  • Implicit comm via memory vs explicit messages
  • Still need separate global vs local picture?

Shared memory vs message passing

Still need separate global vs local picture?

  • No: One thread-safe data structure may be easier
  • Yes: More sharing can hurt performance
    • Synchronization costs even with no contention
    • Contention for locks reduces parallelism
    • Cache coherency slows even non-contending access

Shared memory vs message passing

Still need separate global vs local picture?

  • “Easy” approach: add multi-threading to serial code
  • Better performance: design like message-passing

Let’s dig a little deeper on the HW side

Memory model

  • Single processor: return last write
    • What about DMA and memory-mapped I/O?
  • Simplest generalization: sequential consistency

Sequential consistency

A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.
– Lamport, 1979

Sequential consistency

Program behaves as if:

  • Each process runs in program order
  • Instructions from different processes are interleaved
  • Interleaved instructions ran on one processor

Example: Spin lock

Initially, flag = 0 and sum = 0

Processor 1:

sum += p1;
flag = 1;

Processor 2:

while (!flag);
sum += p2;

Example: Spin lock

Without sequential consistency support, what if

  1. Processor 2 caches flag?
  2. Compiler optimizes away loop?
  3. Compiler reorders assignments on P1?

Starts to look restrictive!

Sequential consistency

Program behavior is “intuitive”:

  • Nobody sees garbage values
  • Time always moves forward

One issue is cache coherence:

  • Coherence: different copies, same value
  • Requires (nontrivial) hardware support

Also an issue for optimizing compiler!

There are cheaper relaxed consistency models.

Snoopy bus protocol

Basic idea:

  • Broadcast operations on memory bus
  • Cache controllers “snoop” on all bus transactions
    • Memory writes induce serial order
    • Act to enforce coherence (invalidate, update, etc)

Snoopy bus protocol

Problems:

  • Bus bandwidth limits scaling
  • Contending writes are slow

Have other protocol options (e.g. directory-based).
But usually give up on full sequential consistency.

Weakening sequential consistency

Try to reduce to the true cost of sharing

  • volatile tells compiler: worry about sharing
  • Atomic operations do reads/writes as a single op
  • Memory fences tell when to force consistency
  • Synchronization ops (lock/unlock) include fences

Unprotected data races give undefined behavior.

Sharing

True sharing:

  • Frequent writes cause a bottleneck.
  • Idea: make independent copies (if possible).
  • Example problem: malloc/free data structure.

Sharing

False sharing:

  • Distinct variables on same cache block
  • So make processor memory contiguous (if possible)
  • Example problem: array of ints, one per processor

Take-home message

  • Sequentially consistent shared memory useful
    • “Natural” analogue to serial case
    • Architects work hard to support it
  • ... but implementation is costly!
    • Makes life hard for optimizing compilers
    • Coherence traffic slows things down
    • Helps to limit sharing

Think about these things to get good performance.