CS 5220: Applications of Parallel Computers
Parallel machines and models
10 Sep 2015
Why clusters?
- Clusters of SMPs are everywhere
- Commodity hardware – economics!
- Supercomputer = cluster + custom interconnect
- Relatively simple to set up and administer (?)
- But still costs room, power, ...
- Economy of scale ⟹ clouds?
- Amazon now has HPC instances on EC2
- StarCluster: launch your own EC2 cluster
- Lots of interesting challenges here
Totient structure
Consider:
- Each core has vector parallelism
- Chips have six cores, shares memory with others
- Accelerators have sixty cores, shared memory
- Each box has two chips + accelerators
- Eight instructional nodes, communicate via Ethernet
How did we get here? Why this type of structure? And how does the
programming model match the hardware?
Parallel computer hardware
- Physical machine has processors, memory, and interconnect
- Where is memory physically?
- Is it attached to processors?
- What is the network connectivity?
- Programming model through languages, libraries.
- Programming model ≠ hardware organization!
- Can run MPI on a shared memory node!
Parallel programming model
- Control
- How is parallelism created?
- What ordering is there between operations?
- Data
- What data is private or shared?
- How is data logically shared or communicated?
- Synchronization
- What operations are used to coordinate?
- What operations are atomic?
- Cost: how do we reason about each of above?
Simple example
Consider dot product of x and y.
- Where do arrays x and y live? One CPU? Partitioned?
- Who does what work?
- How do we combine to get a single final result?
Shared memory programming model
Program consists of threads of control.
- Can be created dynamically
- Each has private variables (e.g. local)
- Each has shared variables (e.g. heap)
- Communication through shared variables
- Coordinate by synchronizing on variables
- Examples: OpenMP, pthreads
Shared memory dot product
Dot product of two n vectors on p≪n processors:
- Each CPU evaluates partial sum (n/p elements, local)
- Everyone tallies partial sums
Can we go home now?
Race condition
A race condition:
- Two threads access same variable, at least one write.
- Access are concurrent – no ordering guarantees
- Could happen simultaneously!
Need synchronization via lock or barrier.
Race to the dot
Consider S += partial_sum
on 2 CPU:
- P1: Load
S
- P1: Add
partial_sum
- P2: Load
S
- P1: Store new
S
- P2: Add
partial_sum
- P2: Store new
S
Shared memory dot with locks
Solution: consider S += partial_sum
a critical
section
- Only one CPU at a time allowed in critical section
- Can violate invariants locally
- Enforce via a lock or mutex (mutual exclusion variable)
Dot product with mutex:
- Create global mutex l
- Compute partial_sum
- Lock l
- S += partial_sum
- Unlock l
Shared memory with barriers
- Many codes have phases (e.g. time steps)
- Communication only needed at end of phases
- Idea: synchronize on end of phase with barrier
- More restrictive (less efficient?) than small locks
- Easier to think through! (e.g. less chance of deadlocks)
- Sometimes called bulk synchronous programming
Shared memory machine model
- Processors and memories talk through a bus
- Symmetric Multiprocessor (SMP)
- Hard to scale to lots of processors (think ≤32)
- Bus becomes bottleneck
- Cache coherence is a pain
- Example: 6-core chips on cluster
Multithreaded processor machine
- May have more threads than processors!
- Can switch threads on long latency ops
- Cray MTA was an extreme example
- Similar to hyperthreading
- But hyperthreading doesn’t switch – just schedules multiple
threads onto same CPU functional units
Distributed shared memory
- Non-Uniform Memory Access (NUMA)
- Can logically share memory while
physically distributing
- Any processor can access any address
- Cache coherence is still a pain
- Example: SGI Origin (or multiprocessor nodes on cluster)
Message-passing programming model
- Collection of named processes
- Data is partitioned
- Communication by send/receive of explicit message
- Lingua franca: MPI (Message Passing Interface)
Message passing dot product: v1
Processor 1: Processor 2:
1. Partial sum s1 1. Partial sum s2
2. Send s1 to P2 2. Send s2 to P1
3. Receive s2 from P2 3. Receive s1 from P1
4. s = s1 + s2 4. s = s1 + s2
What could go wrong? Think of phones vs letters...
Message passing dot product: v2
Processor 1: Processor 2:
1. Partial sum s1 1. Partial sum s2
2. Send s1 to P2 2. Receive s1 from P1
3. Receive s2 from P2 3. Send s2 to P1
4. s = s1 + s2 4. s = s1 + s2
Better, but what if more than two processors?
MPI: the de facto standard
- Pro: Portability
- Con: least-common-denominator for mid 80s
The “assembly language” (or C?) of parallelism...
but, alas, assembly language can be high performance.
Distributed memory machines
- Each node has local memory
- ... and no direct access to memory on other nodes
- Nodes communicate via network interface
- Example: our cluster!
- Other examples: IBM SP, Cray T3E
The story so far
- Serial performance as groundwork
- Complicated function of architecture and memory
- Understand to design data and algorithms
- Parallel performance
- Serial issues + communication/synch overheads
- Limit: parallel work available (Amdahl)
- Also discussed serial architecture and some of the basics of
parallel machine models and programming models.
- Next: Parallelism and locality in simulations
CS 5220: Applications of Parallel Computers
Parallel machines and models
10 Sep 2015