CS 5220
Distributed memory
Modeling message costs
06 Oct 2015
Basic questions
- How much does a message cost?
- Latency: time to get between processors
- Bandwidth: data transferred per unit time
- How does contention affect communication?
- This is a combined hardware-software question!
- Goal: understand just enough to model roughly
Conventional wisdom
- Roughly constant latency (?)
- Wormhole routing (or cut-through) flattens latencies vs
store-forward at hardware level
- Software stack dominates HW latency!
- Latencies not same between networks (in box vs
across)
- May also have store-forward at library level
- Avoid topology-specific optimization
- Want code that runs on next year’s machine, too!
- Bundle topology awareness in vendor MPI libraries?
- Sometimes specify a software topology
α-β model
Crudest model: tcomm=α+βM
- tcomm= communication time
- α= latency
- β= inverse bandwidth
- M= message size
Works pretty well for basic guidance!
Typically α≫β≫tflop. More money on
network, lower α.
LogP model
Like α-β, but includes CPU time on send/recv:
- Latency: the usual
- Overhead: CPU time to send/recv
- Gap: min time between send/recv
- P: number of processors
Assumes small messages (gap ∼ bw for fixed message size).
Communication costs
Some basic goals:
- Prefer larger to smaller messages (avoid latency)
- Avoid communication when possible
- Great speedup for Monte Carlo and other embarrassingly parallel
codes!
- Overlap communication with computation
- Models tell you how much computation is needed to mask
communication costs.
Intel MPI on Totient
- Two 6-core chips per nodes, eight nodes
- Heterogeneous network:
- Ring between cores
- Bus between chips
- Gigabit ethernet between nodes
- Test ping-pong
- Between cores on same chip
- Between chips on same node
- Between nodes
Approximate α-β parameters (on node)

Approximate α-β parameters (cross-node)

Network model

- On-chip: α-β works well!
- Off-chip: Not so much
- But cross-node communication is clearly expensive!
Moral
Not all links are created equal!
- Might handle with mixed paradigm
- OpenMP on node, MPI across
- Have to worry about thread-safety of MPI calls
- Can handle purely within MPI
- Can ignore the issue completely?
For today, we’ll take the last approach.