Algorithms for Scalable Synchronization on Shared-Memory Mult iprocessors
Reviewed by Zhiyuan Chen and Yin Zhang, February 1998
Idea:
- Scalable Synchronization on Shared-Memory multi-processors is crucial for pe rformance.
- Rather than using special-purpose hardware, busy-wait software algorithms ca n solve
this problem pretty well.
- The key for scalable busy-wait algorithm is to minimize remote references. B ut the
performance of different algorithm is very relative to architecture.
Algorithms
- Spin Locks: Basic Idea: keep waiting until getting the lock.
- Simple Test_and_Set Lock Solution.
Only one global lock. Everyone tries to get it.
Advantage: simplicity.
Disadvantage: Heavy traffic. No guarantee of fairness.
Some improvement over the basic algorithm: Exponential backoff. Randomness.
Similar algorithm used in network: CSMA.
- Other Versions of Spin Locks:
Processors are more polite -- they know they should wait in a queue. Each proces sor keeps
waiting until his turn comes (i.e., there is no one before him in the queue.)
Advantage: Less traffic. FIFO ordering of lock acquisitions.
- Barriers:Basic Idea: First indicate your existence, then keep waiting until everyone has
shown up.
- Centralized Barriers. Use a counter. Indicate your existence by decrement t he counter.
Use a global flag to indicate that everyone has come. The processo r who comes latest sets
the flag.
- Dissemination Barrier: Similar to information dissemination algorithm, sign al some
processor each round.
- Other Versions of Barriers: In essence, all the other versions are simply th e recursive
version of the centralized barrier.
Basic Design Technique: Divide and Conquer.
Performance comparison
- Author explored spin locks and barrier algorithms' performance under two kin ds of
architecture. (distributed shared memory, having fetch-and-F
instructions and cache-coherent, shared-bus multipro cessors).
- Spin locks:
- Scalability and induced network load: MCS and array-based queuing (on cache -coherent
machines) best. Test_and_set with exponential backoff and ticket locks with proportional
back also works well.
- One-processor latency (no contention): Test_and_set, ticket are small. Other s
reasonable.
- Space requirement: Anderson's and G&T lock are prohibitive.
- Fairness/sensitivity to preemption: MCS (need compare_and_swap), ticket lock ,
array-based queuing locks all guarantee FIFO.
- Required atomic operations: all require some sort of fetch-and-F
.
- Overall: MCS most attractive, ticket lock with proportional backoff s econd if no
compare_and_swap, test_and_set with exponential backoff also attract ive if processes may
be preempted when spining or latency important or lack of f etch_and_F
.
- Barriers:
- Space: centralized barrier is constant, tree-based and combining tree linear ,
dissemination barrier and tournament barriers O(PlogP).
- Number of network transactions: Tournament, tree-based O(P), dissemination O (PlogP), on
machines with broadcast-based coherent caches, centralized and combi ning tree also O(P),
but unbound if no such cache.
- Length of critical path: if network transaction can proceed in parallel, all are O(logP)
except centralized O(P). If not in parallel, near total number of n etwork transactions.
- Instruction required: centralized and combining tree need atomic increment a nd
decrement. Other atomic read, write.
- Overall:
- For distributed memory without broadcast, dissemination best. Since it can p roceed in
parallel with its critical path only logP, tree is logP + log4P, tounament 2logP. But
tree algorithm use less space.
- For broadcast-based coherent cache machines, tree-based with wakeup flag (ot her than
tree) is fastest. Only O(P) simple write. Also may use centralized barr ier.
- When numbers of processors change dynamically, centralized barrier has no ne ed to
change the internal organization.
- Author also thinks Dance-Hall machine is bad since it has no coherent cache and use of
special hardware is not cost-effective.
Question:
- Are these algorithms applicable in other areas?