Algorithms for Scalable Synchronization on Shared-Memory Mult iprocessors
Reviewed by Zhiyuan Chen and Yin Zhang, February 1998
Idea:
Scalable Synchronization on Shared-Memory multi-processors is crucial for pe rformance.
Rather than using special-purpose hardware, busy-wait software algorithms ca n solve
this problem pretty well.
The key for scalable busy-wait algorithm is to minimize remote references. B ut the
performance of different algorithm is very relative to architecture.
Algorithms
Spin Locks: Basic Idea: keep waiting until getting the lock.
Simple Test_and_Set Lock Solution.
Only one global lock. Everyone tries to get it.
Advantage: simplicity.
Disadvantage: Heavy traffic. No guarantee of fairness.
Some improvement over the basic algorithm: Exponential backoff. Randomness.
Similar algorithm used in network: CSMA.
Other Versions of Spin Locks:
Processors are more polite -- they know they should wait in a queue. Each proces sor keeps
waiting until his turn comes (i.e., there is no one before him in the queue.)
Advantage: Less traffic. FIFO ordering of lock acquisitions.
Barriers:Basic Idea: First indicate your existence, then keep waiting until everyone has
shown up.
Centralized Barriers. Use a counter. Indicate your existence by decrement t he counter.
Use a global flag to indicate that everyone has come. The processo r who comes latest sets
the flag.
Dissemination Barrier: Similar to information dissemination algorithm, sign al some
processor each round.
Other Versions of Barriers: In essence, all the other versions are simply th e recursive
version of the centralized barrier.
Basic Design Technique: Divide and Conquer.
Performance comparison
Author explored spin locks and barrier algorithms' performance under two kin ds of
architecture. (distributed shared memory, having fetch-and-F
instructions and cache-coherent, shared-bus multipro cessors).
Spin locks:
Scalability and induced network load: MCS and array-based queuing (on cache -coherent
machines) best. Test_and_set with exponential backoff and ticket locks with proportional
back also works well.
One-processor latency (no contention): Test_and_set, ticket are small. Other s
reasonable.
Space requirement: Anderson's and G&T lock are prohibitive.
Fairness/sensitivity to preemption: MCS (need compare_and_swap), ticket lock ,
array-based queuing locks all guarantee FIFO.
Required atomic operations: all require some sort of fetch-and-F
.
Overall: MCS most attractive, ticket lock with proportional backoff s econd if no
compare_and_swap, test_and_set with exponential backoff also attract ive if processes may
be preempted when spining or latency important or lack of f etch_and_F
.
Barriers:
Space: centralized barrier is constant, tree-based and combining tree linear ,
dissemination barrier and tournament barriers O(PlogP).
Number of network transactions: Tournament, tree-based O(P), dissemination O (PlogP), on
machines with broadcast-based coherent caches, centralized and combi ning tree also O(P),
but unbound if no such cache.
Length of critical path: if network transaction can proceed in parallel, all are O(logP)
except centralized O(P). If not in parallel, near total number of n etwork transactions.
Instruction required: centralized and combining tree need atomic increment a nd
decrement. Other atomic read, write.
Overall:
For distributed memory without broadcast, dissemination best. Since it can p roceed in
parallel with its critical path only logP, tree is logP + log4P, tounament 2logP. But
tree algorithm use less space.
For broadcast-based coherent cache machines, tree-based with wakeup flag (ot her than
tree) is fastest. Only O(P) simple write. Also may use centralized barr ier.
When numbers of processors change dynamically, centralized barrier has no ne ed to
change the internal organization.
Author also thinks Dance-Hall machine is bad since it has no coherent cache and use of
special hardware is not cost-effective.