Scalable Synchronization

Algorithms for Scalable Synchronization on Shared-Memory Mult iprocessors

Reviewed by Zhiyuan Chen and Yin Zhang, February 1998

Idea:

Scalable Synchronization on Shared-Memory multi-processors is crucial for pe rformance.
Rather than using special-purpose hardware, busy-wait software algorithms ca n solve this problem pretty well.
The key for scalable busy-wait algorithm is to minimize remote references. B ut the performance of different algorithm is very relative to architecture.

Algorithms

Simple Test_and_Set Lock Solution.
Only one global lock. Everyone tries to get it.
Advantage: simplicity.
Disadvantage: Heavy traffic. No guarantee of fairness.
Some improvement over the basic algorithm: Exponential backoff. Randomness.
Similar algorithm used in network: CSMA.
Other Versions of Spin Locks:
Processors are more polite -- they know they should wait in a queue. Each proces sor keeps waiting until his turn comes (i.e., there is no one before him in the queue.)
Advantage: Less traffic. FIFO ordering of lock acquisitions.

Barriers:Basic Idea: First indicate your existence, then keep waiting until everyone has shown up.

Centralized Barriers. Use a counter. Indicate your existence by decrement t he counter. Use a global flag to indicate that everyone has come. The processo r who comes latest sets the flag.
Dissemination Barrier: Similar to information dissemination algorithm, sign al some processor each round.
Other Versions of Barriers: In essence, all the other versions are simply th e recursive version of the centralized barrier.
Basic Design Technique: Divide and Conquer.

Performance comparison

Author explored spin locks and barrier algorithms' performance under two kin ds of architecture. (distributed shared memory, having fetch-and-F instructions and cache-coherent, shared-bus multipro cessors).
Spin locks:

Scalability and induced network load: MCS and array-based queuing (on cache -coherent machines) best. Test_and_set with exponential backoff and ticket locks with proportional back also works well.
One-processor latency (no contention): Test_and_set, ticket are small. Other s reasonable.
Space requirement: Anderson's and G&T lock are prohibitive.
Fairness/sensitivity to preemption: MCS (need compare_and_swap), ticket lock , array-based queuing locks all guarantee FIFO.
Required atomic operations: all require some sort of fetch-and-F .

Overall: MCS most attractive, ticket lock with proportional backoff s econd if no compare_and_swap, test_and_set with exponential backoff also attract ive if processes may be preempted when spining or latency important or lack of f etch_and_F .

Space: centralized barrier is constant, tree-based and combining tree linear , dissemination barrier and tournament barriers O(PlogP).
Number of network transactions: Tournament, tree-based O(P), dissemination O (PlogP), on machines with broadcast-based coherent caches, centralized and combi ning tree also O(P), but unbound if no such cache.
Length of critical path: if network transaction can proceed in parallel, all are O(logP) except centralized O(P). If not in parallel, near total number of n etwork transactions.
Instruction required: centralized and combining tree need atomic increment a nd decrement. Other atomic read, write.
Overall:

For distributed memory without broadcast, dissemination best. Since it can p roceed in parallel with its critical path only logP, tree is logP + log_{4P, tounament 2logP. But
tree algorithm use less space.}
For broadcast-based coherent cache machines, tree-based with wakeup flag (ot her than tree) is fastest. Only O(P) simple write. Also may use centralized barr ier.
When numbers of processors change dynamically, centralized barrier has no ne ed to change the internal organization.

Author also thinks Dance-Hall machine is bad since it has no coherent cache and use of special hardware is not cost-effective.

Question: