Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

(a.k.a. Everything You Ever Wanted to Know about Spin Locks and Barriers)

Revised Notes by Vicky Weissman, March 1999
Original Notes by Zhiyuan Chen and Yin Zhang, February 1998

Basic Definitions

A spin lock guarantees that no 2 process are executing the same critical section code simultaneously.
A barrier requires that all processes reach a particular point before any process continues past that point.

Key Concepts

The poor performance of spin locks and barriers significantly effect the overall capability of shared-memory multiprocessors.
Special-purpose hardware is not necessary. By minimizing the number of remote references, software can reduce synchronization contention to be effectively zero.
The software must be designed to complement the specific hardware architecture.

Algorithms
Spin Locks

Each process repeatedly executes a Test_and_Set instruction to query a global flag and, if possible, set the flag. When a process succeeds in setting the flag, it has the lock. Optimizations tinker with when the polling occurs, but the approach is fundamentally unfair and results in heavy traffic/contention.
Each process waits for its turn to get the lock by holding a ticket or a place in an array. The MCS lock uses a linked list instead of an array so that a small, constant amount of space is needed per lock and coherent caching is not required for good performance. These approaches create less contention then Test_and_Set and they also guarantee fairness.

Barriers

When a process arrives at the barrier, it decrements a global counter. The process which reaches the barrier last (decrements the counter to 0) resets the counter and changes a global boolean to allow the other processes to continue.
Processes pair-up, synchronizing with a series of partners.
Processes pair-up in a tournament system where, after both reach the boundary, 1 of the processes is passed up to the next round.
Each process is part of a tree structure. When the process and its children reach the boundary, the process's parent is informed. When all process have reached the boundary, the go-ahead is passed from parent to children

Experimental Results

The algorithms were tested with 2 different multiprocessor architectures. One used distributed shared memory, while the other had a cache-coherent, shared bus. The author does not believe an efficient algorithm can be designed for 'dance-hall' architectures.
The MCS lock (which the authors designed) proved to be the most efficient in competitive environments. Ticket locks worked best when the hardware fetch operations were slow and 1-processor latency was a concern.
The best barrier algorithm was the tree-based one the author's designed. If the number of processes varies from boundary to boundary, however, the centralized approach is best.

Questions

Can these algorithms be used to solve other performance problems?
Is there a way to design the lock and barrier algorithms so that they are less dependant on the type of underlying hardware?
Does the cost of special-purpose hardware justify the use of fairly complex algorithms?