Synchronization

When you program with threads, you use a shared-memory parallelism programming model. This means that multiple streams of instructions are running simultaneously, and they can both read and write the same region of memory.

Problems abound in shared-memory parallelism. This lecture is about recognizing and fixing those problems.

Atomicity

For example, imagine that you have two threads that both concurrently run this line of C code:

*x += 1;

If the value x points to starts out at 0 before these two threads run, it would be nice if we could be guaranteed that that *x contained 2 after both threads finish.

But, as you know, *x += 1 is not a single action that your machine takes all at once. You need to break it down into at least three steps: load the value, add 1 to it, and then store it back to memory. What can happen as these three steps from the two threads interleave? For example, consider what happens if this ordering events happens:

thread 1 loads the value x points to
thread 2 loads the same value
thread 1 increments the value
thread 2 increments it
thread 1 stores the modified value back to address x
thread 1 stores its modified value back to address x

What would the value of *x be then?

If this is not the intended behavior—if the programmer intended both copies of *x += 1 to take places as a single unit, resulting in the final value 2—then this is a violation of atomicity. That is, the programmer might intend for an action like *x += 1 to be atomic: to happen all at once, without the ability for any thread to observe or interfere with the intermediate states between the beginning and the end of the operation. But in C (and in the equivalent assembly), this is not an atomic operation: it consists of several smaller observations, and other threads can interfere in the middle.

Mutual Exclusion

Synchronization is a technique to avoid the problems that arise from shared-memory parallelism, such as atomicity violations. There are many forms of synchronization, and this lecture will explore a few of them.

An extremely popular form of synchronization is mutual exclusion, or mutex for short, also known as locking. The idea is that we want to delimit parts of the code where only one thread can be running at a time. Imagine that C had a special construct for mutual exclusion; then we might write this:

mutex {
  x += 1;
}

This would mean that only one thread would be allowed to be running inside those curly braces at a time. The region of code protected by mutual exclusion (the code inside the braces inside this imaginary construct) is called a critical section. So if thread 1 entered the critical section, and then thread 2 arrived at the top of the section, it would need to wait until thread 1 left the critical section before it could enter.

Can you convince yourself that this mutual exclusion would fix the atomicity problems from our example? If we enforce mutually exclusive execution of that critical section, is that enough? (It is.)

Sadly, C does not have a built-in mutex construct. Instead, we need to use a library or build it ourselves.

A Failed Attempt

Here’s a naive way that you might try to implement mutual exclusion: use a lock variable to keep track of whether someone is currently occupying the critical section. Something like this:

int lock = 0;

while (lock) {}  // Wait for the lock to be free.
lock = 1;        // Acquire the lock.
*x += 1;         // Critical section here.
lock = 0;        // Release the lock.

That should do it, right? What happens if two different threads run this code concurrently?

It doesn’t work. Imagine that both threads first encounter the while statement, and they both bypass it before setting lock to 1. So we have failed to enforce mutual exclusion.

It’s possible to fall down a deep rabbit hole of techniques for implementing mutual exclusion. A famous example is Peterson’s algorithm, which works by combining one flag variable per thread (instead of one shared flag variable).

However, these custom algorithms for mutual exclusion are neither necessary nor sufficient. They are not necessary because CPUs provide special instructions just for implementing synchronization mechanisms such as mutual exclusion. They are not sufficient because CPUs implement optimizations that typically mean that any synchronization mechanism implemented using ordinary loads and stores, instead of the special instructions, cannot work reliably.

This insufficiency is a deep topic of its own that is out of scope for CS 3410, but here’s a brief summary. Please skip this paragraph unless you are super duper curious about an entirely separate branch of computer science. In a multiprocessor system, it takes a while for each processor to publish its memory stores so that they can be read by other processors. (The architectural component to blame is a store buffer.) That means that each CPU can read its own writes immediately, but other processors see these updates only after a delay. This results in a memory consistency model that allows updates to appear “out of order” to remote processors. Processors have therefore developed special instructions that bypass these optimizations and, at the cost of performance, force certain memory accesses to happen in a sequentially consistent order. All correct synchronization implementations, therefore, must use these special instructions instead of ordinary load and store instructions.

Atomic Instructions

RISC-V provides two basic atomic instructions to support the implementation of synchronization operations such as mutual exclusion. They are called lr, for load reserved, and sc, for store conditional. These two instructions work together to provide the basic mechanisms required to implement any style of synchronization. (In other ISAs, this pattern is called load-link/store-conditional.)

The instructions come in different accesses sizes; for example, lr.w and sc.w are the word-sized (32-bit) versions. Here’s what the instructions do:

lr.w rd, (rs1): Load the 32-bit value at the address in rs1 and put the value in rd. (So far, like a normal lw.) Also, create a “reservation” of this address. (What is a “reservation”? Keep reading.)
sc.w rd, rs2, (rs1): Store the value of rs2 at the address in rs1. (Again, so far, like a normal store.) But, also check whether a reservation of this address exists. If so, then the store proceeds as normal, and set rd to 0. (Call this a “success.”) If not, then cancel the store altogether: do not write anything at all the memory, and set rd to 1. (This is a “failure.”)

This “reservation” business is a mechanism for checking whether anyone else wrote to a given address. While a reservation exists, think of the CPU carefully monitoring the given address to see if anyone else writes to that address. If nobody writes to the address between the lr and the sc, the reservation is preserved and sc succeeds. If somebody else does write to the given address, then the reservation is lost and sc fails.

Implementing Synchronization Operations

The usual way to use lr and sc together is to put them at the beginning and the end of some region of code, and then wrap the whole thing in a loop. The loop lets you try the code repeatedly, until the sc succeeds. If you’re careful, this can mean that the code surrounded by the lr/sc pair eventually executes atomically. The pattern looks something vaguely like this:

loop:
  lr.w t0, (a0)
  # ... do something with t0 to compute t1 ...
  sc.w t2, t1, (a0)
  bnez t2, loop     # if the lr/sc failed, then try again

The memory address in this example is in register a0. This little loop tries to do something with the value at this address and then store it back. If any other thread ever interferes, then it gives up and tries again—over and over, until the operation succeeds. The end result is that we get to perform an atomic operation on the value stored at the address in a0.

You will use this pattern to implement interesting synchronization operations, including mutual exclusion, in this week’s assignment.