Atomics

Last time, we saw how fence instructions can synchronize different threads to avoid data races. This lecture explores a wider variety of synchronization strategies.

Atomicity

Imagine a C program that spawns two threads that both concurrently increment a global variable x:

*x += 1;

Without any synchronization, this is a data race. (There are two different threads accessing the same memory location, and they’re both reading and writing.)

Our goal with this code is to have both threads increment *x. If the value x points to starts out at 0 before these two threads run, it would be nice if we could be guaranteed that that *x contained 2 after both threads finish. That’s clearly not the case currently: in fact, the program’s behavior is undefined because there’s a data race.

The formal term for what we want here is atomicity. We want the increment to be atomic in the sense that it is indivisible: it happens all at once, and no other thread can observe or interfere partway through the execution of *x += 1.

To see a violation of atomicity in action, here’s a program that “amplifies” the problem by incrementing many times in many different threads:

#include <stdio.h>
#include <pthread.h>
#include <unistd.h>

#define THREADS 100

int count;

void* add_stuff(void* arg) {
    for (int i = 0; i < 10000; ++i) {
        count += 1;
    }
    return NULL;
}

int main() {
    count = 0;

    pthread_t threads[THREADS];
    for (int i = 0; i < THREADS; ++i) {
        pthread_create(&threads[i], NULL, add_stuff, NULL);
    }

    for (int i = 0; i < THREADS; ++i) {
        pthread_join(threads[i], NULL);
    }

    printf("count = %d\n", count);

    return 0;
}

This program has undefined behavior, so we don’t know what it will do in general. On my machine, the final value of count is nondeterministic and frequently less than the expected value of 1,000,000.

The next few steps in this lecture are about trying to make *x += 1 execute atomically.

Fences Don’t Fix It

Let’s expand *x += 1 into an instruction sequence. It would look something like this:

lw t0, 0(a0)
addi t0, t0, 1
sw t0, 0(a0)

Following our advice from last time, we could try inserting fence instructions. While adding enough fences can avoid the data race, this still doesn’t fix the atomicity problem. In the limit, you could add fence instructions between every instruction:

fence
lw t0, 0(a0)
fence
addi t0, t0, 1
fence
sw t0, 0(a0)
fence

Even though this code is free of data races, we still don’t have atomicity. A bad ordering of events like this is still possible (reading time from top to bottom):

THREAD 1            THREAD 2
--------            --------
lw t0, 0(a0)
                    lw t0, 0(a0)
addi t0, t0, 1
                    addi t0, t0, 1
sw t0, 0(a0)
                    sw t0, 0(a0)

If that happens, the two threads will both load the initial value 0, both increment it, and then both store it. The result is that one update is lost.

We need a different strategy to enforce atomicity.

Mutual Exclusion

One way to enforce atomicity is mutual exclusion, or mutex for short, also known as locking. The idea is that we want to delimit parts of the code where only one thread can be running at a time. Imagine that C had a special construct for mutual exclusion; then we might write this:

mutex {
  x += 1;
}

This would mean that only one thread would be allowed to be running inside those curly braces at a time. The region of code protected by mutual exclusion (the code inside the braces inside this imaginary construct) is called a critical section. So if thread 1 entered the critical section, and then thread 2 arrived at the top of the section, it would need to wait until thread 1 left the critical section before it could enter.

Can you convince yourself that this mutual exclusion would fix the atomicity problems from our example? If we enforce mutually exclusive execution of that critical section, is that enough? (It is.)

Sadly, C does not have a built-in mutex construct. Instead, we need to use a library or build it ourselves.

A Failed Attempt

Here’s a naive way that you might try to implement mutual exclusion: use a lock variable to keep track of whether someone is currently occupying the critical section. Something like this:

int lock = 0;    // A global variable shared between threads.

while (lock) {}  // Wait for the lock to be free.
lock = 1;        // Acquire the lock.
*x += 1;         // Critical section here.
lock = 0;        // Release the lock.

That should do it, right (at least with enough fence instructions to avoid the data race on the lock variable)?

It doesn’t work. Imagine that both threads first load the lock variable from memory. Both threads observe that the value is zero. Then, they will both bypass the while statement and set lock to 1. So we have failed to enforce mutual exclusion.

It’s possible to fall down a deep rabbit hole of techniques for implementing mutual exclusion. A famous example is Peterson’s algorithm, which works by combining one flag variable per thread (instead of one shared flag variable). By using many fence instructions to avoid data races, it is possible to get this to work. But there is a better way: ISAs provide special instructions to support atomicity.

Atomic Instructions

RISC-V provides two basic atomic instructions to support the implementation of synchronization operations. They are called lr, for load reserved, and sc, for store conditional. These two instructions work together to provide the basic mechanisms required to implement any style of synchronization. (In other ISAs, this pattern is called load-link/store-conditional.)

The instructions come in different accesses sizes; for example, lr.w and sc.w are the word-sized (32-bit) versions. Here’s what the instructions do:

  • lr.w rd, (rs1): Load the 32-bit value at the address in rs1 and put the value in rd. (So far, like a normal lw.) Also, create a “reservation” of this address. (What is a “reservation”? Keep reading.)
  • sc.w rd, rs2, (rs1): Store the value of rs2 at the address in rs1. (Again, so far, like a normal store.) But, also check whether a reservation of this address exists. If so, then the store proceeds as normal, and set rd to 0. (Call this a “success.”) If not, then cancel the store altogether: do not write anything at all the memory, and set rd to 1. (This is a “failure.”)

This “reservation” business is a mechanism for checking whether anyone else wrote to a given address. While a reservation exists, think of the CPU carefully monitoring the given address to see if anyone else writes to that address. If nobody writes to the address between the lr and the sc, the reservation is preserved and sc succeeds. If somebody else does write to the given address, then the reservation is lost and sc fails.

The .w suffix indicates a word-sized (4-byte) access, like the lw and sw instructions for ordinary non-atomic loads and stores. For double-word atomic instructions, there are also lr.d and sc.d, which are like ld and sd.

Implementing Atomic Operations

The usual way to use lr and sc together is to put them at the beginning and the end of some region of code, and then wrap the whole thing in a loop. The loop lets you try the code repeatedly, until the sc succeeds. If you’re careful, this can mean that the code surrounded by the lr/sc pair eventually executes atomically. The pattern looks something vaguely like this:

loop:
  lr.w t0, (a0)
  # ... do something with t0 to compute t1 ...
  sc.w t2, t1, (a0)
  bnez t2, loop     # if the lr/sc failed, then try again

The memory address in this example is in register a0. This little loop tries to do something with the value at this address and then store it back. If any other thread ever interferes, then it gives up and tries again—over and over, until the operation succeeds. The end result is that we get to perform an atomic operation on the value stored at the address in a0.

You will use this pattern to implement interesting synchronization operations, including mutual exclusion, in an upcoming assignment.

An Atomic Update

Let’s use this pattern to implement *x += 1 atomically. Imagine that a0 contains the address of the x variable, so here’s the ordinary (non-atomic) version of this code:

lw t0, 0(a0)
addi t0, t0, 1
sw t0, 0(a0)

To make this atomic, we’ll first replace lw with lr.w and sw with sc.w:

lr.w t0, (a0)
addi t0, t0, 1
sc.w t1, t0, (a0)

Then, we need to account for potential failures in the sc.w by jumping back to the top and trying again:

.retry:
  lr.w t0, (a0)
  addi t0, t0, 1
  sc.w t1, t0, (a0)
  bnez t1, .retry

Try putting this code into an assembly file and wrapping it a function. You can call it atomic_incr. Then, in our original code above, replace the count += 1 line with a call to atomic_incr(). Now, our code reliably outputs 1,000,000 because we have successfully enforced atomicity.

Synchronization Constructs

It’s possible to use lr, sc, and fence to implement many general-purpose, reusable synchronization constructs. You will do some of this in an assignment. Here is a very incomplete list of synchronization constructs that people use in real multithreaded code:

  • Mutual exclusion. You write a lock() function that waits until the lock is free and acquires it, and a corresponding unlock() function that releases the lock and lets other threads enter the critical section.
  • Condition variables. These extend mutual-exclusion locks with the ability to temporarily release a lock until some condition changes.
  • Barriers. A barrier has a parameter \(n\), and it waits until \(n\) threads reach the same barrier. When that threshold is reached, all \(n\) threads simultaneously proceed past the barrier.
  • Semaphores. A semaphore limits the number of threads that can simultaneously acquire some resource.

Different synchronization mechanisms have different strengths and weaknesses, and you’ll find that each one may be appropriate in a different parallel programming scenario.

You Might Still Need Fences

The fence instruction we studied last time and lr/sc are related but not equivalent. We have seen that lr/sc can implement atomic updates to a single variable. But unlike fence, they don’t enforce any ordering with respect to other “plain” loads and stores your program might do. In some circumstances, you might need to use both: for example, if you implement a mutual-exclusion lock with lr/sc, you may still want a fence to enforce an ordering between lock acquires/releases and the critical sections they protect.

Note

RISC-V provides extra flags for the lr and sc instructions that can optionally enforce memory ordering, obviating the need for a fence. They are the acquire and release flags, which appear as aq and rl suffixes on the assembly instructions. For example, you can write lr.d.aq, lr.d.rl, or lr.d.aqrl. Here’s a synopsis: aq prevents accesses that happen after the instruction from being moved before these instructions, and rl prevents accesses before the instruction from moving after.

You can read more about these flags in the RISC-V ISA reference manual. For 3410 purposes, it’s fine to stick to using fence.