Synchronization
When you program with threads, you are using a shared-memory parallelism programming model. This means that multiple streams of instructions are running simultaneously, and they can both read and write the same region of memory. As we discussed last time, this programming model is relatively natural; threads don’t need to do anything special to communicate with each other and they all run the same program (usually different parts of the same program). Do not be deceived by this apparent simplicity though, as programming with threads is notoriously complex and error prone.
While each thread executes the program sequentially, there is (almost) no ordering or timing guarantees between threads. This problem leads to a whole class of bugs which are hard to reason about and may be impossible to reproduce. In this lecture, we will focus on recognizing and fixing these problems with synchronization.
The Problem
Let’s try writing two threads that communicate to each other using the memory they share. We’ll use a global variable:
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
int value;
void* send(void* arg) {
value = 3410;
return NULL;
}
void* receive(void* arg) {
printf("value = %d\n", value);
return NULL;
}
int main() {
pthread_t thread1;
pthread_t thread2;
value = 0;
pthread_create(&thread1, NULL, send, NULL);
pthread_create(&thread2, NULL, receive, NULL);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
return 0;
}
The send thread writes to the global variable,
and the receive thread reads from it.
This kind of communication is possible because threads share a virtual memory address space.
But this program has a problem:
it does not do anything to guarantee that the load in receive happens after the store in send.
We may get lucky and see receive print 3410 sometimes, but that won’t happen every time.
Here are a few reasons to imagine why that might not happen:
- When we launch the two threads, the OS scheduler gets to choose which to run when. The scheduler uses a complicated heuristic to choose threads, and even though we create the
receivethread second, it is allowed to schedule it first. - Imagine
sendandreceiveboth start at about the same time on different CPU cores. Then, a regularly-scheduled timer interrupt happens on the core runningsend, delaying it before it can store to the global variable. But the core runningreceivedoesn’t happen to have a timer scheduled, so that thread goes ahead.
Both circumstances would cause the load in receive not to “see” the store in send.
In general, because our program does not do anything to make sure the events happen in the right order, the behavior will be unpredictable.
(This situation is called a data race, and we’ll define it more clearly soon. Data races are undefined behavior in C. So the problem is actually deeper than just unpredictability in the OS: as with any other undefined behavior in C, this can cause the program to do something arbitrarily bad.)
Attempt 1: Synchronization with sleep
Here’s one idea that won’t work.
Let’s insert calls to sleep!
For example, try adding a sleep(1) call at the top of receive.
All fixed, right??
Warning
No.
You should be suspicious of any solution that relies on physical time passing. Here are a few arguments why:
- Philosophically, it seems bad that the behavior of your program depends on the passage of time instead of the actions of instructions.
- How do you know how much is enough time to force a given ordering? (Remember that the threads will not start at precisely the same time, and that
sleepwill always sleep a little more than the time you request.) - The same scenarios from the previous section are still possible, even if the
sleepcall makes them less likely. It seems bad to write code that is merely very likely to be correct—we want it to be right every time.
How Bad Are Nondeterministic Bugs?
We have created a bug here that happens sometimes but not every time. In other words, the bug occurs nondeterministically.
Imagine you have a nondeterministic bug that occurs with probability \(P\). If \(P=0\), then it’s not a bug at all, and that’s good; if \(P=1\), then it’s a plain old deterministic bug, and that’s clearly bad. When \(0 \lt P \lt 1\), how bad is it? For whatever definition of “bad,” think about how you think “badness” is related to \(P\).
One problem with nondeterministic bugs is that they are sometimes also heisenbugs: they go away when you look for them.
Attempt 2: Synchronization with a Shared Variable
Let’s try to force the ordering using another global variable. Here’s the idea:
int value;
int done;
void* send(void* arg) {
sleep(1);
value = 3410;
done = 1;
return NULL;
}
void* receive(void* arg) {
while (!done) {}
printf("value = %d\n", value);
return NULL;
}
The new variable, done, starts out holding 0.
After send stores to value, it then sets done to 1.
receive uses a loop to wait for done to become 1, and then it prints out value.
So even though we’re sleeping at the beginning of send, this looks like it should force the load in receive to happen after the store in send.
This also won’t work.
At least on my machine, it seems to work OK without optimization, but using gcc -O1 to turn on some compiler optimizations makes it run forever.
It seems bad to have programs that only work with optimizations turned off!
Why did this happen?
Ordinary Loads and Stores Do Not Suffice
The central problem is that you cannot synchronize threads with ordinary loads and stores alone.
The foundational reason for this rule stems from the hardware. In a multiprocessor system, it takes a while for each processor to publish its memory stores so that they can be read by other processors. (The architectural component to blame is a store buffer.) That means that each CPU can read its own writes immediately, but other processors see these updates only after a delay—and possibly in a different order. The hardware details here are out of scope for CS 3410, but the consequence is that maintaining an “obvious” ordering for all loads and stores in a multiprocessor would be prohibitively expensive. The result is that all modern hardware implements a memory consistency model that allows memory accesses to appear “out of order” to remote processors.
Processors have therefore developed special instructions that bypass these optimizations and, at the cost of performance, force certain memory accesses to happen in a sequentially consistent order. All correct synchronization implementations, therefore, must use these special instructions instead of ordinary load and store instructions.
The key takeaway here is that, to implement correct synchronization between threads, we need hardware support.
Beyond the hardware, the compiler is also a problem. It wants to optimize your program by reordering (or even eliminating) accesses to memory. Using special instructions will also instruct the compiler not to do that.
Fence
RISC-V provides an instruction that prevents problematic reorderings, called fence.
You can write it like this:
fence
(This instruction has other options you can set, but a plain fence is the “strongest” form that comes with the most guarantees.)
Like the name implies, you can think of fence as a barrier that prevents accesses from being reordered “across” it.
All loads and stores that appear before the fence in your program must finish before the fence runs,
and all loads and stores that appear afterward must actually occur afterward.
Fixing Our Example
At last, we can write a correct version of our send/receive program.
But because ordinary loads and stores do not suffice on their own, we will need to write assembly.
We’ll use the same tactic as we did when writing our first functions in assembly.
Let’s start with some function signatures in our C code:
void sync_set(int* flag);
void sync_wait(int* flag);
These two functions will synchronize across threads.
sync_set will wait for all prior memory accesses to be done,
and then set the supplied variable to 1.
sync_wait will wait until that variable becomes 1,
and then make sure that all following memory accesses come afterward.
Together, the idea is to synchronize the code before sync_set on one thread and after sync_wait on the other thread.
This way, the latter code can be sure to “see” all the memory accesses performed by the former.
We then amend our thread code to use those functions:
int value;
int done;
void* send(void* arg) {
sleep(1);
value = 3410;
sync_set(&done);
return NULL;
}
void* receive(void* arg) {
sync_wait(&done);
printf("value = %d\n", value);
return NULL;
}
Finally, let’s implement functions those in assembly:
.global sync_set
.global sync_wait
sync_set:
li t0, 1
fence
sw t0, 0(a0)
ret
sync_wait:
.loop:
lw t0, 0(a0)
beqz t0, .loop
fence
ret
The placement of the fence instructions is important.
In sync_set, we need to make sure that all memory accesses that the thread does before the call is done are finished before we tell the other thread to continue.
In sync_wait, we need to make sure that all memory accesses that the thread does after the call actually happen after we’re done waiting in the loop.
Compile and run this code together by typing something like:
rv gcc sync.c sync.s -o sync
rv qemu sync
This is finally a reliable way to synchronize between threads.
Note
In recent versions of C, it is possible to write C code that compiles to special synchronization instructions (i.e., to implement correct synchronization without hand-writing assembly). Check out the
stdatomicheader if you’re curious. But we’ll stick to RISC-V in 3410.
Synchronization Orders Events Across Threads
By writing sync_set and sync_wait, we have created an inter-thread synchronization construct.
There are many styles of synchronization, but they are all different ways of doing the same thing:
establishing a reliable order between events that occur on different threads.
One way to think about this is to imagine the execution of a multithreaded program as a graph.
The vertices are events (e.g., instructions or statements)
and the edges represent the ordering between these events.
Within a single thread, all events are ordered:
you can draw an edge between every instruction and the instruction that immediately follows it in the thread’s execution.
By default, however, there are no edges between events on different threads.
Synchronization (like the calls to sync_set and sync_wait) creates inter-thread ordering edges.
This idea is called a happens-before graph.
Data Races
Let’s take a step back and think about what we’ve accomplished with our sync_set/sync_wait synchronization strategy.
We’ve created a way to force an ordering between threads, regardless of any accidents of timing.
Even if the OS makes very strange decisions about when to schedule threads, the same ordering happens every time.
In general, synchronization establishes a reliable order between events in different threads.
Let’s think about which events in different threads are problematic.
In our example above, the problem was that one thread stores to the value address and the other thread loads from the same location in memory.
Synchronization guarantees that the store happens before the load.
If that ordering isn’t guaranteed, that’s a problem.
The formal name for this kind of problem is a data race.
Here’s a definition:
a data race occurs when two different threads perform unsynchronized accesses to the same memory location,
and at least one of those accesses is a store.
Our original program above has a data race between the store to value in the send thread and the load from value in the other thread.
To understand this definition, is an be useful to think through things that are not data races:
- Memory accesses within a single thread. Memory accesses can of course be buggy for other reasons, but they are not data races!
- When different threads access different memory locations. Writing to one variable and reading from a totally different variable is fine, even when those accesses happen in different threads.
- Multithreaded reads of the same data. It is always OK for different threads to share read-only data. The only situations that are data races are when one thread writes and the other thread reads and when both threads write.
The subtlest part of the definition is that unsynchronized qualifier. This part means that there are no synchronization operations establishing an order between the accesses in question. The implication is that you can always fix data races by adding synchronization.
Example
Let’s consider our original (buggy) program above. The problematic lines are:
value = 3410;
And:
printf("value = %d\n", value);
Let’s check the four parts of our definition:
- These lines execute in different threads.
- The accesses are unsynchronized: we haven’t done anything to ensure ordering.
- The accesses go to the same memory location. (There is only one global
valuevariable.) - The first access is a write, and the second access is a read.
So this is indeed a data race.
However, the final version is free of data races because sync_set and sync_wait suffice to enforce an order between the store and the load.
Data Races are Undefined Behavior
Data races are undefined behavior in C (and C++). That means that they are equally problematic as a violation of the heap commandments: use-after-free bugs, out-of-bounds accesses, and so on. The compiler is allowed to assume your program does not have races and transform and bases its optimizations on that assumption.
The consequence is that you cannot reason about the behavior of racy programs; they can do anything. To write working parallel software, you must avoid data races.