Parallel Programming

One of the two motivations we used when introducing threads was the idea of harnessing parallel hardware to make computations go faster. Parallelism is important because the overwhelming majority of computers in the modern world are parallel. When was the last time (if ever) that you saw a laptop for sale with a single-core CPU? Core counts like 8 are much more common today. Even the Apple Watch has a dual-core processor. And on the other end of the spectrum, server processors have core counts like 96 and 192. The result is that, when performance matters, parallelism is the only way to take full advantage of the hardware.

Now that you know about the “building blocks” for parallelism (namely, atomic instructions), this lecture is about writing software that uses them to get work done. In CS 3410, we focus on the shared memory multiprocessing approach, a.k.a. threads. There are many other programming models for writing parallel software out there, but the shared-memory approach is ubiquitous: because they represent an incremental extension of a sequential programming paradigm, they are kind of the “default” way for modern software to incorporate parallelism.

pthreads

Last week’s assignment was on implementing synchronization operations to support parallel programming. It turns out that Unix has a standard library, called POSIX Threads or, affectionately, pthreads, that implements many of these sync ops for you. This lecture is about moving up the abstraction hierarchy: now that you know how these building blocks work, we can grant ourselves permission to use the “standard” version.

You can read the entire pthread.h header header to see what’s available. Let’s walk through the basics step by step.

Spawn & Join Threads

The pthread_create function launches a new thread. It’s a tiny bit like fork and exec for processes, but for threads within the current process instead of creating new subprocesses. Here’s the signature:

int pthread_create(pthread_t* thread, pthread_attr_t* attr,
    void *(*thread_func)(void*), void* arg);

We’ll come back to the other arguments, but the important ones for now are:

  • The first argument, thread, is a pthread_t pointer to initialize. This struct is what the parent will use to interact with its brand-new child thread.
  • The third argument, thread_func, is a function pointer to the code to run in the new thread. The thread function has to have a specific signature: void* thread_func(void* arg). The void* argument and return types are C’s way of letting the thread function receive and return “anything.”

It’s OK (for now) to pass NULL for the other parameters. So the basic recipe for spawning a new thread looks like this:

void* thread_func(void* arg) {
    // code to run in a new thread!
}

// ...

pthread_t thread;
pthread_create(&thread, NULL, thread_func, NULL);

Whenever you spawn a thread, you will also want to wait for it to finish, a.k.a. join the thread. There is a pthreads call for that too, in [the pthread_join function][join]:

int pthread_join(pthread_t thread, void** out_value);

We will again ignore the second parameter for a moment (it can be NULL). The first parameter is the pthread_t value that we previously initialized with pthread_create. The call blocks until the given thread finishes.

Putting it all together, here’s a complete program that launches a thread and then properly waits for it to finish:

#include <stdio.h>
#include <pthread.h>

void* my_thread(void* arg) {
    printf("Hello from a child thread!\n");
    return NULL;
}

int main() {
    printf("Hello from the main thread!\n");

    pthread_t thread;
    pthread_create(&thread, NULL, my_thread, NULL);
    pthread_join(thread, NULL);

    printf("Main thread is done!\n");
    return 0;
}

There are no race conditions here; this program is properly synchronized and is guaranteed to print the three messages in order:

Hello from the main thread!
Hello from a child thread!
Main thread is done!

Arguments & Return Values

Thread functions take a void* argument and return a void* return value so that the parent can communicate with it. You pass a pointer to the argument value to pthread_create, and pthreads will pass this along to the thread function’s argument. Then, if you return a value from the thread function, the parent can receive that value through an “out-parameter” in pthread_join: that is, the parent has to wait for the child to finish for the return value to become available.

Here’s an example of a thread that performs the incredibly heavy-duty work of multiplying an integer by 2:

#include <stdio.h>
#include <pthread.h>

void* doubler_thread(void* arg) {
    int* num = (int*)arg;
    *num = *num * 2;
    return arg;
}

int main() {
    int my_number = 21;
    printf("Before, my_number = %d\n", my_number);

    pthread_t thread;
    pthread_create(&thread, NULL, doubler_thread, &my_number);
    int* result;
    pthread_join(thread, (void**)&result);
    printf("Result returned: %d\n", *result);

    printf("After, my_number = %d\n", my_number);
    return 0;
}

The parent passes a pointer to my_number to the doubler_thread thread function. The thread function then passes the same pointer right back to the parent.

While thread arguments are really important, to be honest, I don’t usually find thread return values all that useful. It’s usually easier to just use the thread argument: to pass a pointer to where the thread should write its results. You’ll see that happen in the rest of the examples in this lecture.

Launching Lots of Threads

You usually want to create many threads at once, not just one. You still need one pthread_t per thread, so a good tactic is to use an array (on the stack or the heap) of these. Use a loop to launch the threads with pthread_create, and then another loop to wait for each one with pthread_join.

Here’s an example that launches one thread per number in a range to check if it’s prime (in the slowest way possible):

#include <stdio.h>
#include <pthread.h>
#include <stdbool.h>

#define NUMBERS 20

bool is_prime(int n) {
    for (int i = 2; i < n; ++i) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

typedef struct {
    int number;
    bool* prime_flags;
} my_thread_args_t;

void* prime_thread(void* args_in) {
    my_thread_args_t* args = (my_thread_args_t*)args_in;
    args->prime_flags[args->number] = is_prime(args->number);
    return NULL;
}

int main() {
    // We'll set `prime[i]` to true iff `i` is prime.
    bool prime[NUMBERS];

    // Launch a thread to check every number.
    pthread_t threads[NUMBERS];
    my_thread_args_t thread_args[NUMBERS];
    for (int i = 1; i < NUMBERS; ++i) {
        thread_args[i] = (my_thread_args_t){
            .number = i,
            .prime_flags = prime,
        };
        pthread_create(&threads[i], NULL, prime_thread, &thread_args[i]);
    }

    // Join all threads and print results when ready.
    for (int i = 1; i < NUMBERS; ++i) {
        pthread_join(threads[i], NULL);
        printf("%d is %s\n", i, prime[i] ? "prime" : "composite");
    }

    return 0;
}

This example also demonstrates another useful technique: defining your own little struct just to use as the argument to the thread function. If thread functions could take multiple arguments, we might just do that. But using a struct for the arguments is the next best thing. Here, my_thread_args_t contains the number that the thread is supposed to process and a pointer to the results array where it should write. To ensure that the argument struct remains “alive” for the entire duration of the thread, we also need an array to store all these my_thread_args_t values. (It would not work, for example, to use a local variable inside the loop.)

Make Threads Do Coarse-Grained Chunks of Work

Threads are not free. Launching a thread takes time to coordinate with the OS; joining similarly costs waiting time; each running thread costs bookkeeping memory; and frequent context switching between threads adds overhead. And if you are aiming to fully harness a parallel CPU, it doesn’t help to have more threads than you have available hardware parallelism anyway.

It is therefore not a good idea to launch threads that only do a tiny amount of work, such as checking a single number for primality. Checking thousands or millions of numbers is perfectly practical, but launching millions of threads to check each one is not. In practical programming, you will want to divide a problem into coarser-grained chunks of work. Then you can launch a small number of threads—probably somewhere close to the number of cores in your machine.

For our primality example, it could make sense to divide up the numbers we need to check. We can extend our my_thread_args_t struct to contain not just one number but a start/end interval. Then, we just need to change our thread to loop over the range. Here’s a full implementation:

#include <stdio.h>
#include <pthread.h>
#include <stdbool.h>

#define THREADS 8
#define NUMBERS 1024

bool is_prime(int n) {
    for (int i = 2; i < n; ++i) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

typedef struct {
    int start_number;
    int end_number;
    bool* prime_flags;
} my_thread_args_t;

void* prime_thread(void* args_in) {
    my_thread_args_t* args = (my_thread_args_t*)args_in;

    for (int n = args->start_number; n < args->end_number; ++n) {
        args->prime_flags[n] = is_prime(n);
    }

    return NULL;
}

int main() {
    // We'll set `prime[i]` to true iff `i` is prime.
    bool prime[NUMBERS];

    // Launch a thread to check chunks of numbers.
    pthread_t threads[THREADS];
    my_thread_args_t thread_args[THREADS];
    int numbers_per_thread = NUMBERS / THREADS;  // Hopefully they divide.
    for (int i = 0; i < THREADS; ++i) {
        thread_args[i] = (my_thread_args_t){
            .start_number = i == 0 ? 1 : i * numbers_per_thread,
            .end_number = (i + 1) * numbers_per_thread,
            .prime_flags = prime,
        };
        pthread_create(&threads[i], NULL, prime_thread, &thread_args[i]);
    }

    // Join all threads and print results when ready.
    for (int i = 0; i < THREADS; ++i) {
        pthread_join(threads[i], NULL);
        for (int n = thread_args[i].start_number;
             n < thread_args[i].end_number;
             ++n) {
            printf("%d is %s\n", n, prime[n] ? "prime" : "composite");
        }
    }

    return 0;
}

The nice thing about this version is that the problem size (the number of integers to check for primality) is not related to the thread count. So we can freely change the two parameters independently.

Concurrency Bugs

Sadly, parallel programming comes with an entirely new category of bugs to worry about. You have already seen atomicity violations, for example, and many other forms of concurrency bugs also lurk in shared-memory programming. In essence, the whole game of parallel programming is avoiding concurrency bugs without sacrificing too much of the awesome performance potential of parallel hardware.

A Racy Program

Let’s try changing our multithreaded primality checker to, instead of reporting which numbers are prime, just count how many primes exist in a range of numbers. Here’s the complete program:

#include <stdio.h>
#include <pthread.h>
#include <stdbool.h>

#define THREADS 8
#define NUMBERS 1024

bool is_prime(int n) {
    for (int i = 2; i < n; ++i) {
        if (n % i == 0) {
            return false;
        }
    }
    return true;
}

typedef struct {
    int start_number;
    int end_number;
    int* prime_count;
} my_thread_args_t;

void* prime_thread(void* args_in) {
    my_thread_args_t* args = (my_thread_args_t*)args_in;

    for (int n = args->start_number; n < args->end_number; ++n) {
        if (is_prime(n)) {
            (*(args->prime_count))++;
        }
    }

    return NULL;
}

int main() {
    int primes = 0;

    // Launch a thread to check chunks of numbers.
    pthread_t threads[THREADS];
    my_thread_args_t thread_args[THREADS];
    int numbers_per_thread = NUMBERS / THREADS;  // Hopefully they divide.
    for (int i = 0; i < THREADS; ++i) {
        thread_args[i] = (my_thread_args_t){
            .start_number = i == 0 ? 1 : i * numbers_per_thread,
            .end_number = (i + 1) * numbers_per_thread,
            .prime_count = &primes,
        };
        pthread_create(&threads[i], NULL, prime_thread, &thread_args[i]);
    }

    // Join all threads.
    for (int i = 0; i < THREADS; ++i) {
        pthread_join(threads[i], NULL);
    }

    // Print final prime count.
    printf("%d numbers in the range 1-%d are prime\n",
           primes, (NUMBERS - 1));

    return 0;
}

When I compiled and ran this program on my machine, it gave disturbingly inconsistent answers. Here are a few runs:

$ gcc -O2 threads-racy.c -o racy
$ ./racy
153 numbers in the range 1-1023 are prime
$ ./racy
163 numbers in the range 1-1023 are prime
$ ./racy
154 numbers in the range 1-1023 are prime
$ ./racy
167 numbers in the range 1-1023 are prime
$ ./racy
153 numbers in the range 1-1023 are prime
$ ./racy
159 numbers in the range 1-1023 are prime
$ ./racy
161 numbers in the range 1-1023 are prime

It’s bad enough that these answers are incorrect, but even worse, the program is nondeterministically incorrect.

The problem is reminiscent of the basic atomicity violation that we saw recently, but it actually indicates an even deeper problem.

Data Races

The fundamental problem in the buggy program above is an unsynchronized memory access. The formal name is a data race. Here’s a definition: a data race occurs when two different threads perform unsynchronized accesses to the same memory location, and at least one of those accesses is a write.

To understand this definition, is an be useful to think through things that are not data races:

  • Memory accesses within a single thread. Memory accesses can of course be buggy for other reasons, but they are not data races!
  • When different threads access different memory locations. In our original primality check program, for example, different threads wrote to different primes[i] indices. But no to threads ever tried to write to the same index, so there was no data race.
  • Multithreaded reads of the same data. It is always OK for different threads to share read-only data. The only situations that are data races are when one thread writes and the other thread reads and when both threads write.

The final criterion is that unsynchronized qualifier. This has a more nuanced definition, but it broadly means that there are no synchronization operations (such as locks) protecting the data. The implication is that you can always fix data races by adding synchronization.

The line in our program with the data race is this one:

(*(args->prime_count))++;

Let’s check the four parts of our definition:

  • Multiple threads run this line.
  • The access is unsynchronized: we haven’t done anything to ensure ordered access.
  • The accesses go to the same memory location. (There is only one prime_count variable.)
  • Although the ++ syntax makes it slightly harder to see, this line both reads and writes the variable.

So this is indeed a data race.

Data races are undefined behavior in C (and C++). That means that they are equally problematic as a violation of the heap commandments: use-after-free bugs, out-of-bounds accesses, and so on. The compiler is allowed to assume your program does not have races and transform and bases its optimizations on that assumption.

The consequence is that you cannot reason about the behavior of racy programs; they can do anything. To write working parallel software, you must avoid data races.

Locks in pthreads

You can fix data races by adding synchronization. We could even use the spin-lock mutex that is on your current assignment. But pthreads also provides a mutual exclusion lock. There are three steps to use a pthreads lock:

To fix our racy program above, we can declare a new mutex in main:

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

Then, we’ll need to pass this mutex along to each thread by adding it to our my_thread_args_t struct. Within each thread, we’ll acquire and release the mutex to protect a critical section:

pthread_mutex_lock(args->mutex);
(*(args->prime_count))++;
pthread_mutex_unlock(args->mutex);

We now have a properly synchronized program with no data races. If we run this program, it reliably gets the right answer:

$ gcc -O2 threads-mutex.c -o mutex
$ ./mutex
173 numbers in the range 1-1023 are prime
$ ./mutex
173 numbers in the range 1-1023 are prime
$ ./mutex
173 numbers in the range 1-1023 are prime

Catching Races with Thread Sanitizer

To catch other forms of undefined behavior such as out-of-bounds accesses, we recommend enabling sanitizers in the compiler. Is there a similar way to detect data races?

Fortunately, yes: ThreadSanitizer is a feature built into some compilers that does exactly this. Unfortunately, it doesn’t (yet) work in the CS 3410 RISC-V container. But if you like and you have a recent compiler set up on your host machine, you can enable ThreadSanitizer with -fsanitize=thread. For example, this will find the data race in our buggy example above (before we added the lock):

$ clang -g -fsanitize=thread threads-racy.c -o racy
$ ./racy
==================
WARNING: ThreadSanitizer: data race (pid=56484)
  Write of size 4 at 0x00016dd9efe0 by thread T2:
    #0 prime_thread threads-racy.c:28 (racy:arm64+0x100003c04)

  Previous write of size 4 at 0x00016dd9efe0 by thread T1:
    #0 prime_thread threads-racy.c:28 (racy:arm64+0x100003c04)
[...]

This error indicates that line 28 of threads-racy.c had a data race with itself.

Producer/Consumer Parallelism

Locks and critical sections are only one way to coordinate work between multiple threads. This section will build up toward a different style.

One limitation in our approach so far to diving work into chunks is imbalance between threads. Our primality program, for example, takes as long as the slowest thread. Larger numbers take longer to check, so the earlier chunks will run faster than the later chunks. Dealing with this kind of imbalance is a major challenge in parallel programming.

One parallel programming technique to help automatically deal with imbalance is the producer/consumer pattern. The idea is that you will have one thread producing the work to do and \(n\) parallel threads consuming the work items and actually doing the work. You need a data structure keep track of the work and to intermediate between the producer and the consumers.

We’ll start by designing that data structure and then build up to a new automatically-balancing implementation of our primality checker.

Circular Buffer

We need a queue data structure to intermediate between the producer and the consumers. The idea is that the producer will push work items on to the tail of the queue, and consumers will pop items from the head.

A sensible way to implement a bounded-size queue is with a circular buffer (a.k.a. a ring buffer). The idea is to allocate an array of \(n\) elements, and to hope that you never need to have more than \(n\) things in your queue at once. Then, you keep track of two indices: the head and the tail of the queue. They “wrap around” the \(n\)-element array.

Here’s a sample implementation of a bounded buffer without any parallelism involved. We’ll need a struct to keep track of the state:

typedef struct {
    int* data;
    int capacity;  // The size of the `data` array.
    int head;      // The next index to pop.
    int tail;      // The next index to push.
} bounded_buffer_t;

Here are the functions to push into and pop from the queue:

void bb_push(bounded_buffer_t* bb, int value) {
    assert(!bb_full(bb));
    bb->data[bb->tail] = value;
    bb->tail = (bb->tail + 1) % bb->capacity;
}

int bb_pop(bounded_buffer_t* bb) {
    assert(!bb_empty(bb));
    int value = bb->data[bb->head];
    bb->head = (bb->head + 1) % bb->capacity;
    return value;
}

The functions work by advancing the head or tail index by one and then “wrapping around” the capacity-sized array.

There is a critical detail here represented by the assert calls. (You can imagine simple implementations of bb_full and bb_empty: the buffer is empty if the head and tail indices are equal, for example.) We really don’t want to push into a full buffer or pop from an empty queue. When we take this data structure into a parallel context, we will want to handle these conditions by waiting for some other thread to do push or pop before proceeding with our own operation.

A Simple Lock and Busy Waiting

One way to make the producer/consumer pattern work to wrap all our accesses to the queue in a lock, just like any other shared data structure.

We’ll start by extending the queue data structure:

typedef struct {
    int* data;
    int capacity;  // The size of the `data` array.
    int head;      // The next index to pop.
    int tail;      // The next index to push.

    pthread_mutex_t* mutex;
    bool done;
} bounded_buffer_t;

We add a mutex to protect the lock, and also a done flag to signal to consumers that there are no more items coming. Next, we will implement variants of the bb_push and bb_pop functions that are safe to call from separate threads, and which block (wait) until they can succeed. Our goal is to write a couple of thread functions like this:

void* producer_thread(void* arg) {
    bounded_buffer_t* buf = (bounded_buffer_t*)arg;
    for (int i = 0; i < NUMBERS; ++i) {
        printf("producing %d\n", i);
        bb_block_push(buf, i);
    }
    bb_finish(buf);
    return NULL;
}

void* consumer_thread(void* arg) {
    bounded_buffer_t* buf = (bounded_buffer_t*)arg;
    while (1) {
        bool done;
        int number = bb_block_pop(buf, &done);
        if (done)
            break;
        printf("consuming %d\n", number);
    }
    return NULL;
}

The producer thread pushes the numbers 0 through NUMBERS-1 into the queue. Whenever the queue is full, bb_block_push should wait until there is room and then proceed.

The consumer thread pops one number at a time. The bb_block_pop call blocks until there is at least one item in the queue to consume or until the done flag becomes true, in which case the thread should shut down.

Let’s look at bb_block_push first:

void bb_block_push(bounded_buffer_t* bb, int value) {
    pthread_mutex_lock(bb->mutex);

    // Spin to wait until the queue has room to push.
    while (bb_full(bb)) {
        // Release the lock for a moment to let other threads proceed.
        pthread_mutex_unlock(bb->mutex);
        pthread_mutex_lock(bb->mutex);
    }

    // Actually do the push.
    bb_push(bb, value);

    pthread_mutex_unlock(bb->mutex);
}

This is a busy waiting loop: we repeatedly check for there to be room in the queue, and when there finally is, then we push. The tricky thing I’ve done here is to briefly unlock and relock the buffer’s mutex. If we didn’t do this, no other thread could acquire the lock to pop, so we could never make progress.

The critical sections here (regions between a pthread_mutex_lock and pthread_mutex_unlock) here are a little harder to see because of this trick. But they protect all the shared state: all the accesses to the buffer’s internal data happen with the lock held.

The bb_block_pop function looks somewhat similar:

int bb_block_pop(bounded_buffer_t* bb, bool* done) {
    pthread_mutex_lock(bb->mutex);

    // Spin to wait until queue has a value (or until we are done).
    while (bb_empty(bb) && !bb->done) {
        pthread_mutex_unlock(bb->mutex);
        pthread_mutex_lock(bb->mutex);
    }

    // Either we're done or we can pop.
    int value;
    if (bb->done) {
        *done = true;
        value = 0;
    } else {
        value = bb_pop(bb);
    }

    pthread_mutex_unlock(bb->mutex);
    return value;
}

One main difference here is that we also need to check for the done flag. Because it’s shared state, that access also needs to be protected by the buffer’s mutex.

This implementation totally works. It is a little sad that we had to resort to busy-waiting, though: it is inefficient to need to repeatedly acquire a lock to check a condition until it happens to change. This should be a clue that a mutex alone may not be the perfect tool for the job.

Condition Variables

This is a perfect use case for a different synchronization construct: a condition variable. You always pair a condition variable with a lock. Condition variables let you temporarily release the lock while you wait for other threads to change some condition you care about. In this case, the condition we need to wait for is the fullness or emptiness of the buffer.

The pthreads library provides a pthread_cond_t type for condition variables. Aside from initialization/destruction, there are three important operations:

An important thing to realize about the condition variable API is that it doesn’t say anything about whether an actual logical condition about your program is true or false. That’s up to you. It just handles the mechanics of waiting for the abstract idea of condition changes.

The Correct Way™ to use condition variables is to wait on them in a loop that checks your actual, logical condition to become true. Something like this:

pthread_mutex_lock(mutex);
while (!check_your_condition()) {
    pthread_cond_wait(cond, mutex);
}
do_stuff();  // Now you know `check_your_condition()` returned true.
pthread_mutex_unlock(mutex);

The specification for pthread_cond_wait allows for spurious wakeups: the call can sometimes return even when nobody signaled. That’s why it’s a good idea to always put your wait call in a loop that checks whether the condition actually changes. It also lets other threads “err on the side of signalling”: it is OK to signal a condition even if there’s a chance the logical condition did not actually change. Because you know all the waiting threads will double-check the condition in their loops, you can feel safe in signalling even when you don’t strictly need to.

Using Condition Variables in the Producer/Consumer Pattern

Let’s try replacing the busy waiting in our producer/consumer program with condition variables.

We will associate two pthread_cond_t condition variables with our buffer in its definition:

typedef struct {
    int* data;
    int capacity;  // The size of the `data` array.
    int head;      // The next index to pop.
    int tail;      // The next index to push.

    pthread_mutex_t* mutex;
    bool done;

    pthread_cond_t* full_cv;
    pthread_cond_t* empty_cv;
} bounded_buffer_t;

The two condition variables reflect two abstract states: whether the queue is full and whether it is empty. We’ll signal the full_cv condition variable when the buffer goes from full to non-full. Similarly, we’ll signal empty_cv when it goes from empty to non-empty.

Here’s what the push function looks like with condition variables:

void bb_block_push(bounded_buffer_t* bb, int value) {
    pthread_mutex_lock(bb->mutex);
    while (bb_full(bb)) {
        pthread_cond_wait(bb->full_cv, bb->mutex);
    }
    bb_push(bb, value);
    pthread_mutex_unlock(bb->mutex);
    pthread_cond_signal(bb->empty_cv);
}

The loop looks pretty similar; we just get to replace that unlock/lock pair with a pthread_cond_wait. The wait call appears in a loop that checks the actual logical condition. After the critical section finishes, we know that the queue’s emptiness may have changed, so we signal the empty_cv condition.

We can change the pop function in a similar way:

int bb_block_pop(bounded_buffer_t* bb, bool* done) {
    pthread_mutex_lock(bb->mutex);
    while (bb_empty(bb) && !bb->done) {
        pthread_cond_wait(bb->empty_cv, bb->mutex);
    }
    int value;
    if (bb->done) {
        *done = true;
        value = 0;
    } else {
        value = bb_pop(bb);
    }
    pthread_mutex_unlock(bb->mutex);
    pthread_cond_signal(bb->full_cv);
    return value;
}

This time, we need to signal the full_cv condition because, after this pop is done, the queue may have just gone from full to non-full.

The code is shorter this way, and the pthreads library can help put these threads to sleep while they’re waiting. Awesome!

Deadlock

We have seen two types of concurrency bugs so far: atomicity violations and data races. This section is about a third kind. Deadlock is the name for the problem that happens when two different threads get stuck waiting for the other.

Here’s the general scenario. Imagine a situation with two threads, T1 and T2, that need to use some sort of shared resources, R1 and R1. The program wants to prevent concurrent use: i.e., only one thread can be using a resource at a given time. Now imagine that T1 is currently using only R1 and T2 is currently using only R2. Next, imagine that T1 also wants to start using R2, and that T2 wants to start using R1. Because R2 is busy, T1 must wait for T2 to be done with it. Similarly, because R1 is busy, T2 must wait. Neither thread can make progress, so neither can relinquish their reservation on either resource. So we are stuck.

An Example

We can turn this abstract idea into real code using locks. We’ll spawn two threads, and use two locks (representing the shared resources R1 and R2 above). The program looks like this:

#include <stdio.h>
#include <pthread.h>

pthread_mutex_t lock1 = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_t lock2 = PTHREAD_MUTEX_INITIALIZER;

void* thread1(void* arg) {
    printf("Hello from a thread 1!\n");
    pthread_mutex_lock(&lock1);
    /*** Potential deadlock here! ***/
    pthread_mutex_lock(&lock2);
    pthread_mutex_unlock(&lock2);
    pthread_mutex_unlock(&lock1);
    return NULL;
}

void* thread2(void* arg) {
    printf("Hello from a thread 2!\n");
    pthread_mutex_lock(&lock2);
    /*** Potential deadlock here! ***/
    pthread_mutex_lock(&lock1);
    pthread_mutex_unlock(&lock1);
    pthread_mutex_unlock(&lock2);
    return NULL;
}

int main() {
    printf("Hello main!\n");

    pthread_t threads[2];
    pthread_create(&threads[0], NULL, thread1, NULL);
    pthread_create(&threads[1], NULL, thread2, NULL);
    pthread_join(threads[0], NULL);
    pthread_join(threads[1], NULL);

    printf("Main is done!\n");
    return 0;
}

I’ve added a comment to mark the problematic point in both threads. If both threads were to reach that point at the same time, then thread1 would need to wait for thread2 to release lock2 and vice versa. Deadlock!

If you try to compile and run this example, however, it will be hard to make this potential deadlock manifest. You have to get unlucky with the relative progress of the two threads. If one thread happens to finish before the other one even gets started, for example, there’s no deadlock here.

This is the worst kind of concurrency bug: the kind that manifests rarely. If the bug happens every time, that’s not great, but at least you can find it, reproduce it and fix it. If you have a bug manifest only once every N days or months, it’s hopeless: you can recreate exactly the same conditions that led to the bug and not be able to trigger the behavior so you can inspect it. As one recent example, here’s a blog post from some Netflix engineers about an intermittent concurrency bug (not a deadlock, but the point still stands). In that story, it was easier to just periodically kill the problematic processes than to find and fix the bug.

Just so we can prove it’s a problem, we can force the deadlock to happen every time by synchronizing the threads at the problematic point. Like this:

void* thread1(void* arg) {
    printf("Hello from a thread 1!\n");
    pthread_mutex_lock(&lock1);
    barrier();
    printf("Passed the barrier in thread 1!\n");
    pthread_mutex_lock(&lock2);
    pthread_mutex_unlock(&lock2);
    pthread_mutex_unlock(&lock1);
    return NULL;
}

void* thread2(void* arg) {
    printf("Hello from a thread 2!\n");
    pthread_mutex_lock(&lock2);
    barrier();
    printf("Passed the barrier in thread 2!\n");
    pthread_mutex_lock(&lock1);
    pthread_mutex_unlock(&lock1);
    pthread_mutex_unlock(&lock2);
    return NULL;
}

By using a barrier to make the threads reach the point just before they acquire the second lock, we can make the deadlock manifest deterministically.

A Rule for Avoiding Deadlock

The crucial mistake that makes our example above deadlock is that the threads acquire the locks in different orders. thread1 has a lock1 critical section surrounding a lock2 critical section; thread2 acquires and releases the locks in the opposite order. Think about what would happen instead if both threads acquired lock1 and then, within that critical section, had a smaller lock2 critical section.

It turns out that you can use this observation to concoct a rule for avoiding deadlocks when using mutexes:

  1. Decide on a total order among all your mutexes.
  2. Always acquire the mutexes in that order.
  3. Always release them in opposite order.

A different way of describing the third element in the rule is that, when critical sections overlap, one should always entirely contain the other—they should never partially overlap. So this is OK:

pthread_mutex_lock(&lock1);
// do stuff with one lock
pthread_mutex_lock(&lock2);
// do more stuff with both locks
pthread_mutex_unlock(&lock2);
// do even more stuff with just lock1
pthread_mutex_unlock(&lock1);

But this is not, because neither critical section entirely contains the other:

pthread_mutex_lock(&lock1);
// do stuff with one lock
pthread_mutex_lock(&lock2);
// do more stuff with both locks
pthread_mutex_unlock(&lock1);
// do even more stuff with just lock2
pthread_mutex_unlock(&lock2);

If you always “scope” your critical sections, and you always acquire your locks in a consistent order, you can avoid deadlock that arises from locks.