Memory Safe Languages

This semester we have emphasized the importance of memory safety. Back in our fifth lecture when we introduced The Heap Laws, we claimed that following these laws were the hardest part of programming in C. Now that you have spent nearly an entire semester programming in C, I hope you can see why! You likely dealt with frustrating segmentation faults, searched for memory leaks, and perhaps even encountered a double free or two. All of these problems result in undefined behavior, meaning that anything could happen (like demons flying out of your nose). In the best case, your program crashes because it tried to do something it shouldn’t have. In the worst case though, your program contains extremely dangerous vulnerabilities that can be nigh impossible to find.

For example, the Morris worm relied on a buffer overflow vulnerability (among others) to spread itself across the entire Internet, causing between $100,000 and $10,000,000 in total economic impact. As a fun aside, the Morris worm was written by Robert Tappan Morris during his first year of graduate school here at Cornell University! The 2024 CrowdStrike outage last July is another prominent, recent example where an out-of-bounds read prevented roughly 8.5 million Windows systems globally from booting. The worldwide economic impact of the outage has been estimated to be upwards of $10 billion. A 2019 study by Microsoft found that 70% of all the security vulnerabilities found in their software stemmed from memory safety issues. In 2020, Google reported that around 70% of all “serious security bugs are memory safety problems” in the Chromium project. Hopefully these few examples have illustrated how severe memory safety bugs can be.

Take a moment to reflect on the fact that these problems are really only possible in languages like C and C++, where the programmer (i.e., you!) is responsible for managing memory on the heap. In contrast, Python, Java, OCaml, Swift, Haskell, C#, Go, and Rust are all memory safe languages, meaning that they manage the heap automatically for you. This is not just a convenience; these languages can rule out out these extremely dangerous memory bugs altogether. As we will shortly see, while they give up some performance or control to do so, programmers in these languages find these downside to be an acceptable trade-off to avoid the extreme challenge posed by memory bugs. The rest of this lecture focuses on how these languages automatically manage dynamically allocated memory for you.

Garbage Collection

Garbage collection is a popular strategy that many languages (e.g., Java) use to automatically free dynamically allocated memory. A garbage collector is a system that searches the heap for memory blocks that were allocated by the program, but are no longer used. Garbage collection was invented by John McCarthy in 1959 for the LISP programming language.

The goal of a garbage collector is to find and free all memory that is unreachable (garbage) by the program at a given point in time. To do this, garbage collectors make the key insight that memory can be viewed as a directed graph, where the vertices are memory blocks and the edges are pointers or references between blocks. Each vertex can have an arbitrary number of edges pointing in and pointing out. For example, the integer 42 can have any number of pointers pointing to it, but because 42 is a value, not a reference, it wouldn’t have any outgoing edges. On the other hand, a struct or a Java object may have any number of incoming and outgoing edges. The graph may also contain cycles and self-loops.

Tracing Garbage Collection

The most common type of garbage collector is known as a tracing garbage collector. Usually when people refer to garbage collection, they are talking about tracing garbage collection. These garbage collectors employ a two-phase algorithm called mark-and-sweep to locate unreachable memory. In the mark phase all reachable memory is marked as, well, reachable. Then, in the sweep phase all memory that has not been marked as reachable is freed. Let’s take a closer look at each phase in turn.

The mark phase is concerned with figuring out which memory blocks are reachable. Informally, a block is reachable if there a pointer to it or it is otherwise accessible. For example, we can assume that local and global variables are always accessible by the program. We call the set of memory blocks that we assume are always reachable the root set. We can now formally define reachability:

Reachable

A memory block is reachable if it is either:

  1. in the root set, or
  2. referenced (pointed) by a block of memory that is reachable.

Tri-Color Marking

This definition of reachability essentially outlines how the mark phase distinguishes reachable memory blocks from unreachable ones. First, the garbage collector builds the directed graph of memory. Then, it uses some tree-traversal algorithm (such as DFS or BFS) to find all the vertices starting at the root set. While traversing the graph, the garbage collector colors each vertex it touches one of three colors: white, grey, and black. The first time the collector visits a memory block, it colors it as grey. Grey denotes the vertices which are reachable, but whose edges haven’t yet been fully explored. You can think of the grey vertices as a sort of “worklist” for the garbage collector. Once all the outgoing edges of a vertex has been explored, the vertex becomes black. Black vertices are fully explored, reachable memory blocks. All the remaining, unreachable memory blocks are left white. The mark phase terminates when all grey vertices have been exhausted.

At this point, all vertices in the graph are either black or white. The sweep phase then goes through the entire heap and frees all white memory blocks.

Example

Let’s see an example of a mark-and-sweep garbage collector in action! Below is a graph of all memory blocks that currently exist in a program. There are two root nodes on the left-hand side (e.g., local variables).

The first step is to color the root nodes grey.

Next, the garbage collector explores all of the outgoing edges from all of the grey vertices until all the grey nodes have been exhausted.

The last step is for the garbage collector to dispose of all the garbage (i.e., the white vertices).

Reference Counting

Another popular strategy of automatic memory management is reference counting. In comparison to the mark-and-sweep algorithm, reference counting is pretty simple! Instead of periodically searching for unreachable memory, reference counting keeps a tally of how many references (e.g., pointers) each memory block has. Whenever a new reference is created, the tally is incremented. Similarly, when a reference is deleted the tally is decremented. When the tally reaches zero (i.e., there are no references/pointers pointing at the memory block), the object is freed.

Example

Let’s work through an example together. Consider the graph below depicting the layout of memory at some point in a program.

Square boxes around a “P” denote local pointer variables. Vertices A-H are memory blocks located on the heap. Take a moment to count how may references currently exist for memory blocks A-H. Once you’ve given it a go, you may check your answer below.

Answer

Now suppose that the reference inside of memory block A currently pointing at memory block B is updated to point to memory block G, shown below in red.

By updating this pointer, a reference to G was created and a reference to B was destroyed. So, memory block G’s reference tally is incremented to 2 and memory block B’s tally is decremented to 0 (shown below).

Since B’s tally is now zero, its memory is freed. However, by doing so one of C’s incoming references has been destroyed! Whenever a memory block is freed, reference counting recursively updates the tallies of all memory blocks that were referenced by the freed memory block. So, memory block C’s tally is decremented to 1, as shown below in red.

At this point, all the reference counts are updated and all memory blocks with a tally of 0 have been freed. However, we have a problem: memory blocks C-E are unreachable from the rest of the program’s memory but they haven’t been freed. Worse, they will never be freed, resulting in a memory leak. This example highlights the key disadvantage of reference counting: it is unable to handle cycles. Because memory blocks C-E form a cycle, their reference counts will never drop below 1. Therefore, their memory will never be freed. For this reason, languages that use reference counting (e.g., Python) often also use a garbage collector to deal with cycles.

Garbage Collecting vs. Reference Counting

We just discussed one of the key downsides of reference counting over garbage collection, namely that reference counting struggles with cyclical references. Garbage collection avoids this issue by directly checking whether each node is reachable from the root set.

Another key distinction between these two techniques is when each are run. Garbage collection is run periodically; it can run when memory is low, when it is manually triggered, or simply on a schedule. However, when it runs it must pause execution of the program. If it didn’t, the program might modify the edges of the memory graph while the garbage collector is traversing the graph. This could result in memory errors as the garbage collector might inadvertently free memory that was just made reachable. As you might expect, pausing the program to run garbage collection can have significant performance impacts. It can also be difficult or impossible to predict when garbage collection may run, causing issues for timing-sensitive programs. In comparison, reference counting updates tallies as soon as a pointer is created or destroyed. While this still affects performance, the benefit is that memory is freed as soon as it is no longer referenced.

Garbage collection and reference counting also differ in the amount of metadata each must manage. Garbage collection only needs to store the “color” of each object while it is running; this mark can be as small as a single bit. Reference counting, on the other hand, needs to store a tally (i.e., an integer) for every object.

The last difference I’ll highlight is that reference counting is much simpler to to implement than garbage collection. There are many, many variations of the naive mark-and-sweep algorithm discussed above. Further, it can be easier to estimate the performance impacts of reference counting over garbage collection as reference counting more predictable. Ultimately, the choice between these two methods depends on the specific needs and constraints of the application, balancing the trade-offs between implementation complexity, performance, and memory management efficiency.

Rust

Up until now we have been discussing strategies for automatically managing memory at runtime. Dynamic, automated memory management techniques, such as garbage collection and referencing counting, generally introduce a non-trivial amount of overhead which can negatively affect performance. For example, in 2017 [one paper][pereira2017] measured the energy efficiency of many popular programming languages, from C/C++ to Python and Java. They found that C was the most energy efficient language, primarily because C doesn’t have the overhead that (most) memory safe languages do. The one exception to this rule is Rust.

Rust is a strongly typed, compiled, memory safe, systems-oriented programming language first released in 2012. Rust’s killer feature is that memory is managed at compile-time rather than runtime. That is, the compiler knows where to insert de-allocation calls (i.e., free()). This results in the best of both worlds — a memory safe language without the runtime performance impacts of garbage collection and/or reference counting! Additionally, now that undefined behavior is caught at compile-time rather than runtime, Rust programs also tend to exhibit greater reliability and stability over C/C++ programs.

There is no such thing as a free lunch, though. Rust requires the programmer to follow certain ownership rules. These rules — which are checked by the compiler — encourage memory-safe programming and allow the compiler to accurately determine where to allocate and deallocate memory.

Ownership

Ownership is Rust’s “secret sauce” for how it efficiently manages memory at compile-time. In Rust, all data has a single owner in the form of a variable. Only the data’s owner can access it. Then, when the variable goes out of scope the memory associated with the variable is deallocated. Let’s see a few examples.

fn increment(x: i32) -> i32 {
  x + 1
}

fn main() {
  let n = 5;
  let y = increment(n);
  println!("The value of y is: {y}");
}

The program above is simple: it initializes the variable n with the value 5, calls increment() with the argument n which just returns n+1, and prints this value out. A few notes:

  • All memory in this program is stored on the stack, just like in C.
  • In Rust, if the last line of a function’s body doesn’t end in a semicolon, the expression is implicitly returned. So, the increment() function’s body could also be written as return x + 1.
  • An i32 is a signed, 32-bit integer. In comparison, a u32 is an unsigned, 32-bit integer.

Let’s trace the ownership of the value 5. First, 5 belongs to the variable n. Next, when increment(n) is called, ownership is transferred or moved to the variable x in the increment() function. Then, when the function returns, ownership is moved to the variable y. Lastly, ownership is moved for a final time when the println! macro is called.

Now let’s see an example with dynamic memory allocation on the heap.

fn make_and_drop() {
  let a_box = Box::new(5);
}
fn main() {
  let a_num = 4;
  make_and_drop();
}

In Rust, a Box is a type that allocates memory for and stores the value it is given on the heap. So, Box::new(5) allocates memory for an integer on the heap and stores the value 5 in it. The owner of this heap data is the variable a_box. However, notice that a_box is local variable in the make_and_drop() function. When the make_and_drop() function exits, a_box goes out of scope and the Box containing 5 is deallocated (or dropped, in Rust terminology). Therefore, all the make_and_drop() function does is allocate some memory on the heap, place a value there, and then frees that memory.

Many other data types in Rust are also stored on the stack, for instance Strings.

fn greet(mut name: String) -> String {
  name.insert_str(0, "Hi, ");
  name.push_str("!");
  name
}

fn main() {
  let name = String::from("Zach");
  let greeting = greet(name);
  println!("{}", greeting);
  println!("Bye, {name}!");
}

The above program is a bit more complicated, so let’s step through it together. First, name is initialized to the String "Zach". In Rust, a String is a mutable string stored on the heap. This String is then given as the argument to the greet() function. The greet() function then modifies name by inserting the prefix "Hi, " and the suffix "!" before returning the updated String. Lastly, the program prints the (just created) greeting and says goodbye to the user. When the main() function exits, the memory associated with name and greeting are deallocated.

This is what would happen if the above program was accepted by the Rust compiler. Unfortunately for us, Rust would reject this program as it is not memory safe. Recall that a String is mutable, meaning we can insert and remove characters and that it is stored on the heap. When "Hi, " is inserted at the beginning of name, more memory might have to be allocated for name. In fact, if this were to happen, a fresh, larger memory block would first be allocated, the old data would then be copied into the new memory block, and lastly the old memory block would be freed. This means that the data that name was pointing to back in the main() function may no longer exist (i.e., name could be a dangling pointer). So, Rust would return a compiler error flagging the last line of the above program.

To fix this, we need to keep the data associated with name separate from the data that we provide to the greet() function. There are many ways to do this, but one simple way is to clone the data. The program below does just that and will be accepted by Rust’s compiler.

fn greet(mut name: String) -> String {
  name.insert_str(0, "Hi, ");
  name.push_str("!");
  name
}

fn main() {
  let name = String::from("Zach");
  let name_clone = name.clone();
  let greeting = greet(name_clone);
  println!("{}", greeting);
  println!("Bye, {name}!");
}

References

While cloning data is a quick and easy fix, it is inefficient. Ideally, we would like to reuse name, but Rust’s ownership rules won’t let us. This is where references come in.

A reference is a non-owning pointer. References allow us to provide temporary access to a variable without transferring ownership. For example, the program below uses references — denoted with an ampersand — to print the same strings as before.

fn greet(name: &String) {
  println!("Hi, {name}!");
}

fn main() {
  let name = String::from("Zach");
  greet(&name);
  println!("Bye, {name}!");
}

Now when we call greet() we pass it &name instead of name. Similar to C, by prefixing name with an ampersand we are creating a reference to name. Since references don’t own the data they point to, we don’t get an error when we say goodbye to the user.

However, there is a catch. In Rust, all variables and references are all either immutable or explicitly marked as mutable. Immutable references are aliases to some data. They cannot be used to write or in any way modify the data they point to. Mutable references, on the other hand, can read or write to the data they point to. Still, neither own the data they point to.

To prevent memory errors, Rust restricts how many references there can be to a single piece of data. Specifically, in any scope there can be either:

  1. any number of immutable references, or
  2. at most one mutable reference referring to the same variable. It is the job of Rust’s borrow checker to enforce these rules.

Rust Resources

Hopefully this quick introduction to Rust has piqued your interest enough to learn more! If so, here are some handy resources to start with:

  • The Rust Programming Language is the official, free, online textbook for Rust. It is the best place to get started learning Rust.
  • The Rust website contains many links to other learning resources and instructions for installing Rust.
  • The Rust playground is an online Rust environment that you can use to play around with small Rust programs. For example, here is a link to a playground with the code from earlier!
  • Rust by Example provides many examples of all the major features of Rust. It can be helpful to quickly get a feel for the language.