Lecture 24: Garbage Collection

Administrivia

Prelim post mortem: class ave = 84 (!) People had the most trouble, oddly enough, will the argv problem. This pretty much guarantees that a problem like it will be on the final.

What does SML do?

Any guesses? It turns out that SML does nothing halfway clever for allocation, because its deallocation scheme is so good. The deallocation scheme guarantees that free space is contiguous, so allocation just grabs the next piece of storage!

Deallocation

What about deallocation? It is easy for memory to become inaccessible. Consider:

let val x = (1,(2, 3))

    val y = #2 x

in

end

If we draw what happens in the environment model when we evaluate this code, we see that x points to a pair that points to another pair. The variable y, and the result of the whole expression, is bound to the second pair, but the first pair is no longer accessible after the whole let expression finishes evaluating. This first pair is garbage.

Reachability

Any boxed value in our environment-based model of SML can become garbage. This includes tuples, records, datatypev values, strings, arrays, and function closures.

Most garbage collectors are based on the idea of reclaiming whole blocks that are no longer reachable from a set of roots, which are pointers into the heap that are assumed to always be accessible. The roots of a given computation consist of pointers that appear in the environment, plus the pointer to the currently computed result. A block of memory is reachable from the roots if there is a direct pointer to that block among the roots, or if there is a pointer to that block in another block that is reachable from the roots.

Recall that the environment is implemented as a stack. So anything reachable from the stack is considered not to be garbage. Anything not reachable from the stack cannot be accessed by any future computation, so it is garbage. It cannot be accessed by any future computation because our box-and-pointer model of computation doesn't permit any way to get at a box to which all pointers have been lost.

Looking a memory more abstractly, we see that the memory heap is simply a directed graph in which the nodes are blocks of memory and the edges are the pointers between these blocks. So reachability can be computed as a graph traversal.

Explicit vs. automatic garbage collection

There are two basic strategies for dealing with garbage: explicit garbage collection by the programmer, and automatic garbage collection built into the language run-time system. Explicit garbage collection is provided by languages like C and C++. There is a way to explicitly deallocate (or "free") allocated memory when it is expected that that memory is about to become garbage. Languages like Java and SML provide automatic garbage collection : the system automatically identifies blocks of memory that can never be used again by the program, and reclaims their space for use by later allocations.

Automatic garbage collection offers the advantage that the programmer does not have to worry about when to deallocate a given block of memory. In languages like C the need to explicitly manage memory complicates any code that allocates data on the heap, and is a significant burden on the programmer. Worse, if the programmer fails to deallocate properly, bugs are introduced into the program that are hard to find:

· If the programmer neglects to deallocate some garbage, it creates a memory leaks in which some allocated memory can never again be reused. This is a program for long-running programs which will tends to grow in size until they consume all of memory.

· If the programmer is too aggressive and deallocates a block of memory that is still in use, this creates a dangling pointer that may be followed later even though it now points to unallocated memory or to a new allocated value that may be of a different type.

· If a block of memory is deallocated twice, this typically corrupts the memory heap data structure even if the block was initially garbage. Corruption of the memory heap is likely to cause unpredictable effects later during execution and be difficult to debug.

In practice, programmers manage explicit allocation and deallocation by keeping track of what piece of code "owns" each pointer in the system. That piece of code is responsible for deallocating the pointer later. The tracking of pointer ownership shows up in the specifications of code that manipulates pointers, complicating specification, and use, and implementation of the abstraction.

Automatic garbage collection helps modular programming, because two modules can share a value without having to agree on which module is responsible for deallocating it. The details of how boxed values will be managed does not pollute the interfaces in the system.

Requirements for automatic garbage collection

Many programs written in SML (and Java) generate garbage at a high rate, so it is important to have an effective way to collect the garbage. The following properties are desirable in a garbage collector:

It should identify most garbage
Anything it identifies as garbage must be garbage
It should impose a low added time overhead
During garbage collection the program may be paused; these pauses should be short

Fortunately modern garbage collectors provide all of these important properties. We will not have time for a complete survey of modern garbage collection techniques, but we can look at some simple garbage collectors.

Reference counting

A simple technique for automatic garbage collection that is occasionally used is reference counting. The idea is to keep track for each block of memory how many pointers there are incoming to that block. When the count goes to zero, the block must be unreachable and can be deallocated.

There are a few problems with this conceptually simple solution:

It imposes a lot of run-time overhead, because each time a pointer is updated, the reference counts of two blocks of memory must be updated (one incremented, one decremented).
It can take a long time, because deallocating one object can cause a cascade of other objects to be deallocated at the same time (yet there is no way to defer the pause this introduces, as there is in mark-and-sweep collector)
It cannot collect garbage that lies in a cycle in the heap graph, because the reference counts will never go down to zero.

Tag bits

To compute reachability accurately, the garbage collector needs to be able to identify pointers. Since a word of memory cells is just a sequence of bits, how can the garbage collector tell apart, say, a pointer from an integer? One simple strategy is to reserve a bit in every word to indicate whether the value in that word is a pointer or not. This tag bit uses up about 3% of memory, which may be acceptable. It also limits the range of integers (and pointers) that can be used. On a 32-bit machines, using a single tag bit means that integers can go up to about 1 billion, and that the machine can address about 2GB instead of the 4GB that would otherwise be possible. Adding tag bits also introduces a small run-time cost that is incurred during arithmetic or when dereferencing a pointer.

A different solution is to have the compiler record information that the garbage collector can query at run time to find out the types of the various locations on the stack. Given the types of stack locations, the successive pointers can be followed from these roots and the types used at even step to determine where the pointers are. This approach avoids the need for tag bits but is substantially more complicated because the garbage collector and the compiler become more tightly coupled.

Finally, it is possible to build a garbage collector that works even if you can't tell apart pointers and integers. The idea is that if the collector encounters something that looks like it might be a pointer, it treats it as if it is one, and the memory block it points to is treated as reachable. Memory is considered unreachable only if there is nothing that looks like it might be a pointer to it. This kind of collector is called a conservative collector because it may fail to collect some garbage, but it won't deallocate anything but garbage. In practice it works pretty well because most integers are small and most pointers look like large integers. So there are relatively few cases in which the collector is not sure whether a block of memory is garbage.

Mark and sweep collection

Mark-and-sweep proceeds in two phases: a mark phase in which all reachable memory is marked as reachable, and a sweep phase in which all memory that has not been marked is deallocated. This algorithm requires that every block of memory have a bit reserved in it to indicate whether it has been marked.

Marking for reachability is essentially a graph traversal; it can be implemented as either a depth-first or a breadth-first traversal, though depth-first traversal is likely to be faster

In the sweep phase all unmarked blocks are deallocated. This phase requires the ability to find all the allocated blocks in the memory heap, which is possible with a little more bookkeeping information per each block.

The key limitation of mark-sweep is that it has to look through all of memory. In practice, most of memory ends up being garbage, so this is wasteful.

Triggering garbage collection

When should the garbage collector be invoked? An obvious choice is to do it whenever the process runs out of memory. However, this may create an excessively long pause for garbage collection. Also, it is likely that memory is almost completely full of garbage when garbage collection is invoked. This will reduce overall performance and may also be unfair to other processes that happen to be running on the same computer. Typically, garbage collectors are invoked periodically, perhaps after a fixed number of allocation requests are made, or a number of allocation requests that is proportional to the amount of non-garbage (live) data after the last GC was performed.

Reducing GC pauses

One problem with mark-and-sweep is that it can take a long time -- it has to scan through the entire memory heap. While it is going on, the program is usually stopped . Thus, garbage collection can cause long pauses in the computation. This can be awkward if, for example, one is relying on the program to, say, help pilot an airplane. To address this problem there are incremental garbage collection algorithms that permit the program to keep computing on the heap in parallel with garbage collection, and generational collectors that only compute whether memory blocks are garbage for a small part of the heap.

Compacting garbage collection

Collecting garbage is nice, but the space that it creates may be scattered among many small blocks of memory. This external fragmentation may prevent the space from being used effectively. A compacting collector is one that tries to move the blocks of allocated memory together, compacting them so that there is no unused space between them. Compacting collectors tend to cause caches to become more effective, improving run-time performance after collection.

A simple compacting collector is based on mark-sweep: you divide memory in half (OLD and NEW) and move things in use into NEW. But what happens in the middle of the mark phase, when you chase a pointer to something in OLD that has been moved to NEW?

More broadly, compacting collectors are difficult to implement because they change the locations of the objects in the heap. This means that all pointers to moved objects must also be updated. Finding all these pointers can be expensive and requires added storage or time.