Memory Leak Detection Using Shape Analysis

This page provides a general overview of my Fall 2003 CS 490 project, which will be on display at BOOM 2004. The goal of this page is to clarify and elaborate on the presentation.

Brian Hackett, bwh6 at cornell.edu

Prof. Radu Rugina, rugina at cs.cornell.edu

Table of Contents

Project Goals

Background

Results

Analysis

Project Goals

Background

This section provides a brief outline of existing alternatives to this analysis, in the form of other ways to detect memory leaks, and other shape analyses.

We contrast our analysis with dynamic analyses such as Purify, which instrument code to detect memory leaks in individual runs. Other static analyses exist that detect memory leaks, but there is very little common ground between them.

While the purpose of our analysis is to detect memory leaks, it can also be used separately as a shape analysis. Shape analyses are a particular type of static analysis which seek to determine the 'shape' of a program's structures. This usually takes the form of distinguishing between tree-like structures, whose cells each have a single incoming reference, and arbitrary graphs, whose cells may have many incoming references. A list, a set of disjoint lists, or a binary tree is tree-like, while a DAG or a doubly-linked list is not.

Most shape analyses function by directly keeping track of the shape of a program's structures, forming an abstract representation capturing all possible states of the program at each point in its execution. Unfortunately, this technique does not seem to scale well to moderately sized programs, as the number of structures and their representations is simply too large to efficiently track.

Our analysis is apparently unique among shape analyses in that it is concerned with the incoming references of one cell of the program at a time. We perform the analysis once for each allocation site in the program, and by combining their results, get shape information for every structure at each point. Since doing this eliminates interdependencies between different structures of the program, it is faster than tracking every structure at once, and the analysis scales up more reliably. Moreover, algorithms that preserve shape generally preserve reference counts, so the precision of our analysis is essentially the same as that of other shape analyses.

Results

On test programs, the analysis has performed very well. These were designed to assess the power of the shape analysis. The main way one can test a shape analysis is to apply it to various algorithms that create or modify recursive structures, seeing whether it can detect that the tree-ness of the structures is maintained. Our shape analysis is on par with the best existing shape analyses, detecting that operations such as list insertion, merging, in-place reversal, and insertion sort all preserve the list-ness of their parameters. While we have not done much testing on tree algorithms, the design of the analysis and its performance on a few key operations (such as rotations) suggest that it will perform equally well.

Results on real programs are only preliminary. While we hope to eventually run the analysis on GCC, a compiler with about 100,000 lines of code, our testing so far has only been on segments of binutils, a collection of programs whose total size is about 150,000 lines of code. Still, performance has been promising. Of roughly 7,000 lines of code analyzed, 10 bugs have been reported, and 8 correspond with real errors in the code.

Analysis

The analysis is divided into two phases. The goal of the first is to give reasonably precise information about aliasing in the program, i.e. to determine how the heap is laid out, what each pointer may reference, whether two expressions may refer to the same location. Since in C and C++, pointers can reference data on the stack, an alias analysis is necessary to have any hope of reasoning conservatively about the behavior of a program using pointers. The goal of the second phase is to leverage the previous results to give extremely precise aliasing information, information precise enough to determine not just the rough outline of the heap but the shape of each of its structures, as well.

During the first phase, the analysis divides the heap into a finite set of distinct regions, such that all objects in each region share some relationship with the objects of other regions. For example, if a local variable x has type int*, then there is a region Rx pointing to some region Rt, such that Rx contains the local copy of x, and Rt contains every cell that x can ever be pointing to. Regions are combined so that for any assignment that can be performed by the program, the values on the left and right sides bear the same relationship to the heap. If there is an assignment x = y, then Rx and Ry must point to the same region. This ensures that no cell in a region violates the constraints imposed by that region. More importantly, our specification of regions ensures that any assignment changes the value of a cell in a particular region, and we can place a very specific constraint on the possible values of the assigment's right hand side.

During the second phase, we accumulate information that may not hold universally, or even at all times when a particular point in the program is reached. While the first phase of the analysis is very coarse, accumulating information that holds at all times during a program's execution, the second phase is extremely fine-grained, accumulating information that holds only when a particular point is reached in a particular configuration. These configurations count the number of references to a particular object originating in each region of the heap - for example, if x is the only pointer to the object, the reference counts are {Rx[1]}. Rather than looking at reference counts between all objects of the heap, we accurately determine what the possible reference counts are for some object in the heap. We trace forwards from the allocation site (usually a call to malloc) for this object, and by capturing all possible states at the point when it was allocated, the results we get apply for all objects allocated at that site. By repeating the process for every allocation site in the program (it is important to realize that this will not require the same effort as holding all information about the sites and tracing once, as handling sites separately removes the need to handle any computationally prohibitive interactions between different objects - safely approximating these interactions is the goal of the first phase), we can combine the information to yield reference counts for each point, such that whenever the program reaches that point, each object in the program has reference counts consistent with one of the possible counts. For example, if Rx is the region containing the local copy of a list pointer x, and Rl is the region containing all the pointers in the possible lists x can point to, then the reference count {Rx[1]} is consistent with objects pointed to uniquely by x, and the reference count {Rl[1]} is consistent with objects pointed to uniquely by some element in the list. If these are the only two possible reference counts for elements in the list x points to, then x must be either NULL or point to an acyclic list whose elements have not been combined with any other list in the program. This sort of shape information is very important for accurately finding leaks in list manipulation code, and by maintaining the reference counts for an object, we can track it arbitrarily deeply into structures, until it is deleted and references to it are destroyed, or it is leaked and we report the error.