Adrian Sampson

Manual Test-Case Reduction

2023-12-21T00:00:00-06:00

Test-case reduction is a useful research skill in my line of work. We build lots of tools, and those tools are full of bugs: it’s a normal part of the work to run into weird problems and to figure out what’s going wrong. Especially for people who are new to a research project:

Reduced test cases are an extremely powerful communication tool for asking questions and getting help from people who have been around longer.
When you don’t have intuition yet for where bugs usually come from, reducing a test case can help with your guesswork.

The concept behind test-case reduction is really simple, but—maybe because it’s so simple—sometimes it’s hard to convey what I mean when I say, “can you try reducing that test?” I think the idea might be easier to show than to tell. This post will do both.

The Recipe

Here are the steps in test-case reduction:

Run into a bug.
Capture your input that reproduces the bug. In our research, this input is usually a program. You’ll need both the input program and a command you can run on the program to trigger the bug.
Delete stuff from your input. Try to delete as much as possible without making the bug go away. Remember to repeatedly run your command after each little deletion to be sure the bug still happens.
Stop when you don’t think you can delete anything more without making the bug go away.

Now you have a reduced test case. The hope here is that you and your collaborators will gain a flash of inspiration by staring at the reduced test case that leads you directly to the root cause. Critically, that flash of inspiration was impossible with your original, big test case because it had lots of extraneous stuff in it that obscured the real problem.

Because this recipe is so mechanical, there are many good automated test-case reducer tools out there that can do it for you. Automation is especially important for big programs. Manually reducing test cases is still a useful skill: it helps to understand what the automated tools are doing for you, and it might be faster when your test case is already pretty small. I’ll demonstrate an automated reducer in a follow-up post.

A Demo

This video tries to convey what it feels like to manually reduce a test case. This one revealed a bug in an interpreter for Bril, the instruction-based intermediate language we use in Cornell’s PhD-level compilers course. A student helpfully reported a program that crashes the interpreter:

$ bril2json < problem.bril | cargo run -- -p false false
thread 'main' panicked at src/interp.rs:543:45:
index out of bounds: the len is 0 but the index is 1

The original program from the report isn’t very long—just 25 lines—but it still does enough stuff that it’s hard to see exactly what went wrong in the interpreter. To help find the problem, we want a program that does nothing other than trigger the bug.

In this demo, I deleted all but 4 lines:

@main() {
  .lbl:
    jmp .lbl;
}

Even if you’ve never seen Bril before, I hope you agree that it’s now easy to imagine where to start looking in the interpreter for a fix.

To follow along at home, check out revision c543ae2 of the Bril repo, follow the README’s instructions to get the basic Bril tools set up, build the buggy interpreter with cd brilirs ; cargo build, get the original unreduced problem.bril, and then try the command above to see the Rust panic message.

Critiquing a PhD Application Statement

2023-12-06T00:00:00-06:00

My best advice for applying to grad school is to get feedback on your statement. Let’s see what that feedback might look like by critiquing a sample statement. I’ll quote the full text verbatim, as it was submitted in 2008.

When I saw my mother after I heard my first NP-completeness proof in my Sophomore-year algorithms course, she asked how my classes were going.

This is a worrisome start. This draft makes two classic mistakes right out of the gate:

Resist the temptation to open with a cute anecdote. To the extent they convey anything, little personal stories like this mainly serve to illustrate your enthusiasm—but everyone applying to CS PhD programs is enthusiastic about CS (I hope). Spend the space instead showing off your experience, interests, and expertise, which are what make you unique.
It really helps for the first few sentences to help readers quickly “route” your application. That means writing, clearly and early on, which research areas you like. These areas can be broad, like “systems” or “theory,” and they help people know who should pay attention.

My first instinct, of course, was to relate to my mom—a public school child psychologist with a math phobia—the bizarre and thrilling tale of NP-completeness. Her hesitant acknowledgements and supportive but bewildered face betrayed that my explanation needed some work. That curious, surprised face, however, reminded me of my own astonishment at a host of earlier concepts that now seem basic: asymptotic complexity, self-balancing search trees, functional programming, and a litany of others. The repeated feeling of surprise, intrigue, and elation has marked time as I have become entranced with computer science.

I recommend cutting this whole paragraph. The anecdote doesn’t add much, and—worse—the topics you mention are both pretty basic and fairly scattershot. Again, this story can really only illustrate that you like CS in general, not anything about your specific interests or past experience. That latter, more individual stuff is what potential advisors want to know about.

My experiences with research and tutoring as an undergraduate have focused my intent to pursue a career in CS. I am applying to the PhD program in CS at the University of Washington in order to eventually become a professor.

This is too generic. It’s not necessarily a bad idea to say what career you eventually want, but these sentences are otherwise a no-op. Consider replacing this with specificity about the research areas you hope to work in.

My first major research was part of an NSF Research Experience for Undergraduates project after my sophomore year. I worked with two other students and Professor Ran Libeskind-Hadas on algorithm design and complexity analysis in the realm of optical network routing. My colleagues and I started with very little direction, defined our own research area, and constructed a solution start-to-finish. This was my first encounter with the intricacies of constructive, undirected theoretical research and my first large-scale technical writing project.

We are finally talking about research. It’s helpful that you have given the context for the project and given credit to other students who worked on it with you. However, this summary is missing two big things:

Any specifics about what the research project was actually about. You presumably learned something about how to frame and “sell” your research during this project; put that into action here. Tell the reader about the problem, its importance, and your solution. By clearly describing the research contribution, you will not only illustrate your experience but also show off your ability to discuss research.
A specific description of your role in the project. It’s good to know that you worked with two other students on this project, but which part of it was “yours”? The original ideas, the algorithmic development, the complexity analysis, the experimental results, or something else? Be specific.

Our paper, “On-line Distributed Traffic Grooming,” was accepted to the IEEE Communications Society’s 2008 International Conference on Communications. I traveled to Beijing with Professor Libeskind-Hadas to present it. I prepared the presentation with the help of Professor Libeskind-Hadas but was the sole presenter.

It’s useful to know that you have experience with preparing and giving conference talks. But this could be reduced to one sentence. You can just cite the paper to convey where it was published to anyone who might care and delete the part about traveling to Beijing, which doesn’t matter.

The experience exposed the unique difficulty of designing presentations that keep theoretical topics tenable and interesting for an uninitiated audience.

This is pretty generic; I’d cut it. As a general rule, I recommend removing most of the stuff in this statement that talks about your reactions to your experience unless you have something truly unique to add. The problem with this stuff is that it’s usually obvious: these are exactly the things that everybody learns when giving their first talk. You can use the space you’ll recover to say things that distinguish you from other applicants.

My enthusiasm for this first project ensured that I would pursue research constantly throughout the remainder of my career at Harvey Mudd.

Most statements about enthusiasm are not helpful: we hope most people applying to grad school are enthusiastic about research. Skip it.

I pursued an independent research project with Professor Robert Keller on neural-network techniques for automated processing of structured text. I identified a problem called “adaptive parsing”: the interpretation of certain kinds of grammars with minimal knowledge of the grammars themselves. I constructed and implemented a technique based on rival-penalized competitive learning to accomplish a basic adaptive parsing task. Working alone with occasional advice from my advisor, I was entirely responsible for the framing of my problem and the construction of my solution.

This research-project summary is better than the last one; you’ve told us what the problem was and how you approach it. You even have a sentence in there specifically describing your role. Nice work! The problem description still goes by pretty quickly, though; another sentence or two could make it clearer why this work matters.

I’d probably add a citation for “rival-penalized competitive learning” because this technique is not common knowledge. (If it were k-means or whatever, you wouldn’t need to cite it.)

While the experience was challenging, I proved to myself that I had the motivation required to produce something I consider significant.

You can delete this sentence, which is another somewhat generic “my reaction” statement.

More recently, I’ve conducted research in two fields distant from theory and machine learning: computer security and filesystems. The first, conducted under the supervision of Professor Everett Bull at Pomona College, examined the security implications of a unique storage system called Venti. I proposed a succinct, low-overhead method for implementing capabilities-based security in the system.

It’s a good start, but this needs more detail. What is Venti? A citation would be helpful, but please also describe whatever is salient about Venti and what makes security interesting or different in this setting. Otherwise, it’s hard to tell what you actually did, i.e., what experience you have that may be relevant to your time in grad school.

During the same time period, I worked with Professor Geoff Kuenning on the structural complications of filesystems with interfaces based on arbitrary, unstructured metadata (“tags”) rather than directory hierarchies. I applied the canonical disk-layout techniques from the Berkeley Fast Filesystem (FFS) to outline a filesystem optimized for tag-based storage.

This sounds like a pretty wild project! I think you may be underselling the “big idea” a little bit. Are you really proposing to replace hierarchical directory-based filesystems with something else entirely? That’s a radical change, and you could amp up the marketing here to make sound as radical as it is.

Also, I’m not sure what you mean by “structural complications”; maybe you can think of a more specific thing to say here.

The project exposed me to the intricacies of constructive systems research. An enormous range of considerations, from interface to low-level data structures, had to be identified and carefully analyzed in order to create a coherent and useful proposal.

The passive voice obscures your role in the project. Instead of focusing on this somewhat vague “lesson” that you learned, it would be more helpful to write down what you actually did for this project. Did you implement a whole filesystem? Did you do any performance experiments? Be specific about your concrete work, which will help potential advisors understand what you’re capable of.

My positive experiences with this wide range of research were enough to convince me to investigate graduate school in CS. During my REU project in particular, it was exhilarating to have intellectual pursuit as my primary day-to-day responsibility, to spend eight hours a day talking to my research team and coming away every day feeling like we invented something clever and insightful.

This is more “reaction” stuff; I’d cut it.

Aside from research, however, my experiences with tutoring both in computer science and in writing have further solidified my intention to become a professor. I have tutored for Harvey Mudd’s CS department and worked as a consultant for the Writing Center for three years. My success in these areas has suggested that I am able to communicate abstract ideas clearly. As a writing consultant, I constantly reflect on the infinite complexities of writing and how to best inspire the same considerations in other students. This year, four other consultants and I presented at the 2008 National Conference on Peer Tutoring in Writing on the relationships between tutoring styles for technical and non-technical exposition. This experience has motivated me to think of teaching, alongside research, as an end goal for my career.

It’s interesting that you have this other experience and I’m glad you included it. Teaching and writing are both big parts of grad school, so any “extracurricular” background you have on these themes is relevant. It also helps explain why you say you want an academic career.

I’m not sure about the “my success” sentence. You haven’t actually shown us specific successes here. It might be better to just let the reader decide what this experience suggests you are able to do.

As evidenced by the diversity of my research endeavors, upper-division electives have not helped at all to narrow my enthusiasm for computer science. Every time I take a new course—in filesystems, in complexity theory, in computer security—I plan out a new path through graduate school in another subfield. Only intense consideration has led me to narrow my aspirations to the fields I find most fascinating: theory and systems. While I am fascinated by both of these fields individually, I also see possibility in studying in their intersection.

Move this direct statement of your research interests to the top so it’s easier to find.

While I think it’s awesome that you have such broad CS interests, I admit I’m a little concerned about the emphasis you place on that here. I can imagine some readers being confused enough by this generality that they don’t see themselves as potential advisors for you. I’d consider leading with the specific areas, and briefly mention this breadth of interest but don’t spend much ink on it.

Remember that a grad app statement is not a contract. It’s pretty common for PhD students to change their focus after they start. (Maybe you’ll get interested in programming languages or computer architecture; who knows?) Being specific for the sake of specificity will help you show off your technical depth better, even if the choice is somewhat arbitrary.

I want to attend the University of Washington because it is a large university with outstanding programs in many subfields. If I find that systems-based research is not my favorite research area, for instance, UW also has an exceptional theory group. Not only is the department top-ranked, but it also comes highly recommended from professors I trust and respect. At a large, diverse, and highly-ranked school like UW, I anticipate opportunities to explore many research areas before committing to one.

This is a weird paragraph. “Large university,” while accurate, isn’t exactly a distinguishing feature of UW. And while it’s also true that a big department affords you the opportunity to change your mind, this doesn’t seem all that relevant to the committee deciding whether or not to admit you. Maybe just cut this stuff because it doesn’t add much?

For instance, I am intrigued by Professor Paul Beame’s research on modern data structures. His work highlights the interactions between theory and systems by examining the algorithmic implications of realistic instruction sets. This intersection between two disparate fields is attractive for its combination of two different ways of reasoning about the same set of problems. I am applying to UW because of the wide variety of compelling research opportunities like Professor Beame’s.

It’s a good idea to briefly mention the professors you might want to work with at each school. It’s also nice that you found someone with research that nominally combines two of your interests.

I recommend adding a few more professors. Including a handful of names will again help “route” your application to the right set of readers. Some names to consider here might include Luis Ceze and Dan Grossman, for example.

Well, it’s a start! I recommended a few places where you can cut, especially some material that doesn’t feel very unique to you. Hopefully you can use the extra space to add more technical detail about your past work, which will help readers understand your experience better. Remember, your past research experience is probably the most important factor readers will be looking for.

Good luck! I don’t know if this statement will get you into UW, but if it does, I’m sure it will be a great place to pursue your PhD.

Flattening ASTs (and Other Compiler Data Structures)

2023-05-01T00:00:00-05:00

Normal and flattened ASTs for the expression a * b + c.

Arenas, a.k.a. regions, are everywhere in modern language implementations. One form of arenas is both super simple and surprisingly effective for compilers and compiler-like things. Maybe because of its simplicity, I haven’t seen the basic technique in many compiler courses—or anywhere else in a CS curriculum for that matter. This post is an introduction to the idea and its many virtues.

Arenas or regions mean many different things to different people, so I’m going to call the specific flavor I’m interested in here data structure flattening. Flattening uses an arena that only holds one type, so it’s actually just a plain array, and you can use array indices where you would otherwise need pointers. We’ll focus here on flattening abstract syntax trees (ASTs), but the idea applies to any pointer-laden data structure.

To learn about flattening, we’ll build a basic interpreter twice: first the normal way and then the flat way. Follow along with the code in this repository, where you can compare and contrast the two branches. The key thing to notice is that the changes are pretty small, but we’ll see that they make a microbenchmark go 2.4× faster. Besides performance, flattening also brings some ergonomics advantages that I’ll outline.

A Normal AST

Let’s start with the textbook way to represent an AST. Imagine the world’s simplest language of arithmetic expressions, where all you can do is apply the four basic binary arithmetic operators to literal integers. Some “programs” you can write in this language include 42, 0 + 14 * 3, and (100 - 16) / 2.

Maybe the clearest way to write the AST for this language would be as an ML type declaration:

type binop = Add | Sub | Mul | Div
type expr = Binary of binop * expr * expr
          | Literal of int

But for this post, we’ll use Rust instead. Here are the equivalent types in Rust:

enum BinOp { Add, Sub, Mul, Div }
enum Expr {
    Binary(BinOp, Box<Expr>, Box<Expr>),
    Literal(i64),
}

If you’re not a committed Rustacean, Box<Expr> may look a little weird, but that’s just Rust for “a plain ol’ pointer to an Expr.” In C, we’d write Expr* to mean morally the same thing; in Java or Python or OCaml, it would just be Expr because everything is a reference by default.¹

With the AST in hand, we can write all the textbook parts of a language implementation, like a parser, a pretty-printer, and an interpreter. All of them are thoroughly unremarkable. The whole interpreter is just one method on Expr:

fn interp(&self) -> i64 {
    match self {
        Expr::Binary(op, lhs, rhs) => {
            let lhs = lhs.interp();
            let rhs = rhs.interp();
            match op {
                BinOp::Add => lhs.wrapping_add(rhs),
                BinOp::Sub => lhs.wrapping_sub(rhs),
                BinOp::Mul => lhs.wrapping_mul(rhs),
                BinOp::Div => lhs.checked_div(rhs).unwrap_or(0),
            }
        }
        Expr::Literal(num) => *num,
    }
}

My language has keep-on-truckin’ semantics; every expression eventually evaluates to an i64, even if it’s not the number you wanted.²

For extra credit, I also wrote a little random program generator. It’s also not all that interesting to look at; it just uses a recursively-increasing probability of generating a literal so it eventually terminates. Using fixed PRNG seeds, the random generator enables some easy microbenchmarking. By generating and then immediately evaluating an expression, we can measure the performance of AST manipulation without the I/O costs of parsing and pretty-printing.

You can check out the relevant repo and try it out:

$ echo '(29 * 3) - 9 * 5' | cargo run
$ cargo run gen_interp  # Generate and immediately evaluate a random program.

Flattening the AST

The flattening idea has two pieces:

Instead of allocating Expr objects willy-nilly on the heap, we’ll pack them into a single, contiguous array.
Instead of referring to children via pointers, Exprs will refer to their children using their indices in that array.

Let’s look back at the doodle from the top of the post. We want to use a single Expr array to hold all our AST nodes. These nodes still need to point to each other; they’ll now do that by referring to “earlier” slots in that array. Plain old integers will take the place of pointers.

If that plan sounds simple, it is—it’s probably even simpler than you’re thinking. The main thing we need is an array of Exprs. I’ll use Rust’s newtype idiom to declare our arena type, ExprPool, as a shorthand for an Expr vector:

struct ExprPool(Vec<Expr>);

To keep things fancy, we’ll also give a name to the plain old integers we’ll use to index into an ExprPool:

struct ExprRef(u32);

The idea is that, everywhere we previously used a pointer to an Expr (i.e., Box<Expr> or sometimes &Expr), we’ll use an ExprRef instead. ExprRefs are just 32-bit unsigned integers, but by giving them this special name, we’ll avoid confusing them with other u32s. Most importantly, we need to change the definition of Expr itself:

 enum Expr {
-    Binary(BinOp, Box<Expr>, Box<Expr>),
+    Binary(BinOp, ExprRef, ExprRef),
     Literal(i64),
 }

Next, we need to add utilities to ExprPool to create Exprs (allocation) and look them up (dereferencing). In my implementation, these little functions are called add and get, and their implementations are extremely boring. To use them, we need to look over our code and find every place where we create new Exprs or follow a pointer to an Expr. For example, our parse function used to be a method on Expr, but we’ll make it a method on ExprPool instead:

-fn parse(tree: Pair<Rule>) -> Self {
+fn parse(&mut self, tree: Pair<Rule>) -> ExprRef {

And where we used to return a newly allocated Expr directly, we’ll now wrap that in self.add() to return an ExprRef instead. Here’s the match case for constructing a literal expression:

 Rule::number => {
     let num = tree.as_str().parse().unwrap();
-    Expr::Literal(num)
+    self.add(Expr::Literal(num))
 }

Our interpreter gets the same treatment. It also becomes an ExprPool method, and we have to add self.get() to go from an ExprRef to an Expr we can pattern-match on:

-fn interp(&self) -> i64 {
+fn interp(&self, expr: ExprRef) -> i64 {
-    match self {
+    match self.get(expr) {

That’s about it. I think it’s pretty cool how few changes are required—see for yourself in the complete diff. You replace Box<Expr> with ExprRef, insert add and get calls in the obvious places, and you’ve got a flattened version of your code. Neat!

But Why?

Flattened ASTs come with a bunch of benefits. The classic ones most people cite are all about performance:

Locality. Allocating normal pointer-based Exprs runs the risk of fragmentation. Flattened Exprs are packed together in a contiguous region of memory, which is good for spatial locality. Your data caches will work better because Exprs are more likely to share a cache line, and even simple prefetchers will do a better job of predicting which Exprs to load before you need them. A sufficiently smart memory allocator might achieve the same thing, especially if you allocate the whole AST up front and never add to it, but using a dense array removes all uncertainty.
Smaller references. Normal data structures use pointers for references; on modern architectures, those are always 64 bits. After flattening, you can use smaller integers—if you’re pretty sure you’ll never need more than 4,294,967,295 AST nodes, you can get by with 32-bit references, like we did in our example. That’s a 50% space savings for all your references, which could amount to a substantial overall memory reduction in pointer-heavy data structures like ASTs. Smaller memory footprints mean less bandwidth pressure and even better spatial locality. And you might save even more if you can get away with 16- or even 8-bit references for especially small data structures.
Cheap allocation. In flatland, there is no need for a call to malloc every time you create a new AST node. Instead, provided you pre-allocate enough memory to hold everything, allocation can entail just bumping the tail pointer to make room for one more Expr. Again, a really fast malloc might be hard to compete with—but you basically can’t beat bump allocation on sheer simplicity.
Cheap deallocation. Our flattening setup assumes you never need to free individual Exprs. That’s probably true for many, although not all, language implementations: you might build up new subtrees all the time, but you don’t need to reclaim space from many old ones. ASTs tend to “die together,” i.e., it suffices to deallocate the entire AST all at once. While freeing a normal AST entails traversing all the pointers to free each Expr individually, you can deallocate a flattened AST in one fell swoop by just freeing the whole ExprPool.

I think it’s interesting that many introductions to arena allocation tend to focus on cheap deallocation (#4) as the main reason to do it. The Wikipedia page, for example, doesn’t (yet!) mention locality (#1 or #2) at all. You can make an argument that #4 might be the least important for a compiler setting—since ASTs tend to persist all the way to the end of compilation, you might not need to free them at all.

Beyond performance, there are also ergonomic advantages:

Easier lifetimes. In the same way that it’s easier for your computer to free a flattened AST all at once, it’s also easier for humans to think about memory management at the granularity of an entire AST. An AST with n nodes has just one lifetime instead of n for the programmer to think about. This simplification is quadruply true in Rust, where lifetimes are not just in the programmer’s head but in the code itself. Passing around a u32 is way less fiddly than carefully managing lifetimes for all your &Exprs: your code can rely instead on the much simpler lifetime of the ExprPool. I suspect this is why the technique is so popular in Rust projects. As a Rust partisan, however, I’ll argue that the same simplicity advantage applies in C++ or any other language with manual memory management—it’s just latent instead of explicit.
Convenient deduplication. A flat array of Exprs can make it fun and easy to implement hash consing or even simpler techniques to avoid duplicating identical expressions. For example, if we notice that we are using Literal expressions for the first 128 nonnegative integers a lot, we could reserve the first 128 slots in our ExprPool just for those. Then, when someone needs the integer literal expression 42, our ExprPool don’t need to construct a new Expr at all—we can just produce ExprRef(42) instead. This kind of game is possible with a normal pointer-based representation too, but it probably requires some kind of auxiliary data structure.

Performance Results

Since we have two implementations of the same language, let’s measure those performance advantages. For a microbenchmark, I randomly generated a program with about 100 million AST nodes and fed it directly into the interpreter (the parser and pretty printer are not involved). This benchmark is not very realistic: all it does is generate and then immediately run one enormous program. Some caveats include:

I reserved enough space in the Vec<Expr> to hold the whole program; in the real world, sizing your arena requires more guesswork.
I expect this microbenchmark to over-emphasize the performance advantages of cheap allocation and deallocation, because it does very little other work.
I expect it to under-emphasize the impact of locality, because the program is so big that only a tiny fraction of it will fit the CPU cache at a time.

Still, maybe we can learn something.

I used Hyperfine to compare the average running time over 10 executions on my laptop.³ Here’s a graph of the running times (please ignore the “extra-flat” bar; we’ll cover that next). The plot’s error bars show the standard deviation over the 10 runs. In this experiment, the normal version took 3.1 seconds and the flattened version took 1.3 seconds—a 2.4× speedup. Not bad for such a straightforward code change!

Of that 2.4× performance advantage, I was curious to know how much comes from each of the four potential advantages I mentioned above. Unfortunately, I don’t know how to isolate most of these effects—but #4, cheaper deallocation, is especially enticing to isolate. Since our interpreter is so simple, it seems silly that we’re spending any time on freeing our Exprs after execution finishes—the program is about to shut down anyway, so leaking that memory is completely harmless.

So let’s build versions of both of our interpreters that skip deallocation altogether⁴ and see how much time they save. Unsurprisingly, the “no-free” version of the flattened interpreter takes about the same amount of time as the standard version, suggesting that it doesn’t spend much time on deallocation anyway. For the normal interpreter, however, skipping deallocation takes the running time from 3.1 to 1.9 seconds—it was spending around 38% of its time just on freeing memory!

Even comparing the “no-free” versions head-to-head, however, the flattened interpreter is still 1.5× faster than the normal one. So even if you don’t care about deallocation, the other performance ingredients, like locality and cheap allocation, still have measurable effects.

Bonus: Exploiting the Flat Representation

So far, flattening has happened entirely “under the hood”: arenas and integer offsets serve as drop-in replacements for normal allocation and pointers. What could we do if we broke this abstraction layer—if we exploited stuff about the flattened representation that isn’t true about normal AST style?

The idea is to build a third kind of interpreter that exploits an extra fact about ExprPools that arises from the way we built it up. Because Exprs are immutable, we have to construct trees of them “bottom-up”: we have to create all child Exprs before we can construct their parent. If we build the expression a * b, a and b must appear earlier in their ExprPool than the * that refers to them. Let’s bring that doodle back again: visually, you can imagine that reference arrows always go backward in the array, and data always flows forward.

Let’s write a new interpreter that exploits this invariant. Instead of starting at the root of the tree and recursively evaluating each child, we can start at the beginning of the ExprPool and scan from left to right. This iteration is guaranteed to visit parents after children, so we can be sure that the results for subexpressions will be ready when we need them. Here’s the whole thing:

fn flat_interp(self, root: ExprRef) -> i64 {
    let mut state: Vec<i64> = vec![0; self.0.len()];
    for (i, expr) in self.0.into_iter().enumerate() {
        let res = match expr {
            Expr::Binary(op, lhs, rhs) => {
                let lhs = state[lhs.0 as usize];
                let rhs = state[rhs.0 as usize];
                match op {
                    BinOp::Add => lhs.wrapping_add(rhs),
                    BinOp::Sub => lhs.wrapping_sub(rhs),
                    BinOp::Mul => lhs.wrapping_mul(rhs),
                    BinOp::Div => lhs.checked_div(rhs).unwrap_or(0),
                }
            }
            Expr::Literal(num) => num,
        };
        state[i] = res;
    }
    state[root.0 as usize]
}

We use a dense state table to hold one result value per Expr. The state[i] = res line fills this vector up whenever we finish an expression. Critically, there’s no recursion—binary expressions can get the value of their subexpressions by looking them up directly in state. At the end, when state is completely full of results, all we need to do is return the one corresponding to the requested expression, root.

This “extra-flat” interpreter has two potential performance advantages over the recursive interpreter: there’s no stack bookkeeping for the recursive calls, and the linear traversal of the ExprPool could be good for locality. On the other hand, it has to randomly access a really big state vector, which could be bad for locality.

To see if it wins overall, let’s return to our bar chart from earlier. The extra-flat interpreter takes 1.2 seconds, compared to 1.3 seconds for the recursive interpreter for the flat AST. That’s marginal compared to how much better flattening does on its own than the pointer-based version, but an 8.2% performance improvement ain’t nothing.

My favorite observation about this technique, due to a Reddit comment by Bob Nystrom, is that it essentially reinvents the idea of a bytecode interpreter. The Expr structs are bytecode instructions, and they contain variable references encoded as u32s. You could make this interpreter even better by swapping out our simple state table for some kind of stack, and then it would really be no different from a bytecode interpreter you might design from first principles. I just think it’s pretty nifty that “merely” changing our AST data structure led us directly from the land of tree walking to the land of bytecode.

Very Large Scale Disintegration

2023-03-26T00:00:00-05:00

Figure from Chasing Carbon by Gupta et al. showing Facebook’s datacenter carbon footprint over time. “Scope 3” is the supply chain, 49% of which is construction & hardware manufacturing. “Scope 2” is the datacenter power, i.e., the energy it takes to run the machines. The dotted line shows what Scope 2 would look like without buying renewable energy.

Research communities in computer systems should worry about capex carbon emissions. Capex or embodied carbon accounts for the carbon manufacturers produce when building a machine. It’s in contrast to opex carbon, which counts the emissions we incur to use a machine, i.e., from the electricity we feed into a datacenter or a smartphone’s charging port. In a way, systems researchers are already all experts on opex carbon: we worship at the temple of computational efficiency, and making machines faster almost always means getting more work done per joule of energy. But researchers have recently suggested that, over the lifetime of a computer system, its capex carbon can outstrip—perhaps dramatically—its opex emissions.

If capex carbon is the real problem in computing’s climate impact, systems researchers should worry because our favorite tools are a poor fit for the job. It does not suffice to design new and better computers that work more efficiently than the old computers, as we usually do; we instead need to figure out how to use the same old hardware for longer. Reuse and longevity are the key metrics for climate-aware computing.

Meanwhile, a technology trend is promising a different kind of reuse: multi-chip modules (MCMs) replace one big chip with a network of separately manufactured chiplets. Chiplets are suddenly everywhere: AMD’s latest Threadripper parts have 9 dies, and Intel’s Ponte Vecchio GPU consists of 47 chiplets. One selling point for the chiplet revolution is the cost-saving advantage of design reuse: you can tape out one chiplet and use it across several MCM products. Four of seven chiplets in AWS’s Graviton3 MCM, for example, are DDR5 memory controllers. It’s not hard to imagine that these DDR5 chiplets will still be useful for next year’s AWS server product, so AWS can amortize the cost of building that chiplet across multiple generations.

Reusing chiplets saves money, but it does not save capex carbon. Every MCM still consists of brand-new silicon, with all the concomitant manufacturing emissions, just like a monolithic chip.

What if there were a way to literally reuse chiplets? To recover chiplets from old and obsolete MCMs that could still be useful as a building block for new products?

Silicon Recycling

We envision silicon recycling: an imaginary world where we make new MCMs by harvesting chiplets from old computers and remixing them in new ways. Silicon recycling is the general principle of design for active disassembly applied to integrated circuits. In the same way a couch or a toaster could be built with debondable adhesives to make recycling easier at the end of its life, the idea is to build MCMs with a debondable process.

In the real world, MCM packaging uses a bonding process to attach chiplets to a silicon interposer. I like to imagine the world’s tiniest soldering iron (at, say, a 10 μm pitch) attaching the bumps on each chiplet to the corresponding pad on the interposer. In our imaginary world of silicon recycling, the idea is to (somehow) make this bonding process reversible. We build the MCM in the same way, but we design the bonding process that makes it possible to undo the tiny, metaphorical soldering job. By applying heat, lasers, some magical solvent, or a combination of the three, the chiplets break free from the interposer—and both are undamaged, ready to be bonded again in a new product.

In a hypothetical world with silicon recycling, when you upgrade your phone and send your old one off for recycling, the recycler doesn’t just recover the precious metals from the case, PCBs, and screen. They also take the MCM at the heart of the machine, debond all its chiplets, and put them up for sale on a marketplace for second-hand silicon. Your smartphone’s chiplets may go into a next-generation smartphone, coupled with some brand-new chiplets that differentiate it, or they may go downmarket into a camera or a microwave.

Reversible Packaging is a Fantasy (For Now)

The problem with this vision is that it is science fiction. In the real world, bonding is irreversible—there is no way to safely disassemble an MCM and recover working chiplets.

I am very far from an expert on bonding and packaging—I base this conclusion only on a reasonably thorough literature search that turned up no indication that anyone is even working on reversible bonding for MCMs. The closest thing appears to be temporary bonding technologies, which which are useful during the manufacturing process. For example, some technologies temporarily bond chiplets to silicon or glass carriers while processing them; then, IR lasers debond the silicon (avoiding any mechanical force) before packaging. The final MCM uses a permanent bond.

On the other hand, I did not find evidence that reversible bonding is infeasible in principle. The vacuum in the literature seems to indicate that no one is trying, perhaps because the idea is just too ridiculous.

Research Directions in Computer Systems

Reversible packaging is a problem of materials and technology—not something that can be solved by systems-level research: architecture, programming languages, operating systems, and the like. But the consequences of silicon recycling technology would be systems problems. Even though it is not yet practical, we can already imagine the systems research that silicon recycling would entail:

Carbon-Aware Architectural Disaggregation

The silicon recycling vision needs architecture research that explores how to build MCMs that maximize their potential for reuse. As in brick and mortar architecture, the idea is to take your favorite monolithic processor design and disaggregate it into little chiplet-sized pieces. Disaggregated architectures need to balance two goals: bigger chiplets can better mitigate the costs of inter-chiplet communication, while finer-grained chiplets are more reusable. An ALU chiplet is more likely to be useful in future designs, for example, than a chiplet that bundles together a particular processor’s needs for arithmetic, registers, address calculation, pipeline bypassing, and branch prediction. But a single ALU is probably too tiny to be practical as a standalone chiplet. This kind of disaggregated architecture research needs to start with a prior assumption about what other, future architectures will look like. Today’s designs can then use this prior to maximize the likelihood that their components will be useful in tomorrow’s designs.

Tools for Design from Spare Parts

Today’s design tools all produce hardware “from scratch.” To wildly oversimplify, you feed in your HDL code and the toolchain produces a physical design ready to tape out. To enable silicon recycling, we will need tools that can synthesize hardware made from an inventory of “spare parts”: chiplets we have on hand or think we can easily buy. In spare-parts synthesis, the designer feeds in (alongside their HDL code) a list of descriptions of all that second-hand hardware; the toolchain’s job is to produce a design for a complete MCM that maximizes the use of those repurposed chiplets. The tools will surely still need to generate some new, project-specific hardware, but the goal is to make this fresh silicon a minority of the overall area.

Physically Reconfigurable Hardware

Today’s reconfigurable hardware—FPGAs and CGRAs—give you a toolbox of components that you can hook up however you like. But the mixture of components in each toolbox is fixed. If you buy an FPGA from AMD, for example, the FPGA comes with a fixed ratio of basic logic elements (LUTs) to memories (BRAMs) to arithmetic units (DSPs). With silicon recycling, we could make physically reconfigurable hardware: where you start with an assortment of LUT chiplets, BRAM chiplets, and DSP chiplets and mix them in the proportion and arrangement that your application domain demands. Once you have crafted your custom FPGA MCM, you then configure and reconfigure it as many times as you need to implement your application as it evolves. Physically reconfigurable FPGAs need a kind of two-level compiler: they need to jointly produce (1) a physical configuration of chiplets into an FPGA, and (2) a logical configuration of the FPGA into your design. This kind of compiler needs to be aware that physical reconfiguration is expensive and logical reconfiguration is cheap, so the former should admit as much flexibility in the latter as possible while still optimizing for efficiency.

A Call to Action

I confess that I do not know how feasible reversible MCM packaging is. It may be a technical impossibility. But it seems equally likely that it’s the victim of a chicken-and-egg problem: it doesn’t exist, so no one has done the research on how to exploit it for silicon recycling, so there is no pressure to develop the technology, so it doesn’t exist.

Given the urgency of mitigating computing’s capex carbon footprint, we should break this incentive deadlock. Systems researchers should rush ahead and do the work to understand how to design for reusability and how to exploit second-hand chiplets. By demonstrating the systems-level potential for silicon recycling, we can create the incentive to develop the technology that will make it possible.

Try Snapshot Testing for Compilers and Compiler-Like Things

2022-07-22T00:00:00-05:00

Over the past few years, folks in our lab have become devotees of snapshot testing. Snapshot tests are preposterously simple: they’re just pairs of complete input and output files that you check into version control. It’s a good fit for programs that turn text into other text, which describes compilers and lots of other compiler-like things we tend to build. I like snapshots because they take the drudgery out of writing new tests, so I tend to write a lot more of them.

This approach is so basic and so widespread that I don’t think most people bother to give it a name. It’s like air: it’s so obvious and so obviously useful that there’s no need to talk about it most of the time. But the philosophy is very different from other kinds of testing I am used to, so this post introduces the idea and the reasons you might want to try it.

I’ll demonstrate Turnt, a kind of ascetically simple snapshot testing tool we built in the lab. There are other great options, like LLVM’s lit (which directly inspired Turnt), the Insta crate for Rust, Jane Street’s ppx-based framework for OCaml, and Mercurial’s Cram (the OG, I think). A particularly good option is Runt, Rachit Nigam’s fast and full-featured realization in Rust.

An Example

To feel what snapshot testing is like, let’s test something contrived but convenient. We’ll test the venerable Unix wc command.

The first thing we need is an input file. This is a critical thing about this style of testing: it assumes the thing you want to test is a program that transforms text into other text. Fortunately, that describes lots of compiler-like things, and it also describes our SUT, wc. Let’s make a test file, hi.t:

hello, world!

You can probably guess what wc < hi.t will say:

       1       2      14

The idea in snapshot testing is to “lock in” this output so, as we make changes in the future, we can make sure we didn’t break anything. It’s easy to generate a snapshot file:

$ wc < hi.t > hi.out

If we were really working on the wc implementation, we would check both hi.t and hi.out into version control.

Now all we need is a convenient way to make sure wc < hi.t still matches hi.out. That way, we can write a whole slew of these input files and get into the habit of checking that they all still do the same thing.

Trying Out Turnt

That’s what Turnt does. (And that’s all that it does, more or less.) You can install it with pip:

$ pip install --user turnt

We need to tell Turnt what command to run. Put this into a file called turnt.toml:

command = "wc < {filename}"

Then run Turnt on our little test:

$ turnt hi.t
1..1
ok 1 - hi.t

Success! Turnt tells us that it ran a grand total of one (1) test, and it succeeded—in the sense that wc < hi.t printed, on its standard output, exactly the same stuff that’s saved in hi.out.

Let’s add a second test. Put this in in 2lines.t:

hello,
world!

The first time around, we created the *.out file for our test ourselves. But Turnt will happily do it for us with the --save flag:

$ turnt --save 2lines.t
1..1
not ok 1 - 2lines.t # skip: updated 2lines.out
# missing: 2lines.out

It might be a good idea to cat 2lines.out to make sure it looks OK. Then we can run our entire little test suite:

$ turnt *.t
1..2
ok 1 - 2lines.t
ok 2 - hi.t

Success again! We’re already two tests into the business of growing a thorough test suite. The cornerstone of the snapshot testing philosophy is that it should be extremely easy to add new tests: we just need to write an input file and turnt --save its output, and our test suite will grow.

Turnt’s spartan output is in TAP format, so you can make it prettier using one of a million TAP consumers, like Faucet:

$ turnt *.t | faucet
✓ 2lines.t
✓ hi.t

Adapting to Changes

The trade-off for snapshot testing’s convenience is that its “specifications” are brittle. Because tests have to match the saved output exactly, even tiny changes count as failures. The remedy is to rely on human review—and to make these manual checks as convenient as possible.

Let’s change one of our tests and watch it fail:

$ echo goodbye >> 2lines.t
$ turnt *.t | faucet
⨯ 2lines.t # differing: 2lines.out
✓ hi.t
⨯ fail  1

We want to see what changed in our failing test. Running turnt --diff shows the change:

$ turnt --diff 2lines.t
1..1
--- 2lines.out	2022-07-17 16:04:35.000000000 -0400
+++ /tmp/tmpnim30l99	2022-07-20 14:55:21.000000000 -0400
@@ -1 +1 @@
-       2       2      14
+       3       3      22
not ok 1 - 2lines.t # differing: 2lines.out

That looks good, so we can now turnt --save to accept the new output. In fact, since we’ve checked our output files into version control, it’s sometimes easier to skip turnt --diff altogether: you can just turnt --save the new output and then run git diff to see what’s new. Rolling back is just a git stash away.

If you use pull requests and code reviews, changes to test outputs will appear there too. Your reviewers might appreciate these diffs as an easy way to see what behavior has changed.

Overrides

A snapshot test is just a pair of an input file and an output file. If either is a program of some kind, this setup means that the files also work as standalone examples of the input or output language. (You might want to configure the output so it uses the right filename extension for your language.)

If you need to configure something special about a test, there’s a way to do that inside the input file. It works by assuming your input language has some way of commenting out text, and it extracts options from that text. For example, you can configure your turnt.toml to use {args} as a placeholder for per-test command-line flags:

command = "wc {args} < {filename}"

Then, you put a special marker in your input file:

// ARGS: -l

Turnt doesn’t care what comments look like in your language; it just looks for the string ARGS: anywhere inside it. This test will run wc -l instead of just plain wc. You can also mark tests as expected to fail with a given exit status using something like RETURN: 1.

Interactive Execution

When debugging a test setup, it can be handy to see exactly what a given test is doing. The -p flag turns off all output checking and just shows you the test command and its result:

$ turnt -p hi.t
$ wc  < hi.t
       1       2      14

You can combine -p with --args to interactively try different variants of the test command:

$ turnt -p hi.t --args=-w
$ wc -w < hi.t
       2

In this mode, Turnt becomes a simple way to avoid typing out complicated commands to run them on different input files.

Turnt also supports:

gathering multiple output files from one command,
running several commands on the same input file, and
comparing the outputs from different commands as a form of differential testing.

Check out Bril’s Turnt setup or Calyx’s Runt configuration for full-scale examples of snapshot testing in action.

The Snapshot Philosophy

Snapshot testing is a liberation from the drudgery of “normal” tests. If you’re like me, you’ve internalized that a morally good test is one with a minimal, flexible assertion on the output—one that checks no more than is absolutely necessary. This path is righteous, but it makes testing a bummer. Faced with the prospect of carefully crafting good test logic, in practice I’ll avoid writing tests at all.

Snapshot tests are decadent and depraved. They tempt you into giving up on any semblance of precision: fuck it; just commit the entire output! Let that be your spec! The spoils of the dark side are a joyful, carefree feeling of lightness as you add new tests with abandon.

The sinister philosophy of snapshot testing is:

It should be as easy and as fast as possible to add new tests. Everyone should be able to “lock in” features and fixes with tests, and they should have a good time doing it.
Manual change review is a small price to pay for the better test coverage that stems from convenience.
It’s a feature, not a bug, that the SUT must be a Unixy tool with text input and text output. It forces you to build a simple command-line interface that does a straightforward text-to-text translation, which humans also like.
Tests can act as a crude form of documentation in the form of input/output examples.

Join us!

Addenda on other names for the same idea:

Steffen Smolka says that Google calls it golden testing.
@matt_dz points out that there is a Wikipedia page about this kind of test under the name characterization test.
@bmc_ reports that PostgreSQL’s “regression tests” are snapshot tests.