Adrian Sampson: Try Snapshot Testing for Compilers and Compiler-Like Things

Over the past few years, folks in our lab have become devotees of snapshot testing. Snapshot tests are preposterously simple: they’re just pairs of complete input and output files that you check into version control. It’s a good fit for programs that turn text into other text, which describes compilers and lots of other compiler-like things we tend to build. I like snapshots because they take the drudgery out of writing new tests, so I tend to write a lot more of them.

This approach is so basic and so widespread that I don’t think most people bother to give it a name. It’s like air: it’s so obvious and so obviously useful that there’s no need to talk about it most of the time. But the philosophy is very different from other kinds of testing I am used to, so this post introduces the idea and the reasons you might want to try it.

I’ll demonstrate Turnt, a kind of ascetically simple snapshot testing tool we built in the lab. There are other great options, like LLVM’s lit (which directly inspired Turnt), the Insta crate for Rust, Jane Street’s ppx-based framework for OCaml, and Mercurial’s Cram (the OG, I think). A particularly good option is Runt, Rachit Nigam’s fast and full-featured realization in Rust.

An Example

To feel what snapshot testing is like, let’s test something contrived but convenient. We’ll test the venerable Unix wc command.

The first thing we need is an input file. This is a critical thing about this style of testing: it assumes the thing you want to test is a program that transforms text into other text. Fortunately, that describes lots of compiler-like things, and it also describes our SUT, wc. Let’s make a test file, hi.t:

hello, world!

You can probably guess what wc < hi.t will say:

       1       2      14

The idea in snapshot testing is to “lock in” this output so, as we make changes in the future, we can make sure we didn’t break anything. It’s easy to generate a snapshot file:

$ wc < hi.t > hi.out

If we were really working on the wc implementation, we would check both hi.t and hi.out into version control.

Now all we need is a convenient way to make sure wc < hi.t still matches hi.out. That way, we can write a whole slew of these input files and get into the habit of checking that they all still do the same thing.

Trying Out Turnt

That’s what Turnt does. (And that’s all that it does, more or less.) You can install it with pip:

$ pip install --user turnt

We need to tell Turnt what command to run. Put this into a file called turnt.toml:

command = "wc < {filename}"

Then run Turnt on our little test:

$ turnt hi.t
1..1
ok 1 - hi.t

Success! Turnt tells us that it ran a grand total of one (1) test, and it succeeded—in the sense that wc < hi.t printed, on its standard output, exactly the same stuff that’s saved in hi.out.

Let’s add a second test. Put this in in 2lines.t:

hello,
world!

The first time around, we created the *.out file for our test ourselves. But Turnt will happily do it for us with the --save flag:

$ turnt --save 2lines.t
1..1
not ok 1 - 2lines.t # skip: updated 2lines.out
# missing: 2lines.out

It might be a good idea to cat 2lines.out to make sure it looks OK. Then we can run our entire little test suite:

$ turnt *.t
1..2
ok 1 - 2lines.t
ok 2 - hi.t

Success again! We’re already two tests into the business of growing a thorough test suite. The cornerstone of the snapshot testing philosophy is that it should be extremely easy to add new tests: we just need to write an input file and turnt --save its output, and our test suite will grow.

Turnt’s spartan output is in TAP format, so you can make it prettier using one of a million TAP consumers, like Faucet:

$ turnt *.t | faucet
✓ 2lines.t
✓ hi.t

Adapting to Changes

The trade-off for snapshot testing’s convenience is that its “specifications” are brittle. Because tests have to match the saved output exactly, even tiny changes count as failures. The remedy is to rely on human review—and to make these manual checks as convenient as possible.

Let’s change one of our tests and watch it fail:

$ echo goodbye >> 2lines.t
$ turnt *.t | faucet
⨯ 2lines.t # differing: 2lines.out
✓ hi.t
⨯ fail  1

We want to see what changed in our failing test. Running turnt --diff shows the change:

$ turnt --diff 2lines.t
1..1
--- 2lines.out	2022-07-17 16:04:35.000000000 -0400
+++ /tmp/tmpnim30l99	2022-07-20 14:55:21.000000000 -0400
@@ -1 +1 @@
-       2       2      14
+       3       3      22
not ok 1 - 2lines.t # differing: 2lines.out

That looks good, so we can now turnt --save to accept the new output. In fact, since we’ve checked our output files into version control, it’s sometimes easier to skip turnt --diff altogether: you can just turnt --save the new output and then run git diff to see what’s new. Rolling back is just a git stash away.

If you use pull requests and code reviews, changes to test outputs will appear there too. Your reviewers might appreciate these diffs as an easy way to see what behavior has changed.

Overrides

A snapshot test is just a pair of an input file and an output file. If either is a program of some kind, this setup means that the files also work as standalone examples of the input or output language. (You might want to configure the output so it uses the right filename extension for your language.)

If you need to configure something special about a test, there’s a way to do that inside the input file. It works by assuming your input language has some way of commenting out text, and it extracts options from that text. For example, you can configure your turnt.toml to use {args} as a placeholder for per-test command-line flags:

command = "wc {args} < {filename}"

Then, you put a special marker in your input file:

// ARGS: -l

Turnt doesn’t care what comments look like in your language; it just looks for the string ARGS: anywhere inside it. This test will run wc -l instead of just plain wc. You can also mark tests as expected to fail with a given exit status using something like RETURN: 1.

Interactive Execution

When debugging a test setup, it can be handy to see exactly what a given test is doing. The -p flag turns off all output checking and just shows you the test command and its result:

$ turnt -p hi.t
$ wc  < hi.t
       1       2      14

You can combine -p with --args to interactively try different variants of the test command:

$ turnt -p hi.t --args=-w
$ wc -w < hi.t
       2

In this mode, Turnt becomes a simple way to avoid typing out complicated commands to run them on different input files.

Turnt also supports:

gathering multiple output files from one command,
running several commands on the same input file, and
comparing the outputs from different commands as a form of differential testing.

Check out Bril’s Turnt setup or Calyx’s Runt configuration for full-scale examples of snapshot testing in action.

The Snapshot Philosophy

Snapshot testing is a liberation from the drudgery of “normal” tests. If you’re like me, you’ve internalized that a morally good test is one with a minimal, flexible assertion on the output—one that checks no more than is absolutely necessary. This path is righteous, but it makes testing a bummer. Faced with the prospect of carefully crafting good test logic, in practice I’ll avoid writing tests at all.

Snapshot tests are decadent and depraved. They tempt you into giving up on any semblance of precision: fuck it; just commit the entire output! Let that be your spec! The spoils of the dark side are a joyful, carefree feeling of lightness as you add new tests with abandon.

The sinister philosophy of snapshot testing is:

It should be as easy and as fast as possible to add new tests. Everyone should be able to “lock in” features and fixes with tests, and they should have a good time doing it.
Manual change review is a small price to pay for the better test coverage that stems from convenience.
It’s a feature, not a bug, that the SUT must be a Unixy tool with text input and text output. It forces you to build a simple command-line interface that does a straightforward text-to-text translation, which humans also like.
Tests can act as a crude form of documentation in the form of input/output examples.

Join us!

Addenda on other names for the same idea:

Steffen Smolka says that Google calls it golden testing.
@matt_dz points out that there is a Wikipedia page about this kind of test under the name characterization test.
@bmc_ reports that PostgreSQL’s “regression tests” are snapshot tests.