Compiler Validation via Equivalence Modulo Inputs

by Rolph Recto, Gregory Yauney December 2, 2019

Imagine you, a clever person who wants her C programs to run faster, sat down, thought very hard, and developed a new compiler optimization. Say you implement it as a transformation pass in LLVM so that other people can take advantage of your cleverness to make their programs run faster as well. You run your optimization pass over a few benchmarks and see that it does indeed make some programs run faster. But a question nags you: how do you know that your optimization is correct? That is, how do you know that your optimization doesn't change the semantics of the input program?

Equivalence Modulo Inputs (henceforth EMI), a testing technique introduced by Le & al. in a PLDI 2014 paper, allows our compiler hacker above to test her optimization rigorously without much effort. EMI is especially effective at finding miscompilation bugs, wherein compilers produce wrong code, which is much more pernicious than compiler crashes, bugs where the compiler terminates abnormally. This allows EMI to test the optimization phase of compilers much more rigorously than prior work such as Csmith, which finds fewer miscompilations than compiler crashes.

Background

EMI is a form of differential testing, the widely applicable idea that if multiple systems presumed to be equivalent produce different outputs on the same input, then there is a bug in at least one of the systems.

Csmith also differentially tests compilers by generating random test cases, leading to 281 bug reports for GCC and LLVM by the time it was published. Whereas Csmith compares output from multiple compilers on the same input program, EMI compares output from different programs compiled by the same compiler.

Explicitly wanting to avoid Csmith's painstaking approach to restricting random program generation with aggressive safety analysis, Le & al. design their implementation of EMI around generating equivalent variants of valid seed programs, as we will see below. The EMI and Csmith approaches are not oppositional; in fact, Le & al. incorporate Csmith into their workflow. The vast majority (all but four) of bugs identified by EMI were found with EMI variants of random programs generated by Csmith.

Some definitions

Let us now formally define EMI and show how we can use it as a condition to determine whether a compiler is buggy.

Equivalence modulo inputs (EMI). Given an input set $I$ and a program $P$, a program $Q$ is an EMI-variant of $P$ (read: $P$ and $Q$ are EMI) relative to $I$ if $P$ and $Q$ have the same denotations ("mean the same thing") over all inputs in $I$. Formally, for all $i$ in $I$, $\llbracket P \rrbracket(i) = \llbracket Q \rrbracket (i)$.

It is immediately clear that EMI is a relaxation of semantic equivalence, wherein $P$ and $Q$ have the same denotations for all possible inputs.

For example, the following two programs are semantically equivalent (and thus EMI for any input set):

$\lambda x. (x + x)$ and $\lambda x. (2 * x)$

The following two programs are not semantically equivalent yet EMI over input set {0}:

$\lambda x. x$ and $\lambda x. 0$

Now that we have a formal definition of EMI, how can we use it as a condition to check whether a compiler is buggy or not?

EMI-validity. Given an input set $I$, a compiler $C$ is EMI-valid relative to $I$ if for any program $P$ and EMI-variant $Q$, it is the case that $C(P)(i) = C(Q)(i)$ for all $i$ in input set $I$.

If a compiler is not EMI-valid, then we consider it buggy. But the inverse is not true: if a compiler is EMI-valid, it can still be buggy! Consider the degenerate compiler that maps all source programs to the same target program. The compiler is EMI-valid for any input set, but it is obviously buggy. Thus EMI-validity is a conservative overapproximation for compiler correctness.

Why is this useful? Couldn't we just define validity for a compiler over semantic equivalence, not just its relaxed counterpart in EMI? (You can imagine defining EMI-validity where the input set is all possible inputs.) EMI solves two hard practical problems in differential testing of compilers:

How do we generate "equivalent" variants of input programs?
How do we check that the compiler's output programs are "equivalent"?

Using the more stringent condition of semantic equivalence makes solving these practical problems hard—indeed, in the general case (2) is undecidable by a famous result in computability theory. But the more relaxed condition of EMI makes these tractable. As we'll seen with the implementation of Orion below, there is an efficient procedure for generating EMI-variants from seed programs, thus solving (1). We determine that output programs are "equivalent" if they are EMI, which, since we're only checking equivalence over a particular set of inputs, gives an efficient procedure for (2).

EMI in Practice: Orion

Le & al. realize the promise of Equivalence Modulo Inputs by implementing Orion, a bug-finding tool for C compilers. Given a seed program and an input set, it generates EMI-variants of the seed program and then checks if a compiler configuration is EMI-valid with respect to these.

To generate EMI-variants, Orion uses code coverage information provided by tools such as gcov to modify unexecuted parts of the seed program. Intuitively, this procedure generates EMI-variants of the seed program since unexecuted statements should not affect the output of the compiled program.

Specifically, Orion probabilistically deletes unexecuted statements of seed programs to generate EMI-variants. The authors consider other mutation strategies as part of future work, but as we'll see in the evaluation section below, the simple strategy of deleting statements works well in practice.

Orion's EMI-variant generation algorithm is sketched below in gen_variant.

def prune_visit(prog, statement, coverage_set):
  # probabilistically delete unexecuted statement
  if statement not in coverage_set and flip_coin(statement):
    prog.delete(statement)

  # otherwise, traverse its children
  else:
    for child in statement.children:
      prune_visit(prog, child, coverage_set)

def gen_variant(prog, coverage_set):
  emi_variant = clone(prog)
  for statement in emi_variant:
    prune_visit(emi_variant, statement, coverage_set)

  return emi_variant

gen_variant takes as input a seed program and the coverage set—the set of all statements executed for some input in the input set. It clones the program into emi_variant and then uses prune_visit to probabilistically delete unexecuted statements.

With its EMI-variant generation algorithm outlined, we can now sketch the algorithm by which Orion validates C compilers.

def validate(compiler, prog, input_set):
  # compile with no optimizations
  out_prog = compiler.compile(prog, NO_OPTIMIZATION)

  # generate reference output
  in_out_set = [(i, out_prog.execute(i)) for i in input_set]

  # get coverage info
  # a statement is considered covered if it was executed
  # by the program on any input
  coverage_set = set()
  for i in input_set:
    coverage_set = union(coverage_set, coverage(prog, i))

  for i in range(MAX_ITER):
    emi_variant = gen_variant(prog, coverage_set)

    for config in compiler.configurations:
      out_emi_variant = compiler.compile(emi_variant, config)

      # check if compiled EMI variant is equivalent over all inputs
      for i, o in in_out_set:
        # compiler is not EMI-valid!
        emi_o = out_emi_variant.execute(i)
        if emi_ o != o:
          report_bug(compiler, config, prog, emi_variant, i, o, emi_o)

validate takes as input a compiler (compiler), a seed program (prog), and an input set (input_set). First, validate compiles prog with compiler using no optimizations, the output of which (out_prog) is then used to generate a reference set of outputs for input_set. Next, it uses a code coverage tool (coverage) to determine the set of executed statements over all inputs in input_set. In its "main loop," validate generates an EMI-variant of prog using the computed coverage information by calling gen_variant. For every relevant compiler configuration, it then compiles the EMI-variant and runs the output program over all inputs in input_set to check that it returns the same output as recorded in the reference set. If not, we flag the current compiler configuration as having a bug (report_bug). validate repeats this main loop for some number of iterations (MAX_ITER) to find more bugs using different EMI-variants.

The authors note that the implementation effort for Orion is much less burdensome than other bug-finding tools such as Csmith: whereas Csmith is about 30K-40K lines of C++, Orion is only about 500 lines of shell scripts and 1K lines of C++.

Evaluation

In order to evaluate EMI—even in its concrete implementation Orion—several questions must be answered:

What compilers (and compiler configurations) will be tested?

The authors test GCC and LLVM, popular open-source compilers with transparent bug tracking. The latest development builds of the compilers were tested on an x86_64 machine, targeting both 32- and 64-bit machines. Because the goal is to find miscompilations arising from optimizations, the common optimization configurations were all tested: -O0, -O1, -Os, -O2, -O3.
What seed programs will be profiled and pruned?

Some seed programs were taken from the GCC, LLVM, and KCC regression test suites. The authors report attempting to use tests from open-source projects, but were unable to reduce and interpret the resulting bugs.

The bulk of the bugs were found by starting from randomly generated Csmith programs, likely because each consisted of, on average, thousands of lines of code with a high proportion of unexecuted lines.

Though the compiler test programs were verified to be correct by experts, there was no one verifying that the random Csmith programs produced correct output. Only equivalence (preserved by the pruning process) is necessary to ensure EMI variants are able to detect bugs, greatly increasing the pool of seed programs.
What parameters will guide the pruning process?

Each seed program had a random number of variants generated, with an expectation of eight variants. The two random parameters that control the likelihood of a given statement getting pruned were independently reset to a uniform new value between 0 and 1 after each pruning.
What will be done with bugs once any have been found?

The authors used a combination of C-reduce and Berkeley Delta to shrink the size of EMI programs that generated different outputs. They attempted to reject programs that triggered undefined behavior by using compiler warnings, static analysis, and CompCert. The final step was reporting the bugs using the compilers' transparent bug-tracking tools.

With this context, on to the headline result:

Orion found 147 confirmed, unique bugs in GCC and LLVM over the course of eleven months in 2013.

Le & al. evaluate these bugs in a twofold way: 1) quantitative description of components affected by bugs, and 2) qualitative evaluation of about ten generated programs.

Quantitative description

A major strength of the evaluation is its integration with the bug reporting workflows for GCC and LLVM. While the authors go perhaps too far in asserting, "First, most of our reported bugs have been confirmed and fixed by developers, illustrating their relevance and importance (as it often takes substantial effort to fix a miscompilation)," the fact that 182 of the 195 total reported bugs (with 35 of these getting marked duplicate) were confirmed by outside experts to really exist is evidence that EMI is a viable bug-finding strategy.

95 of the confirmed bugs were miscompilations, lending credence to the authors' initial claim that Orion is able to target miscompilations more easily than Csmith alone can. The most bugs were found in the development trunks of both GCC and LLVM. More bugs were also found in increasing levels of optimization, with the most under -O3.

The authors also found performance bugs through comparing compilers to each other, in a differential testing scenario similar to that used by Csmith. 19 of the 147 confirmed bugs were performance issues.

It's important to note that these were only the bugs that were found by Orion. Because Orion specifically targeted optimization phases, it is understandable that GCC Tree Optimization and RTL optimization were the components with the most discovered bugs (LLVM developers did not classify the reported bugs). These components did not necessarily have more bugs than others, nor were these the only possible bugs.

The authors do not make an attempt to evaluate the search space that Orion explored in producing these reported bugs. Nor do they explicitly determine the proportion of the generated variants that led to identified bugs. They only report that they didn't record how many seed programs they started with or how many variants they generated (merely estimating "millions to tens of millions"). They also do not report (and likely did not record) the Csmith configurations or Orion's dynamic pruning parameters.

Qualitative examples

The confirmed bugs are said to span compiler segfaults, internal compiler errors, performance problems, and wrong code generation. The authors present and interpret a handful of the bugs that were confirmed and fixed by compiler developers. We highlight just two of those for a flavor of the generated programs. Note that the authors only show the reduced code they reported to compiler developers; they show neither the non-reduced versions nor the EMI variants.

The following example led to a segfault when compiled with GCC due to a wrong offset computation in an optimization pass called "predictive commoning," a form of common subexpression elimination:

int b, f, d[5][2];
unsigned int c;
int main() {
  for (c = 0; c < 2; c++)
    if (d[b + 3][c] & d[b + 4][c])
      if (f)
        break;
  return 0;
}

Clang incorrectly vectorized the following code:

int main() {
  int a = 1;
  char b = 0;
  lbl:
    a &= 4;
    b--;
    if (b) goto lbl;
  return a;
}

Current statistics

The website for the EMI project shows the number of bugs found and fixed by tools using the EMI methodology. It shows an astronomical number of bugs found in GCC and LLVM, and the usefulness of the technique for compilers for other languages like Scala.

Compiler	Bugs reported	Bugs fixed
GCC/LLVM	1,602	1,007
ICC	35	?
CompCert	31	27
scala/dotty	42	17

Discussion

The remaining examples in the paper cover problems with jump-threading logic, global value numbering, inlining, vectorization, and performance. Because the authors analyze only a few cherry-picked examples, a question remains: Are these eight examples representative of all of the other bugs?

Additionally, the authors claim: "EMI variants generated from existing code, say via Orion, are likely programs that people actually write." Is this true, especially when random programs are used as seeds? Is this even true of the two examples discussed above?

The results indicate that the kind of seed programs heavily influence the number of bugs that are discovered. Randomly-generated Csmith seed programs revealed far more bugs than those taken from compiler test suites and open-source project tests. This suggests that EMI should be used in conjunction with an existing fuzzer. Do other fuzzers provide amenable seed programs?

Finally, the authors tout EMI as a general validation technique that can be use to differentially test applications such as compilers for other languages. Do you think this methodology will be as useful for other applications as it is for testing C compilers?

The CS 6120 Course Blog