Dealing with SAT, levels of optimality, and aspirations towards perfection: Denali
Background and Significance
Denali is a project which aims to generate actually optimal programs utilizing automatic theorem proving. Whereas traditional compilers rely on decades of carefully tuned heuristics, peephole optimizations, and target-specific patterns, Denali aims to search the full design space of machine code sequences and provably discover the fastest possible implementation of a given computation.
The work is historically notable because it builds directly on Massalin’s “superoptimizer” idea from the late 1980s. Denali replaces brute-force enumeration with a goal-directed search powered by satisfiability solving and automated reasoning. The goal is not to improve code marginally, but to eliminate the gap between compiler output and expert-crafted assembly for short, performance-critical routines – precisely those parts of programs that developers often hand-tune for specific architectures. Denali promises machine-generated code that not only rivals but sometimes exceeds what a seasoned engineer might write, while preserving portability and correctness guarantees.
Key Ideas & Contributions
The core contribution of Denali is a formal system that recasts optimal code generation as a satisfiability problem over an E-graph of equivalent expressions. The authors develop a structured set of axioms, a matching engine, and a SAT-based scheduling/search mechanism that collectively enumerate all computations of a program fragment and then identify the shortest one.
Axioms and Representation
The paper begins by defining several classes of axioms:
- Mathematical Axioms — Generic identities (commutativity, associativity, select/store rules) that allow expression of algebraic expressions.
- Architectural Axioms — Definitions of available machine instructions in terms of pure functional equivalents.
These axioms serve as the foundation for building a rigorous space of program equivalences. Rather than rewriting expressions directly, Denali uses these axioms to populate an E-graph—a compact representation that tracks all ways an expression can be equivalently computed. This lets Denali explore a massive space of possible instruction sequences without brute-force enumeration.
The E-Graph Expansion Phase
Once the guarded multi-assignment (the primitive IR of Denali) is established, the system repeatedly applies axioms to expand the E-graph. Each application adds equivalences: introducing alternate instruction sequences, identifying common subexpressions, or restructuring expressions to expose new opportunities.
This expansion continues until quiescence: a state where all relevant instances of axioms have been considered. The result is a connected equivalence structure representing potentially thousands of candidate straight-line programs.
Formalizing Optimality
With the E-graph constructed, Denali then formalizes the statement:
“No program of the target architecture computes the goal terms within K cycles.”
This is encoded entirely as a propositional formula. Boolean variables represent whether each machine operation is launched or completed in a given cycle, whether its operands are available, and whether the equivalence class of a term has been computed.
By negating the conjecture and feeding the constraints to a SAT solver, Denali determines whether a valid ≤K-cycle program exists. A failed proof yields an explicit assignment for instruction scheduling. A successful proof forces the system to try a larger K. Through this process Denali discovers the smallest K for which the solver finds a model.
Preliminary Results
To validate the approach, the authors present several small but illustrative benchmarks:
- Byte Swap: Denali synthesizes an optimal 4-byte swap on the Alpha EV6. The resulting five-cycle program matches or slightly beats highly tuned compiler output.
- 5-Byte Swap: Denali outperforms the production C compiler by one cycle.
- Packet Checksum: A more complex routine involving unrolling, software pipelining, and word-parallel operations. Even without a full pipeline-automation pass, Denali generates a 31-instruction, 10-cycle loop.
Merits and Shortcomings
Fundamentally, the formulation of the problem as a goal-directed search using an E-graph representation is really cool; it’s a lot more efficient than Massalin’s earlier brute force approach, and E-graphs are used widely today for a variety of applications.
The online and in-class discussions yielded some interesting insights and critiques. There were several frequently cited limitations of the work. The first is that the techniques discussed in the paper can only be applied to small instruction sequences that must be manually identified. Superoptimizers cannot operate effectively on long sequences; the search space grows exponentially with input size, making computation and correctness verification (via SMT solvers) expensive or infeasible. To be fair to the authors, the paper doesn’t claim to solve this problem – the proposed strategy is meant to be used to optimize small critical subroutines – but this does require some level of human competence to effectively use the tool.
Furthermore, the Denali superoptimizer (and anything that uses an equivalence-based approach in general) depends on algebraic and architectural axioms that are manually specified (and potentially incorrect). This raises the question of whether we can methodically generate these, and how we can verify their correctness. In particular, enumerating these axioms for different CPU architectures can be expensive, and likely requires human labor or at least verification somewhere down the line. There was also an interesting point raised about the possibility of improving computation efficiency by using only a subset of axioms, and whether this can be done without severely compromising the optimality of the solution.
It’s also worth examining the reliance of Denali on assumptions about the cost model and hardware behavior, e.g. memory access latency data collected by a profiler. The “optimal” solution generated by the superoptimizer depends on a set of axioms encoding the behavior of the target machine; however, it is often very difficult to accurately model real hardware behavior. The actual costs of cache misses, memory access latencies, branch prediction, etc. can be very different in reality than what’s encoded in the cost model, thereby making it possible, or even likely, that the “optimal” solution is not actually optimal in practice.
Despite these drawbacks, Denali put forward a couple of highly impactful ideas: using SMT solvers for provably optimal code generation, and using E-graphs to efficiently encode equivalence relations. We’ll discuss some later work on superoptimizers in some more detail in the next section.
Relevance of the Work
SAT Solving and Equality Saturation
Denali pioneered a goal-directed search by combining SAT solvers with equality saturation (E-graphs). In practice, Denali builds an E-graph of all equivalent expressions, then encodes timing constraints into a SAT query to pick the lowest-cost program. This contrasts with brute-force enumeration and lets Denali prove optimality for small kernels. Modern SAT and SMT solvers are far more powerful now, making this approach more feasible on real code. Indeed, other synthesizing superoptimizers like Google’s Souper extract straight-line code from LLVM IR and use SMT solvers to synthesize optimal sequences.
Automated Vectorization and Kernel Fusion
By exposing vector and fused operations as equivalences in the E-graph, a superoptimizer can automatically derive SIMD or fused code paths. For example, one can add rules that say scalar operations are equivalent to an AVX-style vector shuffle or a fused multiply-add. Denali-like search will then naturally consider those options. This is essentially automated vectorization: the solver chooses the option that minimizes cycles. Similarly, if the E-graph includes fusing patterns, the superoptimizer can discover fused kernels that reduce memory traffic. These generated equivalences can be validated and, if safe, added as peephole rewrites.
Superoptimizing FlashAttention and GPU Memory Patterns
FlashAttention is an optimized GPU kernel for transformer attention that manually reorders computation to maximize shared-memory reuse. Superoptimization tools can reproduce and extend this by searching over reorderings and loop schedules. For instance, the Mirage superoptimizer treated FlashAttention as a target: it automatically discovered graph-based kernel transformations (new custom kernels and data layouts) that matched or beat the expert FlashAttention and FlashDecoding implementations. The key was modeling GPU memory hierarchies in the search: Mirage’s “Graphs” representation lets it reorder matmuls and partition sums. One discovered kernel reshapes and fuses the query/key matrices so that blocks can use full GEMMs (tensor cores) instead of slower GEMVs.
Discovering New Equivalences and Peephole Optimizations
A key byproduct of superoptimization is the resulting equivalence rules. Once an optimal sequence is found for a snippet, it can be turned into a rewrite rule or peephole optimization for a compiler. For example, Google’s Souper has automatically learned thousands of LLVM IR rewrites by synthesizing constant/bit-manipulation patches. However, generalization must be careful - a sequence that’s optimal for one specific code might not apply universally. So developers often need to vet the learned rules. In practice, superoptimization is a powerful way to seed a compiler’s optimization database with machine-checked equivalences.
Integration in Modern Compilers
Superoptimization ideas are now finding their way into production compilers via equality-saturation engines. For instance, the Cranelift compiler (used for WebAssembly/JIT) was recently reworked to use an e-graph in its mid-end. Cranelift’s new e-graph representation replaces many separate optimization passes (GVN, LICM, etc.) with one saturation process, applying rewrite rules eagerly without fixpoint loops. This essentially does a bounded form of superoptimization during compilation. Denali’s time-bounded search philosophy also appears in practice: compilers typically impose strict limits (time or IR size) on such search. Cranelift’s rewrite engine (FLAX) applies all known rewrites immediately to each expression, ensuring compile times remain JIT-friendly.