This is a course about the design and implementation of compilers. It is intended to provide a reasonably complete introduction to the major aspects of compilers, providing student with the mental toolbox to build compilers and to understand more advanced compiler techniques they may encounter later.
Compilers are notoriously complex. Compiler construction was historically considered a significant challenge requiring major engineering effort. It is still not easy to build a compiler, but fortunately, techniques have been developed over the past few decades that greatly simplify compiler design.
A running theme of this course is that the different parts of a compiler can be described in a declarative way. These declarative specifications can then be turned into efficient and clear code, even if we are programming in an imperative language. Compared to other introductions to compiler construction, this course focuses more on describing these specifications precisely, using inference rules and other mathematical techniques where other courses might be satisfied with less precision. The advantage of precision is that it gives us as developers a well-defined implementation goal, so that it makes sense to talk about whether an implementation is correct, or not.
These precise descriptions mean that we are taking a theory-oriented approach to compilers. In fact, a surprisingly large amount of computer science theory turns out to be useful for the design and implementation of compilers. Many theoretical contributions in computer science were originally motivated by compiler construction. That theory includes not only the theory of programming languages, but also the theory behind useful algorithms and data structures.
This course is not just about theory; it's also intended to give students the ability to implement a real, working compiler. So we sometimes get into implementation details too. These details are often very interesting in their own right!
Of course, what programming language you are compiling has a large influence on how you implement a compiler. Some language features require special techniques or impose restrictions on what compilers can do. Understanding how to build compilers naturally deepens our understanding of programming languages. And understanding the interactions between the language and the compiler also makes us better programmers even when we are not implementing compilers.
A compiler is a translator from one language, the source language to another, the target. Translation is not an easy task when the source and target languages are very different. In a classic compiler, the source language is a programming language designed to be understandable to humans, whereas the target language is assembly language or machine language, which is designed to be efficiently executed by hardware. However, other translations are useful and indeed, commonly used. For example, the Java compiler compiles Java source code to Java Virtual Machine (JVM) code (called bytecode), and the JVM runtime system has a second, “just-in-time” (JIT) compiler that translates frequently executed bytecode into machine code that the processor can execute directly.
Do we need a compiler to run programs? No—programs can be run using an interpreter. For example, Java bytecode is executed by a JVM interpreter that simulates the execution of a hypothetical machine. However, software interpreters are much less efficient than processor hardware, so compiled code tends to execute at least an order of magnitude faster than interpreted code. In fact, a processor is simply a hardware implementation of an interpreter for machine code. The performance advantage of compiled code, running on this hardware interpreter, also means that compiled code consumes an order of magnitude less energy and generates an order of magnitude less heat. Of course, the actual step of compiling code can require a lot of computation, so if you are only going to run a program once, it's better to interpret it with a software interpreter. Compilation makes more sense for programs that are going to be run many times or for a long time.
Source code and target code are typically quite different. Source code is optimized for humans to read and modify it. It tries to match human notions of language to some extent, and it has redundancy to help prevent programmer mistakes. Source code supports analysis and reasoning, either by humans or by algorithms. Target code, on the other hand, is usually optimized for efficient interpretation. It is compact, not very readable, and does not waste space on redundancy. Target code has often discarded much of the high-level structure of the source code, so it is hard to reason about.
Even more important for compilers than the performance of the generated code is its correctness. Debugging code is hard enough when you can trust the compiler; if the bugs might have been inserted by the compiler, debugging becomes truly difficult. The differences between the source and target languages also make it more difficult to show that the compiler is doing its job correctly. Interestingly, it is actually possible to define with mathematical precision what it means for a compiler to be correct, when there is a defined semantics for both the source and target languages: that is, a mathematical description of the behavior of both languages. And indeed, a number of compilers have been developed along with proofs of correctness—even proofs that can be automatically checked to ensure they are constructed correctly. Developing compilers with this high degree of assurance is not our goal here, but the approach we take of precisely specifying different parts of the compilation process is helpful for achieving this goal.
To see the flavor of the work that a compiler does, consider the following short function implemented in the source language we will be using in this course. Its syntax is similar to a number of popular languages, so it should be easy to understand.
code
An unoptimizing compiler might translate the above code to x86-64 assembly code as shown below on
the left.
The variables from the course program are stored in memory on the program stack, and are
accessed using memory operands. For example, the variable a is
stored at memory location [rbp - 16], where rbp is a register that
points to the top of the current stack frame. Also notable is that the control structure of
the high-level source code, such as the while-statement, has been replaced with
low-level jump instructions like jne and jmp.
The unoptimized code is more inefficient than the code an experienced assembly programmer could produce, but a good optimizing compiler can often match or improve on hand-written code, generating code like the code shown on the right. Variables are stored entirely in registers, the compiler has done a better job of chosing efficient instructions, and instructions associated with managing the stack have been removed because the stack is not used.
The compiler usually doesn't finish the job of generating machine code; this step is left to the assembler. The result of assembling the optimized assembly code above is the following sequence of machine code bytes, expressed in hexadecimal, with their positions shown in hexadecimal in the left column and the corresponding assembly code shown on the right. Notice that the labels in the assembly code have been replaced by hexadecimal offsets.
machine code
It might seem almost magical that it is possible to translate source code to efficient, correct low-level code, and even more so that it can be done reasonably quickly. The key insight that makes high-quality translation feasible is not to do it all at once. The way to solve a hard problem is often to break it into smaller, easier problems, and compilers are no exception.
Instead of translating directly from the source code (a stream of bytes) to the target code (another stream of bytes), we define a sequence of compiler phases, each of which represents the original program differently. Essentially, in between the source and target languages we define a sequence of intermediate languages that lie between the source and target, and are designed for the convenient of the current compiler phase.
To see how this works in a nutshell, let's look at how a program statement like the following one might be represented in each compiler phase:
if (val >= 0) pos = val
The first step in compilation is to turn the input program, represented as a sequence of bytes or characters, into a tree data structure that supports later compiler phases. Lexical analysis divides up the input into a sequence of tokens (or lexemes). These are the “words” of the program. Whitespace and other token separators are discarded:
The next step is parsing, which creates a tree structure that captures the syntax of the program, discarding syntactic elements such as parentheses that do not further affect the meaning of the program. The output of this phase is an abstract syntax tree:
Semantic analysis completes the task of determining whether the source code represents a legal program. It often involves type checking. Additional information may be collected about the program during this phase, such as the types of program expressions. This information may be used to decorate the abstract syntax tree.
The compiler generates new representations of the program in multiple phases. Typically, there is an initial phase of code generation that produces a intermediate representation (IR) of the program. A later phase of the compiler generates target code. When the target language is assembly code, this phase is called instruction selection.
A common intermediate representation is as a control-flow graph. This would be an output of intermediate code generation:
A later phase of code generation is instruction selection, in which the intermediate representation is translated to assembly instructions. Instructions can be generated as abstract assembly code, in which variables are assumed to be usable as if they were hardware registers. For our example, the result of translating to abstract (x86-64) assembly code might be the following:
cmp val, 0
jl L1
mov pos, val
L1:
Optimization can be applied to make programs better. Typically “better” means “faster”, although other goals are sometimes sought, such as reducing the memory footprint or the energy consumption of the program. Optimization can be applied in almost every compiler phase.
The most important optimization is register allocation, which assigns variables to hardware registers. Applying register allocation to the assembly code above might yield the following x86-64 assembly code:
cmp rax, 0
jl L1
mov rcx, rax
L1:
Another optimization might allow us to use a more efficient instruction sequence that avoids branching:
cmp rax, 0
cmovge rcx, rax
The phases of a compiler are typically structured into a front end and a back end.
The front end of the compiler comprises phases that do the first part of compilation: lexical, syntactic, and semantic analysis, intermediate code generation, and intermediate code optimization. These phases convert the program from input source code to an optimizable intermediate representation that is relatively free of details about the final target machine architecture. The front end is therefore largely machine-independent and can be used for more than one target machine.
The back end handles the machine-dependent aspects of compilation: instruction selection and machine-dependent optimizations such as register allocation. In principle, the back end is largely independent of the original source language, so if the intermediate code representation it takes as input is sufficiently general, the back end can be reused for multiple source languages.
The compiler usually only does one part of the task of producing executable code. After it produces assembly code for each of the compilation units (source files), an assembler is used to produce object code modules that have dependencies on each other. A linker resolves the dependencies between the object files and determines how to patch or extend the assembly code so that these dependencies are satisfied. A loader, often built into the operating system, then loads the linker output into memory and produces an executable image that the processor can interpret.
Traditionally, many compilers were implemented using the language they were implementing. This process is a bit like pulling yourself up by your own bootstraps, but these self-hosting compilers have some nice properties, such as avoiding dependencies on other languages and helping to identify limitations in the language design.
In the modern era, there are many programming languages available to implement a compiler in, and some of them may be better suited to writing a compiler than the language you are implementing in the compiler, so self-hosting is less valuable than it once was.
In practice, most modern, general-purpose languages work reasonably well for compiler implementation. However, there are some trade-offs worth thinking about before you plunge into coding.
A compiler is a big project with many moving parts, and one that is often too big for a single person tackle alone or to complete in a timely fashion. Good language support for modularity and separation of concerns is essential. A language with a static type system is also very helpful to guide design of interfaces and help constrain the interaction of different code modules.
One pain point is that compilers use complex, often cyclic data structures that are shared across multiple compiler phases. A language with automatic garbage collection is recommended so that these data structures can be shared across module boundaries without needing to complicate interfaces with memory management directives, performing unnecessary copying, or worrying about memory corruption errors.
Languages that support pattern matching—particularly, functional languages—work well for the more tree-structured front end of the compiler, but many of these languages make it more difficult to manage arbitary graph structures, so they will probably be less of a win for the back end of the compiler.