Adrian Sampson: LLVM for Grad Students

This is an introduction to doing research with the LLVM compiler infrastructure. It should be enough for a grad student to go from mostly uninterested in compilers to excited to use LLVM to do great work.

What is LLVM?

LLVM is a compiler. It’s a really nice, hackable, ahead-of-time compiler for “native” languages like C and C++.

Of course, since LLVM is so awesome, you will also hear that it is much more than this (it can also be a JIT; it powers a great diversity of un-C-like languages; it is the new delivery format for the App Store; etc.; etc.). These are all true, but for our purposes, the above definition is what matters.

A few huge things make LLVM different from other compilers:

LLVM’s intermediate representation (IR) is its great innovation. LLVM works on a representation of programs that you can actually read if you can read assembly. This may not seem like a great revelation, but it is: other compilers’ IRs tend to be in-memory structures so complicated that you can’t really write them down. This makes other compilers harder to understand and messier to implement.
LLVM is nicely written: its architecture is way more modular than other compilers. Part of the reason for this niceness comes from its original implementor, who is one of us.
Despite being the research tool of choice for squirrelly academic hackers like us, LLVM is also an industrial-strength compiler backed by the largest company on Earth. This means you don’t have to compromise between a great compiler and a hackable compiler, as you do in Javaland when you choose between HotSpot and Jikes.

Why Would a Grad Student Care About LLVM?

LLVM is a great compiler, but who cares if you don’t do compilers research?

A compiler infrastructure is useful whenever you need to do stuff with programs. Which, in my experience, is a lot. You can analyze programs to see how often they do something, transform them to work better with your system, or change them to pretend to use your hypothetical new architecture or OS without actually fabbing a new chip or writing a kernel module. For grad students, a compiler infrastructure is more often the right tool than most people give it credit for. I encourage you to reach for LLVM by default before hacking any of these tools unless you have a really good reason:

an architectural simulator
a dynamic binary instrumentation tool like Pin
source-level transformation (from simple stuff like sed to more complete tools involving AST parsing and serialization)
hacking the kernel to intercept system calls
anything resembling a hypervisor

Even if a compiler doesn’t seem like a perfect match for your task, it can often get you 90% of the way there far easier than, say, a source-to-source translation.

Here are some nifty examples of research projects that used LLVM to do things that are not all that compilery:

Virtual Ghost from UIUC showed you could use a compiler pass to protect processes from compromised OS kernels.
CoreDet from UW makes multithreaded programs deterministic.
In our approximate computing work, we use an LLVM pass to inject errors into programs to simulate error-prone hardware.

I’ll reemphasize: LLVM is not just for implementing new compiler optimizations!

The Pieces

Here’s a picture that shows the major components of LLVM’s architecture (and, really, the architecture of any modern compiler):

Front End, Passes, Back End

There are:

The front end, which takes your source code and turns it into an intermediate representation or IR. This translation simplifies the job of the rest of the compiler, which doesn’t want to deal with the full complexity of C++ source code. You, an intrepid grad student, probably do not need to hack this part; you can use Clang unmodified.
The passes, which transform IR to IR. In ordinary circumstances, passes usually optimize the code: that is, they produce an IR program as output that does the same thing as the IR they took as input, except that it’s faster. This is where you want to hack. Your research tool can work by looking at and changing IR as it flows through the compilation process.
The back end, which generates actual machine code. You almost certainly don’t need to touch this part.

Although this architecture describes most compilers these days, one novelty about LLVM is worth noting here: programs use the same IR throughout the process. In other compilers, each pass might produce code in a unique form. LLVM opts for the opposite approach, which is great for us as hackers: we don’t have to worry much about when in the process our code runs, as long as it’s somewhere between the front end and back end.

Getting Oriented

Let’s start hacking.

Get LLVM

You’ll need to install LLVM. Linux distributions often have LLVM and Clang packages you can use off the shelf. But you’ll need to ensure you get a version that includes all the headers necessary to hack with it. The OS X build that comes with Xcode, for example, is not complete enough. Fortunately, it’s not hard to build LLVM from source using CMake. Usually, you only need to build LLVM itself: your system-provided Clang will do just fine as long as the versions match (although there are instructions for building Clang too).

On macOS in particular, the Homebrew formula is a great way to do it, but otherwise Brandon Holt has good instructions.

RTFM

You will need to get friendly with the documentation. I find these links in particular are worth coming back to:

The automatically generated Doxygen pages are super important. You will need to live inside these API docs to make any progress at all while hacking on LLVM. Those pages can be hard to navigate, though, so I recommend going through Google. If you append “LLVM” to any function or class name, Google usually finds the right Doxygen page. (If you’re diligent, you can even train Google to give you LLVM results first even without typing “LLVM”!) I realize this sounds ridiculous, but you really need to jump around LLVM’s API docs like this to survive—and if there’s a better way to navigate the API, I haven’t found it.
The language reference manual is handy if you ever get confused by syntax in an LLVM IR dump.
The programmer’s manual describes the toolchest of data structures peculiar to LLVM, including efficient strings, STL alternatives for maps and vectors, etc. It also outlines the fast type introspection tools (isa, cast, and dyn_cast) that you’ll run into everywhere.
Read the Writing an LLVM Pass tutorial whenever you have questions about what your pass can do. Because you’re a researcher and not a day-to-day compiler hacker, this article disagrees with that tutorial on some details. (Most urgently, ignore the Makefile-based build system instructions and skip straight to the CMake-based “out-of-source” instructions.) But it’s nonetheless the canonical source for answers about passes in general.
The GitHub mirror is sometimes convenient for browsing the LLVM source online.

Let’s Write a Pass

Productive research with LLVM usually means writing a custom pass. This section will guide you through building and running a simple pass that transforms programs on the fly.

A Skeleton

I’ve put together a template repository that contains a useless LLVM pass. I recommend you start with the template: when starting from scratch, getting the build configuration set up can be painful.

Clone the llvm-pass-skeleton repository from GitHub:

$ git clone https://github.com/sampsyo/llvm-pass-skeleton.git

The real work gets done in skeleton/Skeleton.cpp, so open up that file. Here’s where the business happens:

PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM) {
    for (auto &F : M) {
        errs() << "I saw a function called " << F.getName() << "!\n";
    }
    return PreservedAnalyses::all();
};

There are several kinds of LLVM pass, and we’re using one called a module pass. Exactly as you would expect, LLVM invokes the method above for every module, which roughly corresponds to one source-code file. For now, all it does is loop over all the functions in the module and out their names.

Details:

That errs() thing is an LLVM-provided C++ output stream we can use to print to the console.
The function returns PreservedAnalyses::all() to indicate that it didn’t modify M. Later, when we actually transform the program, we’ll need to return something like PreservedAnalyses::none().

Build It

Build the pass with CMake:

$ cd llvm-pass-skeleton
$ mkdir build
$ cd build
$ cmake ..  # Generate the Makefile.
$ make  # Actually build the pass.

If LLVM isn’t installed globally, you will need to tell CMake where to find it. You can do that by giving it the path to the share/llvm/cmake/ directory inside wherever LLVM resides in the LLVM_DIR environment variable. Here’s an example with the path from Homebrew:

$ LLVM_DIR=`brew --prefix llvm`/lib/cmake/llvm cmake ..

Building your pass produces a shared library. You can find it at build/skeleton/libSkeletonPass.so or a similar name, depending on your platform. In the next step, we’ll load this library to run the pass on some real code.

Run It

To run your new pass, invoke clang on some C program and use some freaky flags to point at the shared library you just compiled:

$ clang -fpass-plugin=`echo build/skeleton/SkeletonPass.*` something.c
I saw a function called main!

Instead of just typing clang, you will want to use the Clang binary associated with the LLVM installation you used to build the pass. For Homebrew’s keg-only LLVM, for example, use `brew --prefix llvm`/bin/clang.

That -fpass-plugin=build/skeleton/SkeletonPass.so option is all you need to load and activate your pass in Clang. So if you need to process larger projects, you can just add those arguments to a Makefile’s CFLAGS or the equivalent for your build system.

(You can also run passes one at a time, independently from invoking clang. This way, which uses LLVM’s opt command, is the official documentation-sanctioned way, but I won’t cover it here.)

Congratulations; you’ve just hacked a compiler! In the next steps, we’ll extend this hello-world pass to do something interesting to the program.

Understanding LLVM IR

Module, Function, BasicBlock, Instruction — Modules contain Functions, which contain BasicBlocks, which contain Instructions. Everything but Module descends from Value.

To work with programs in LLVM, you need to know a little about how the IR is organized.

Containers

Here’s an overview of the most important components in an LLVM program:

A Module represents a source file (roughly) or a translation unit (pedantically). Everything else is contained in a Module.
Most notably, Modules house Functions, which are exactly what they sound like: named chunks of executable code. (In C++, both functions and methods correspond to LLVM Functions.)
Aside from declaring its name and arguments, a Function is mainly a container of BasicBlocks. The basic block is a familiar concept from compilers, but for our purposes, it’s just a contiguous chunk of instructions.
An Instruction, in turn, is a single code operation. The level of abstraction is roughly the same as in RISC-like machine code: an instruction might be an integer addition, a floating-point divide, or a store to memory, for example.

Most things in LLVM—including Function, BasicBlock, and Instruction—are C++ classes that inherit from an omnivorous base class called Value. A Value is any data that can be used in a computation—a number, for example, or the address of some code. Global variables and constants (a.k.a. literals or immediates, like 5) are also Values.

An Instruction

Here’s an example of an Instruction in the human-readable text form of LLVM IR:

%5 = add i32 %4, 2

This instruction adds two 32-bit integer values (indicated by the type i32). It adds the number in register 4 (written %4) and the literal number 2 (written 2) and places its result in register 5. This is what I mean when I say LLVM IR looks like idealized RISC machine code: we even use the same terminology, like register, but there are infinitely many registers.

That same instruction is represented inside the compiler as an instance of the Instruction C++ class. The object has an opcode indicating that it’s an addition, a type, and a list of operands that are pointers to other Value objects. In our case, it points to a Constant object representing the number 2 and another Instruction corresponding to the register %4. (Since LLVM IR is in static single assignment form, registers and Instructions are actually one and the same. Register numbers are an artifact of the text representation.)

By the way, if you ever want to see the LLVM IR for your program, you can instruct Clang to do that:

$ clang -emit-llvm -S -o - something.c

Inspecting IR in Our Pass

Let’s get back to that LLVM pass we were working on. We can inspect all of the important IR objects by sending them to a C++ ostream with <<. It just prints out the human-readable representation of an object in the IR. Since our pass gets handed Functions, let’s use it to iterate over each Function’s BasicBlocks, and then over each BasicBlock’s set of Instructions.

Here’s some code to do that. You can get it by checking out the containers branch of the llvm-pass-skeleton git repository:

errs() << "Function body:\n" << F << "\n";
for (auto& B : F) {
  errs() << "Basic block:\n" << B << "\n";
  for (auto& I : B) {
    errs() << "Instruction: " << I << "\n";
  }
}

Using C++11’s fancy auto type and foreach syntax makes it easy to navigate the hierarchy in LLVM IR.

If you build the pass again and run a program through it, you should now see the various parts of the IR split out as we traverse them.

Now Make the Pass Do Something Mildly Interesting

The real magic comes in when you look for patterns in the program and, optionally, change the code when you find them. Here’s a really simple example: let’s say we want to replace the first binary operator (+, -, etc.) in every function with a multiply. Sounds useful, right?

Here’s the code to do that. This version, along with an example program to try it on, is available in the mutate branch of the llvm-pass-skeleton git repository:

for (auto& B : F) {
  for (auto& I : B) {
    if (auto* op = dyn_cast<BinaryOperator>(&I)) {
      // Insert at the point where the instruction `op` appears.
      IRBuilder<> builder(op);

      // Make a multiply with the same operands as `op`.
      Value* lhs = op->getOperand(0);
      Value* rhs = op->getOperand(1);
      Value* mul = builder.CreateMul(lhs, rhs);

      // Everywhere the old instruction was used as an operand, use our
      // new multiply instruction instead.
      for (auto& U : op->uses()) {
        User* user = U.getUser();  // A User is anything with operands.
        user->setOperand(U.getOperandNo(), mul);
      }

      // We modified the code.
      return true;
    }
  }
}

Details:

That dyn_cast<T>(p) construct is an LLVM-specific introspection utility. It uses some conventions from the LLVM codebase to make dynamic type tests efficient, because compilers have to use them all the time. This particular construct returns a null pointer if I is not a BinaryOperator, so it’s perfect for special-casing like this.
The IRBuilder is for constructing code. It has a million methods for creating any kind of instruction you could possibly want.
To stitch our new instruction into the code, we have to find all the places it’s used and swap in our new instruction as an argument. Recall that an Instruction is a Value: here, the multiply Instruction is used as an operand in another Instruction, meaning that the product will be fed in as an argument.
We should probably also remove the old instruction, but I left bit that off for brevity.

Now if we compile a program like this (example.c in the repository):

#include <stdio.h>
int main(int argc, const char** argv) {
    int num;
    scanf("%i", &num);
    printf("%i\n", num + 2);
    return 0;
}

Compiling it with an ordinary compiler does what the code says, but our plugin makes it double the number instead of adding 2:

$ cc example.c
$ ./a.out
10
12
$ clang -fpass-plugin=build/skeleton/SkeletonPass.so example.c
$ ./a.out
10
20

Like magic!

Linking With a Runtime Library

When you need to instrument code to do something nontrivial, it can be painful to use IRBuilder to generate the LLVM instructions to do it. Instead, you probably want to write your run-time behavior in C and link it with the program you’re compiling. This section will show you how to write a runtime library that logs the results of binary operators instead of silently changing them.

Here’s the LLVM pass code, which is in the rtlib branch of the llvm-pass-skeleton repository:

// Get the function to call from our runtime library.
LLVMContext& Ctx = F.getContext();
FunctionCallee logFunc = F.getParent()->getOrInsertFunction(
  "logop", Type::getVoidTy(Ctx), Type::getInt32Ty(Ctx)
);

for (auto& B : F) {
  for (auto& I : B) {
    if (auto* op = dyn_cast<BinaryOperator>(&I)) {
      // Insert *after* `op`.
      IRBuilder<> builder(op);
      builder.SetInsertPoint(&B, ++builder.GetInsertPoint());

      // Insert a call to our function.
      Value* args[] = {op};
      builder.CreateCall(logFunc, args);

      return true;
    }
  }
}

The tools you need are Module::getOrInsertFunction and IRBuilder::CreateCall. The former adds a declaration for your runtime function logop, which is analogous to declaring void logop(int i); in the program’s C source without a function body. The instrumentation code pairs with a run-time library (rtlib.c in the repository) that defines that logop function:

#include <stdio.h>
void logop(int i) {
  printf("computed: %i\n", i);
}

To run an instrumented program, link it with your runtime library:

$ cc -c rtlib.c
$ clang -fpass-plugin=build/skeleton/SkeletonPass.so -c example.c
$ cc example.o rtlib.o
$ ./a.out
12
computed: 14
14

If you like, it’s also possible to stitch together the program and runtime library before compiling to machine code. The llvm-link utility, which you can think of as the rough IR-level equivalent of ld, can help with that.

Annotations

Most projects eventually need to interact with the programmer. You’ll eventually wish for annotations: some way to convey extra information from the program to your LLVM pass. There are several ways to build up annotation systems:

The practical and hacky way is to use magic functions. Declare some empty functions with special, probably-unique names in a header file. Include that file in your source and call those do-nothing functions. Then, in your pass, look for CallInst instructions that invoke your functions and use them to trigger your magic. For example, you might use calls like __enable_instrumentation() and __disable_instrumentation() to let the program confine your code-munging to specific regions.
If you need to let programmers add markers to function or variable declarations, Clang’s __attribute__((annotate("foo"))) syntax will emit metadata with an arbitrary string that you can process in your pass. Brandon Holt again has some background on this technique. If you need to mark expressions instead of declarations, the undocumented and sadly limited __builtin_annotation(e, "foo") intrinsic might work.
You can jump in full dingle and modify Clang itself to interpret your new syntax. I don’t recommend this.
If you need to annotate types—and I believe people often do, even if they don’t realize it—I’m developing a system called Quala. It patches Clang to support custom type qualifiers and pluggable type systems, à la JSR-308 for Java. Let me know if you’re interested in collaborating on this project!

I hope to expand on some of these techniques in future posts.

And More

LLVM is enormous. Here are a few more topics I didn’t cover here:

Using the vast array of classic compiler analyses available in LLVM’s junk drawer.
Generating any special machine instructions, as architects often want to do, by hacking the back end.
Exploiting debug info, so you can connect back to the source line and column corresponding to a point in the IR.
Writing frontend plugins for Clang.

I hope this gave you enough background to make something awesome. Explore, build, and let me know if this helped!

Thanks to the UW architecture and systems groups, who sat through an out-loud version of this post and asked many shockingly good questions.

Addenda, courtesy of kind readers:

Emery Berger pointed out that dynamic binary instrumentation tools, like Pin, are still the right choice if you need to observe architecture specifics: registers, the memory hierarchy, instruction encoding, etc.
Brandon Holt just posted tips for debugging in LLVM, including how to draw control flow graphs with GraphViz.
John Regehr mentioned in a comment a drawback to hitching your software wagon to LLVM’s star: API instability. LLVM internals change a lot from release to release, so maintaining a project means keeping up with the project. Alex Bradbury’s LLVM Weekly newsletter is a great resource for following the LLVM ecosystem.