LLVM for Grad Students

August 3, 2015

This is an introduction to doing research with the LLVM compiler infrastructure. It should be enough for a grad student to go from mostly uninterested in compilers to excited to use LLVM to do great work.

What is LLVM?

LLVM is a compiler. It’s a really nice, hackable, ahead-of-time compiler for “native” languages like C and C++.

Of course, since LLVM is so awesome, you will also hear that it is much more than this (it can also be a JIT; it powers a great diversity of un-C-like languages; it is the new delivery format for the App Store; etc.; etc.). These are all true, but for our purposes, the above definition is what matters.

A few huge things make LLVM different from other compilers:

Why Would a Grad Student Care About LLVM?

LLVM is a great compiler, but who cares if you don’t do compilers research?

A compiler infrastructure is useful whenever you need to do stuff with programs. Which, in my experience, is a lot. You can analyze programs to see how often they do something, transform them to work better with your system, or change them to pretend to use your hypothetical new architecture or OS without actually fabbing a new chip or writing a kernel module. For grad students, a compiler infrastructure is more often the right tool than most people give it credit for. I encourage you to reach for LLVM by default before hacking any of these tools unless you have a really good reason:

Even if a compiler doesn’t seem like a perfect match for your task, it can often get you 90% of the way there far easier than, say, a source-to-source translation.

Here are some nifty examples of research projects that used LLVM to do things that are not all that compilery:

I’ll reemphasize: LLVM is not just for implementing new compiler optimizations!

The Pieces

Here’s a picture that shows the major components of LLVM’s architecture (and, really, the architecture of any modern compiler):

Front End, Passes, Back End

There are:

Although this architecture describes most compilers these days, one novelty about LLVM is worth noting here: programs use the same IR throughout the process. In other compilers, each pass might produce code in a unique form. LLVM opts for the opposite approach, which is great for us as hackers: we don’t have to worry much about when in the process our code runs, as long as it’s somewhere between the front end and back end.

Getting Oriented

Let’s start hacking.

Get LLVM

You’ll need to install LLVM. Linux distributions often have LLVM and Clang packages you can use off the shelf. But you’ll need to ensure you get a version that includes all the headers necessary to hack with it. The OS X build that comes with Xcode, for example, is not complete enough. Fortunately, it’s not hard to build LLVM from source using CMake. Usually, you only need to build LLVM itself: your system-provided Clang will do just fine as long as the versions match (although there are instructions for building Clang too).

On macOS in particular, the Homebrew formula is a great way to do it, but otherwise Brandon Holt has good instructions.

RTFM

You will need to get friendly with the documentation. I find these links in particular are worth coming back to:

Let’s Write a Pass

Productive research with LLVM usually means writing a custom pass. This section will guide you through building and running a simple pass that transforms programs on the fly.

A Skeleton

I’ve put together a template repository that contains a useless LLVM pass. I recommend you start with the template: when starting from scratch, getting the build configuration set up can be painful.

Clone the llvm-pass-skeleton repository from GitHub:

$ git clone https://github.com/sampsyo/llvm-pass-skeleton.git

The real work gets done in skeleton/Skeleton.cpp, so open up that file. Here’s where the business happens:

PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM) {
    for (auto &F : M) {
        errs() << "I saw a function called " << F.getName() << "!\n";
    }
    return PreservedAnalyses::all();
};

There are several kinds of LLVM pass, and we’re using one called a module pass. Exactly as you would expect, LLVM invokes the method above for every module, which roughly corresponds to one source-code file. For now, all it does is loop over all the functions in the module and out their names.

Details:

Build It

Build the pass with CMake:

$ cd llvm-pass-skeleton
$ mkdir build
$ cd build
$ cmake ..  # Generate the Makefile.
$ make  # Actually build the pass.

If LLVM isn’t installed globally, you will need to tell CMake where to find it. You can do that by giving it the path to the share/llvm/cmake/ directory inside wherever LLVM resides in the LLVM_DIR environment variable. Here’s an example with the path from Homebrew:

$ LLVM_DIR=`brew --prefix llvm`/lib/cmake/llvm cmake ..

Building your pass produces a shared library. You can find it at build/skeleton/libSkeletonPass.so or a similar name, depending on your platform. In the next step, we’ll load this library to run the pass on some real code.

Run It

To run your new pass, invoke clang on some C program and use some freaky flags to point at the shared library you just compiled:

$ clang -fpass-plugin=`echo build/skeleton/SkeletonPass.*` something.c
I saw a function called main!

Instead of just typing clang, you will want to use the Clang binary associated with the LLVM installation you used to build the pass. For Homebrew’s keg-only LLVM, for example, use `brew --prefix llvm`/bin/clang.

That -fpass-plugin=build/skeleton/SkeletonPass.so option is all you need to load and activate your pass in Clang. So if you need to process larger projects, you can just add those arguments to a Makefile’s CFLAGS or the equivalent for your build system.

(You can also run passes one at a time, independently from invoking clang. This way, which uses LLVM’s opt command, is the official documentation-sanctioned way, but I won’t cover it here.)

Congratulations; you’ve just hacked a compiler! In the next steps, we’ll extend this hello-world pass to do something interesting to the program.

Understanding LLVM IR

Module, Function, BasicBlock, Instruction
Modules contain Functions, which contain BasicBlocks, which contain Instructions. Everything but Module descends from Value.

To work with programs in LLVM, you need to know a little about how the IR is organized.

Containers

Here’s an overview of the most important components in an LLVM program:

Most things in LLVM—including Function, BasicBlock, and Instruction—are C++ classes that inherit from an omnivorous base class called Value. A Value is any data that can be used in a computation—a number, for example, or the address of some code. Global variables and constants (a.k.a. literals or immediates, like 5) are also Values.

An Instruction

Here’s an example of an Instruction in the human-readable text form of LLVM IR:

%5 = add i32 %4, 2

This instruction adds two 32-bit integer values (indicated by the type i32). It adds the number in register 4 (written %4) and the literal number 2 (written 2) and places its result in register 5. This is what I mean when I say LLVM IR looks like idealized RISC machine code: we even use the same terminology, like register, but there are infinitely many registers.

That same instruction is represented inside the compiler as an instance of the Instruction C++ class. The object has an opcode indicating that it’s an addition, a type, and a list of operands that are pointers to other Value objects. In our case, it points to a Constant object representing the number 2 and another Instruction corresponding to the register %4. (Since LLVM IR is in static single assignment form, registers and Instructions are actually one and the same. Register numbers are an artifact of the text representation.)

By the way, if you ever want to see the LLVM IR for your program, you can instruct Clang to do that:

$ clang -emit-llvm -S -o - something.c

Inspecting IR in Our Pass

Let’s get back to that LLVM pass we were working on. We can inspect all of the important IR objects by sending them to a C++ ostream with <<. It just prints out the human-readable representation of an object in the IR. Since our pass gets handed Functions, let’s use it to iterate over each Function’s BasicBlocks, and then over each BasicBlock’s set of Instructions.

Here’s some code to do that. You can get it by checking out the containers branch of the llvm-pass-skeleton git repository:

errs() << "Function body:\n" << F << "\n";
for (auto& B : F) {
  errs() << "Basic block:\n" << B << "\n";
  for (auto& I : B) {
    errs() << "Instruction: " << I << "\n";
  }
}

Using C++11’s fancy auto type and foreach syntax makes it easy to navigate the hierarchy in LLVM IR.

If you build the pass again and run a program through it, you should now see the various parts of the IR split out as we traverse them.

Now Make the Pass Do Something Mildly Interesting

The real magic comes in when you look for patterns in the program and, optionally, change the code when you find them. Here’s a really simple example: let’s say we want to replace the first binary operator (+, -, etc.) in every function with a multiply. Sounds useful, right?

Here’s the code to do that. This version, along with an example program to try it on, is available in the mutate branch of the llvm-pass-skeleton git repository:

for (auto& B : F) {
  for (auto& I : B) {
    if (auto* op = dyn_cast<BinaryOperator>(&I)) {
      // Insert at the point where the instruction `op` appears.
      IRBuilder<> builder(op);

      // Make a multiply with the same operands as `op`.
      Value* lhs = op->getOperand(0);
      Value* rhs = op->getOperand(1);
      Value* mul = builder.CreateMul(lhs, rhs);

      // Everywhere the old instruction was used as an operand, use our
      // new multiply instruction instead.
      for (auto& U : op->uses()) {
        User* user = U.getUser();  // A User is anything with operands.
        user->setOperand(U.getOperandNo(), mul);
      }

      // We modified the code.
      return true;
    }
  }
}

Details:

Now if we compile a program like this (example.c in the repository):

#include <stdio.h>
int main(int argc, const char** argv) {
    int num;
    scanf("%i", &num);
    printf("%i\n", num + 2);
    return 0;
}

Compiling it with an ordinary compiler does what the code says, but our plugin makes it double the number instead of adding 2:

$ cc example.c
$ ./a.out
10
12
$ clang -fpass-plugin=build/skeleton/SkeletonPass.so example.c
$ ./a.out
10
20

Like magic!

Linking With a Runtime Library

When you need to instrument code to do something nontrivial, it can be painful to use IRBuilder to generate the LLVM instructions to do it. Instead, you probably want to write your run-time behavior in C and link it with the program you’re compiling. This section will show you how to write a runtime library that logs the results of binary operators instead of silently changing them.

Here’s the LLVM pass code, which is in the rtlib branch of the llvm-pass-skeleton repository:

// Get the function to call from our runtime library.
LLVMContext& Ctx = F.getContext();
FunctionCallee logFunc = F.getParent()->getOrInsertFunction(
  "logop", Type::getVoidTy(Ctx), Type::getInt32Ty(Ctx)
);

for (auto& B : F) {
  for (auto& I : B) {
    if (auto* op = dyn_cast<BinaryOperator>(&I)) {
      // Insert *after* `op`.
      IRBuilder<> builder(op);
      builder.SetInsertPoint(&B, ++builder.GetInsertPoint());

      // Insert a call to our function.
      Value* args[] = {op};
      builder.CreateCall(logFunc, args);

      return true;
    }
  }
}

The tools you need are Module::getOrInsertFunction and IRBuilder::CreateCall. The former adds a declaration for your runtime function logop, which is analogous to declaring void logop(int i); in the program’s C source without a function body. The instrumentation code pairs with a run-time library (rtlib.c in the repository) that defines that logop function:

#include <stdio.h>
void logop(int i) {
  printf("computed: %i\n", i);
}

To run an instrumented program, link it with your runtime library:

$ cc -c rtlib.c
$ clang -fpass-plugin=build/skeleton/SkeletonPass.so -c example.c
$ cc example.o rtlib.o
$ ./a.out
12
computed: 14
14

If you like, it’s also possible to stitch together the program and runtime library before compiling to machine code. The llvm-link utility, which you can think of as the rough IR-level equivalent of ld, can help with that.

Annotations

Most projects eventually need to interact with the programmer. You’ll eventually wish for annotations: some way to convey extra information from the program to your LLVM pass. There are several ways to build up annotation systems:

I hope to expand on some of these techniques in future posts.

And More

LLVM is enormous. Here are a few more topics I didn’t cover here:

I hope this gave you enough background to make something awesome. Explore, build, and let me know if this helped!


Thanks to the UW architecture and systems groups, who sat through an out-loud version of this post and asked many shockingly good questions.

Addenda, courtesy of kind readers: