Project 3: Write an LLVM Pass

Proposal due: October 30, 2019
Report due: November 13, 2019

Overview

The point of Project 3 is to learn to use the LLVM compiler infrastructure. The scope is similar to Project 1, i.e., small: you don't need to make anything too ambitious; just build something that works and evaluate it rigorously.

The end result of your project should be an LLVM pass—a thing that reads or transforms LLVM IR programs. Unless you have a very good reason not to, you will implement your pass in C++ so you have access to the full power of the LLVM API. Unlike in Project 1, your pass needs to work on full-scale, real-world programs rather than just on small tests.

To understand the scope of Project 3, think back to Project 1, where you built something cool for Bril. Project 3 is like that, with some important differences:

LLVM, because it is an industrial-strength compiler, is harder to use than Bril, which is an educational toy.
For the same reason, LLVM ships with lots of standard compiler functionality built in for you to reuse. For example, you might consider relying on an existing alias analysis pass or using the CloneFunction utility method.
You don't want to propose to extend the LLVM language, which would take a lot of time to implement. And it already has stuff like pointers and floating-point numbers, of course!
Because the LLVM language is so large and complete, you won't want to propose a project that requires you to completely address every possible LLVM instruction. For example, building a backend would be a bad idea.
Your choice of language is constrained: you really want to use C++ for this project. (Don't worry; in Project 4, you can go back to using whatever language you want.)
There are not many Bril programs in the world yet, so you didn't have access to full-scale applications in your evaluation. That excuse won't work in Project 3; your evaluation will probably need to use full-scale, real-world programs.

Finally, for simplicity, we're requiring that projects be implemented as passes on LLVM IR. Your pass need not actually transform the IR; implementing an analysis that generates warnings or errors, for example, is in scope. But building a new frontend that emits LLVM IR is out of scope.

Ideas

Here are some general categories of LLVM passes you might implement. This is not meant as an exhaustive list; please consider getting creative with your proposal:

Implement a standard optimization, even if you think an existing LLVM pass can do the same optimization. That's OK, but your evaluation will probably want to ensure that the baseline treatment doesn't do the optimization in question. While some Project 2 ideas might be too ambitious, these optimizations seem tractable:
- Dead code elimination (based on data or control).
- Strength reduction, e.g., replacing x * 2 with x << 1.
- Loop unrolling.
- Loop-invariant code motion.
- Function inlining or outlining.
- If-conversion, i.e., replacing control flow with a conditional move.
- Basic block reordering for locality.
Some sort of static code analysis. The goal would be to emit an error or warning when the code does something you consider dangerous. Pick soundness or completeness as a goal. Here are some properties you might check:
- Simple information flow from "untrusted" sources to "dangerous" sinks you determine (e.g., by detecting certain libc calls like fread and fwrite).
- Check for basic memory safety bugs, such as use-after-free or double-free errors.
- Implement some approach to alias analysis.
Similarly, instrument the code to do some dynamic checking. The idea is that the program should print an error message or forcibly exit before it does something you consider dangerous. For example:
- A more precise memory safety bug detector for a specific kind of issue. For example, you could insert a null check before every dereference or a bounds check before every GEP.
- Collect some sort of run-time execution statistic, such as the number of LLVM instructions, basic blocks, or memory operations executed.
- Profile edges or paths.

Doing the Project

Start with my LLVM pass skeleton repository. (I recommend you use the noauto branch, which avoids a problem on recent LLVM version.) Before you do anything else, make sure you can build and run the SkeletonPass that just prints out the names of functions in a program. Then you can steal my skeleton code, copy it to your own repository, and start hacking from there.

For a walkthrough of some LLVM basics, see my example-driven blog post. Invariably, you will want to rely on the official documentation and the auto-generated Doxygen pages for the LLVM API while you're working. You might also find the "Kaleidoscope" tutorial useful, although it is meant for people building frontends rather than passes.

To evaluate your pass, use real code written in a language that compiles to LLVM. That probably means C and C++ code via Clang, but it might include code in other languages that compile to LLVM. Here are some benchmark suites you can consider:

PARSEC is a serious and very widely used C/C++ benchmark suite. There are three versions of each benchmark; you probably want to stick to the "pthreads" version and ignore the OpenMP and TBB versions.
Embench is a recent effort to create a C benchmark suite targeted toward embedded systems as opposed to "normal" computers. The authors claim that it fills a similar role but is a better choice than the very old and widespread Dhrystone benchmark.
Probably not SPEC, which has an annoying license.

Because time is short, you don't need to get a complete benchmark suite running if that's too hard. But you should not rely exclusively on hand-written tests.