The CS 6120 Course Blog

Bril Bytecode Interpreter

by Charles Sherk, Susan Garry

Motivation

In order to interpret Bril, we currently have access to brili, which is written with an emphasis on readable code and extensiblity. However, it is written in typescript and is not very fast. brilirs attempts to solve this problem with a Rust implementation, but is also intended to be a showcase of the bril-rs library for working with Bril in Rust. It also does some runtime checking, which requires more time. There were some points in this course where time was spent on “waiting for the Bril interpreter” that could have been avoided. So, we propose and implement an interpreter for Bril whose sole focus is speed. On top of this, we provide a reasonably well specified bytecode format for Bril, as well as a translation from this bytecode to armv8-a assembly.

Design overview

Big Red Bytecode

In order to make our interpreter as efficient as possible, we have designed a very compressed representation of Bril instructions: (almost) all of them are represented in 64 bits, with some taking multiple 64 bit words as needed. See the documentation.

The first component of the bytecode translation is variable names. We translate variables to 16 bit numbers in the order that they appear in the function. For functions with arguments, the first argument is 0, then 1 et cetera, and the variables used in the body follow. This allows the following encoding of “most” instructions:

By “most” instructions we mean the instructions of the form t1 = t2 op t3. arg1, arg2, and arg3 are the aforementioned numbers of the variables.

We also have to translate labels, as strings don’t fit into our representation very well. We will have the instructions of a function in an array of 64 bit bytecode instructions, so we represent labels as indices in this array. We also have to keep track of which instructions had a label in the original program so that we can interpret the φ instruction efficiently.

We notice that nearly all operations are coincidentally encodings of type information. For example, add only operates on int operands/results. However, the id instruction is an exception to this, but it only has one argument so we encode the type in the space for the second argument. See the more detailed docs for the type encoding.

We also have a translation from operations to numbers. These are provided in config files in our repo, as well as our documentation PDF.

Some Bril instructions simply won’t fit into a 64 bit (or even constant sized) word, so they have variable length encoding. The call instruction is a prime example of this. Function calls may have any number of arguments, so the first word contains the information about how many words following it are part of this instruction. Everything is always a multiple of 64 in order to maintain alignment.

Functions & more

We “encode” function names as their index in a program. So, the first function is 0, then 1, et cetera. This is all well and good for the current state of Bril, but it would be nice for the compiled programs to be linkable, so we preserve the function names in our function type. We have not specified a format for the bytecode other than the instruction encoding, as we feel that this is the more important contribution.

In order to appease the C typechecker and not rely on undefined behavior, we use a union type to internally represent instructions, but most have the first 16 bits as the opcode, so we can pull this out and then determine how to decode the rest of the instruction. In principle, we could also use uint64_t or uint16_t[4], but using unions is the most user friendly, and lets the C compiler do more work to access the fields of structs, in case there are any cool tricks for that behind the scenes.

Features

We support a superset of the behavior provided by brili, so options like -p work exactly the same. We also support the following:

fastbrili, despite the name, is also a compiler to armv8-a assembly

We translate variables to consecutive natural numbers (the first seen is 0, then 1, etc.), and we couldn’t help but notice that these numbers could easily be multiplied by 8, and then just turned into stack locations.

We have a working compiler to armv8 with no register allocation, and are working on linear scan register allocation in arm as well as x86 code generation. Unfortunately, these features aren’t quite finished due to being undergraduate students with other finals, however there’s a good chance they’ll be fixed after the class is over (in the interest of having something nice to merge into the Bril repo).

Our general translation from Bril instructions to assembly was as follow:

  1. Load the temps used by the instruction from the stack into registers
  2. Run the instruction on these registers
  3. Store the result from whatever register it ends up in back to the stack

Our assembly does depend on the C standard library for functions like malloc (to support the memory extension) as well as a few printing functions as doing it all with system calls just didn’t seem like a great use of time. We have some functions in the lib/ directory for printing base types, as well as parsing them in order for the main function to read command line arguments. A future improvement would be putting malloc and other C functions we call into this file so that we could keep a compiled version around and not have to link with C.

Assembly difficulties

The translation to armv8-a went quite smoothly. The most difficult part was passing arguments to functions, as there are 8 argument registers for ints, and 8 for floats. As we immediately save these to the stack in the prologue of our functions, we could have just ignored the difference between the ints and floats and used all 16 registers in order, but obeying the calling convention allows us to generate code that can link with other assembly.

Register allocation was enough work that we have not finished it yet, but this is more due to the existence of other classes than any serious difficulties, and it will probably be working at some point this summer.

Implementation

Our interpreter consists of an initial pass to translate into this bytecode, which turns temps and labels from strings into numbers, and reduces the size of the code overall. The bytecode instructions are stored in an array so that we can benefit from cache locality. We then interpret them by iterating through this array. One minor inconvenience is that certain instructions, such as print, require multiple words. This means that our array index for iteration will NOT be an induction variable, so the C compiler will not be able to optimize it as aggressively. One solution to this could be using a less flat representation of a list of instructions, and using pointers for the multiword instructions, but this seems like more trouble than it’s worth, especially considering the benefits of cache locality.

For the armv8 assembly generation, we initially implemented it as a single pass over the bytecode that printed the instructions as it went. This is fine, but it leaves little room for extensions and improvements, so in order to get started on register allocation we refactored to have an internal representation of arm instructions. For this, we went with a tagged union, as it is similar to a variant which is very convenient for this, as learned in CS 4120.

Evaluation

We used the Bril benchmarks from the repo, and used wall-clock time to evaluate the difference. This isn’t exactly optimal as most of these benchmarks are very small, so the startup time might be significant, but the startup time should hurt our interpreter the most, as it has the translation phase. Our results are here, in ods format. On (geometric) average, brili-rs runs in .13 the time of brili, and fastbrili runs in .21 the time of brili-rs. fastbrili also runs in .03 the time of brili, so I think we can call this a job well done.

One interesting part of the performance evaluation was that the compiled code ran in .80 the time of fastbrili. This means either our interpreter is very good, or our compiler isn’t. We very trivial instruction selection as well as no register allocation, so probably the latter, but this gives us a really good idea of how fast the interpreter is. We use 4 armv8 instructions for (most) of the Bril instructions, which means that if most of the runtime is indeed running the program, fastbrili runs on the order of 5 machine instructions to interpret one Bril instruction.

Drawbacks

While we would love to say that fastbrili is the answer to all of your Bril needs, there are some minor drawbacks, mainly that invalid Bril programs cause undefined behavior. Most of the type information is erased and then inferred (e.g., if you use add on two floats, it will not check, assume they are ints, and run integer addition on the values), but we do deal with all correct Bril programs correctly. The Bril spec explicitly permits that we only deal with well defined programs, so we have.We believe that this isn’t much of an issue, as brili isn’t actually that helpful with these errors, though we do acknowledge that fastbrili is worse.

We impose a couple extra limits on the Bril language:

We check all of these and will fail if they are violated, though this should be very uncommon.

Extensibility

One of the really cool things about Bril is that it is relatively easy to extend. Our bytecode is designed in a way that it supports $2^{15}$ opcodes, which is room for plenty of extensions, but if you wanted to add more types this would be considerably more difficult. We rely on being able to represent types in 16 bits, by using 2 of them for the “base type” and then the other 14 for pointer depth. Realistically $2^{14}$ is deeper than any pointer will be nested, so adding a few more base types wouldn’t be too much of an issue, but adding more complex types would be a serious complication. This could be solved by having one bit of tag for whether the type is complex, and then using the rest of the bits to index into some memoized array of the types that show up, but this would require a bit of work.

Saving Bytecode

While we do support saving the bytecode, we have not specified exactly how it works, so if you want to interoperate with it right now it requires reading the byte-io.c file to see the exact format. This is fine since right now we are the only ones using this bytecode, but in the interest of this being a helpful extension to the Bril ecosystem, this should be documented. We’ll probably do this soon, but didn’t get to it yet.