Finding Redundant Structures in Data Flow Graphs

Wed, 18 Dec 2019 00:00:00 +0000

Go with the flow 🌊</h2>
Analyzing the data flow graphs of programs allows us to think about the shape of the computation, independent of the literal order a programmer used to specify it. In particular, two separate programs are more likely to share data flow structure than literal source code redundancy (since many reorderings can maintain the same data flow). Even within the same source program, shared structure in the data flow graph may indicate core computational patterns.
Data flow graphs for computational acceleration</h3> If our goal is to compile faster or more energy-efficient code, data flow graphs can help show us where to focus. By identifying redundant subgraphs in the structure of data flow graphs, we can find groupings of operations that we expect to occur frequently enough to benefit from additional optimization effort. What's more, the shape of the subgraphs is also a signal for how useful the acceleration might be: subgraphs that are wider, rather than simply linear chains, indicate more opportunity for fine-grained parallelism</a>. Our goals in this project are shaped by the domain of hardware acceleration with heterogeneous computing</a>, where a compiler's goal is to target multiple processors, each with differing strengths and weaknesses.
For this project, we build on the LLVM compiler infrastructure</a> to find redundant structures in programs' static data flow graphs. Our goal is to find a fixed number of subgraph structures that occur the most frequently (that is, cover the highest number of instructions) throughout the program. We focus on finding candidate subgraphs with high frequency, and leave analysis and heterogeneous compilation of those subgraphs to later work.
Building data flow graphs from LLVM</h2>
Data flow graphs exist at multiple levels of abstraction in a compiler toolchain, and there are trade-offs to targeting any particular choice.
First, data flow graphs can either represent a program statically, purely from the program's source code, or dynamically, from a program execution trace. A static DFG has a one-to-one relation to the source code: each operation and its dependencies are directly translated. The control flow of the program exists only implicitly: if a data value's flow depends on the branching structure of the program, the DFG would have back edges and cycles. A dynamic DFG captures a single trace throughout the program, where operations are repeated each time they are executed. In this case, the data flow graph remains acyclic (with values only flowing "down"), and loops in the control flow repeat in the subgraph for each time the loop is executed. However, dynamic data flow graphs only represent a single execution of the program and may not even cover the full program behavior. They also may be infeasible to generate ahead of time for long-running applications, and they tend in practice to be so large as to limit analysis to fragments of the full dynamic DFGs.
In addition, DFGs can target either the intermediate representation level, with LLVM-level operations, or at the machine code level, with operations corresponding to the exact instruction set architecture. The machine code data flow graph corresponds more directly to the program's actual execution, but is not as general across different targets.
For this project, we use LLVM to target the static DFG at the intermediate representation level of abstraction. LLVM translates the program source to static single assignment (SSA)</a> form, where every variable name can only be assigned to once. Because LLVM's in-memory intermediate representation stores pointers to instructions' operands, we can build a program's static data flow graph by inserting edges to an instruction and from each of its operands. We narrow the project's scope to only consider acyclic subgraphs by considering subgraphs only within basic block boundaries, which lack branching control flow.
Matching fixed DFG stencils</h2>
To begin, let's imagine we already have some oracle that has given us a great candidate subgraph (which we'll call a stencil), and our job is to find all the redundant instantiations of that stencil. If we consider a large program DFG G</code> and a smaller stencil DFG H</code>, the task is to find as many subgraph isomorphisms of H</code> and G</code>. Here, the larger program DFG G</code> is generated directly from the LLVM in-memory representation as described above, but does not include edges across control flow boundaries. Rather, G</code> is a collection of DFG components per basic block. In addition, we focus on operations that consume and produce values directly (such as arithmetic and shift operations) rather than those that read or write from memory or modify control flow (load</code>, store</code>, branch</code>, and return</code>).
`While graph isomorphism is a notoriously tricky problem, it is also a common one, and we make heavy use of out-of-the-box graph algorithms. We employ the`networkx.isomorphism</code></a> Python package, which provides tools for iterating over matches (subgraph isomorphisms) between the program DFG G</code> and a stencil DFG H</code>. There are two features of a matching which distinguish it from a subgraph isomorphism: (1) nodes must be matched to nodes of the same opcode</code>, which technically makes the problem a colored subgraph isomorphism (which fortunately makes the problem easier), and (2) we need to select mutually exclusive subgraphs, where each node can be assigned to at most once isomorphic instance (to model actual hardware acceleration). In the case of a single stencil, we use a greedy heuristic to randomly choose isomorphisms until there are no longer any remaining choices that are mutually exclusive. When trying to match multiple stencils, our heuristic tries to find the largest stencils first. We describe this search process in more detail in our implementation section.
`We started our testing by hand-picking chains of instructions found in our benchmarking code. From theEmbench</a> embedded programming benchmarking suite, we used matmult-int.c</code> to chose a few common chains of operations:`
`mul</code> → add</code> → srem</code>`
`shl</code> → add</code>`
`sdiv</code> → mul</code> → add</code>`
`As we expected, these small human-selected stencils subgraphs performed especially poorly. On the original program, matmult-int.c</code>, these stencils only matched less than 4% of instructions.`
`Identifying common DFG stencils</h2> Of course, finding the common subgraphs by hand is pretty antithetical to a reasonable approach at compiling code. Our real goal is to automate the process of finding the common DFG stencils to accelerate.`Formal description of the task</h3> In this context of ignoring control flow and considering data flows within basic blocks, we can look at the problem purely graph-theoretically. For a single trace through the program, the data flow graph $G$ is acyclic, and we would like to cover as much of it as possible with subgraphs corresponding to the stencils that we accelerate. Statically, we do not know what the final data flow graph is, but we do know that we will be able to assemble one by connecting dangling edges from control-flow-free components: basic blocks. We would like to find a small collection of graph components $\mathcal H = {H_i, \ldots, H_k}$, which we can use to replace parts of and accelerate programs having basic blocks $\mathcal G = {G_1,\ldots,G_n}$, that maximizes the total saved time: $$\mathcal S_{\mathcal H}(\mathcal G) := \max_{\mathcal C \in \text{Cov}(\mathcal G, \mathcal H)}~ \sum_{G \in \mathcal G} w_G \cdot \sum_{H \in \mathcal C_G} f_H \cdot |H|$$ where: $\text{Cov}(\mathcal G, \mathcal H)$ is the set of all valid (partial) coverings of basic blocks with at most one stencil, that is, injective graph morphisms $\varphi: (\cup \mathcal G) \to \cup \mathcal H$.</li> $\mathcal C_G$ is the component of the covering $\mathcal C$ of the total covering on the particular basic block graph $G$.</li> $w_G$ is independent of $\mathcal H$ and proportional to the expected number of times $G$ is executed.</li> $f_H$ is the expected speedup factor from accelerating the component $H$.</li> </ul> Now supposing that $f_H$ was roughly constant, we could achieve the maximum savings by trivially choosing $\mathcal H := \mathcal G$. There are a few problems with this: $|\mathcal H|$ is large; there are many of these sub-graphs, which makes the search process substantially less efficient.</li> Each $H_i \in \mathcal H$ is also large, making the specialized component more expensive.</li> There is now a dependency between $\mathcal H$ and $\mathcal G$, and so we need to know our program in order to build the components we use to accelerate.</li> </ol> The third issue is the most important; to a first approximation, the first two are heuristics which help solve it. This intuition suggests an alternative framing as a statistical learning problem: you're given some training data in the form of pieces of program DFGs ($\mathcal G$), and the objective is to find a collection of snippets $\mathcal H$ that not only covers this program, but also can be re-configured to accelerate other programs in the future. We might imagine that there's some underlying distribution $\mathtt{Programs}$ of programs that people write, in which case the work we present here can be seen as a solution to the following optimization problem: $$ \arg\max_{\mathcal H}\left( \mathop{\mathbb E}\limits_{\mathcal G\sim \texttt{Programs}}~ \mathcal S_{\mathcal H}(\mathcal G) - \text{Cost}(\mathcal H) \right)$$ where $\text{Cost}(\mathcal H)$ is the additional cost incurred by choosing to accelerate the subgraph stencil $\mathcal H$, which is higher for larger subgraphs. Note that in this presentation, we can no longer choose $\mathcal H$ based on $\mathcal G$, which resolves issue (3); issues (1) and (2) are partially incorporated into the $\text{Cost}(\mathcal H)$ term, effectively regularizing the search space by artificially imposing a cost for larger or more sub-graphs. Just like $\ell^1$ regularization, we are explicitly adding a preference for short descriptions, to avoid overfitting to a single program (like setting $\mathcal H:= \mathcal G$ as discussed above). Rather than solve this optimization problem in closed form, we explicitly look for a given number of subgraphs (issue 1) and of a pre-selected size (issue 2). In doing so, we are using the regularization knobs to explicitly avoid over-fitting to the program at hand, which is necessary because our held-out evaluation (below)</a> shows that generalizing to unseen programs is considerably more difficult. Implementation strategy</h3> We first instrument an LLVM module pass that writes out a JSON representation of the DFG. Our Python module then explores candidate subgraphs using a combination of our heuristics and the out-of-the-box graph isomorphism tooling. We implemented and compared two separate algorithms for finding the stencils from static DFGs generated per-basic-block from LLVM programs. In both cases, the general idea is to iterate over the DFG's connected components, successively building larger subgraphs. Our first approach is node-based, and exhaustively considers node subsets up to some size. The second approach is edge-based, and uses smaller subgraph components to build graphs of the desired size. In preliminary experiments we found the edge-based approach to be slightly faster, so we used that approach for our evaluation. More specifically, the edge-based subgraph stencil generation iteratively grows connected subgraphs. For each $k$-edge subgraph in the DFG, the algorithm considers adding every edge in the DFG, keeping the new ($k+1$)-edge subgraphs that are connected. It then finds which of these subgraphs are isomorphic to each other, and constructs a canonical stencil name for each isomorphic set. These ($k+1$)-edge subgraphs are used for the next iteration of the algorithm. Mutually exclusive matches</h4> To generate a valid choice of subgraph stencils, we need more than simply an enumeration of all subgraphs in a program and a way to match them: we also have to make sure the matches don't step on one another's toes—that is, we need to throw out matches until each instruction is only covered by at most a single component. Finding the optimal one is difficult: it is related to the weighted optimal scheduling problem (which can be solved with dynamic programming</a> in $O(n \log n)$ time, but on a general directed graph, we get an exponential factor in the branching coefficient). Rather than solve this problem optimally in the general case, we implement the greedy biggest-first strategy, and focus instead on searching for collections of matches which have higher coverage in the first place. After generating possible subgraph stencils, we choose a combination that achieves the highest static coverage of the DFG using only mutually exclusive matches. Search</h3> Ultimately, we do not need to search the space exhaustively if we have reasonable heuristics that might cause us to believe that we're going in the right direction with certain stencils. We can then do our search traversal in a different order, guided by the objective function. This can be done in the form of a beam search: we only keep around the $k$ best subgraphs in the search frontier, and at each step try to expand one to a random neighboring node. Though not used in our evaluation, beam search can speed up generating larger subgraphs in the future. Evaluation</h2> We primarily look at what coverage (percent of instructions matched by some subgraph over total instructions) we can get on a given source program. We consider both static and dynamic coverage. In both cases, 100% coverage is impossible because we exclude instructions with control-flow implications (phi</code>, branch</code>, return</code>) and those that read and write from memory (load</code> and store</code>). Dynamic coverage instrumentation</h3> To generate dynamic coverage information, we instrument our LLVM pass. The pass adds annotation to each instruction that is matched specifying which stencil it was covered by, along with the node-isomorphism. Because basic blocks execute atomically (ignoring exceptions), we generate the count of matched and total instructions per-block at compile time. We then link a C profiling module with state for these dynamic counters. At the end of each basic block, our pass adds a call to a function to increment the profiling counters by the statically-determined amounts. We add a final function call to LLVM's global destructors list to write the final profiling values both to standard out and an auxiliary file. For convenience, we also save the static coverage the same way. Embench evaluation</h3> We chose to use the Embench</a> embedded programming benchmarking suite because it represents a small but fairly diverse set of programs that can be easily compiled and executed with LLVM tooling. For each benchmark, we generated all allowable three-node subgraphs and chose the two that statically covered the most instructions in that benchmark's DFG. Three-node stencil generation took between a few seconds and 17 minutes for each benchmark on a 2017 MacBook Pro (2.3 GHz Intel Core i5, 8 GB RAM). The following graphs show static and dynamic code coverage for each benchmark. Note that each benchmark's coverage was calculated with the subgraphs generated from that benchmark (and coverages are deterministic, rendering error bars unnecessary). As stated above, a coverage of 100% is elusive because of our restrictions on what instructions we consider. An interesting component of this profiling data is that as expected, static and dynamic coverage correlate, but which is better depends on the particular benchmark. From smaller scale experimentation, the coverage also varies based on the compiler flags used to generate the original LLVM IR. In particular, running at a more aggressive -03</code> optimization level (rather than the -01</code> used here) changes the coverage metrics as loops are statically unrolled, introducing more redundancy. Embench case study: nettle-256sha</code></h3> Digging into nettle-256sha</code>, the benchmark with the best coverage, we can see that the following combination of three-node subgraph stencils was chosen out of 66 possible three-node subgraphs: Stencil</th> Number of static matches</th></tr></thead> lshr</code> → or</code> ← shl</code></td> 208</td></tr> xor</code> → xor</code> → add</code></td> 80</td></tr> </tbody></table> Here are a close-up and a closer-up (marked with a heavy black rectangle) view of the DFG, with vertices matched to a stencil shown in bright red. The latter shows three matches of the first stencil and one of the second. Stencil generalization</h3> We also explored generating stencils from one benchmark and testing how well they generalized to the other benchmarks. The three-node stencils generated and chosen from minver</code>: fcmp</code> → select</code> ← fsub</code></li> getelementptr</code> ← pointer</code> → getelementptr</code></li> </ol> were found at least once in all but five of the other Embench benchmarks, producing dynamic coverage ratios between 0 and 7.68% (with an average of 1.45% ± 2.35). Stencils generated from other benchmarks achieved even less coverage on the rest of the benchmarks. For example, stencils generated from edn</code>, libwiki</code>, and nbody</code> were not matched in any other benchmarks. Ongoing directions</h2> While finding redundancies in DFGs within each basic block is a good initial approach, this project could be extended in several directions. We could build on existing literature in extended basic blocks</a> to find subgraphs that speculatively occur. That is, in extended basic blocks, we consider control flows that are likely to jump from one block to another in the common case, and only fall back to different branches in the case that our guess of the next block was wrong. In the context of hardware acceleration, we can imagine building accelerators that handle these larger speculative subgraphs when possible, and fall back to slower CPU execution if the control flow differs. In addition, it would be interesting to compare this project against dynamic data flow graphs. For example, the Redux</a> paper essentially introduced the formulation of dynamic data flow graphs as we describe them here, and outlines how to efficiently generate them. From the perspective of hardware acceleration, the RADISH</a> project (“Iterative Search for Reconfigurable Accelerator Blocks with a Compiler in the Loop”) uses Python wrappers to generate dynamic data flow graphs, and heuristic genetic algorithms to "fuse" similar dynamic graphs together. Like RADISH, we could extend our application to target groups of applications instead of single programs. The scale of this undertaking would require more clever heuristics than our current search strategies, but would ideally help us find more general subgraphs to accelerate. Finally, the impact of this project could be more clearly explicated by evaluating our subgraph identification with actual computational acceleration. In particular, we hope this strategy will prove useful in conjunction with other work that uses compile-time analysis for heterogeneous targets. Composable Brili Extensions Wed, 18 Dec 2019 00:00:00 +0000 The Goal</h2> The first project for this course was to extend the Bril language</a> in any way that we wanted. The initial Bril codebase included a parser from a textual representation of Bril to an AST of the source program encoded in JSON. It also included a program that transforms TypeScript programs into Bril programs and an interpreter for the language once it was in JSON format. Some projects added new standalone extensions to the codebase such as a Bril debugger</a> or a Bril to C translator</a>. As long as these projects don't make any changes to the underlying grammar or representation and only add new files to the existing codebase, then they can be relatively easily merged into the codebase. However, some projects created extensions to the Bril language that added new operations, new state, and new control flow to the language. These projects include adding record types</a>, dynamically allocated memory</a>, and function calls</a> to the language. All of these projects make changes to the grammar of the language by adding new types of operations. Additionally, all of the project modify the interpreter to support their new operations. All of these changes require adding additional state to the evaluation function, which all conflict with each other. In this project, I designed and implemented a composable language extension system for the Bril interpreter that allows developers to independently create language extensions that can later be composed together. For example, consider an extension A that adds a new operation 'a' to the language and an extension B that adds a new operation 'b' to the language. Once these two extensions are merged into the codebase, other developers should be able to compose extensions A and B together to create an interpreter that supports both operations 'a' and 'b'. Furthermore, the developers of extension A and extension B should ideally never have to know about each other's existence. This is different from an extensible language framework, such as Polyglot</a>, because language extensions in Polyglot explicitly specify the language that they are extending. With composable extensions, the goal is that the language extension simply defines a small set of assumptions about the abstract language that it is extending and then implements its extensions based on those assumptions. Then, any base language that satisfies those assumptions can be extended by the extension. The Implementation</h2> I implemented the extensible interpreter in TypeScript. I first identified some common datatypes that all extensions would use. The first was the base instruction interface. I decided that all instructions would be identifiable by operation field called op (sometimes called a discriminant): interface BaseInstruction { op: string; } </pre> Next, I decided that all functions in Bril would have a name and a list of instructions or labels: interface BaseFunction<I extends BaseInstruction> { name: Ident; instrs: (I | Label)[]; } </pre> One interesting thing to note is that the BaseFunction</code> interface is generic over the type of instructions that it will be containing. Finally, a Bril program contained a list of functions: interface BaseProgram<I extends BaseInstruction, F extends BaseFunction<I>> { functions: F[]; } </pre> These three interfaces define the most basic structure that a Bril program can have. All extensions must be extensions to Bril that respect this generic language definition. Put another way, every Bril extension must work on a subtype of the BaseProgram</code> interface parameterized over some I</code> and F</code> that extend BaseInstruction</code> and BaseFunction</code> respectively. Next, I defined a generic evaluation function for evaluating instructions: type evalInstr<A, PS, FS, I> = (instr: I, programState: PS, functionState: FS) => A; </pre> The evalInstr</code> function is parameterized over 4 types. The first type, A</code>, is the action type. Every evaluation function needs to generate an action that specifies which instruction to execute next. The second and third types, PS</code> and FS</code>, represent the program state and the function state respectively. The program state type is meant to represent the entire state of the currently running Bril program (so think things like global variables). The function state holds only function local state for the currently executing function (like the values of local variables). The final parameter is I</code>, which is the type of the instruction the evaluation function operates over. Each extension defines an evalInstr</code> function that specifies how to update the program and function state in terms of the instructions it adds/extends as well as which action to generate for those instructions. In order to compose these functions, each extension defines a function that takes in a function of type evalInstr</code> and returns a function of type evalInstr</code>. The idea is that the function that is returned has 2 cases. In the first case the instruction passed in is of the type that this extension is extending/adding. In this case the function executes the logic to update the program and function states and return an action according to the operation. In the other case, the instruction is not an instruction that this extension implements. So it dispatches to the evalInstr</code> function that was passed in to the original function. The code looks something like this: function evalInstr<A,PS extends ProgramState,FS extends FunctionState, I extends BaseInstruction>( baseEval: (instr: I, programState:PS, functionState:FS) => A ) { return (instr: bril.Instruction | I, programState:PS, functionState:FS): A | brili_base.Action => { if (isExtInstr(instr)) { return handleExtInstr(instr); } else { return baseEval(instr, programState, functionState); } } } </pre> Each extension also usually defines types ProgramState</code> and FunctionState</code>, which are record types representing the types of fields that this extension expects to be in the program state and the function state respectively. For example, if type FunctionState = {env: Env}; </pre> then that means that this extension expects the function state to have an env</code> field of type Env</code> (this is the mapping from variable names to values in the base version of Bril). Because FunctionState</code> is a record type, the restriction FS extends FunctionState</code> implies that any object of type FS</code> must have at least the same fields with the same types as specified in FunctionState</code>. In the above case, FS</code> must be a record that has the env</code> field. It may have more, but it has at least that one field. This is because TypeScript uses structural subtyping for records (as opposed opposed to nominal subtyping). Composition of extensions now is a simple matter of composing the evalInstr</code> functions together. I implemented a Composer</code> class that is used to compose extensions together which composes extensions in the following way: constructor(evalExts: ((baseEval: evalFunc<A, PS, FS, I>) => evalFunc<A, PS, FS, I>)[], ...) { this.evalInstr = (instr,programState,functionState) => { throw `unhandled instruction: ${instr.op}`; } for (let ext of evalExts) { this.evalInstr = ext(this.evalInstr); } ... } </pre> It simply creates a base evalInstr</code> function that throws an unhandled exception and then in a loop composes all of the evalInstr</code> functions in evalExts</code>. In addition to defining how different operations are handled, an extension may also want to override/extend the handling of actions generated by evaluating the various operations. First, however, I need to define how control flow is handled in Bril programs. When a Bril program is executing there is a program counter variable called pc</code> that keeps track of where we are in the program. The type of this pc</code> variable is: type PC<I extends BaseInstruction, F extends BaseFunction<I>> = { function: F; index: number }; </pre> It contains a function</code> field which specifies which Bril function we are currently in and an index</code>, which specifies which instruction in that function we are executing. These are again parameterized over an instruction and function type in order to support many different kinds of extensions to functions. The handlers for actions take in the action generated by the current instruction and the current PC and generate a new PC. We also supply the action handler functions the current function and program state in case action handler extensions want access to the current program state. The type of these action handler functions is thus: type actionHandler<A, PS, FS, I extends BaseInstruction, F extends BaseFunction<I>> = (action: A, pc: PC<I, F>, programState: PS, functionState: FS) => PC<I, F>; </pre> These handlers are also composable in a very similar manner to the evalInstr</code> functions. The extensions export a function that takes in an actionHandler</code> function that represents the action handler function being extended and outputs an actionHandler</code> function. The outputted function executes the extension's action handling functions if the current action is one that it handles and otherwise dispatches to the action handling function that it was extending. In the Composer</code> class I do the following, which is very similar to how I composed evalInstr</code> functions: constructor(..., actionHandleExts: ((extFunc: actionHandler<A, P, FS, I, F>) => actionHandler<A, P, FS, I, F>)[], ...) { this.handleAction = (action,pc,programState,functionState) => { throw `unhandled action`; }; for (let ext of actionHandleExts) { this.handleAction = ext(this.handleAction); } ... } </pre> Finally, the composer exposes a function that evaluates a Bril program object which extends the BaseProgram</code> described above: evalProg<Prog extends BaseProgram<I,F>>(prog: Prog) </pre> This function finds the main function object and then creates a new PC</code> object with that function set as its function field and the index field set to 0. Then, in a loop it gets the current instruction, executes it and gets back an action, and then updates the pc</code> based on that action. It repeats this until the index of the pc goes out of the current function's bounds. The Composer</code> class also takes in two functions initP</code> and initF</code>, which initialize the program and function states respectively. In order to create a new Bril interpreter which is the composition of extensions A, B, and C, you simply need to define a function state that contains all the fields required by all the function states in all of the extensions, a program state that contains all of the fields required by all of the program states in the extensions, and initialization functions for those types. Then, you simply create a new instance of the Composer</code> class with the extensions and action handlers you want. Example Extensions</h2> Bril Base</h3> In order to demonstrate the usability of this system, I implemented a few extensions and then composed them together. First, I implemented the base Bril language as an extension. For the evalInstr</code> function, this mostly just involved copying over the switch statement on the instruction operation from the base implementation and adding an env</code> field to the function state. Then, the actionHandler</code> function performed the same logic as in the evalFunc</code> function in the current brili code except that variable i</code> got replaced with the pc</code>. I also had to add checks to each of the two new functions to make sure that the instruction/action being handled was actually one of the instructions that the function was meant to handle. Otherwise, it would just dispatch to the base instruction evaluation or action handler function that was passed in (similar to how it was described above). Implementing this base extension required very few changes to the code that was in the brili.ts</code> file. Most of it was simply copied over and a few minor tweaks were made. Manually Managed Memory</h3> The next extension that was implemented was the manually managed memory</a> extension. This extension added a heap datastructure to the program state and a way to allocate space on that heap. It also added a new value type, a pointer, which pointed to values in the heap. The new operations added by this extension can also load values from the heap into variable and store the value in a variable into the heap. This also requires an environment field in the function state. This leads to the following definitions in the code for this extension: type ProgramState = {heap: Heap<Value>}; type FunctionState = {env: Env}; </pre> After declaring these definitions, creating the rest of the extension was simply a matter of copying over the code that handled the new operations from that project into the evalInstr</code> function for that extension. I didn't need to define an actionHandler</code> function because this extension did not add any additional control flow to the base Bril language. Because of this assumption, and the assumption that there was an environment field that was being managed in the function state, this extension did kind of need to know that there was going to be an underlying base Bril language that already implemented the control flow and correctly maintained the environment. Record Types</h3> The third extension that I ported over was the record types</a> extension. This extension added record types to the language, as well as operations to create records and access/set record fields. This extension added a typeEnv</code> function state variable to their extension to keep track of the defined record types in a function. Similar to the memory extension, it also assumed that there was some kind of local variable environment, so the function type for this extension looked like: type ProgramState = {}; type FunctionState = {env: Env, typeEnv: TypeEnv}; </pre> This extension also didn't require any additional control flow compared to the base Bril language so I also didn't implement an actionHandler</code> function. Function Calls</h3> The final extension was to add function calls. This extension was interesting because it added new control flow to the Bril language. There were two different projects that both added function calls to the language: Function Calls and Property-Based Testing</a> and Exceptions in Bril</a>. The Function Calls and Property-Based Testing project added function calls by recursively calling the evalFunc</code> function and augmenting the behavior of the ret</code> instruction to have evalFunc</code> return a value. This approach didn't really fit that well into the new framework that I had build, because in my framework control flow revolves around the pc</code> variable. The second project primarily added exceptions, but in order to add exceptions it also added basic support for function calls. This project's method of adding function calls involved creating new stack frames for each function call and pushing and popping the stack frames to implement call and return instructions. This method was more in line with the way my framework handled state, so I decided to just add the function call part of the exceptions project. This extension required adding several additional fields to both the program and function states. The main thing that was added was an array of stack frames (i.e., a stack) in order to correctly handle function calls and returns. Each stack frame holds the function state and the return pc: type FunctionState = { env: Env }; type StackFrame<FS extends FunctionState, I extends BaseInstruction, F extends BaseFunction<I>> = [FS, PC<I,F>]; export type ProgramState<FS extends FunctionState, I extends BaseInstruction, F extends Function<I>> = { functions: F[], currentFunctionState: FS, callStack: StackFrame<FS,I,F>[], initF: () => FS }; </pre> Furthermore, arguments were added to the function records. These arguments consist of a name and a type: interface Function<I extends BaseInstruction> { name: bril.Ident; args: Argument[]; instrs: (I | Label)[]; } </pre> The only additional operation added in this extension was the call</code> operation. The evalInstr</code> function for this extension was quite simple: it just returned a new call action with the name of the function being called and the argument variables being passed into the function. Most of the control flow logic was handled in the actionHandler</code> function exported by this extension. On a call action, this handler created a new stack frame with the initF</code> function provided in the program state record and then pushed the current stack frame onto the call stack. It also made the assumption that all of the evalInstr</code> functions were called with the program state currentFunctionState</code> field. This was indeed the case, as the main loop of the Composer</code> class contained this code: let action = this.evalInstr(line, programState, programState.currentFunctionState); pc = this.handleAction(action, pc, programState, programState.currentFunctionState); </pre> This allowed the actionHandler</code> function to set the new function state on a function call. In addition, this extension overrides the end</code> action. This action is usually generated by the ret</code> operation in the base Bril implementation. However, instead of terminating the program, this extension modified the behavior to pop the current stack frame off of the program state's callStack</code> field, set the currentFunctionState</code> field to that popped stack frame, and finally setting the pc to the return pc in the popped stack frame. This extension required a bit more effort than the other extensions because of the changes in control flow made to the interpreter. However, much of the logic could simply be copied over from the exceptions project with only minor tweaks. Brining it all together</h3> In order to test the extensions and their composition I created a new instance of the Composer</code> class with the 4 evalInstr</code> functions from each of the extensions as well as the 2 actionHandler</code> functions from the base and the function call extensions. I then added all of the test cases from each of the extensions to the project and ran the composed interpreter over all of the test cases, which all passed (there was one minor bug but it was easily fixed). I also added a test case that combined some operations from each extension, which was also correct. Evaluation of the system</h2> Extension development</h3> One of the goals that I had when developing this system was that extensions should be able to be developed in isolation. In evaluating this goal, I will look back at the four extensions that I implemented. For the most part, each extension could be developed in isolation as long as each extension specified the assumptions it was making about the state. For example, the base extension specified that the function state had to include an environment field that stored all of the function local variables. However, because of the use of generics, the base extension never specifically names a type that the function state must be. Instead, it leaves it up to the composer of the extensions. Furthermore, even though the base extension assumes that function state has an environment field, it is kind of implicitly assuming that this field will be used correctly by all other extensions that are composed with this extension. For example, if an extension comes along and lowercases all variable names before getting and setting to the environment, it could potentially interfere with the base extension. Or, if there is a different extension that assumes that function local data is stored in a field called local_vars</code> instead of env</code>, then these two extensions may not be properly composable. This was a more general trend that I realized while developing the extensions. Extensions be developed in relative isolation using the framework. However, when the extensions are composed together, the composer needs to be aware of all assumptions, implicit and explict, that are made by each extension in order to determine if the extensions can be composed in a meaningful way. On example is that there is some code in the manually managed memory code that assumes that a value is either a number, a boolean, or a pointer. However, when we compose this extension with the record type extension we have values that can be record types as well. Similarly, there is code in the record types extension that assumes that a value is either a number, a boolean, or a record. While these two extensions can operate in isolation in the same program, if I try and create a record with a pointer field things start to break. In order to have extensions be correctly composable with each other there need to be some conventions that extension developers need to follow. To solve the above example, if the two extensions were a bit more generic with the types of values they assumed in the environment, then they could probably be correctly composable. In other words, while extensions can be developed in isolation, they need to be aware that they will be extended. The ease at which extensions could be developed was also an important feature that I wanted to have. The minimal number of changes that I had to make to the code for the base, memory, and record type extensions indicate that writing these extensions is not that much more challenging than writing the extensions by simply adding code to the original Bril interpreter. The main challenge then remains in actually composing the extensions together. However, all of the Bril extensions assumed only the base extension, so multiple versions of the interpreter can easily be made by simply composing the base extension with the extension developed for the project. Then, all of the interpreter code can be merged into the repository in a conflict-free way, because of the isolation of the extensions from each other. There is still another challenge that needs to be solved, which is the extensibility of the parser. This may be solved by rewriting the parser using parser combinators</a>. However, I have not explored this option in any great depth. Type safety</h3> Another goal that I had for this project was that when composing extensions, the composer should be assisted as much as possible by the type checker. In TypeScript, you can basically revert to untyped Javascript by simply making everything have type any</code>. If all of the generic types were removed and replaced with the all-encompassing any</code>, then it would be very easy to compose two extensions because their function signatures trivially match. However, you lose some type safety, such as the restrictions placed on the function and program state by each extension. So I added in generics for the program and function states. This is able to catch some type errors at compile time. However, when there is a type error the generics usually generate quite verbose type errors from the compiler, which can sometimes be difficult to parse through, especially because of all of the restrictions placed on the generics (such as the extends</code> conditions). The generics for the instruction and function types were also added to increase type safety within the main evaluation loop and within the handler functions. Having the pc</code> parameterized on those types allowed me to write code that knew that each function would have a list of instruction and a name. However, because the Bril program is parsed from a JSON file and then cast to a Bril program type without any dynamic checks means that this type safety was largely negated by the initial cast. More fundamentally, the way most extensions check that the instruction passed into evalInstrs</code> is pretty unsound. The instr</code> type that evalInstrs</code> usually takes in is given the following type: instr: Instruction | I </pre> where I</code> is the generic type and Instruction</code> is the type of the instructions that the extension implements. The isInstruction</code> function is usually just a function that returns a type guard if the opcode of the instruction matches one of the implemented instruction ops (here in the instrOp</code> array): function isInstruction(instr: BaseInstruction): instr is Instruction { return instrOps.some(op => op === instr.op); } </pre> This is not really a safe type guard because one of your instructions could assume a dest</code> field in the instruction but this is only checking that the opcode matches, not that the instruction has all of the correct fields. This could maybe be improved by more comprehensive run time type checking, but would still probably not rule out everything that could go wrong. Furthermore, it could do the wrong thing if extensions are composed in the wrong way. If two extensions A</code> and B</code> implement operation a</code> but an extension A</code> removes some fields of operation a</code>, if B</code> extends A</code>, then B</code> will be operating on the wrong kind of instruction. However, if A</code> extends B</code>, then the a</code> operation will be handled by extension A</code> and never be propagated to B</code>. Furthermore, one other issue with the above code is that the instrOps</code> array might not contain all of the correct operations. This is actually a bug that did run into because I forgot to include the "const"</code> field in the base extensions instrOps</code> array. If the type Instruction</code> is the discriminated union of all of the instructions supported by an extension then Instruction["op"]</code> is the discriminated union of all of the opcodes of all of the instructions. Ideally, we would want to transform the type Instruction["op"]</code> into to instrOps</code> array statically. However, TypeScript doesn't really support this transformation. However, you can convert const arrays to discriminated unions, so I came up with the following code snippet to make sure that the instrOps</code> array is equivalent to discriminated unions of the opcodes of the instructions implemented by the extension: const instrOps = [...] as const; // This implements a type equality check for the above array, providing some static safety type CheckLE = (typeof instrOps)[number] extends (Instruction["op"]) ? any : never; type CheckGE = (Instruction["op"]) extends (typeof instrOps)[number] ? any : never; let _: [CheckLE, CheckGE] = [0, 0]; </pre> Because TypeScript has conditional types and because constant arrays can be transformed to discriminated union types, if the assignment of [0,0]</code> to a type [CheckLE, CheckGE]</code> does not throw a type error then the instrOps</code> array contains exactly the types in the discriminated union Instruction["op"]</code>. Final Thoughts</h2> TypeScript actually turned out to be a very good language to develop this framework in. The ability to decide how much type safety I want greatly simplified early development of the framework and allowed me to add type safety via generics in an incremental way. More generally, I think this project is a good prototype for building composable interpreter extensions for Bril. The biggest issue right now is that the use of generics makes some boilerplate code quite verbose and can lead to confusing type errors because of the large number of generics and the dependencies between them. The lack of a composable parser does still prevent this approach from being integrated into the codebase as is but is an interesting step in that direction. The code for this project can be found here</a> An Autoscheduler for Halide Wed, 18 Dec 2019 00:00:00 +0000 Halide</a> is a domain-specific language embedded in C++ for writing code that processes images and, more generally, arrays. The main innovation of Halide is that it separates algorithm --- the actual function being computed -- from schedule --- the decisions regarding when to perform computations and when to store intermediate results. This allows developers to write the function that their image pipelines implement once and then performance-tune the implementation by swapping out schedules --- different schedules can be used for different platforms while not modifying function code. Writing an efficient schedule for Halide functions requires expertise in performance tuning. To alleviate this, in this project we create a toy autoscheduler for Halide that attempts to automatically generate an efficient schedule for Halide functions. (Note that Halide has an autoscheduler built-in: see this paper</a> for more information.) Our autoscheduler is implemented in Python 2.7 and can be found at this repository</a>. Design Overview</h2> The following presentation of schedules as trees manipulated by schedule transformers closely follows Chapter 7 of Jonathan Ragan-Kelley's thesis</a>. The images below are from that document. In order to search for schedule, we represent them as schedule trees, wherein the ancestry relationships between nodes represent ordering information. Schedule trees have the following kinds of nodes: Root nodes represent the top of the schedule tree. </li> Loop nodes represent the traversal of how the function is computed along a given dimension. Loop nodes are associated with a function and a variable (dimension). Since functions are assumed two-dimensional, by default functions have two variables: x and y. Loop nodes also contain information such as whether the loop is run sequentially, run in parallel, or vectorized. </li> Storage nodes represent storage for intermediate results to be used later. </li> Compute nodes are the leaves of the schedule tree, and they represent computation being performed. Compute nodes can have other compute nodes as children to represent functions that are inlined instead of loaded from intermediate storage. </li> </ul> Schedule trees are considered well-formed if they satisfy the following criteria: The ancestry path from a function's compute node to the root node contains all the loop nodes and the storage node (if the function is not the output1</a>) for that function. Intuitively, this means that the traversal of how the function is computed is completely defined, and storage for the function's results is available. </li> If a function calls another function and the callee is not inlined, the compute node for the callee must occur before the compute node of the caller in a depth-first traversal. Intuitively, this ensures that the callee's results are stored before the caller is computed. </li> </ul> ^{1 By convention the output function does not have a storage node in the schedule tree since it is assumed that storage for the output has already been allocated and thus there is no decision to be made about the granularity with which to allocate it. </div> For any function we can define the default schedule, which traverses the output function in row-major order and inlines all called functions, like so: We can give a semantics for schedule trees as nested loops. Consider the schedule below for three functions, in, bx, and by, where by calls bx and bx calls in. The schedule tree on the left represents the nested loop on the right. }Schedule Transformers</h3> We define transformers over schedule trees. We use these to traverse the search space of schedules. Split - split a function's variable into two. For example, we can split a function's x</code> variable into x_inner</code> and x_outer</code>. This allows tiered traversal of a function's extent along one dimension. For example, splitting the x</code> variable changes this loop: for x in [1..16]: a[x] = ... </pre> into: for x_outer in [1..4]: for x_inner in [1..4]: a[(x_outer4)+x_inner] = ... </pre> Combined with reorder, split can represent schedules that tile computations. </li> Change Loop Type - change how the loop will be traversed; by default the loop type is sequential</code>, but it could also be parallel</code>, unrolled</code>, or vectorized</code>. For simplicity our implementation only supports sequential</code> and vectorized</code>. </li> Reorder - switch loop nodes for the same function. </li> Hoist / lower compute - change the granularity in which intermediate results are computed. </li> Hoist / lower storage - change the granularity in which storage for intermediate results is allocated. </li> Inline / deinline - inline functions into callers (don't store their results in intermediate storage) or deinline function out of callers. Intuitively, inlining functions trades off smaller memory usage for redundant computations, while de-inlining trades off higher memory usage for fewer redundant computations. </li> </ul> Below are some diagrams to give intuition to how these scheduler transformers work. Bounds Inference</h3> Now that we have a representation for schedules and a set of schedule transformers, we are close to arriving at a search algorithm for finding efficient schedules. The last component that we need is a notion of cost for schedules. In order to provide a cost model for schedules, we need to determine the number of iterations performed by loops in the schedule. This determines the number of times instructions inside the body of loops will be executed, as well as the size of intermediate storage to be allocated. We determine this by computing the extent in which functions will be computed. For the output function, we assume that the extent is given by a call to the realize</code> function. For called functions that are not inlined, the extent is the dimensions of the function that will be stored as intermediate results. Because storage will be reused depending on the granularity with which intermediate results are stored, the extent of called functions does not necessarily coincide with the total extent over which the function will be computed (e.g., the called function might be computed on a per-scanline basis). For example, consider the simple pipeline below that has one producer (g</code>) and one consumer (f</code>): g(x, y) = x y</code> f(x, y) = g(x, y) + g(x+1, y+1)</code> Given that f</code> is realized in a 512x512 box and a schedule where g</code> is computed in total before computing f</code>, the extent of g</code> is 513x513. Computing the extent in which functions will be computed is hard in general, but since Halide makes the simplifying assumption that all extents are rectangular (as opposed to, say, polytopes in the polyhedral model), there is a simple method for doing this: we only need to check the maximum and minimum points of the caller functions and check the arguments to the callee. Note that we also assume that function arguments are drawn from a grammar of "simple" arithmetic expressions consisting only of +</code>, -</code>, </code>, /</code>, variables and constants. In the example above, the extent of f</code> is defined by the box bounded by (1,1)</code> and (512, 512)</code>. The arguments to g</code> at these points are: at (1,1)</code>: (1,1), (2,2)</code></li> at (512,512)</code>: (512,512), (513,513)</code></li> </ul> Thus we can determine the extent of g</code> to be 513x513. We encode these caller-callee relationships into logical formulas and use the Z3</a> SMT solver to a retrieve model that contains concrete values for the arguments. Search Algorithm for Schedules</h3> Once loop sizes have been inferred, we have enough information to determine important execution features of the schedule, such as how much memory it will allocate and how many operations it will perform. The cost of the schedule is then a weighted sum of these data points. By default our implementation groups execution features into the following: mem - amount of memory allocated </li> loads - number of intermediate results loaded from storage </li> stores - number of intermediate results stored </li> arithmetic operations - number of +</code>, -</code>, </code> and /</code> operations performed </li> mathematical operations - number of sin</code>, cos</code>, tan</code>, sqrt</code> operations performed </li> </ul> Each of these groups has a weight that determines the importance of these features with respect to the schedule's cost (see Evaluation below). Now that we can give a notion of cost to schedules, we can search for efficient schedules. We use beam search as our search algorithm, with the default schedule as the starting node. We describe the concrete parameters used for search below in Evaluation. Conversion to Halide</h3> Once we have a candidate schedule tree, we convert it into Halide. We do this by checking the ancestry path from compute nodes: this path determines whether a function's variables are split, the traversal order for computing the function, and, for called functions, the granularity at which the function is stored and computed. Consider the schedule above for the functions bx</code>, by</code>, and in</code>. Converted into Halide code, the schedule looks like the following: by.reorder(y, x); bx.store_root(); bx.compute_at(by, y); bx.reorder(y, x); </pre>Evaluation</h2> We evaluate the performance of the autoscheduler over three benchmarks. We do this by comparing the performance of the autoscheduled run (OPT configuration) vs. the run with the default schedule (DEF configuration). We measure runtime and memory usage using gprof</code>. For the experiments, we set the weights for execution features as follows: Group</th> Weight</th></tr></thead> mem</td> 0.1</td></tr> loads</td> 0.5</td></tr> stores</td> 0.5</td></tr> arith ops</td> 1.0</td></tr> math ops</td> 10.0</td></tr> </tbody></table> We run beam search with a depth of 10 and beam width of 300. For all benchmarks, the output functions are realized across an extent of 2048x2048. The results below are averaged across three runs. Benchmark 1</h3> g(x,y) = sqrt(cos(x) + sin(y)); f(x,y) = g(x + 1,y) + g(x,y) + g(x + 1,y + 1) + g(x,y + 1); </pre>Function</th> Runtime (ms)</th> Peak heap usage (bytes)</th></tr></thead> f (DEF)</td> 87.72</td> 0</td></tr> f (OPT)</td> 13.48</td> 0</td></tr> g (DEF)</td> N/A</td> 0</td></tr> g (OPT)</td> 172.70</td> 32874</td></tr> </tbody></table> Benchmark 2</h3> blur_x(x,y) = input(x - 1,y) + input(x,y) + input(x + 1,y) / 3; blur_y(x,y) = blur_x(x - 1,y) + blur_x(x,y) + blur_x(x + 1,y) / 3; </pre>Function</th> Runtime (ms)</th> Peak heap usage (bytes)</th></tr></thead> blur_y (DEF)</td> 12.70</td> 0</td></tr> blur_y (OPT)</td> 19.02</td> 0</td></tr> blur_x (DEF)</td> N/A</td> 0</td></tr> blur_x (OPT)</td> 16.16</td> 16400</td></tr> </tbody></table> Benchmark 3</h3> f(x,y) = x + y; </pre>Function</th> Runtime (ms)</th> Peak heap usage (bytes)</th></tr></thead> f (DEF)</td> 11.85</td> 0</td></tr> f (OPT)</td> 12.18</td> 0</td></tr> </tbody></table> Discussion</h3> Note that for the DEF configuration, only the output functions have runtimes associated with them since all called functions are inlined. The autoscheduler performs rather poorly relative to the default schedule. While it successfully makes space-runtime tradeoffs (e.g., f</code> in Benchmark 1), allowing the computation of a function to run much faster by saving intermediate results, it runs more slowly and uses more memory than the default schedule across all benchmarks. We believe the poor performance of the autoscheduler has two main causes: Wrong feature weights. The feature weights for the cost model are chosen by fiat; if these were learned instead given a set of training data, then more the weights can probably better capture the execution profile of schedules. </li> Missing execution features. There are some execution features not captured in the current cost model that probably has a significant effect on performance. Most importantly, the cost model does not reason about locality. Because of this, the autoscheduler sometimes generates schedules with loop order that has poor locality (e.g., a function being traversed in column-major order instead of row-major order). It is not clear how to quantify locality in the cost model, but it is an obvious extension to the cost model. </li> </ul> Software Simulation for Data Streaming in HeteroCL Wed, 18 Dec 2019 00:00:00 +0000 With the pursuit of higher performance under physical constraints, there has been an increasing deployment of special-purpose hardware accelerators such as FPGAs. The traditional approach to program such devices is by using hardware description languages (HDLs). However, with the rising complexity of the applications, we need a higher level of abstraction for productive programming. C-based high-level synthesis (HLS) is thus proposed and adopted by many industries such as Xilinx and Intel. Nonetheless, to achieve high performance, users usually need to modify the algorithms of applications to incorporate different types of hardware optimization, which makes the programs less productive and maintainable. To solve the challenge, recent work such as HeteroCL</a> proposes the idea of decoupling the algorithm from the hardware customization techniques, which allows users to explore the design space and the trade-offs efficiently. In this project, we focus on extending HeteroCL with data streaming support by providing functional software-level simulation (in contrast with hardware-level simulation, where we simulate after hardware synthesis). Experimental results show that with LLVM JIT runtime, we can have orders of speedup compared with the software simulation provided by HLS tools. Why Data Streaming?</h3> Unlike traditional devices such as CPUs and GPUs, FPGAs do not have a pre-defined memory hierarchy (e.g., caches and register files). Namely, to achieve better performance, the users are required to design their memory hierarchy, including data access methods such as streaming. In this project, we focus on the streaming between on-chip modules. The reason that we are interested in the cross-module streaming is that it introduces more parallelism to the designs. To be more specific, we can use streaming to implement task-level parallelism. We use the following example written in HeteroCL to illustrate the idea of streaming. @hcl.def_([A.shape, B.shape, C.shape]) def M1(A, B, C): with hcl.for_(0, 10) as i: B[i] = A[i] + 1 C[i] = A[i] - 1 @hcl.def_([B.shape, D.shape]) def M2(B, D): with hcl.for_(0, 10) as i: D[i] = B[i] + 1 @hcl.def_([C.shape, E.shape]) def M3(C, E): with hcl.for_(0, 10) as i: E[i] = C[i] - 1 M1(A, B, C) M2(B, D) M3(C, E) </pre> In this example, M1</code> takes in one input tensor A</code> and writes to two output tensors B</code> and C</code>. Then, M2</code> and M3</code> read from B</code> and C</code> and write to D</code> and E</code>, respectively. We can see that M2</code> and M3</code> have no data dependence and can thus be run in parallel. Moreover, these two modules can start as soon as they receive an output produced by M1</code>. To realize such task-level parallelism, we can replace the intermediate results B</code> and C</code> with data streams. We illustrate the difference between before and after applying data streaming with the following figure. Data Streaming in HeteroCL</h3> The key feature of HeteroCL is to decouple the algorithm specification from the hardware optimization techniques, which is also applicable to streaming optimization. To specify streaming between modules, we use the primitive to(tensor, dst, src, depth=1)</code>. It takes four arguments. The first one is the tensor that will be replaced with a stream. The second one is the destination module and the third one is the source module. Finally, users can also specify the depth of the stream. Currently, the data stream is implemented with FIFOs. HeteroCL will provide other types of streaming in the future. Following, we show how to specify data streaming with our previous example. s = hcl.create_schedule([A, B, C, D, E]) s.to(B, s[M2], s[M1], depth=1) s.to(C, s[M3], s[M1], depth=1) </pre>Software Simulation for Data Streaming</h3> It is not enough with the programming language support only. We also need the ability to simulate the programs after applying data streaming. One way to do that is by using the existing HeteroCL back ends. Namely, we can generate HLS code with data streaming and use the HLS tools to run software simulation. Note that the software simulation here refers to cycle-inaccurate simulation. The reason why we only focus on cycle-inaccurate simulation is that to complete cycle-accurate simulation, we need to run through high-level synthesis, which could be time-consuming in some cases. We can see that the existing back ends require users to have HLS tools installed, which is not ideal for an open-source programming framework. Moreover, the users will need a separate compilation to run the simulation. Thus, in this project, we introduce a CPU simulation flow to HeteroCL by extending the LLVM JIT runtime. With this feature, users can quickly verify the correctness of a program after adding data streaming. Implementation Details</h3> The code can be seen here</a>. The key idea is to simulate data streaming with threads. In other words, each module will be executed using a single thread. We also implement a scheduling algorithm to decide the firing of a thread and the synchronization between threads. For streaming, we implement the streams by using one-dimensional buffers. We assign the size of a buffer according to the specified FIFO depth. Currently, we only provide blocking reads and blocking writes. Non-blocking operations will be left as our future work. In the following sections, we describe the algorithms and implementation details. Module Scheduling</h4> The purpose of this algorithm is to schedule each module by assigning it with a timestep, which indicates the execution order between modules. Namely, modules that can be executed in parallel are assigned with the same timestep. Similarly, if two modules are executed in sequence, they are assigned with different timesteps. Note that the numbers assigned to two consecutive executions do not need to be continuous. Since each module is executed with a single thread, a thread synchronization is enforced between two consecutive timesteps. To begin with, we first assign each module with a group number. Modules within the same group are executed in sequence, while modules in different groups can be executed in parallel. To assign the group number, we first build a dataflow graph (DFG) according to the input program. An example is shown in the following figure, where the solid lines mean normal read/write operations while the dotted lines refer to the read/write of data streams. After the DFG is built, we remove all the dotted lines. Then, we assign a unique ID to each connected component. This ID will be the group number. An example is shown below. Now, we can start the scheduling process by assigning the timestep to each module. We first perform a very simple as-soon-as-possible (ASAP) algorithm. Namely, the first module within each group will be assigned with timestep 0. After that, we assign the timestep of each module according to the data dependence. An example is shown below. However, this is not correct because as we mentioned above, modules connected with streams should be run in parallel. Namely, they will share the same timestep. To solve that, we add one dotted line back at a time and correct the timesteps. We also need to correct its succeeding modules accordingly. After all dotted lines are added, we finish our scheduling algorithm. Note that there exist cases where we cannot solve. For example, if two modules A</code> and B</code> are connected with a solid line, and the producer A</code> streams to a module M</code> while B</code> also streams from M</code>, then there exists no valid scheduling according to our constraints. One possible way to solve that is by merging A</code> and B</code> into a new module A_B</code>. In this case, the streaming from/to M</code> becomes an internal stream, which can be scheduled easily by assigning A_B</code> and C</code> with the same timestep. The reason why in this implementation we do not merge the two modules is that it is possible that we reuse A</code> or B</code> for other computations. In this case, we will need to reconstruct the DFG. Thus, we leave this as our future work. Parallel Execution with Threads</h4> After we assign each module with a timestep, we can start to execute them via threads. Before we execute a module with a new thread, we check whether all modules assigned with smaller timesteps are completed. In other words, we first check whether all modules assigned with smaller timesteps are fired. If not, we schedule the current module to be executed in the future by pushing it into a sorted execution list. Then, if all modules with smaller timesteps are fired, we check whether they are finished. If not, we perform thread synchronization (e.g., by using thread.join()</code> in C++). Finally, we need to execute the modules in the execution list. Since the list is sorted, we do not need to worry about new modules being inserted into the list. Stream Buffers</h4> In this work, we implement the streams with buffers that act like FIFOs. Instead of popping from or pushing to the buffers, we maintain a head and a tail pointer for each buffer. The pointers are stored as integer numbers. The head pointer points to the next element that will be read from, and the tail pointer points to the next element that will be written to. We update the pointers each time an element is written to or read from the buffer. We need to perform modulo operations if the pointer value is greater than the buffer size (i.e., FIFO depth). Since we may have two threads updating the pointers at the same time, we use std::atomic</code> provided by C++ to make sure there is no data race. Finally, we maintain a map so that we can access a stream according to its ID. LLVM JIT Extension</h4> To enable users with a one-pass compilation, we extend the existing LLVM JIT runtime in HeteroCL. It is complicated and hard to maintain if we implement both threads and stream buffers using pure LLVM. Thus, we implement them with C++ and design an interface composed of a set of functions. For instance, we have BlockingRead</code>, BlockingWrite</code>, ThreadLaunch</code>, and ThreadSync</code>. Then, inside our JIT compiler, we call the functions by using LLVM external calls. Evaluation</h3> In this section, we evaluate our implementation by using both unit tests and realistic benchmarks. Experiments are performed on a server with 2.20GHz Intel Xeon processor and 128 GB memory. We verify the correctness of the final result and compare the total run time in different cases. Unit Tests</h4> The tests can be found here</a>. Following we breifly illustrate what each test does by using the DFGs. For unit tests, we compare the run time before and after applying data streaming. The results are shown in the following table. We run the results for 1000 times and calculate the average. Testcase</th> Original (ms)</th> Multi-threading (ms)</th> Speedup</th></tr></thead> two_stages</td> 0.0592</td> 0.0554</td> 1.070</td></tr> three_stages</td> 0.0831</td> 0.0715</td> 1.162</td></tr> internal_stage</td> N/A</td> 0.0638</td> N/A</td></tr> fork_stage</td> 0.0865</td> 0.0758</td> 1.141</td></tr> merge_stage</td> 0.0906</td> 0.0739</td> 1.226</td></tr> </tbody></table> The average speedup of our test cases is 1.150, which makes sense because now we use multi-thread execution. Note that for the third benchmark (i.e., test_internal_stage</code>), the functionalities are different before and after applying data streaming. To be more specific, we list the test program here. @hcl.def_([A.shape, B.shape, C.shape, D.shape]) def M1(A, B, C, D): with hcl.for_(0, 10) as i: B[i] = A[i] + 1 D[i] = C[i] + 1 @hcl.def_([B.shape, C.shape]) def M2(B, C): with hcl.for_(0, 10) as i: C[i] = B[i] + 1 M1(A, B, C, D) M2(B, C) s = hcl.create_schedule([A, B, C, D]) s.to(B, s[M2], s[M1], depth=1) s.to(C, s[M1], s[M2], depth=1) </pre> We can see that without applying streaming, the production of D</code> is not affected by M2</code>. However, if we specify C</code> to be streamed from M2</code> to M1</code>, the original memory read of C</code> in M1</code> now becomes a blocking read. This also demonstrates that without the simulation support for streaming, some hardware behaviors cannot be correctly represented. Realistic Benchmark</h4> We also show the evaluation results from a realistic benchmark, which is more complicated than the synthetic tests in the unit tests. Due to time limitation, we only use the Sobel edge detector, which is a popular edge detecting algorithm in image processing. We compare the results with the software simulation tool provided by the HLS compiler. More specifically, we first generate Vivado HLS code with hls::stream</code>. Then we use csim</code> to run the software simulation. The evaluation results are shown below. We also show the time overhead due to compilation. Simulation Method</th> Simulation Time (s)</th> Compilation Overhead (s)</th> Total Run Time (s)</th></tr></thead> LLVM JIT</td> 0.00094</td> 0</td> 0.00094</td></tr> Vivado HLS csim</td> 1.63</td> 1.29</td> 2.92</td></tr> </tbody></table> We can see that with LLVM JIT runtime, we can have orders of speedup compared with HLS simulation. Moreover, the overhead caused by compilation is not negligible for HLS simulation. Conclusion and Future Work</h3> In this work, we implement a software simulation runtime for data streams in HeteroCL by extending the existing LLVM JIT back end. We implement the simulation runtime with multi-threading in C++. Moreover, we propose a scheduling algorithm that exploits the task-level parallelism of a program after applying data streaming. Finally, we use unit tests to verify our work and use a realistic benchmark to demonstrate the programming efficiency over existing HLS tools. Our next step will be testing our extension with more realistic benchmarks. In addition, by parsing HLS reports, we may be able to perform the cycle-accurate simulation. Then we can compare the performance of our scheduling algorithm with those implemented in existing HLS tools. In the end, we want to submit a pull request to the upstream HeteroCL repository. Evaluating the Performance Implications of Physical Addressing Wed, 18 Dec 2019 00:00:00 +0000 Introduction to Virtual Addressing</h2> Modern processors use virtual addressing</a> to access actual memory locations through a translation layer. Only highly privileged software, such as the operating system (OS), has access to physical memory addresses while all other processes can only refer to memory via these virtual ones. When a process requests memory (e.g., via malloc</code>), the OS will allocate physical memory in fixed size chunks, called pages, and then map them into the process' virtual address space. This allows the OS to allocate whichever regions of physical memory happen to be free despite the fact that the process may have requested a large, contiguous allocation. Virtual addressing provides a few key abstractions for user-level software: A fully contiguous address space.</li> A unique address space not shared by any other process.</li> </ol> The former enables software to easily calculate relative memory addresses; accessing any element in an array requires only one or two instructions to add the offset of the base pointer and then load from memory. Similarly, locations on the program stack are computed relative to the current stack pointer. Neither of these "pointer arithmetic" operations would be valid if executed on the physical addresses. The latter is a useful security primitive that enables strong process memory isolation "for free," since there is no way for a process to even reference memory owned by another process (unless the OS maps some physical location into both address spaces). The Case Against Virtual Addressing</h2> The translation of virtual addresses is accelerated by dedicated hardware called the Translation Lookaside Buffer</a> (TLB). This acts as a "translation cache" and hides most of the cost of virtual address translation, except for when an address is not present in the TLB. Missing in the TLB triggers a complex series of physical memory accesses called "walking the page table" and tends to be extremely expensive (especially if this has to be handled by software). For workloads that allocate very large amounts of memory, the TLB can't actually "reach" all of the necessary memory addresses, causing frequent TLB misses</a>. In these cases, it's not uncommon for the CPU to be running only a single application which would like to manage its own memory anyway; the aforementioned advantages of virtual addressing are significantly reduced but the cost in TLB misses can be devastating to performance. The other major cause of TLB misses is frequent context switching between processes, which typically triggers a complete flush of the TLB state. For multithreaded applications which rely heavily on system calls (e.g., webservers), this can incur overheads of up to 20%</a>. Furthermore, virtual addressing is not a requirement for memory security. There are many different proposals (and even some usable implementations) of tagged memory architectures, where physical memory locations are associated with tags that control how those locations can be accessed by software. Some examples include: the CHERI capability architecture</a>; the PUMP processor for software-defined metadata</a>; and the secure information flow CPU, Hyperflow</a>. Instead of relying on a process' inability to address memory, these designs use hardware to efficiently check whether or not a memory access is allowed by the system's security policy. In these designs, the protection provided by virtual addressing is either mostly or completely redundant. Removing Virtual Addressing</h1> Let us imagine that we are running code on one of these tagged memory architectures and we want to eliminate virtual addressing and the overheads it entails. In this world, we can still ask our OS for memory via malloc</code>; however it returns back to us a physically contiguous memory region (rather than virtually contiguous). For the large memory applications described above that manage their own memory, they would likely start by malloc</code>-ing most of the computer's physical memory and then never calling malloc</code> again. Little would change for such programs (except that the spatial locality assumptions their designers had originally made about memory layout are more likely to reflect reality). However, programs which request new allocations throughout their lifetimes may no longer be able to execute correctly. Since malloc</code> returns a physical memory region, the OS needs to find a large enough space inside the memory to allocate. Due to the presence of fragmentation</a>, it is possible that no such region exists. In that case, malloc</code> returns 0</code> and, in all likelihood, the program explodes. Remember that such fragmentation was present with virtual addressing as well, but the OS could stitch together various fragmented segments to form a single virtual allocation. Therefore, programs should strive to allocate memory in fixed-size chunks; essentially, they should assume that the OS can only allocate them pages of physical memory and it's their job to stripe datastructures across them. Experimental Setup</h1> To evaluate the impact of software changes required in lieu of virtual addressing, we ran experiments with the following configurations. First, we ran all of our tests on a computer with 8 Intel i7-7700 CPUs clocked at 3.60GHz, with 32GB of physical memory, running Ubuntu 16.04. Secondly, we followed the guidlines</a> provided by LLVM to reduce variance; in particular, every test was executed on a single, isolated processor core. While we ran all of our tests ten times and report averages of our measurements, with this setup we observed very little variance with typically less than 0.01% standard deviation. Finally, we assumed that some reasonable amount of stack could be pre-allocated contiguously, even on a physically addressed machine. We chose 32KB since that was approximately the smallest sized stack required to execute the benchmarks normally. Unfortunately, we could not actually execute any tests using physical addressing, since there is no reliable method for allocating physical memory in user space. While there are several proposals</a> for how to implement these features, they aren't currently supported in Linux. While there are reconfigurations and workarounds that could enable this evaluation the solutions are not lightweight. Therefore, our results are overhead measurements that represent worst-case performance; we don't actually expect any of our tests to result in speedups. Dealing With The Stack</h1> The stack</a> presents another potential issue. Current compilers assume that stack-allocated variables can be addressed relative to the stack pointer, which is stored in a register. Obviously, while this is an efficient mechanism for address computation, this scheme doesn't work if any given stack frame is not comprised of physically contiguous memory. For certain applications, it is likely that we can allocate a single stack page at start-up and then go on with our lives. In this case, the restrictions mentioned above aren't really an issue. However, programs may allocate large data structures on the stack, may recurse deeply or may have dynamically sized stack allocations. In these cases, we can run into the issues described above since the stack we've already allocated may not be large enough. One solution to this problem is to dynamically allocate stack frames whenever a function call is made. In this case, every function prologue needs to check the current stack and see if there's enough space. If there is, then the function executes normally; otherwise, the function asks the OS for a memory region big enough to store the current function's entire stack frame before running. During the function epilogue, the program should then free</code> that memory. It turns out that gcc</a> has implemented exactly this functionality and calls it "stack splitting"</a>. You can check out that link for a detailed explanation of the -fsplit-stack</code> option for gcc, but it essentially implements the algorithm described above, modulo some tricks for making the common case fast and maintaining its own small free list for stack pages. Overhead of Stack Checking</h2> We evaluated the performance impact of using split stacks on two microbenchmarks designed to be bottlenecked on function calls, which respectively do and do not trigger run-time memory allocations. The first microbenchmark was a naive program to compute the 50th number in the Fibonnacci sequence without memoization; this did not require a large amount of stack so we use it to measure the overhead of just checking whether or not there is enough space. The other microbenchmark naively recursively computes the sum of the first n integers. We executed this benchmark with n=1000000000</code> so that it would trigger run-time allocations by recursing very deeply. This diagram plots the execution time of these two microbenchmarks with the -fstack-split</code> option enabled, normalized to regular execution (statically allocated stacks). As you can see, our Fibonnacci benchmark has only about a 15% increase in runtime caused by checking remaining stack space. While not an insignificant cost, most programs will not execute nearly as high a density of function calls and should not see such high overheads. The recursive sum benchmark ended up executing 6410250 different run-time allocations of 1544 bytes. While it had a very large performance impact, we could certainly tune the stack allocation algorithm to request larger chunks of memory to reduce the frequency of malloc</code> system calls. PARSEC Benchmarks</h3> While these microbenchmarks give us a good upper bound on worst-case overhead, we wanted to evaluate some more realistic tests. We chose the PARSEC</a> benchmarks, mostly because we used them for a prior project in this class and could test them easily. The execution times for these benchmarks are the built-in "Regions of Interest" and exclude initialization and warm up times for each program. With these four benchmarks, there was almost no impact of applying the split-stack option. We should note that the StreamCluster benchmark actually sped up when using split-stack; likely this is some sort of memory alignment effect à la Mytcowicz et al.</a>. In any case, we should probably consider this impact to be negligible. Of these benchmarks, only Ferret actually required dynamically allocating stack space. Each of these allocations was 618 KB, which is a potential concern. It is unclear, in a real system using only physical addressing, whether or not allocations of this size would be frequently servicable or not. I hypothesize that real systems with many gigabytes or terabytes of memory with even severe fragmentation will be able to regularly respond to allocations in the kilobyte range; however, evalutaing this is future work. Large Object Allocations</h1> The other major modification to programs would be supporting large memory allocations. Since it is probably unreliable to request very large contiguous memory regions, we must adopt a new strategy. To evaluate the potential impact of these changes, we modified the Blackscholes benchmark from the PARSEC suite. Blackscholes uses two dynamically allocated arrays, which we replaced with custom Array</code> objects that we implemented to use only fixed size allocations. We chose to modify this application not only because it consists of a single, easily-modifiable source file, but also because it iterates through large arrays and is very likely to be negatively impacted by array access latency and spatial locality within the array. Custom Array Implementation</h2> We implemented our Array</code> using a tree-based datastructure that mimics the functionality of page tables in a virtual address environment. We support up to three levels of allocation, where the final level contains data and all previous levels contain pointers to other pages. In the following diagram, the L1 page contains pointers to L2 pages, which contain pointers to the L3 pages. Each L3 page contains the actual object array data. As an optimization, the constructor for Array</code> determines how many levels of pages are required to store all of the data. For instance, in the event the data fits in a single page, then the L1 page will hold data and no L2 or L3 pages will be allocated. Evaluating Tree Overheads</h2> We had to modify the Blackscholes benchmark slightly to replace the calls to malloc</code> to use our C++ Array</code> objects. This only involved modifying the array allocation and deallocation since we overloaded the []</code> operator for normal array dereferences. Blackscholes only uses pointer arithmetic for allocation purposes so we didn't need to modify any other instructions. The main sources of overhead we anticipated from these datastructures were not only the increased number of instructions to access data, but also the reduced spatial locality of data within the array, which depends on how big our physical allocations actually were. Therefore, we evaluated a number of different configurations, where our library used different sized "pages," ranging from 4KB to 1MB. Furthermore, the original Blackscholes program used complicated pointer arithmetic to allocate five contiguous arrays from a single call to malloc</code>. In our original modification, we treated these as five separate Array</code> objects since we can't guarantee address continuity. In the "Modified Blackscholes" test, we re-wrote this to be a single array of struct</code> objects so that there might be more spatial locality between fields accessed around the same time. We saw a worst-case overhead around 17% with 8 KB pages. Out of the base Array</code> object implementation, 32 KB performed the best. Intuitively, it makes sense that larger pages start to provide diminishing returns once they exceed L1 and L2 CPU cache sizes. We tried to confirm these intuitions using performance counters but found that L1 cache miss rates were very close across all configurations and LLC (last level cache) miss rates varied wildly even across executions of the same configuration. Likely this was caused by interference with processes running on other cores or the OS itself. However, using other performance counters we did notice that both the original and modified Blackscholes programs had very similar IPC (instruction per cycle) values, indicating that CPU efficiency wasn't significantly impacted and the primary overhead was simply caused by executing more instructions. As a simple test of this, we modified the Array</code> code to always use trees of depth three (two layers of pointers and a single data layer), which removed some of the runtime checks required to access data. The results for that test are in the "No Branches" column above. Other than in the 1 MB case, this configuration performed much better than the others, with only a 3% overhead in the 16 KB case. In a robust implementation, one could achieve this effect by using a "factory"</a> pattern to create the appropriate depth Array</code> for the given allocation. Ideally, we would be able to determine this requirement statically so that array accesses could be in-lined; this would probably avoid the vast majority of the overheads caused by using this data structure. Random Access Overheads</h2> The Blackscholes benchmark primarily scans through large arrays; we wanted to measure the overhead of a microbenchmark with significantly less spatial locality. In addition, we wanted to compare the impact of our changes on a long running test that used small arrays. To achieve both of these, we wrote a small benchmark that initializes the values of an array to a set of pseudorandom values (generated by multiplying the index by a large prime number). Then we execute a pointer chase through the array by looking up the value at location 0 and then treating the value as the next index to inspect (modulo array size). Unlike the Blackscholes benchmark, this runs a large number of iterations but can be configured to use a small or large array. The small test was sized to fit into a single data page and therefore should not incur extra memory accesses compared to the traditional array implementation. For this test, we used 4KB pages. Like before, we include a "No Branches" configuration which is precompiled to remove the run-time checks to determine the correct element look-up behavior. We can see here that the random access cost on large arrays is, unsurprisingly, expensive. In the linear access case, there is still quite a bit of spatial locality and Array</code> pointer pages are likely to be in cache for multiple data accesses. With random access, the larger amount of memory required to store the data increases the cache pressure. Small arrays do suffer from some overhead in this test but likely this is primarily caused by the increase in dynamic instruction count (the "No Branches" case executes 1.5 times as many instructions as the baseline). Conclusion</h1> While our results are not completely easily explicable, they do at least somewhat follow our intuitions. Memory allocation is a complex process dependent upon a number of system variables and operating system implementation. In order to better understand what is going on with these results, we must both sample across more test benchmarks and measure performance in a more controlled setting (e.g., by using Stabilizer</a>). These preliminary results suggest that highly optimized datastructures, tuned for physical memory allocation could impose very little overhead. Furthermore, at least for programs which do not allocate large amounts of memory on the stack, the cost of checking stack size and occasionally allocating new stack frame pages would be negligible. All in all, if we could actually address physical memory, we very well might see improvements in performance while also simplifying much of the underlying hardware and operating system. The source code for the Array</code> object, the microbenchmarks and the instrumentation modifications we made to libgcc can be found on github</a>. Quantum Vectorization Wed, 18 Dec 2019 00:00:00 +0000 The code used in this blog post is hosted here</a>. Introduction</h2> In this blog, we describe our efforts to develop a compiler pass to vectorize the implicit parallelism present in quantum algorithms. Quantum algorithms are probabilistic, and so need to be run multiple times to get a "reliable" result. Since each of these program runs are independent, several can be performed simultaneously on the same hardware without changing the final result, so long as the hardware has space to support the additional logic. In this project, we developed an LLVM pass to transform code to help take advantage of this program structure. Our LLVM pass rewrites code to duplicate all algorithm instructions associated with each array onto physical hardware. We cannot conclude if this approach provides speedup without a proper experimental setup, but we have found that such a pass can be run on realistic quantum code to produce somewhat vectorized algorithms. Quantum Computing</h2> Quantum Computing has exploded into the popular imagination in the past decade due to the promise of massive theoretical speedups over conventional digital computers. Whether real quantum hardware can live up to the promise remains to be seen, but that has not stopped researchers from developing complex toolflows and algorithms. The computing paradigm of quantum computers is inherently different from a standard "classical" computer. Instead of representing a bit as either a 0</code> or 1</code>, quantum bits (qubits) represent a bit of information using a quantum superposition a |0> + b |1></code>, where |0></code> and |1></code> represent possible realizable states and a</code> and b</code> are normalized constants related to probability of measuring the respective state. Although a</code> and b</code> theoretically hold infinite information, it is only practically possible to measure a bit of information from the state as in the classical case as the state collapses to one of the realizable state |0></code> or |1></code> upon measurement. Quantum computing offers unique computing properties due to the nature of a qubit. The main computational differences between quantum and classical computing include the following properties: Property</th> Quantum</th> Classical CPU</th></tr></thead> Architecture</td> Spatial</td> von Neumann</td></tr> Data</td> Quantum State</td> Voltage Level</td></tr> Control</td> External (Laser)</td> Voltage Level</td></tr> States per bit</td> Exponential</td> Linear</td></tr> </tbody></table> In general, a quantum computer can implement all of the computation primitives that a classical one can. Both, in theory, can be turing complete with a universal set of logic gates. However, quantum computing also has computational primitives that classical computers don't share. These primitives are key to quantum supremacy: the concept that a quantum computer can theoretically outperform classical computers. Unique compute</th> Example Usage</th></tr></thead> Large State Space</td> Chemical Reaction Simulation</td></tr> Entanglement</td> Combinatorial Optimization (TSP)</td></tr> Amplitude Magnification</td> Database Search</td></tr> Probabilistic (multiple results)</td> ?</td></tr> Phase</td> ?</td></tr> </tbody></table> A potential downside to quantum computing is that it is inherently probabilistic. An output on one execution may be entirely different from the output on the next run. Quantum algorithms must be designed so that the correct answer must have measurement probability >50%. The answer can then be inferred by repeating the execution many times and taking the majority result. Many quantum algorithms exist in Bounded Quantum Polynomial class (BQP) where the correct answer can be found in polynomial time with probability at least 2/3. It can, however, practically be time and resource intensive to run quantum programs a sufficient number of times to achieve a reasonable confidence. Opportunities for Vectorization</h2> The probabilistic nature requires that multiple repeated runs of the same program be executed. The number of runs required to obtain a "correct" result depends on one's error threshold and the design of the algorithm; specifically, the number of runs to obtain error e</code> is given as O(log(1/e))</code>. Thus, there are diminishing returns for running the algorithm many times, but it is important to run the algorithm a "reasonable" amount to achieve acceptably low error. The naive method to repeatedly apply the algorithm is to run many iterations of the algorithm sequentially. This serialization can potentially increase the runtime depending on how many repeated applications are required. Consider, for instance, the entanglement program below written in Scaffold</a>. It must be run multiple times to get representative results. module catN ( qbit bit, const int n ) { H( bit[0] ); for ( int i=1; i < n; i++ ) { CNOT( bit[0][i-1], bit[0][i] ); CNOT( bit[1][i-1], bit[1][i] ); } // measure each bit for ( int i=0; i < n; i++ ) MeasZ(bit[i]); } void main () { qbit bits[4]; catN( bits, 4 ); } </pre> By preempting the need to run this program multiple times, we can directly incorporate the implicit outer loop and produce something like the following: module catN ( qbit bit, const int n ) { H( bit[0] ); for ( int i=1; i < n; i++ ) { CNOT( bit[0][i-1], bit[0][i] ); CNOT( bit[1][i-1], bit[1][i] ); } // measure each bit for ( int i=0; i < n; i++ ) MeasZ(bit[i]); } void main () { qbit bits[2][4]; for (int i = 0; i < 2; i++) { catN( &(bits[i]), 4 ); } } </pre> Now that we have a data-parallel outer loop, we can schedule multiple runs together in spare quantum resource and potentially vectorize the runs if the architecture allows. By exposing this parallelism, we expect to achieve speedup over running the repeats sequentially. Note that we are not trying to vectorize the underlying algorithm; such an implementation would require more information about each algorithm and may fail due to data dependencies. We are instead vectorizing the implicit data-parallel nature of data structures in probabilistic computing due to repeated runs. Implementation</h2> We designed a quantum compiler pass within the ScaffCC</a> compiler infrastructure. ScaffCC adds IR passes and quantum computer backends on top of LLVM, so our pass is written as one would for a classical compiler. The pass first records each instance of the alloca</code> command for vectorization. We make the assumption that qubit arrays are the only memory structures allocated by these programs. This assumption is based on observations of program samples included in the ScaffCC repository. Once every allocation is recorded, each of these commands is cloned a number of times equal to the qvlen</code> argument. We then fully traverse the dataflow graph to copy all dependent instructions. We traverse the dependence graph starting from the allocations in a breadth-first manner, so that we copy a dependent instruction only when all of its dependencies have already been copied. This is required to have the copied values available for use in later instruction copies. Quantum computers are spatial architectures, so functions are inlined to a single basic block. Thus, our dataflow graph algorithm was able to reach the whole program. We do not actually implement vector instructions because it would require extensive backend work to target the simulator. The simulator does support operations in parallel, but does not have give any timing information. Because of this, the extensive backend work would also not show any meaningful results. It is worth noting that this implementation does not scale to situations where qubit allocations include dependencies (such as if a qubit allocation used the size of a previous allocation). We choose to ignore these cases as a simplifying assumption. Evaluation</h2> We evaluated our technique using the ScaffCC compiler infrastructure and a quantum computer simulator</a>. Due to constraints of the simulator we were limited to small benchmarks using a small number of qubits. We still chose to use the simulator to check for correctness as well as get a sense of the probability distributions for the simulated algorithm. We identified six benchmarks that had ~10 or less qubits. One of the benchmarks, QFT (quantum Fourier transform), is an intermediate step in most algorithms and not meant to be measured, so we excluded it. The benchmarks are enumerated below, along with the number of times to repeat the execution. This number is mostly made up and is between 10 and 100 depending on how fast the algorithm ran. We used a pass to count the number of dynamic gate operations for each of the benchmarks. Note that all loops are unrolled in a quantum program because quantum computers are a spatial architecture, i.e., the number of static instructions is the same as the number of dynamic instructions. Benchmark</th> Qubits Used</th> Gates</th> Repeats</th></tr></thead> Cat</td> 4</td> 8</td> 100</td></tr> Ising</td> 10</td> 220</td> 100</td></tr> VQE</td> 4</td> 148</td> 100</td></tr> Grover</td> 11</td> 174</td> 100</td></tr> Ground State</td> 6</td> 8713</td> 10</td></tr> </tbody></table> The simulator does not give timing information, so we created a rough timing model. We assume the target quantum computer uses Ion Trap technology. Here, a microwave laser can implement quantum gates by shining onto qubits. SIMD is possible by directing the laser to multiple qubits at once (citation</a>). These "instructions" are likely not as fast as a >1GHz instruction cache on a classical computer, so amortizing the cost of control is important. Thus, we quantify the timing by the number of total laser pulses required to run the algorithm with enough repeats. Additionally, we consider a quantum computer with 20 logical qubits that can be used for multiple simultaneous runs. We do not consider any spatial scheduling problems and assume the qubit regions working on different runs are effectively in isolation. We can statically predict the best run-time using our model, Time = Repeats * Gates * Gate_Time / Vector_Length</code> We consider relative speedup to baseline, so the actual time to execute a gate is a constant factor that will be divided out. The theoretical speedup is then given by, Speedup = Floor(Max Qubits / Used Qubits)</code> Our theoretical results for each benchmark on a quantum computer is given below for a 20 qubit machine and a 53 qubit machine like Google's recent Sycamore computer, which arguably</a> achieved quantum supremacy for the first time. Benchmark</th> Speedup (20 qubits)</th> Speedup (53 qubits)</th></tr></thead> Cat</td> 5</td> 13</td></tr> Ising</td> 2</td> 5</td></tr> VQE</td> 5</td> 13</td></tr> Grover</td> 1</td> 4</td></tr> Ground State</td> 3</td> 8</td></tr> </tbody></table> We also experimentally compile each benchmark with our pass and execute the program through the simulator to check for "correctness". Each program successfully compiled and executed on the simulator. We explicitly checked the cat</code> program for correctness. In this algorithm a group of 4 qubits are entangled to be either all 0s or all 1s. We verified that there were multiple groups of 4 qubits with this property. The other algorithms also seemed to have reasonable outputs (a mix of 0s and 1s that changed on a run-by-run basis). Conclusion</h2> We implemented an LLVM pass to vectorize the implicit data parallel repetition loop needed to produce precise quantum computing results. Through this implementation, we show that such an optimization is possible and can be readily applied to some common quantum programs. We then used this pass with a quantum gate simulator to predict the speedup possible by applying such an optimization. A Simple Way to Implement a Bad High Level Synthesis Compiler Fri, 13 Dec 2019 00:00:00 +0000 Goal</h2> The goal of this project was to experiment with a novel approach to writing a HLS compiler. In particuar we compile Futil</a>, a novel intermediate representation, to Verilog by generating finite state machines that implement Futil's control constructs. This project is divided into two main parts: Convert a Control AST in Futil to an intermediate FSM structure</li> Generate RTL from the intermediate FSM structure as well as the Futil structure</li> </ul> Background</h2> Futil is made of two sub-languages, a structure language for describing a static computation graph that represents the physical structure of a circuit, and a control language for dynamically choosing which part of the static computation graph runs at a particular time. The ultimate goal of Futil is to be a general framework, similar to LLVM, for working with optimizing HLS compilers. However, the immediate goal of Futil is to provide an Verilog backend for the Dahlia language. This is what we present in this blog. The structural language is straightforward to convert to Verilog; it already is very close to Verilog. However, the control language does not have a straightforward representation in Verilog. Our plan is to convert these statements into a finite state machine with the same semantics. The finite state machine is then easy to translate into Verilog. A typical Futil program is shown below: (define/namespace prog (define/component main () () ==> ((new-std a0 (std_reg 32 0)) (-> (@ const0 out) (@ a0 in)) (new-std const0 (std_const 32 2)) (new-std b0 (std_reg 32 0)) (-> (@ const1 out) (@ b0 in)) (new-std const1 (std_const 32 1)) (new-std gt0 (std_gt 32)) (-> (@ a0 out) (@ gt0 left)) (-> (@ const2 out) (@ gt0 right)) (new-std const2 (std_const 32 1)) (new-std y0 (std_reg 32 0)) (-> (@ const3 out) (@ y0 in)) (new-std const3 (std_const 32 2)) (new-std z0 (std_reg 32 0)) (-> (@ const4 out) (@ z0 in)) (new-std const4 (std_const 32 4))) ==> (seq (par (enable a0 const0) (enable b0 const1)) (if (@ gt0 out) (enable gt0 a0 const2) (enable y0 const3) (enable z0 const4))))) </pre> The first arrow is pointing to the structure and the second arrow is pointing to the control. The structure is an unordered list of two kinds of simple statements: (new-std b0 (std_reg 32 0))</code> stands for instantiation of the library component b0</code> with bitwidth parameter 32</code> and value parameter of 0</code>, and (-> (@ a0 out) (@ gt0 left))</code> represents wiring the out</code> port of component a0</code> with the left</code> port of component gt0</code>. The control part specifies which components are active with enable</code> keyword, and the execution logic with par</code>, seq</code>,if</code>, and while</code> keywords. Think of activating a component like a function call. When a component is active it is allowed to run and produce valid outputs. In this project, we are interested in changing all the control logic to finite state machines (FSMs) and then generating simulatable Verilog program based on both FSMs and the structures. Design Overview</h2> Futil is the backend for Dahlia. The Futil semantics are designed to allow for easy translation from higher level language, like Dahlia, but this creates a gap between the Futil semantics and Verilog implementations. The table below shows the efforts required to translate the Futil semantics to synthesizable Verilog implementations. Futil Semantics</th> Verilog</th></tr></thead> Invalid wire</td> Read wires</td></tr> Component Reusing</td> MUX, Read wires</td></tr> Control</td> FSM</td></tr> </tbody></table> Read Signals</h3> In Futil semantics, enable</code> keyword is used to determine whether a component is active. It is the easiest way of translating a program into hardware. However, this implicitly assumes that the signal on a wire is not valid or readable until we enable</code> a component. We therefore require any data wire to have one extra bit to specify whether the signal is readable. Another way to think about this is that the type of a data wire is Option<T></code>; the wire is either Some(t)</code> or None</code>, depending on whether the module is enabled. In order to encode this in Verilog, we add an extra bit to every data wire to encode the tag of the variant. MUX</h3> A component can be used more than once in Futil. For instance, in the above example, const0</code> and const2</code> are both connected to the input of a0</code>. In Futil, we deal with this by only enabling a0</code> and const0</code>, or a0</code> and const2</code> at the same time (as seen in the example below). However, in Verilog the register needs to choose the input from const0</code> and const2</code>. This introduces a multiplexer (MUX). (enable a0 const0) (enable a0 const2) </pre> At different time steps, read signals tells which wire to the MUX is readable. Therefore, we can use the read wires (the variant tag bit), to serve as sel signals for MUX. In other words, this MUX can be thought of as a function List<Option<T>> -> T</code> that chooses a Some(t)</code> from a list of options. We assume that only wire feeding into a MUX will be valid at a time. FSM</h3> In Futil, there are control constructs like if</code>, while</code> etc. These can be translated into FSM in Verilog implementation, which is the main goal of this project. However, before getting to that, we created intermediate FSM expressions in Futil. An FSM component has: input and output ports,</li> connection of wires between its own ports and other components' port,</li> internal control logic that determines the output signals.</li> </ul> The internal control logic of a FSM component can be divided into several states that determine the output signals. A state transfers to another according to some input signals. In general, all FSM components are composed of one Start state, some Intermediate states and one End state. Consider the syntax (enable A B)</code>. The Start state transfers to the Intermediate state when the valid signal is high. At the Intermediate state, the FSM sends out valid signals to subcomponents A</code> and B</code>, and waits for ready signals from them to be high. Once both of the ready signals are high, the FSM transfers to End state and outputs ready signals to notify any components waiting for this component to finish. It transfers back to Start state when the valid signal is low, indicating the upper components have received the ready signal and finished execution so it is safe for the FSM to go back to the Start state. The same design logic applies to all FSMs. The only difference happens in intermediate state(s): the seq</code> FSM has one or more intermediate states and one intermediate state only transfers to next state when receiving a high ready signal from the previous state; the if</code> FSM sends valid to the module that executes the comparison and receives both ready and condition signals which determine the state it should transfer to; the while</code> FSM transfers to loop Body state when the condition signal is high and goes to End State when the condition is low. Implementation</h2> To realize what we describe in design overview, we gradually added intermediate passes. First Intermediate Pass</h3> There is a Visitor trait in this pass, which performs a recursive walk of abstract syntax tree (AST), so each individual pass can perform modification to the AST with function calls including add_structure</code>, add_input_port</code>, remove_structure</code>, etc. Read Wires</h4> The first pass is adding read wires. We go though all input and output ports of each component, adding corresponding read ports, and then each wire of the components and adding read wires to the ports. We do this pass ahead of creating FSM signatures because we don't want to create read wires for control signals like valid and ready. FSM Signatures</h4> This pass translates control syntax to FSM components. Based on the design logic of FSMs, we can specify the inputs and outputs of each FSM component and the wires connecting each ports to its subcomponents. Notice we also need to add cond_read signals to specify whether the condition signal from the comparison component is readable. FSM</th> Input Ports</th> Output Ports</th></tr></thead> enable</code>/par</code>/seq</code></td> val, rdy_A, rdy_B, ...</td> rdy, val_A, val_B, ...</td></tr> if</code></td> val, rdy_con, cond, cond_read, rdy_T, rdy_F</td> rdy, val_con, val_T, val_F</td></tr> while</code></td> val, rdy_con, cond, cond_read, rdy_body</td> rdy, val_con, val_body</td></tr> </tbody></table> Interfacing</h3> This pass creates input port clock for all components and valid for the top level component. Notice in Futil, we do not have an explicit notion of time steps. However, to make things easier for RTL translation, we created these ports. MUX Signatures</h4> Similar to creating FSM signatures, we need to specify the inputs and outputs of each MUX component and the wires connecting each ports to its subcomponents. To do this, we create a Hashmap indexing with destination ports and store a vector of source ports according to the wiring of the component. For each destination port, if there is more than one source port connecting to it, we create a MUX. Notice there is a difference between control signal and data signals though. Control signals do not have corresponding read wires and no matter which control signal is high, the output is high. On the other hand, data signals always have corresponding read wires to explicitly specify if the data on the data wire is readable. The data signal should be chosen according to its read wires. Therefore at implementation, we go through wires twice. The first time we go through it we record read wires going to the same destination components. The second time we go through it we actually create the large MUX with both data and corresponding read wires. The last step is removing old wires connecting the same destination port with more than one source port. Second Intermediate Pass</h3> FSM Implementation</h4> This pass creates true FSM representations in Futil AST. Each FSM</code> has a name field, a states Hashmap storing states and indexing of the state, a start index corresponding to the first state being created and a last index pointing to the last state being created. Each State</code> is composed of a vector of outputs, where each output is specified with output value and port name and a vector for transitions, where each transition is a tuple of next state index and inputs of value and port name, so that the transition happens when the input has certain value. Finally, there is a default state for each state that is an optional field, telling which state it should transition to when no transition condition is met. We provide abstract methods new(name: &str) -> StateIndex</code>, new_state() -> StateIndex</code>, get_state(idx: StateIndex) -> &mut State</code> for FSM</code> and push_output(output: ValuedPort)</code>and add_transition(transition: Edge) </code> for State</code> to generate actual FSM inner logic according to the graph we made in design review. FSM to RTL generation</h3> After generating an FSM, we need to translate the entire structure of inputs, outputs and states to synthesizable hardware using Verilog. This is done by breaking down a Verilog file into distinct components. Module Declaration: Here we define the name of the module along with the inputs and outputs for it.</li> Wire/reg definitions: These are internal signals that are used within the module.</li> FSM: FSMs can be represented in Verilog using 3 always</code> blocks - State transition, Next state logic, State outputs.</li> </ol> To expand on how all the 3 always</code> blocks are generated we discuss them below: State transition: This is pretty standard. It actually changes the state at a clock edge. Since this is a generic block it can be created without any inputs.</li> Next state logic: This block has a bunch of cases for all the states. For each state in the FSM struct, based on the input transitions to it, we have if else</code> statements for next state logic.</li> Output logic: This block contains output signals for each state represented by cases, similar to the previous block. In addition to having Verilog statements for all the relevant outputs in the state, we also assign the rest of the outputs of the FSM to be zero for now. This is done to avoid inferred latches, which can occur if all outputs are not assigned in each state even though they don't change.</li> </ol> We used a Wadler-style</a> printing api provided by the pretty</a> rust crate to format the Verilog files. Hardest Parts</h2> Futil is implemented with Rust</a>, so we spent some time to get familiar with the language.</li> The design of the FSM representation changed multiple times. Because the state should be stored as pointer and then modified when we add transitions and outputs to it. Rust will force the user to use reference counting, which we did not realize at first. Also, even with reference counting, we would have to create reference cycles which would prevent the FSMs from being freed. We therefore used on HashMaps in the end.</li> Futil read signals are not common in Verilog coding convention. We had two models in our minds: the Verilog valid/response model and our Futil valid/read model. We messed things up because of the existence of the two models and spent a huge amount of time discussing which one should be the most ideal design.</li> </ol> Evaluation</h2> We evaluated our compiler by simulating the generated Verilog. We generated Futil programs with a simple backend we wrote for the Dahlia compiler. The Verilog simulation was donew with an open source tool called Verilator</a> which turns Verilog into a C++</code> object that you can link to, manipulate the inputs, and watch the outputs. This generates a .vcd</code> file that you can view in a wave form viewer like gtkwave</code>. From here you can explore the values of different wires across time. Although we got the core of the compiler working, we weren't able to test very complicated Dahlia programs because we did not implement memories. We also do not correctly generate the logic to multiplex between different inputs to a single component so we were not able to fully take advantage of the parallelism that hardware can provide. Despite these problems, we were still able to get some programs working. Below is a very simple program that simply checks whether a number is greater than 5. let a = 10; let b = 1; --- if (a > 5) { let y = 20; } else { let z = 40; } </pre> Below is an almost equivalent Futil program. We've drawn an arrow to the difference between the two. This statement simply performs the comparison b > 5</code> using the same comparison component as the one used in the condition of the if</code> statement. We've made this change to prove that we can use the same module multiple times. Although this sounds simple, it actually requires muxing between the control signals produced by two different FSMs. We also have to generate muxing between a0</code> and b0</code>. (define/namespace prog (define/component main () () ((new-std a0 (std_reg 32 0)) (-> (@ const0 out) (@ a0 in)) (new-std const0 (std_const 32 10)) (new-std b0 (std_reg 32 0)) (-> (@ const1 out) (@ b0 in)) (new-std const1 (std_const 32 1)) (new-std gt0 (std_gt 32)) (-> (@ a0 out) (@ gt0 left)) (-> (@ b0 out) (@ gt0 left)) (-> (@ const2 out) (@ gt0 right)) (new-std const2 (std_const 32 5)) (new-std y0 (std_reg 32 0)) (-> (@ const3 out) (@ y0 in)) (new-std const3 (std_const 32 20)) (new-std z0 (std_reg 32 0)) (-> (@ const4 out) (@ z0 in)) (new-std const4 (std_const 32 40))) (seq (par (enable a0 const0) (enable b0 const1)) ---> (enable gt0 b0 const2) (if (@ gt0 out) (gt0 a0 const2) (enable y0 const3) (enable z0 const4))))) </pre> This simple program results in a whopping 969 lines of Verilog code (which I will not paste here). We simulated the code in our simple test bench and were able to generate the following signal diagram: From top to bottom, the signals are: clock</li> state of the fsm for the if control statement</li> output of the greater than comparison component</li> read output signal for the greater than component</li> the value in the y register</li> the value in the z register</li> </ul> Notice that the read output signal goes high twice. The first one corresponds to the first time we do the comparison and the second time it goes high is for the comparison in the condition of the if statement. From this diagram, we can see that the true branch of the program was correctly taken and that value 20</code> was put into the register y</code>. The z</code> register remains in its default state. For a slightly more interesting example, and because it is the classic hello world program of hardware, we implemented a counter. The Dahlia code is the following: let i = 0; --- while (i < 10) { i := i + 1; } </pre> and the equivalent Futil code: (define/namespace prog (define/component main () () ((new-std i0 (std_reg 32 0)) (-> (@ const0 out) (@ i0 in)) (new-std const0 (std_const 32 0)) (new-std add0 (std_add 32)) (-> (@ i0 out) (@ add0 left)) (-> (@ const2 out) (@ add0 right)) (new-std const2 (std_const 32 1)) (-> (@ add0 out) (@ i0 in)) (new-std lt0 (std_lt 32)) (-> (@ i0 out) (@ lt0 left)) (-> (@ const1 out) (@ lt0 right)) (new-std const1 (std_const 32 10))) (seq (enable i0 const0) (while (@ lt0 out) (lt0 i0 const1) (enable i0 add0 const2))))) </pre> Running the resulting 509 lines of Verilog code gives us the following trace: Finally, we got a simple implementation of Fibonacci running. Here is the Dahlia code: let a = 1; let i = 0; --- let b = 1; --- while (i < 10) { let tmp = b; i := i + 1; --- b := a + tmp; --- a := tmp; } </pre> Notice that we have a lot more triple dashes than Dahlia requires. This is because without them we run into the aforementioned problems with muxing. Our compiler resulted in 1412 lines of Verilog code. I compared this against the equivalent C++ program compiled with the Vivado HLS toolchain. Their compiler resulted in 148 lines of Verilog code. Although this is an imperfect metric, it does show that this method of compiling to hardware has a large overhead. Conclusion</h2> Overall this was a very interesting project. Although the overhead of this approach is very high, we successfully demonstrated that you can build a simple but functional HLS compiler in ~2 weeks with this approach. Additionally, because most of the compilation work took place within Futil, it would be straightforward to improve the quality of the output by writing more passes. I think that this provides an excellent baseline so that we can explore the impact of more optimizations. CompCert: Formally Verified C Compiler Mon, 09 Dec 2019 00:00:00 +0000 Motivation</h2> The primary motivation of this paper is that compilers these days form a base of trust for most modern applications. If the application's code is correct, then compiled executable of that application will also be correct. However, most modern compilers like GCC and LLVM do have bugs, some of which silently miscompile code without emitting any error messages. Most of these bugs occur when the compiler performs transformations and optimization passes over the source program. The goal of CompCert is to create a compiler that will never silently miscompile code. The way CompCert accomplishes this is by formally verifying (i.e., proving) that each compiler pass does not change the original meaning of the program. The formally verified parts of CompCert are written in Coq</a>, which is a proof assistant based on the calculus of inductive constructions. Semantic Preservation</h2> In order for a compiler to be correct it needs to preserve the semantics of our source program. In this section, we discuss how the paper formalizes the notion of semantic correctness. The paper assumes that the source and target languages have formal languages that assign observable behaviors to each program. The notation $S \Downarrow B$ means that the program $S$ executes with observable behavior $B$. An observable behavior includes things like whether the program terminates or not, and various going wrong behaviors such as accessing an array out of bounds or invoking an undefined operation like dividing by zero. It also includes a trace of all external calls (system calls) that record the input and output of the external functions. However, it doesn't include the state of memory. The strongest definition of semantic preservation is that a source program $S$ has exactly the same set of possible behaviors as a compiled program $C$: $\forall B, S \Downarrow B \Leftrightarrow C \Downarrow B$ However, this definition is too strict because it doesn't give the compiler room to perform certain desirable optimizations, such as dead code elimination, because doing so may optimize away certain going wrong behaviors. For example, if the result of an operation that divides a number by zero is never used, we want the compiler to be able to get rid of it. But doing so means that the compiled program has one fewer going wrong behavior than the source program. For this reason, the paper only requires that all of the safe behaviors of the source program are preserved in the compiled program: $S \texttt{safe} \Rightarrow (\forall B, C \Downarrow B \Rightarrow S \Downarrow B)$ $S \texttt{safe}$ is a predicate that means $S$ doesn't have any going wrong behaviors. This definition enforces that all observable behaviors of $C$ are a subset of the possible behaviors of $S$ and that if $S$ does not go wrong, then $C$ doesn't go wrong either. The paper actually uses the contrapositive of this statement because it is practically easier to prove since you can induct on the execution of $S$: $\forall B \notin \texttt{Wrong}, S \Downarrow B \Rightarrow C \Downarrow B$ Verification vs. Validation</h3> The paper models a compiler as a total function, Comp(S)</code>, from source programs to either OK(C)</code>, a compiled program, or Error</code>, the output that represents a compile-time error, signifying that the compiler was unable to produce code. There are two approaches for establishing that a compiler has the semantic preservation property discussed above: verifying the compiler directly using formal methods or verifying a validator, a boolean function accompanying the compiler that verifies the output of the compiler separately. An unverified compiler along with a verified validator provides the same guarantees as a verified compiler because you can guard the result of the unverified compiler with the validator: Comp'(S) = match Comp(S) with | Error -> Error | OK(C) -> if Validate(S, C) then OK(C) else Error </pre> The validation approach is convenient because sometimes the validator is significantly simpler than the compiler. We'll see this approach used later for verifying part of the register pass. Verifying the compiler directly using formal methods amounts to proving that each step in the semantics of the source program corresponds to a sequence of steps in the semantics of the target program with the same observable effects. If you can also show that the initial states and final states of the source and target programs are equivalent then this proves semantic equivalence. This is represented in the following simulation diagram: $S_1$ represents a state in the execution of a program from the source language and $S_1'$ represents the equivalent state in the execution of the target program. The ~</code> line is an equivalence relation between states from the source semantics to states in the target semantics. The ~</code> line in the diagram is showing that $S_1$ and $S_1'$ are equivalent states. The down arrows represent a single step in the execution of the program and the $t$ label represents the observable effects that took place in this step. Structure of the Compiler</h2> The source language of the CompCert compiler is Clight, which is a subset of C that includes most familiar C programming constructs like pointers, arrays, structs, if/then statements and loops. The compiler front end consists of an unverified parser that parses a source file to a Clight AST. From there, the formally verified section of the compiler performs several passes that repeatedly simplify and transform the representation of the source code all the way to PowerPC assembly code. Then, an unverified assembler and linker take the assembly code and generate an executable that can be run. In total, CompCert formally defines 8 intermediate languages and 14 passes over them. The 14 passes must be proven to preserve the semantics of the original program. The first few passes simplify the C code by converting all types to either ints or floats (pointers get converted to ints) and explicitly describing memory accesses. The result of these passes is an intermediate representation called Cminor, and below is an example of a translation from Clight to Cminor. As you can see, function signatures have been made explicit, implicit casts have been made explicit (like the cast from float to int), array accesses have been transformed into exact byte offsets from pointers, and the size of the function's activation record has been made explicit. The explicit function activation record is used fo dealing with the address-of (&) operator, which requires function local variables to be mapped to a location on the stack frame of the function. Next, CompCert performs instruction selection for the specific architecture that it is targeting. This is done via instruction tiling which can recognize basic algebraic identities. For example, the instruction selection pass will transform 8 + (x + 1) × 4</code> into x × 4 + 12</code>. These algebraic identities are proven in CompCert in order to assist in the semantic preservation proof. The selected instructions are very similar to available PowerPC instructions. Next, CompCert makes the control flow of the program more explicit via a transformation to the RTL IR, which represents control flow using a CFG. In addition to generating a CFG, the RTL representation transforms variables into pseudo-registers, of which there are an unlimited supply of. The RTL representation is a convenient representation to perform optimizations on, so CompCert runs several dataflow analyses on the program in order to perform optimizations such as constant propagation, common subexpression elimination, and lazy code motion. The next transformation pass performed by CompCert maps the pseudo registers to hardware registers or abstract stack locations using a register allocation algorithm. The algorithm implements an approximation of the graph coloring algorithm in OCaml which is used in CompCert using the validator method discussed above. Further passes linearize the CFG, spill registers on the stack and insert the necessary loads for temporaries. Some simple optimizations like branch tunneling (removing branches to branches) are also performed as part of these passes. Finally, CompCert performs instruction scheduling to increase instruction level parallelism on super scaler processors, such as PowerPC processors, and generates PowerPC assembly code. Verification of the Register Allocation Pass</h2> In order to explain the verification process in more depth, the paper describes some of the more technical details of the register allocation pass. The register allocation pass operates on the RTL IR. In this representation functions are represented as CFGs with instructions that roughly map to assembly instructions supported by the PowerPC architecture. However, the instructions use an infinite supply of pseudo-registers, also known as temporaries. The execution semantics of the RTL IR are given by a set of small step semantics. The small step semantics operate over a global environment, which includes a list of all of the temporaries and their values as well as the state of memory. CompCert represents memory as a collection of blocks with a bounded size. Pointers are described as pointing to some offset from the base of a memory block. In order to produce performant code, as many of the temporaries as possible should be mapped to hardware registers instead of being stored on the stack. The register allocation algorithm starts with a liveness analysis of each program point. For every program point $l$ the liveness analysis computes the set of variables that are live coming into program point $l$. This is typically expressed by solving the reverse dataflow equations with a transfer function that removes all defined temporaries at program point $l$ and adds all temporaries that were used at program point $l$. Consider the following code snippet: 1 b = a + 2; 2 c = bb; 3 b = c + 1; 4 return ba; </pre> On line 4, the variables $a$ and $b$ are live. Program point 3 defines $b$ and uses $c$. Therefore, variable $c$ must be live at line 3. However, $b$ must no longer be live before line 3 because it was just redefined. On line 2, $c$ is defined and $b$ is used, so $c$ is no longer live but $b$ is. Finally, on line 1 $b$ is defined so it is not live before line 1. This gives the following live variable sets coming into each program point: 1 b = a + 2; LV = {a} 2 c = bb; LV = {a,b} 3 b = c + 1; LV = {a,c} 4 return ba; LV = {a,b} </pre> The reason live variable analysis is important for register allocation is that it helps build an interference graph. The interference graph represents temporaries as nodes. Edges between two nodes A and B mean that temporaries A and B cannot be assigned the same hardware register. If two temporaries are live at the same time, then they cannot be assigned to the same hardware register. When building the interference graph you simply inspect the live variable sets at each program point and add edges between all temporaries in the live variable set. Then, you need to color the graph to assign temporaries to hardware registers. For the above code snippet the interference graph would be: The live variable analysis is implemented in Coq. Furthermore, the analysis is proven to generate live variable sets that are supersets of the actual live variable sets at a program point. The paper claims this is easier to prove and does not violate the correctness of the register allocation step. This is because supersets of the actual live variable sets only add more edges to the interference graph, which still maintains the correctness of the register allocation pass. The actual coloring of the interference graph is implemented in unverified OCaml code due to its complexity. The function is then used in the proof of semantic preservation using the validator approach. As a reminder, the validator approach allows CompCert to use the OCaml code only if the output is valid. Otherwise, compilation fails and no code is emitted. The correctness conditions for the coloring $\phi$ of the temporaries is: $\phi(r) \neq \phi(r')$ if $r$ and $r'$ interfere</li> $\phi(r) \neq l$ if $r$ and $l$ interfere ($l$ is a machine register or stack location. The interference graph can be pre-colored and some pseudo registers can interfere with hardware registers or stack locations)</li> $\phi(r)$ and $r$ have the same register class (either int or float)</li> </ol> After coloring the graph, the RTL IR is transformed to LTL IR by replacing temporaries according to the mapping $\phi$. For each temporary $r$, $\phi(r)$ either a hardware register or a stack location. In order to prove that the transformation preserves the semantics of the original program, the lock-step simulation approach discussed above is used. The equivalence relation on states requires that control flow is preserved, memory contents are the same, and that the registers are somehow preserved. The first two properties are intuitively correct because register allocation doesn't really affect control flow or memory state at a program point. However, the equivalence between temporaries and hardware registers is a bit more subtle. This is because the value in a hardware register might not be the same as the value in a temporary mapped to it. For example, if two temporaries that do not interfere are mapped to the same hardware register, the value in the hardware register will not be the same as one of the two temporary values at some point in time. Therefore, the paper states that a relaxation was proven where at every program point $l$, $R(r) = R'(\phi(r))$ for all $r$ live at point $l$ ($R$ is the state of the register mapping). Evaluation</h2> In general CompCert is able to output code with similar performance characteristics to gcc at -O1 and -O2. However, we think a more important metric is the correctness of CompCert, since this was the primary purpose of creating it. This is something the paper was not able to do because nobody had seriously tested CompCert at the time of the paper's release. However, several automated compiler testing tools such as Csmith</a> and Orion (from the Equivalence Module Inputs</a> paper) have reported a handful of bugs in CompCert over the years. We looked into issues on the official CompCert GitHub page and bug reports generated by Orion to try and figure out what some of the bugs were and where the manifested themselves. The GitHub repository for CompCert</a> seems to have been created in 2015, while the paper is dated 2009. So there may be several bugs found in the original version of CompCert that were patched when the GitHub repository was started. The Orion project includes a page</a> with all of the bugs it found that led to open issues on the CompCert GitHub repository. There are 31 total issues on this list, with 27 of them being marked as fixed and the remaining 4 marked as won't fix. This list of issues was reported between August 2016 and May 2017. Most of the issues reported seem to be issues in the front end OCaml code and are mostly crash failures. There are a few crash failures associated with the unverified register allocation code that seem to have required some updates to the Coq proofs such as issue 183</a>. However, it does not appear as if there was a case where CompCert silently generated a miscompilation due to an error in the formally verified parts of CompCert. This suggests that CompCert does indeed succeed at its goal of creating a C compiler with no miscompilation bugs. Provably Correct Peephole Optimizations with Alive Fri, 06 Dec 2019 00:00:00 +0000 In previous discussions, we've considered research systems that find bugs in compiler implementations via differential testing. To page you back in, CSmith</a> and Equivalence Modulo Inputs (Orion)</a> both used clever tactics to generate randomized test programs and inputs, with the goal of finding instances where compilers produce different output than expected. These systems exploit a key assumption: while we don't have an oracle that determines the ground truth correct behavior for any program (without a precise semantics), we can expect compilers to produce the "same" behavior across different implementations. On the other hand, there are fully verified compilers such as CompCert</a> that guarantee against mis-compilations, but do so at the cost of supporting entire language surfaces and getting fast, optimized code. What about middle ground, where we leverage a correctness oracle for some particularly tricky portions of a commonly-used optimizing compiler? Lopes et al.’s “Provably Correct Peephole Optimizations with Alive”</a>, from PLDI 2015, takes one flavor of this approach. Instead of treating the compiler itself as a black-box system that we try to break from the outside, Alive proves that the high-level insights behind certain optimizations are correct. Alive is built for LLVM</a>, our friendly massively-optimizing, ahead-of-time, heavily-used beast of a compiler. Alive aims to hit a design point that is both practical and formal—the provable guarantees of verified compilation, for one component of a very pragmatic compiler. Peephole optimizations</h3> In particular, Alive focuses on LLVM's peephole optimizations—those that involve replacing a small set of (typically adjacent) instructions with an equivalent, faster set. For example, a clever compiler might replace %x = mul i8 %y, 2</code> (x = y * 2</code>) with %x = shl i8 %y, 1</code> (x</code> = y</code> shift left</a> 1</code>). While these optimizations may "delight hackers"</a>, they are also extremely tricky to get right for edge cases and boundary conditions. Alive's specific focus was inspired by the author's previous work on CSmith</a>, which found that the single buggiest file in LLVM was in the instruction combiner, home of over 20,000-C++-lines (!) of peephole optimizations. Since its publication in 2015, Alive has been used to fix and prevent dozens of bugs and improve code concision in production LLVM. System overview</h2> Below is a high-level overview of Alive's approach. First, Alive comes with its own domain-specific language (DSL) that was designed to resemble LLVM's intermediate representation. Optimizations are written in this DSL with a source (left hand side) and and target (right hand side) template, which abstract over constant values and exact data types. The semantics of each side are encoded into logical formulas. Then, Alive generates verification conditions that cover the full range of potential cases, including special treatment of undefined behavior. The verification conditions are handed to an off-the-shelf SMT (Satisfiability Modulo Theory) solver, Z3</a>, which proves their validity of provides a counterexample. If the verification conditions are provably correct, Alive is able to generate C++ code that implements the optimization (which the developer can then link into LLVM). If the verification conditions fail, Alive provides the developer with a counter-example (in terms of the original source and target template). Grokking undefined behavior</h2> The greatest technical challenge for a compiler or verification engineer in this space is wrangling with undefined behavior. One of the authors of Alive, John Regehr, has several</a> excellent</a> blog</a> posts on the topic. Refinement</h3> By definition, compilers are allowed to produce different results for the same source program in the presence of undefined behavior. However, compilers are not allowed to introduce undefined behavior for a program and input that was well-defined in the unoptimized source code. That is, the burden on a verifier like Alive is to show that optimization targets are refinements of the source: the optimized target can include a more, but not less, specific subset of behaviors of the source. To illustrate this, let's look at a real bug</a> in an optimization that Alive discovered in production LLVM (also described here</a>). The optimization aims to simplify an expression that negates the division of a variable x</code> with a constant C</code>, from the explicit 0 - (x/C)</code>, to the simpler x / -C</code>. In the Alive DSL, we specify this with: %div = sdiv %x, C %r = sub 0, %div => %r = sdiv %x, -C </pre> When we hand this optimization off to Alive, we get: Precondition: true %div = sdiv %x, C %r = sub 0, %div => %r = sdiv %x, -C Done: 1 ERROR: Domain of definedness of Target is smaller than Source's for i2 %r Example: %x i2 = 2 (0x2) C i2 = 1 (0x1) %div i2 = 2 (0x2) Source value: 2 (0x2) Target value: undef </pre> The problem here is the interplay between an edge case (signed integer overflow) and undefined behavior. When the concrete type is i2</code> and the values are x = -2</code> and C = 1</code>, x/-C = -2/-1 = 2</code>, but 2</code> overflows a 2-bit signed integer! While mathematically this is also true in the source template, LLVM's language reference states that overflow in sdiv</code> is undefined behavior, the same of which is not true for sub</code>. Thus, the target template introduced undefined behavior in a case where there previously was none, so it is not a refinement. In order to fix this bug, the LLVM developers added a precondition that C != 1</code> and C</code> is not a sign bit. In Alive, we can represent this precondition as ((C != 1) && !isSignBit(C))</code>, and the optimization verifies. Poison</h3> An additional complication in handling undefined behavior is that LLVM actually has two flavors of deferred (non-crashing) undefined behavior: the undef</code> value, and implicit poisoned values. Poison values are a stronger form of undefined behavior: they happen when a side-effect-free instruction produces a result that might later trigger undefined behavior. The true undefined behavior only occurs if/when a poisoned value is later used by an instruction that does have side effects (for example, a division by zero). Poison values are not represented explicitly in LLVM IR, and can only be identified via careful analysis. Alive models poison in a similar way to undef</code> values: target templates can only yield poison values if the source did as well. Evaluating Alive's impact</h2> At the time of publication in 2015, Alive's authors (manually) ported 334 peephole optimizations. Optimizations varied in verification time from a few seconds to several hours. From these 334 optimizations, Alive found 8 bugs. In addition, the authors build a version of LLVM with the default instruction combiner replaced by Alive-generated C++ for their 334 optimizations. They found that despite not covering all of the previous optimizations, LLVM+Alive maintained within 10% of the performance of LLVM on SPEC 2000 and 2006 benchmarks. Much more interestingly, however, the authors show how little coverage these optimizations received in the existing tests and benchmarks. An instrumented LLVM-Alive run on LLVM's nightly test suite and both SPEC benchmarks found that only 159 of the 334 optimizations were triggered: That is, nearly half of the peephole optimizations ported to Alive were untested via the existing manual test and benchmark flow! In addition to their hard performance numbers, Alive's authors reached out to LLVM developers to incorporate Alive into work-in-progress patches. The authors report they found "dozens" of proposed incorrect optimization implementations, which they were able to provide counter-examples to prevent with the help of Alive. Key take-aways</h2> Alive leaves us with several key nuggets of wisdom: DSL + SMT = profit</h4> Alive demonstrates that finding a domain-specific language for your goals, in this case concise peephole optimizations, can be especially fruitful for verification. The authors argue that DSLs help compiler engineers reason about code. Beyond that, Alive shows that a DSL makes translation of semantics to a formal logic like SMT more tractable than trying to wrangle with the full semantics of languages like C or LLVM IR directly. Later work on Alive ("Alive2"</a>) has also introduced tools to help translate LLVM IR to Alive's DSL in an automated fashion. Usability matters in formal methods</h4> Alive is a formal system, but it is also a deeply practical one. It recognized that there is impact to be had from building verification systems closer to where working programmers spend their day-to-day-hacking, in part by targeting a massive existing code base in a piecewise, workable way. In addition, Alive's DSL and counter-examples were designed with an interface meant to be familiar to LLVM engineers, which undoubtedly paid off in the adoption of this work. Finally, the authors of Alive engaged closely with the LLVM community, from frequenting the RFC discussion channels to publishing high-level blog posts on their contributions. A less optimistic lesson, however, is that technology transfer is still really hard. Despite the project's deep engagement with the community, LLVM has still not wholesale replaced most of its instruction combinations with Alive-generated code. There is always more work to be done reconciling ideals from research prototypes with the difficult constraints of industry-scale software engineering! Undefined behavior is pernicious</h4> One of the trickiest part of the job for both industry compiler engineers and research verification hackers is dealing with undefined behavior. Eliminating undefined behavior entirely isn't feasible in an aggressive optimizing compiler that wants to exploit speculation, so researchers need to continue to push for better methodology to keep it contained, understandable, and verifiable. In particular, the authors of Alive have been among several researchers who have pushed for LLVM to change its treatment of deferred undefined behavior. In 2016, they shared a proposal titled "Killing undef and spreading poison"</a> that advocated for removing the undef</code> value, adding an IR-level poison</code> value, and introducing a freeze</code> instruction that would stop the prorogation of poison by resolving to an arbitrary value. Alive includes a branch</a> modeling these new semantics. Just last month, LLVM took another step toward realizing this vision by adding the freeze</code> instruction. the freeze instruction finally landed in LLVM!https://t.co/W6odosWUe0</a> docs:https://t.co/kKcCpJUH1L</a> lots of work left to do but this is a big step towards making LLVM have a clear and consistent undefined behavior model— John Regehr (@johnregehr) November 5, 2019</a></blockquote>

How to learn more...</h2>

WebAssembly paper</a></li>
A cartoon intro to WebAssembly</a></li>
WebAssembly will finally let you run high performance applications in your browser</a></li> </ul>

Conclusion</h1>
Thanks to logical indexing, we generated two implementation from one high-level stencil algorithm. I plan to use the flexibility and expressiveness afforded by logical indexing as a foundation to implement future language features.</p>

Composable Brili Extensions

An Autoscheduler for Halide

Software Simulation for Data Streaming in HeteroCL

Evaluating the Performance Implications of Physical Addressing

Quantum Vectorization

A Simple Way to Implement a Bad High Level Synthesis Compiler

CompCert: Formally Verified C Compiler

Provably Correct Peephole Optimizations with Alive

Grokking undefined behavior</h2>
The greatest technical challenge for a compiler or verification engineer in this space is wrangling with undefined behavior. One of the authors of Alive, John Regehr, has several</a> excellent</a> blog</a> posts on the topic.</p>

Key take-aways</h2>
Alive leaves us with several key nuggets of wisdom:</p>

CS 6120

Bringing You Up to Speed on How Compiling WebAssembly is Faster

Logical Tensor Indexing for SPMD Programming

Complexity Minimizer

Bril JIT with On-Stack Replacement

A Dependently Typed Language

Finding Redundant Structures in Data Flow Graphs

CS 6120

Bringing You Up to Speed on How Compiling WebAssembly is Faster

Logical Tensor Indexing for SPMD Programming

Conclusion</h1> Thanks to logical indexing, we generated two implementation from one high-level stencil algorithm. I plan to use the flexibility and expressiveness afforded by logical indexing as a foundation to implement future language features.</p>

Complexity Minimizer

Bril JIT with On-Stack Replacement

A Dependently Typed Language

Finding Redundant Structures in Data Flow Graphs

Composable Brili Extensions

An Autoscheduler for Halide

Software Simulation for Data Streaming in HeteroCL

Evaluating the Performance Implications of Physical Addressing

Quantum Vectorization

A Simple Way to Implement a Bad High Level Synthesis Compiler

Conclusion</h1>
Thanks to logical indexing, we generated two implementation from one high-level stencil algorithm. I plan to use the flexibility and expressiveness afforded by logical indexing as a foundation to implement future language features.</p>