<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
    <channel>
        <title>CS 6120</title>
        <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa</link>
        <description></description>
        <generator>Zola</generator>
        <language>en</language>
        <atom:link href="https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/rss.xml" rel="self" type="application/rss+xml"/>
        <icon>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/img/favicon.ico</icon>
        <lastBuildDate>Fri, 06 Mar 2020 00:00:00 +0000</lastBuildDate>
        
            <item>
                <title>Bringing You Up to Speed on How Compiling WebAssembly is Faster</title>
                <pubDate>Fri, 06 Mar 2020 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/wasm/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/wasm/</guid>
                <description>&lt;p&gt;If you&#x27;re a geography buff like me (you&#x27;re probably not, but one can hope), 
you might remember the time when you downloaded an app to get the exhilarating
experience of zooming on top of a volcano, or onto an island in the middle of the 
Pacific, or onto the top of the Eiffel. Then do you remember when you 
could do these feats from the convenience of the Chrome browser? Well, 
then you might have just heard that now you can do it in (almost) any browser out there. 
For me, that&#x27;s what WebAssembly brings.&lt;&#x2F;p&gt;
&lt;p&gt;In this article, I don&#x27;t expect to talk much about the language (or binary code format) 
itself, but implications of its compilation which makes 
things like &lt;a href=&quot;https:&#x2F;&#x2F;www.google.com&#x2F;earth&#x2F;&quot;&gt;Google Earth&lt;&#x2F;a&gt; on a browser possible.&lt;&#x2F;p&gt;
&lt;img src=&quot;google-earth.png&quot; width=&quot;700&quot; &gt;
&lt;h2 id=&quot;what-is-webassembly&quot;&gt;What is WebAssembly?&lt;&#x2F;h2&gt;
&lt;p&gt;WebAssembly is a binary code format to transfer web applications from the 
server to the browser. It is incorporated in modern browsers to be used in 
tandem with existing JavaScript applications, and uses components from 
existing JavaScript engines to interpret and execute. It has taken the world 
wide web by storm as,&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;it is rolled out by four major browsers (Chrome, Edge, Mozilla and Safari)
i.e., platform independent&lt;&#x2F;li&gt;
&lt;li&gt;it is programming model independent&lt;&#x2F;li&gt;
&lt;li&gt;it is hardware independent&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;What this means is you could write a web application in C, compile it to 
WebAssembly and use it on any browser on any hardware. This gives much more 
performance in general than a JavaScript program, which has type 
classification overheads and overheads at parsing and verification.
Existing technologies also has performance implications based on which browser 
you target (different browsers used different techniques to optimize applications on
browsers).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;is-webassembly-just-c-on-browsers&quot;&gt;Is WebAssembly just C on browsers?&lt;&#x2F;h2&gt;
&lt;p&gt;WebAssembly does brings the performance of pre-compiled languages to web browsers. 
But it does so side-by-side with traditional JavaScript running on the same engines.
This allows browsers to have best of both worlds, drive the Ferrari on the race track--- 
run performant WebAssembly when it needs performance--- or drive the Prius to the 
grocery store--- quickly getting a simple web application up and running.&lt;&#x2F;p&gt;
&lt;p&gt;With WebAssembly, you could use JavaScript for fast development, but also 
use C where you need performance. You could use C (and many other languages) where static typing is
useful, but JavaScript where dynamic typing is a necessity for productivity.
And you could combine these modules, leveraging interoperability to balance productivity and performance,
as both these styles can now be executed in the same compiler flow.
So WebAssembly is more than just C on browsers, it&#x27;s a carefully coordinated, 
well-designed, massive engineering effort to reconcile two worlds at odds.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;how-does-it-differ-from-existing-tools&quot;&gt;How does it differ from existing tools?&lt;&#x2F;h2&gt;
&lt;p&gt;So WebAssembly is portable, but doesn&#x27;t the Java Virtual Machine (JVM) already 
offer portability? Even though the overall goal has some similarities, Java 
cannot be run on a browser without plugins and has doesn&#x27;t have the language
flexibilty WebAssembly provides. This is due to Java&#x27;s portability coming 
from the ability to run Java programs in different platforms, and therefore not
designed to run programs written in multiple languages in a browser. Java also doesn&#x27;t solve the issues
JavaScript poses through interpreter based execution and
garbage collection. Moreover, WebAssembly is designed to make validation easier. 
Java was not designed with formalization in mind, and therefore is complex to validate.
Structured control flow and single pass compilation and validation places WebAssembly as a 
mature option compared to existing bytecode formats such as JVM and CIL 
(Common Intermediate Language) which contain irreducible loops and unbalanced locking 
features that are hard to be handled with JIT compilers and are relegated to 
the interpreter. And these design decisions has made WebAssembly a better 
candidate to implement performant and portable to run different languages 
natively on a browser.&lt;&#x2F;p&gt;
&lt;p&gt;WebAssembly also performs better compared to recent technologies with a similar
goal. Firstly, it can outperform &lt;a href=&quot;https:&#x2F;&#x2F;hacks.mozilla.org&#x2F;2017&#x2F;03&#x2F;why-webassembly-is-faster-than-asm-js&#x2F;&quot;&gt;asm.js&lt;&#x2F;a&gt; from optimizations beyond what can be done with 
JavaScript, WebAssembly has faster download with smaller code size, it does not require
parsing as it&#x27;s already in a binary format. Even though asm.js compilation is different 
from regular JavaScript interpretation, and is statically typed as WebAssembly, 
it is still limited by JavaScript. It is not as compact as WebAssembly as a transporting format. 
Moreover, WebAssembly is designed to use CPU features that are not expressible in JavaScript, such as 64-bit integers.
Secondly, it replaces &lt;a href=&quot;https:&#x2F;&#x2F;developer.chrome.com&#x2F;native-client&quot;&gt;NaCL&lt;&#x2F;a&gt; as a go-to tool to run C programs in a browser,
as WebAssembly is well integrated to the JavaScript ecosystem. Meanwhile, NaCL used
sandboxing within applications to integrate. In fact, the two optimization
fronts offered from these two tools gave the incentive to merge them 
as WebAssembly.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;compiling-for-webassembly&quot;&gt;Compiling for WebAssembly&lt;&#x2F;h2&gt;
&lt;p&gt;The overall execution flow of WebAssembly has two phases. The frontend&lt;br &#x2F;&gt;
compiles and optimizes C (and other) programs to WebAssembly. The WebAssembly
bytecode is downloaded to the browser, where the time-critical compilation 
to machine code occurs. This phase is compared to the conventional JavaScript
execution.&lt;&#x2F;p&gt;
&lt;img src=&quot;wasm.png&quot; width=&quot;700&quot; &gt;
&lt;p&gt;A major component of speedup from WebAssembly comes from the compilation.
Using JavaScript, your JavaScript engine would go through the phases of 
parsing, baseline compilation, optimizing compiler, re-optimizing and 
bail out, execute and garbage collection to run an application. 
WebAssembly affects each of these stages to be more performant, and even completely
obviating the need of some.&lt;&#x2F;p&gt;
&lt;p&gt;To begin with, WebAssembly is more compact than JavaScript source code, 
making it faster to fetch from the server. Then WebAssembly doesn&#x27;t need 
parsing; it&#x27;s already compiled down to virtual instructions which only 
need decoding like in actual hardware. The WebAssembly compiler can do this decode 
much faster than the JavaScript engine parsing JavaScript code to an IR.&lt;&#x2F;p&gt;
&lt;p&gt;Then benefits of WebAssembly from actual compilation kick in. 
JavaScript needs to be compiled to multiple versions based on what types 
are in use (similar to any other dynamically typed language). WebAssembly 
is a sound, statically typed language, which has its types encoded during offline compilation on to WebAssembly.
Therefore, it doesn&#x27;t need monitoring to figure out 
the types, and maintain multiple versions. Unlike in the JavaScript engine, 
WebAssembly also doesn&#x27;t need to do most optimizations, except for platform and hardware 
dependent ones, as everything else is already done in static compile time.&lt;&#x2F;p&gt;
&lt;p&gt;Since WebAssembly doesn&#x27;t need assumptions (such as which type certain 
object is) during interpretting, the compiler doesn&#x27;t  need to 
bail out and reoptimize as such errors never occur. Finally, WebAssembly also allows you to manage memory manually (it only 
supports manual memory management as of now, but automation is to be added
as an option) which allows you to avoid expensive garbage collection.&lt;&#x2F;p&gt;
&lt;p&gt;Therefore, WebAssembly only requires compiler backend, module loading frontend, 
security sandboxing and supporting VM components in the existing JavaScript engine,
and skips parsing, most compiler optimizations, bail out and garbage collection.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementing-webassembly-compilers&quot;&gt;Implementing WebAssembly compilers&lt;&#x2F;h2&gt;
&lt;p&gt;WebAssembly is walking a tight rope between high performance (C world), and 
safety and portability (JS world). It is hard to theorize how to optimize for
it, so the WebAssembly team implemented extensions in different browsers to
validate that it is indeed possible to achieve the said goal.&lt;&#x2F;p&gt;
&lt;p&gt;V8 (from Chrome) and SpiderMonkey (from Firefox) reuse their optimizing JIT 
compilers to compile ahead of time. This provides them predictable high 
performance, as opposed to the unpredictable warmup times with JavaScript.
Chakra (from Edge) uses lazy translation and optimizes only hot code. This 
achieves faster startup time.&lt;&#x2F;p&gt;
&lt;p&gt;To permit efficient use of baseline JIT compilers, WebAssembly is designed to
do fast validation and ability to track registers for allocation without IR
generation. This is done by careful design of WebAssembly instructions, which 
the pass can use to extract register information. To integrate well with optimizing JIT compilers, WebAssembly is 
designed to do direct-to-SSA translation (WebAssembly is not in SSA form, but offers some tools
to derive an SSA form). Moreover, structured control flow 
makes decoding simpler.&lt;&#x2F;p&gt;
&lt;p&gt;Another key aspect in running native code in a browser is security. Sandboxing
for C&#x2F;C++ has required extensive code rewriting. However, WebAssembly is executed
in a sandboxed environment clearly separated from the host, making it possible
to provide some security guarantees. WebAssembly avoids performance overheads 
through sandboxing to run at near native speed with the model created by &lt;a href=&quot;https:&#x2F;&#x2F;developer.chrome.com&#x2F;native-client&quot;&gt;Native Client&lt;&#x2F;a&gt;
which employs static validation of x86 machine code by mandating code generators 
to follow certain patterns.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;how-good-is-it&quot;&gt;How good is it?&lt;&#x2F;h2&gt;
&lt;p&gt;Writing code in WebAssembly doesn&#x27;t mean it&#x27;ll be automatically faster. 
However, programs are already optimized during compilation to WebAssembly 
(more time for the compiler to optimize) by trading off flexibility for an 
optimized general case and WebAssembly compiler can leverage its generality to do 
additional optimizations better than on generic JavaScript. So in practice
WebAssembly can be much more performant.&lt;&#x2F;p&gt;
&lt;p&gt;The following bar graph demonstrates how WebAssembly performs comparative to
native code (running a C application). Most benchmarks are within 10% of 
native performance.&lt;&#x2F;p&gt;
&lt;img src=&quot;figure-5.png&quot; width=&quot;700&quot; &gt;
&lt;p&gt;An &lt;a href=&quot;https:&#x2F;&#x2F;spectrum.ieee.org&#x2F;computing&#x2F;software&#x2F;webassembly-will-finally-let-you-run-highperformance-applications-in-your-browser&quot;&gt;article&lt;&#x2F;a&gt; from some designers of WebAssembly suggests that WebAssembly 
can run at about 80% performance of native applications. The paper also 
reports that performance compared to the state-of-the-art mechanism to 
execute native programs in a browser, asm.js, is about 33.7% better.&lt;&#x2F;p&gt;
&lt;img src=&quot;figure-6.png&quot; width=&quot;700&quot; &gt;
&lt;p&gt;This scatter plot illustrates WebAssembly&#x27;s benefits in terms of code size. They
are on average 37.5% smaller than asm.js code and 14.7% smaller than native code.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-does-this-mean&quot;&gt;What does this mean?&lt;&#x2F;h2&gt;
&lt;p&gt;Well, as I said, Google Earth on any browser. Just in case that&#x27;s not enough,
playing a game on a browser is going to be very similar to downloading an app
and running it natively, you can share your CAD design to someone over the web
and have them take a look at it without downloading the files or even having the
necessary tools installed, you&#x27;ll see a lot more sophisticated features in 
websites such as social media, all you need is a device with a browser to 
experience almost any service, downloading an app is not going to be the same--- 
why would you, when you get similar (in human perception) performance by running
them on a broswer without the overhead of installing?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-s-next&quot;&gt;What&#x27;s next?&lt;&#x2F;h2&gt;
&lt;p&gt;WebAssembly has already achieved a lot in a short span of time (3 years 
since introduced in 2017). It is already supported in all major browsers 
and popular development kit &lt;a href=&quot;https:&#x2F;&#x2F;emscripten.org&quot;&gt;Emscripten&lt;&#x2F;a&gt; offers compiling C down to &lt;code&gt;.wasm&lt;&#x2F;code&gt;
via &lt;code&gt;asm.js&lt;&#x2F;code&gt;. Personally, it has already enabled running Google Earth on 
any browser (what more do you need!!).&lt;&#x2F;p&gt;
&lt;p&gt;Yet, WebAssembly is in its infancy. The popular compiler toolchain, 
&lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&quot;&gt;the LLVM project&lt;&#x2F;a&gt;, has already added a backend for it and developers are adding debug tools
to improve the WebAssembly eco-system.&lt;&#x2F;p&gt;
&lt;p&gt;More exciting and enabling future options to WebAssembly may include &lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;stream compilation: start compiling as the bytecode is being downloaded, &lt;&#x2F;li&gt;
&lt;li&gt;shared memory concurrency: reduce synchronization by handling it 
efficiently on shared memory and &lt;&#x2F;li&gt;
&lt;li&gt;SIMD: to parallelize execution by sharing instructions among data. &lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;how-to-learn-more&quot;&gt;How to learn more...&lt;&#x2F;h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;people.mpi-sws.org&#x2F;%7Erossberg&#x2F;papers&#x2F;Haas,%20Rossberg,%20Schuff,%20Titzer,%20Gohman,%20Wagner,%20Zakai,%20Bastien,%20Holman%20-%20Bringing%20the%20Web%20up%20to%20Speed%20with%20WebAssembly.pdf&quot;&gt;WebAssembly paper&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;hacks.mozilla.org&#x2F;2017&#x2F;02&#x2F;a-cartoon-intro-to-webassembly&#x2F;&quot;&gt;A cartoon intro to WebAssembly&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;spectrum.ieee.org&#x2F;computing&#x2F;software&#x2F;webassembly-will-finally-let-you-run-highperformance-applications-in-your-browser&quot;&gt;WebAssembly will finally let you run high performance applications in your browser&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</description>
            </item>
        
            <item>
                <title>Logical Tensor Indexing for SPMD Programming</title>
                <pubDate>Sat, 21 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/logical-indexing-on-hb/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/logical-indexing-on-hb/</guid>
                <description>&lt;p&gt;My current research explores programming abstractions for an experimental, parallel architecture (codenamed HammerBlade).
The language feature prototyped in this project aims to simplify &lt;em&gt;SPMD&lt;&#x2F;em&gt; programming, wherein all nodes apply the same kernel over different pieces of input data.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;cross-cutting-concerns-in-spmd-programming&quot;&gt;Cross-cutting Concerns in SPMD Programming&lt;&#x2F;h1&gt;
&lt;p&gt;The need to distribute inputs over cores in SPMD programs introduces problems that are orthogonal to the kernel, but that nonetheless affect implementation:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data format of inputs&#x2F;outputs&lt;&#x2F;strong&gt;: 
Each tensor $A$ must be encoded in a particular &lt;strong&gt;format&lt;&#x2F;strong&gt;, $f_A$, that maps the &lt;strong&gt;logical index&lt;&#x2F;strong&gt; of each tensor element $A\langle x \rangle \langle y \rangle$ to &lt;strong&gt;physical index&lt;&#x2F;strong&gt; $f_A(x,y)$ into array $A&#x27;$, s.t., $A\langle x \rangle \langle y \rangle = A&#x27;[f_A(x,y)]$.&lt;&#x2F;p&gt;
&lt;p&gt;Low-level languages like C require this dimensionality reduction to maximize performance contributors such as data locality and bulk memory allocations&#x2F;movements.
Different formatting choices lead to &lt;em&gt;different implementations&lt;&#x2F;em&gt; of the &lt;em&gt;same kernel&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;We constrain our prototype to use the &lt;strong&gt;row-major order&lt;&#x2F;strong&gt; format, which sequentially lays out rows of $A$ via $f_A(x,y) = x * stride + y$, where $stride$ indicates the number of elements per row of $A&#x27;$.
Our results generalize to similar formats (e.g., &lt;em&gt;column-major order&lt;&#x2F;em&gt;), and we leave exploration of more exotic formats to future work.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Caching:
Preemptive promotion of required inputs in the memory hierarchy, another form of data locality, often leads to significant performance gains.
However, the choice of caching can change data formatting. 
For instance, if a &lt;strong&gt;row-major order&lt;&#x2F;strong&gt; sub-matrix is cached from a &lt;strong&gt;row-major order&lt;&#x2F;strong&gt; matrix, its stride will be different.
Thus, the choice of caching also leads to &lt;em&gt;different implementations&lt;&#x2F;em&gt; of the &lt;em&gt;same kernel&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;This project implements a formal separation of &lt;strong&gt;logical indexing&lt;&#x2F;strong&gt; from &lt;strong&gt;data formatting&lt;&#x2F;strong&gt; via a simple 2-D stencil algorithm example.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;stencil-algorithm-using-logical-indexing&quot;&gt;Stencil Algorithm Using Logical Indexing&lt;&#x2F;h1&gt;
&lt;p&gt;The proposed syntax for logical indexing uses angle brackets &lt;code&gt;A&amp;lt;x&amp;gt;&amp;lt;y&amp;gt;&lt;&#x2F;code&gt; to distinguish from traditional array indexing &lt;code&gt;A[x][y]&lt;&#x2F;code&gt;.
We demonstrate this language feature by means of a 2-D stencil algorithm that computes 2-D tensor $B\langle x \rangle \langle y \rangle$ from tensor $A$ as the average of $A\langle x \rangle \langle y \rangle$ and its neighbors. &lt;&#x2F;p&gt;
&lt;p&gt;At the boundaries of $B$ (i.e., $\forall x, y, B\langle 0 \rangle \langle y \rangle$, $B\langle n \rangle \langle y \rangle$, $B\langle x \rangle \langle 0 \rangle$, $B\langle x \rangle \langle m \rangle$), the algorithm simply assigns the input directly.
In SPMD style, each core transforms a block of data, denoted $A_{\text{tile}}$ into a block of output, denoted $B_{\text{tile}}$.
The pseudo-code below uses logical indexing to capture this behavior; noting that a more efficient implementation is possible but unnecessary for our purposes:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(x : B_tile.row_ix &lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;in&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; B_tile.row_range) {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(y : B_tile.col_ix &lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;in&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; B_tile.col_range)  {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;as&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; B.row_ix &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;B.max_row&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&amp;amp;
       &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;y &lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;as&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; B.col_ix &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;B.max_col&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; {
      B&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;y&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; = neighbor_avg(A&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;y&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
    }
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
      B&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;y&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; = A&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;y&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    }
  &lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Both loops iterate over zero-based logical row and column ranges, &lt;code&gt;B_tile.row_range&lt;&#x2F;code&gt; and &lt;code&gt;B_tile.col_range&lt;&#x2F;code&gt;, with a branch detecting points at a logical boundary where behavior differs from inner points.&lt;&#x2F;p&gt;
&lt;p&gt;To determine that an index from sub-tensor $B_{\text{tile}}$ resides on the logical boundary of parent tensor $B$, we perform &lt;strong&gt;index casting&lt;&#x2F;strong&gt; from &lt;code&gt;B_tile.row_ix&lt;&#x2F;code&gt; to &lt;code&gt;B.row_ix&lt;&#x2F;code&gt; and analogously for columns.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;specifying-the-data-format&quot;&gt;Specifying the Data Format&lt;&#x2F;h1&gt;
&lt;p&gt;By expressing the stencil algorithm in terms of logical indexing, we disentangle it from implementation details of data formatting and caching.&lt;&#x2F;p&gt;
&lt;p&gt;We can specify these details by decorating inputs and outputs in the kernel signature as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;stencil_kernel(input[blocked cached, row-major] A : float[n][m], output[non-cached, row-major] B: float[n][m])
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This signature specifies that both &lt;code&gt;A&lt;&#x2F;code&gt; and &lt;code&gt;B&lt;&#x2F;code&gt; are formatted in row-major order, that a block of &lt;code&gt;A&lt;&#x2F;code&gt; will be cached by each core, and that cores will write directly to &lt;code&gt;B&lt;&#x2F;code&gt; in global memory.
Such a specification suggests to the compiler particular implementations for data formats $f_A, f_B$.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;stencil-algorithm-with-logical-indexing-in-c&quot;&gt;Stencil Algorithm with Logical Indexing in C&lt;&#x2F;h1&gt;
&lt;p&gt;Logical indexing and data formatting will be incorporated into my research language; however, compilation for this syntax has not been implemented yet. 
Here, I share a prototype of logical indexing in C that generates cached and non-cached variants of the above stencil algorithm.&lt;&#x2F;p&gt;
&lt;p&gt;We&#x27;ll take the stencil algorithm as a diving board into the interplay of logical indexing and data formatting:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; stencil algorithm
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;stencil&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(tensor A, tensor B) {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; B.n; x&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; y &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; y &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; B.m; y&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
      &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;log_to_phys_row(B, x) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&amp;amp;
            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;log_to_phys_row(B, x) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;top_level_tensor(B)&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&amp;amp;
          &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;log_to_phys_col(B, y) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&amp;amp;
            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;log_to_phys_col(B, y) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;top_level_tensor(B)&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
        log_write(B, x, y, avg(A, x, y));
      }
      &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
        log_write(B, x, y, log_read(A, x, y));
      }
    }
  }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The translation from the earlier psuedo code is nearly straightforward: logical reads and writes are respectively denoted &lt;code&gt;log_read&lt;&#x2F;code&gt; and &lt;code&gt;log_write&lt;&#x2F;code&gt;.
Note the &lt;code&gt;tensor&lt;&#x2F;code&gt; type, which collects data formatting information of an array to implement logical data operations; it&#x27;s pretty &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Dope_vector&quot;&gt;dope&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;typedef struct&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tensor
  { &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tensor &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;parent;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;float*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; arr;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; n; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; m;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; stride; 
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; physical offset of logical origin in parent
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; phys_row_off; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; phys_col_off; 
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;bool&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; cached;
  } tensor;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Using this information, we implement &lt;code&gt;log_read&lt;&#x2F;code&gt; as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;float &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;log_read&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(tensor &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;A, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;log_row, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;log_col) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; phys_row &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;log_to_phys_row(A, log_row);
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; phys_col &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;log_to_phys_col(A, log_col);

    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; A-&amp;gt;arr[phys_row &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; A-&amp;gt;stride &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; phys_col];
  }
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;log_to_phys_row&lt;&#x2F;code&gt; applies sub-tensor offsets to the logical row index recursively up the nested sub-tensor structure:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;log_to_phys_row&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(tensor &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;A, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;log_row) {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; A-&amp;gt;cached &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;||&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; A-&amp;gt;parent &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;NULL &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;?&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; 
    log_row &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; A-&amp;gt;phys_row_off&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:
    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;log_to_phys_row(A-&amp;gt;parent, log_row &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; A-&amp;gt;phys_row_off);
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Importantly, we do not recurse over cached sub-tensors to compute the physical index, since such tensors are only logically, but not physically, nested inside the parent tensor.&lt;&#x2F;p&gt;
&lt;p&gt;We calculate logical upper bounds by recursively traversing up the &lt;code&gt;parent&lt;&#x2F;code&gt;s of sub-tensors and returning the max row or column of the top-level &lt;code&gt;parent&lt;&#x2F;code&gt;.
Caching plays no role in calculating this, or any, logical quantity.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;tensor&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;top_level_tensor&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(tensor &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;A) {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; A-&amp;gt;parent &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;NULL &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;?&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; A &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;top_level_tensor(A-&amp;gt;parent);
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Unfortunately, index casting is not actually implemented yet.
Instead, the above implementation cheats by using the &lt;code&gt;log_to_phys&lt;&#x2F;code&gt; functions: rather than perform logical boundary-checking it performs physical boundary checking.
This works for our purposes, since the logical and physical boundaries are equivalent for the row-major order data format.&lt;&#x2F;p&gt;
&lt;p&gt;Note that these recursive functions for looking up tensor information are expensive: an actual compiler for logical indexing will do this lookup up-front to generate code with inlined values.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;cached-vs-non-cached-implementations&quot;&gt;Cached vs Non-cached Implementations&lt;&#x2F;h1&gt;
&lt;p&gt;Both the cached and non-cached implementations define tensors as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tile_n &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; n &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; cores_x;
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tile_m &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; m &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; cores_y;

  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b_base_row &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tid_x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tile_n;
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b_base_col &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tid_y &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tile_m;

  tensor a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=
    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{ .arr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a_arr,
      .n &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; n, .m &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; m,
      .stride &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; m };
  tensor b &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= 
    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{ .arr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b_arr,
      .n &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; n, .m &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; m,
      .stride &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; m };
  tensor a_tile &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= 
    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{ .arr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a_arr, 
      .parent &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;a,
      .n &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tile_n, .m &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tile_m,
      .stride &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; m,
      .phys_row_off &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b_base_row, .phys_col_off &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b_base_col};
  tensor b_tile &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= 
    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{ .arr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b_arr, 
      .parent &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;b,
      .n &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tile_n, .m &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tile_m,
      .stride &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; m,
      .phys_row_off &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b_base_row, .phys_col_off &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b_base_col};
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;a_tile&lt;&#x2F;code&gt; and &lt;code&gt;b_tile&lt;&#x2F;code&gt; represent the sub-tensors from input &lt;code&gt;A&lt;&#x2F;code&gt; and output &lt;code&gt;B&lt;&#x2F;code&gt; that are operated on by the current tile.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;strong&gt;global memory (non-cached) implementation&lt;&#x2F;strong&gt; uses these tensor definitions to run the stencil algorithm as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;  stencil(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;a_tile, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;b_tile);
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The &lt;strong&gt;cached implementation&lt;&#x2F;strong&gt; defines the same tensors, but additionally performs a caching routine that moves data from tensor &lt;code&gt;a_tile&lt;&#x2F;code&gt; to a new tensor &lt;code&gt;a_scr&lt;&#x2F;code&gt;, as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
tensor a_scr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{ .arr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; scr,
    .parent &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;a_tile,
    .n &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tile_n&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, .m &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tile_m&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,
    .stride &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; m&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,
    .phys_row_off &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, .phys_col_off &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,
    .cached &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;};

  cache(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;a_tile, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;a_scr);
  stencil(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;a_scr, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;b_tile);
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The caching strategy reads a window around the tile data consisting of two additional rows and columns:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;cache&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(tensor &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;input, tensor &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;cache) {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; input-&amp;gt;n&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; x&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; y &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; y &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; input-&amp;gt;m&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; y&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
      &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;log_to_phys_row(input, x) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&amp;amp;
          &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;log_to_phys_row(input, x) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;top_level_tensor(input)-&amp;gt;n &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&amp;amp;
         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;log_to_phys_col(input, y) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&amp;amp;
          &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;log_to_phys_col(input, y) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;top_level_tensor(input)-&amp;gt;m ) {
        log_write(cache, x, y,
            log_read(input, x, y));
      }
    }
  }
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This caching routine shows how we can use the physical offset to arbitrarily shift the otherwise zero-based logical indexing.
Using physical offset for this purpose is also cheating, since logical indexing may in principle be changed while preserving physical indexing.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;evaluation-cached-vs-non-cached&quot;&gt;Evaluation: Cached vs Non-cached&lt;&#x2F;h1&gt;
&lt;p&gt;Both stencil implementations ran on a gem5 simulation of a 4x4 HammerBlade architecture.
Both stencil implementations were tested using matrices with elements equal to their row-major order indices. 
This choice of matrix allowed for simple confirmation of correctness, since these matrices are not transformed by the stencil algorithm.&lt;&#x2F;p&gt;
&lt;p&gt;We compare performance of both implementations by reading the &lt;code&gt;numCycles&lt;&#x2F;code&gt; statistic produced by the simulation for each input size and averaging the result over the four cores, and over three trials:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;non-cached&lt;&#x2F;strong&gt;
| 4x4     | 8x8     | 16x16   |   34x34 |
|---------|---------|---------|---------|
|  7605   | 28740   |  123769 |  580502 |
|         |         |         |         |&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;cached&lt;&#x2F;strong&gt;
| 4x4     | 8x8     | 16x16   | 34 x 34 |
|---------|---------|---------|---------|
|  14592  | 35961   |  115200 | 474928  |
|         |         |         |         |&lt;&#x2F;p&gt;
&lt;h1 id=&quot;discussion&quot;&gt;Discussion&lt;&#x2F;h1&gt;
&lt;p&gt;These results demonstrate that the cost of caching is amortized with the size of the input.&lt;&#x2F;p&gt;
&lt;p&gt;Although this result is expected, I wondered why the cost of caching seemed so high.
Cursory testing revealed that removing the branching from the cache loop removed ~10k cycles from the 4x4 and 8x8, ~40k cycles from the 16x16 input, and ~130k cycles from 34x34.
These results suggest that a more efficient caching routine that minimizes branching would improve performance.&lt;&#x2F;p&gt;
&lt;p&gt;Further analysis can be done to determine how much the simulation penalizes global memory vs scratchpad operations.
This analysis would also inform whether caching on the output might present an opportunity for additional performance benefits.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h1&gt;
&lt;p&gt;Thanks to logical indexing, we generated two implementation from one high-level stencil algorithm.
I plan to use the flexibility and expressiveness afforded by logical indexing as a foundation to implement future language features.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Complexity Minimizer</title>
                <pubDate>Fri, 20 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/internal-function-memoization/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/internal-function-memoization/</guid>
                <description>&lt;p&gt;The goal of this project is to transform a subset of recursive programs into their iterative counterparts. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;motivation&quot;&gt;Motivation&lt;&#x2F;h2&gt;
&lt;p&gt;Execution time of a program is important. As a metric, execution time is relevant not just for saving loads of money from unnecessary compute, but is also relevant in terms of functionality. At the limit, writing faster programs can enable us to solve larger problems that require massive computational work in a tangible amount of time.
Compilers can make a given program run faster. Often, we tend to look at classic compiler optimizations as a way to iteratively speed up a given program.&lt;&#x2F;p&gt;
&lt;p&gt;However, the scope of the optimization pass model is limited, especially when the program is algorithmically inefficient. If we are able to transform programs with unnecessarily  $O(n)$ space cost into $O(1)$ space cost, or programs with $O(2^n)$ runtime into programs with $O(n)$ runtime, why couldn&#x27;t a compiler?&lt;&#x2F;p&gt;
&lt;p&gt;One of the challenges of this idea is the notion of an &lt;em&gt;algorithm&lt;&#x2F;em&gt;. As humans, we can read the program, determine the task, and write a better program given the necessary computation. However, most compilers do not have a notion of the &amp;quot;goal&amp;quot; of a program. Similar questions arise in program synthesis when determining the &amp;quot;specification&amp;quot; of a program.&lt;&#x2F;p&gt;
&lt;p&gt;Computing the _n_th Fibonacci number can be done in an recursive fashion with an $O(2^n)$ time complexity or in a iterative fashion for $O(n)$. However, there is no compiler optimization that can produce the performance of the iterative algorithm when given the recursive one, so we attempt to make a generalizable optimization pass that can change similar recursive programs to their iterative counterparts.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;approach&quot;&gt;Approach&lt;&#x2F;h2&gt;
&lt;p&gt;In order to scope our project to tangible action items, we separated our idea into smaller milestones.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;V0.&lt;&#x2F;em&gt; Transform a recursive Fibonacci program into an iterative program&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;V1.&lt;&#x2F;em&gt; Transform a recursive program with an arbitrary combination of elements on return (e.g., Fibonacci was f(n-1) + f(n-2))&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;V2.&lt;&#x2F;em&gt; Expand scope of input set of valid programs to two-dimensional DP programs&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-first-program&quot;&gt;A First Program&lt;&#x2F;h2&gt;
&lt;p&gt;Before digging in to any code, we spent time thinking deeply about recursive DP programs and their iterative counterparts. &lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;What are the distinguishing parts of the recursive implementation that can give rise to an equivalent iterative program?&lt;&#x2F;li&gt;
&lt;li&gt;Which parts of these recursive functions is the similar across them such that we can generalize? &lt;&#x2F;li&gt;
&lt;li&gt;How do we know these are equivalent?&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;We decided to implement this at the LLVM IR level so that our optimization would be source-language-independent.&lt;&#x2F;p&gt;
&lt;p&gt;When considering iterative Fibonacci, there are two main parts: the base cases and the iteration. The base cases consist of returning 0 or 1 when &lt;em&gt;n&lt;&#x2F;em&gt; is 0 or 1 respectively, and the iteration part entails a while loop with an induction variable that iterates from where the base cases end to the desired &lt;em&gt;n&lt;&#x2F;em&gt; (i.e., from 2 to &lt;em&gt;n&lt;&#x2F;em&gt; becase the last base case is Fibonacci 1).&lt;&#x2F;p&gt;
&lt;h3 id=&quot;information-collection&quot;&gt;Information Collection&lt;&#x2F;h3&gt;
&lt;p&gt;Given a recursive function, we found that this set of information was sufficient for generating a iterative version of that function.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Base Cases&lt;&#x2F;li&gt;
&lt;li&gt;Loop induction variable and its bounds&lt;&#x2F;li&gt;
&lt;li&gt;How these recursive calls are combined (i.e., Fibonacci adds the results from the f(n-1) and f(n-2) together)&lt;&#x2F;li&gt;
&lt;li&gt;The offset from the induction variable in the recursive calls (i.e., (n-1), (n-3), etc.)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;In the case of Fibonacci, we demonstrate how an iterative versin of the program can be created using only the information above. We use C code here for better readability, but our implementation generates this in LLVM IR.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;int fibbonacci(int n) {
  &#x2F;&#x2F; Base cases
  if(n==0) return 0;
  if(n==1) return 1;

  &#x2F;&#x2F; i1 and i2 store offset results
  &#x2F;&#x2F; i1 is Fibonacci(i-1), an offset of 1
  &#x2F;&#x2F; i2 is Fibonacci(i-2), an offset of 2
  int i1 = 0;
  int i2 = 1;

  &#x2F;&#x2F; i is the loop induction variable, bounded from 2 to n
  int i = 2;

  &#x2F;&#x2F; Iteration part
  while(i &amp;lt;= n){
    i++;
    &#x2F;&#x2F; the combination instruction (Fibonacci only has one, but there can be multiple)
    int current = i1+i2;
    i1 = i2;
    i2 = current;
  }
  return i2;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Information collection posed many challenges due to the low-level nature of LLVM IR, and desire to generalize our solution.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;finding-the-base-cases&quot;&gt;Finding the Base Cases&lt;&#x2F;h4&gt;
&lt;p&gt;To more formally define the term &lt;em&gt;base cases&lt;&#x2F;em&gt;, we required base cases to be comparisons with the input argument to the function, found at the top of a program, that resulted in returning a value within the body guarded by the comparison.&lt;&#x2F;p&gt;
&lt;p&gt;While humans have no trouble deciphering the base case given a program, it is much more challenging to find them when sifting through LLVM IR. We used an ad-hoc approach where we found the constant return values and then walked up the CFG to find the conditional statements that would have led to those constant outputs. From the condition instruction we extracted the function argument that would have led to that constant to be returned, and we end up with a list of argument-return pairs, which sufficiently describes the base cases of the function.&lt;&#x2F;p&gt;
&lt;p&gt;Doing this better would involve abstract interpretation. For a particular function argument, it could trigger a recursive call or it could not. With abstract interpretation, we could find all values of the argument such that the basic blocks with recursive function calls are not traversed. Those would be the base cases. &lt;&#x2F;p&gt;
&lt;h4 id=&quot;finding-the-loop-induction-variable&quot;&gt;Finding the Loop Induction Variable&lt;&#x2F;h4&gt;
&lt;p&gt;For this project, we simply initialized the induction variable at the largest base case argument and incremented until it reached the function call&#x27;s argument.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;finding-the-instructions-that-combine-the-recursive-calls&quot;&gt;Finding the Instructions that Combine the Recursive Calls&lt;&#x2F;h4&gt;
&lt;p&gt;We keep a running list of all recursive call instructions. For all instructions in the recursive function that use the result of a recursive call as an operand, we clone them and add the clone to a list of &lt;em&gt;dependents&lt;&#x2F;em&gt; as they depend on the recursive calls. We clone instructions here so we can delete all instructions in the recursive function and insert our own. &lt;&#x2F;p&gt;
&lt;h4 id=&quot;finding-recursive-call-offsets&quot;&gt;Finding Recursive Call Offsets&lt;&#x2F;h4&gt;
&lt;p&gt;For each recursive call that we find, we look at the call argument and assert that it is a constant offset from the original function argument---i.e., of the form $(n-c)$ where $n$ is the original function argument and $c$ is a constant. We remember that $c$. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;creating-a-model-for-iterative-programs-to-use-collected-information&quot;&gt;Creating a Model for Iterative Programs to use Collected Information&lt;&#x2F;h3&gt;
&lt;p&gt;Once we have the information from the recursive program, we need to find a model such that given this information, can generate an iterative program. Instead of writing a single transformation function $LLVM_Pass$(Input Recursive Program) = Output Iterative Program, we break down the programs according to a layout. &lt;&#x2F;p&gt;
&lt;p&gt;Consider the layout of recursive and iterative Fibonacci programs below.&lt;&#x2F;p&gt;
&lt;img src=&quot;fiblay.png&quot; style=&quot;max-width: 100%&quot; &gt;
&lt;p&gt;Here, we separate the base cases from the recursive calls in order to know which set of instructions will need to be in a loop, and which set will only need to be called once. It is important to note that in the black box, you can have any set of instructions to combine the recursive calls.&lt;&#x2F;p&gt;
&lt;p&gt;Consider the layout of the iterative version of the same program below.&lt;&#x2F;p&gt;
&lt;img src=&quot;memo.png&quot; style=&quot;max-width: 100%&quot; &gt;
&lt;p&gt;Note: We keep the function header the same in order to more easily swap out invocations to Fibonacci outside the function with our generated function.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;new-code-generation&quot;&gt;New Code Generation&lt;&#x2F;h3&gt;
&lt;p&gt;To minimize changes to the overall code structure, we delete all blocks and instructions inside the original recursive function and insert our new instructions.&lt;&#x2F;p&gt;
&lt;p&gt;First we add all the base case conditionals. Next, we declare and initialize all the values we need for iteration and the incrementing iterator. After that, we create the while loop and clone in the instructions dependent on the original recursive calls (e.g., the add instruction in fib(n-1) + fib(n-2)). Lastly, we add a return statement after the while loop to finish the function. &lt;&#x2F;p&gt;
&lt;p&gt;Our code for the LLVM pass is available &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;liuhenry4428&#x2F;llvm-pass-skeleton&#x2F;tree&#x2F;noauto&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;correctness&quot;&gt;Correctness&lt;&#x2F;h3&gt;
&lt;p&gt;In order to evaluate correctness, we look at the programs generated for a set of benchmarks that we created to test the generalizability of our pass.&lt;&#x2F;p&gt;
&lt;p&gt;Function equivalence is an undecidable problem, as that would solve the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Halting_problem&quot;&gt;halting problem&lt;&#x2F;a&gt;. Therefore, to approximately validate our generated programs are correct, we test the bases cases, and a few arguments for the iterative case. &lt;&#x2F;p&gt;
&lt;p&gt;We randomly generating a few integers between 2 and 47 , and 47 is the largest Fibonacci number that still fits in a 32-bit integer.&lt;&#x2F;p&gt;
&lt;p&gt;A next step to improve correctness checking would be to utilize randomized testing, but instead, determine the space of valid inputs from the input program itself.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;performance&quot;&gt;Performance&lt;&#x2F;h3&gt;
&lt;p&gt;To measure performance, we ran derivatives of the Fibonacci program. We compare the execution time of the recursive program to the execution time of the iterative program.&lt;&#x2F;p&gt;
&lt;p&gt;Here are the results of running the Fibonacci benchmark shown above. We see that the recursive benchmark follows an $O(2^n)$ execution time trendline as expected. We also see the iterative benchmark&#x27;s trendline is very linear. By the data, the iterative trendline only varies between 3 and 3.5 hundredths of milliseconds between inputs 7 and 55. This is quite small. One reason this may be is that increasing n by 1 in the iterative benchmark, likely only increasing the number of add instructions by 1.
&lt;img src=&quot;call.png&quot; style=&quot;max-width: 100%&quot; &gt;&lt;&#x2F;p&gt;
&lt;p&gt;When interpreting these results, we considered the difference between these two programs. 
Consider the call tree for recursive Fibonacci. As $n \longrightarrow \infty$, the number of calls in the call tree grow by $O(2^n)$.&lt;&#x2F;p&gt;
&lt;img src=&quot;fib.png&quot; style=&quot;max-width: 100%&quot; &gt;
&lt;p&gt;In addition to the redundant computation, the recursive Fibonacci program takes up more space, as it is not tail-recursive, and therefore requires many more stack frames for each computation.&lt;&#x2F;p&gt;
&lt;p&gt;Next consider Katinacci, a derivative of Fibonacci, we created, with the recurrence: $$k(n) = k(n-1) + k(n-3)$$&lt;&#x2F;p&gt;
&lt;img src=&quot;kat.png&quot; style=&quot;max-width: 100%&quot; &gt;
&lt;p&gt;Katinacci was designed to test recurrences with offsets that were not adjacent. We noticed that Katinacci does not follow the exponential trendline quite as closely as the rest of the benchmarks. One reason this may be, is that it approaches a base case much more quickly than the other benchmarks. Therefore, Katinacci may be a little bit lower on the graph than the exponential trendline, as shown.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, Henrinacci, another derivative of Fibonacci we created, with the recurrence: $$h(n) = h(n-1) + h(n-2) + h(n-3)$$&lt;&#x2F;p&gt;
&lt;img src=&quot;rec_henri.png&quot; style=&quot;max-width: 100%&quot; &gt;
&lt;img src=&quot;it_henri.png&quot; style=&quot;max-width: 100%&quot; &gt;
&lt;p&gt;We see the recursive cases of these benchmarks follow the exponential trendline, while the iterative benchmarks look much more linear. The data from the iterative benchmarks cover smaller amounts of time, and therefore, appear to be more influenced by noise.
We run another set of benchmarks on the iterative versions of each of our benchmarks, 3 times per input argument, and plot the results below. Despite increasing the number of runs per benchmark, the error bars indicate there is minimal statistically significant difference between input arguments. Since these benchmarks were run together per input argument, it is likely noise from other processes is the reason the benchmarks vary similarly. &lt;&#x2F;p&gt;
&lt;img src=&quot;lin.png&quot; style=&quot;max-width: 100%&quot; &gt;
&lt;p&gt;While these benchmarks are challenging to compare side-by-side for the iterative cases, it is clear that the iterative cases take much less time than their recursive counterparts. To achieve more precise results for the iterative benchmarks, it may be possible to run these benchmarks on larger input arguments and modify the program to return larger than 32-bit integer results.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;discussion&quot;&gt;Discussion&lt;&#x2F;h3&gt;
&lt;p&gt;We have hit our &lt;em&gt;V0&lt;&#x2F;em&gt; goal of generating an iterative Fibonacci from a recursive one. We have also satisfied our &lt;em&gt;V1&lt;&#x2F;em&gt; goal of generalizing our implementation to recursive functions with any number of recursive calls with arbitrary constant offsets. &lt;&#x2F;p&gt;
&lt;p&gt;Our results show that we have successfully reduced the original functions&#x27; time complexities from exponential to linear.&lt;&#x2F;p&gt;
&lt;p&gt;We were not able to satisfy our &lt;em&gt;V2&lt;&#x2F;em&gt; goals as they were too technically challenging. They would require analyzing a program and then procedurally generating a multidimentional array structure to iterate through, which we found difficult to do in LLVM IR. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;next-steps&quot;&gt;Next Steps&lt;&#x2F;h2&gt;
&lt;p&gt;We are currently using an iterator that always increments by 1. This is not optimal when the recursion contains large holes as we do not need &lt;em&gt;all&lt;&#x2F;em&gt; the values from the bases cases until $n$, such as $$f(n) = f(n-13) + f(n-17)$$
Instead, we need to somehow calculate the minimum number of computations. This can be done by &amp;quot;hopping backwards&amp;quot; from the desired $n$, where the hops are the offsets (in this case, 13 and 17). This then finds the minimum number of values, and also means that the induction variable needs to increase by values other than 1.&lt;&#x2F;p&gt;
&lt;p&gt;We have also only been considering functions with only 1 argument. The same principles should apply when extrapolating this algorithm to functions with multiple arguments. &lt;&#x2F;p&gt;
&lt;p&gt;There is also no reason this pass needs to work only with integers. However, implementing this pass for functions with something like string arguments becomes difficult as strings require memory accesses which introduces pointer analysis and much complexity.&lt;&#x2F;p&gt;
&lt;p&gt;This seems like a very powerful optimization pass if implemented in full. Many coding challenge questions involve dynamic programming, and having a compiler that can optimize a naïve recursive solution into a iterative version seems desirable in all circumstances.&lt;&#x2F;p&gt;
&lt;p&gt;One thing that doesn&#x27;t seem feasible is handing non-constant recursive call offsets (e.g., &lt;code&gt;f(n) = f(f(n-1))&lt;&#x2F;code&gt;, or recursive calls that are not unidirectional (e.g., &lt;code&gt;f(n) = f(n+3) + f(-n*2)&lt;&#x2F;code&gt;) even if they are valid programs due to their base cases.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;primary-challenges&quot;&gt;Primary Challenges&lt;&#x2F;h2&gt;
&lt;p&gt;The primary challenge was using LLVM. The documentation is there but it is highly technical and it required a broad understanding of LLVM before things started &amp;quot;clicking&amp;quot;. For example, we had immense difficulty with LLVM contexts (which were required for IRBuilder and basic block creation). Mix-and-matching LLVM contexts results in mysterious segfaults and correct usage was not obvious due to lack of examples. &lt;&#x2F;p&gt;
&lt;p&gt;It was also difficult to store and replay the instructions dependent on the recursive calls due to messy pointer management, specifically when setting the &lt;code&gt;Use&lt;&#x2F;code&gt; objects of those instructions to point at the new values we operate on for the iterative version. &lt;&#x2F;p&gt;
&lt;p&gt;Code generation was also on the more tedious side as we were writing LLVM IR. This was fine for our 1D functions, but for 2+D functions it seems like a better approach to generate higher level source code (such as C++ code) that implements our iterative function and then use Clang to compile it down to LLVM IR which we would then link in. This allows us to write code for complex things like multi-dimensional arrays in a higher-level language instead of LLVM IR. &lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Bril JIT with On-Stack Replacement</title>
                <pubDate>Wed, 18 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/bril-osr/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/bril-osr/</guid>
                <description>&lt;h2 id=&quot;interpreters&quot;&gt;Interpreters&lt;&#x2F;h2&gt;
&lt;p&gt;Many widely-used programming languages such as Python and JavaScript rely
heavily on interpreters. Some representation of the source code is passed to the
interpreter, where the instructions are evaluated immediately.&lt;&#x2F;p&gt;
&lt;p&gt;Since interpreters do not translate source programs to machine code and they
only consider the subset of instructions that are actually evaluated at run time,
startup cost for a program is minimized. This is huge for developers who may
update and run programs hundreds of times over. So interpreters are fast for
source translation but overall have slow performance because they are not
optimized for the architecture itself.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;jits-and-on-stack-replacement&quot;&gt;JITs and On-Stack Replacement&lt;&#x2F;h2&gt;
&lt;p&gt;Ideally, we want to compile our program to machine code which will allow us to
perform platform specific optimizations. However, compilation can be costly and
in doing so, we lose the startup time benefit of an interpreter.&lt;&#x2F;p&gt;
&lt;p&gt;A just-in-time (JIT) compiler attemps to find the best of both worlds. Instead
of compiling source code ahead of time, a JIT compiler aims to compile code at
run time. There are three advantages to highlight with this approach:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Only the functions that the program executes at run time are compiled to
machine code. In addition, every compilation is much smaller in scope, so this
overhead is spread throughout the execution.&lt;&#x2F;li&gt;
&lt;li&gt;Run-time profiling information can be utilized for more advanced
optimizations that ahead-of-time (AOT) compilers cannot take advantage of.&lt;&#x2F;li&gt;
&lt;li&gt;By compiling at run time, users only need to install one version of a program,
as the JIT can dynamically resolve the architecture it needs to compile source
code to. For example, Java code is compiled to platform-independent Java
bytecode, which can then be executed by a platform-aware Java Virtual Machine
(JVM).&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;The main benefit of a JIT is the speed-up; the process that starts from source
code and ends with execution can often be completed faster with JIT compilation
than with interpretation or with regular compilation followed by execution.
While a basic JIT compiler will always compile a code fragment when it is
reached in the execution, more sophisticated versions can incorporate an
interpreter. The JIT will then decide to interpret a section of code if it seems
like the upfront time cost of compilation would not be made up for by the speed
gains of native execution over interpretation. This allows us to combine the
start-up benefits of an interpreter with the speed of a JIT.&lt;&#x2F;p&gt;
&lt;p&gt;One problem with traditional JITs is that they often compile code in fairly
large segments, typically at the function level or even entire files. This poses
a problem because a JIT compiler may decide to interpret a function based off of
its heuristic, when it should actually compile it for the best performance.
Furthermore, once the JIT has decided to, say, interpret a function, it cannot
change that decision until the next time the function is reached. If a function
being interpreted turns out to execute for much longer than expected, we have no
way to optimize it until it ends. As such, we need a way to retroactively decide
when to compile snippets of code.&lt;&#x2F;p&gt;
&lt;p&gt;The solution to this is &lt;strong&gt;on-stack replacement&lt;&#x2F;strong&gt; (OSR). In the middle of
executing a function in interpreted mode, we may decide to compile the function
to get better performance. This is different from a normal JIT, which will
either compile the entire function or not compile it at all.&lt;&#x2F;p&gt;
&lt;p&gt;Consider the following &lt;a href=&quot;https:&#x2F;&#x2F;capra.cs.cornell.edu&#x2F;bril&#x2F;&quot;&gt;Bril&lt;&#x2F;a&gt; program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
  one: int = const 1;
  b: bool = const true;
  c: int = const 0;
  max: int = const 1000;
loop:  
  c: int = add c one;
  b: bool = le c max;
  br b loop end;
end:
  # Rest of program
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Naïvely, a JIT compiler may not compile the main function as it appears to be
short or not computationally intensive. However, upon execution, a JIT with OSR
will be able to see that the same snippet of code is running over and over
again, and will switch from interpreted mode to compiled mode to enhance
performance.&lt;&#x2F;p&gt;
&lt;p&gt;On-stack replacement is more complicated implementation-wise than normal
just-in-time compilation, as it needs to map the current state of execution from
the interpreter to the compiled execution. This includes placing the values of
variables in the correct positions, as well as resuming execution from the
correct program point. We use a heuristic marking code sections that are
frequently run as &amp;quot;hot,&amp;quot; and then compiling a function with OSR if any section
in it is determined to be sufficiently hot. This allows us to switch from
interpreted to compiled code when it will likely be beneficial, and allows us to
get the benefits of both modes of execution.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design-overview&quot;&gt;Design Overview&lt;&#x2F;h2&gt;
&lt;p&gt;In order to explore the use cases for JITs and on-stack replacement, we built
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;splashofcrimson&#x2F;jit-bril&quot;&gt;Bril-OSR&lt;&#x2F;a&gt;. Here&#x27;s how it works.&lt;&#x2F;p&gt;
&lt;p&gt;At Bril-OSR&#x27;s core is an efficient interpreter, with a series of arguments to
toggle its various features on and off. With the &lt;code&gt;-jit&lt;&#x2F;code&gt; flag turned on, the
interpreter will compile functions to x86-64 machine code at run time, and
execute them. The threshold at which the interpreter compiles functions can be
tuned by the programmer, representing the number of times a function must be
called before it is compiled. The &lt;code&gt;-osr&lt;&#x2F;code&gt; flag is similar, although it operates
at a basic block level. If a block is executed &lt;em&gt;n&lt;&#x2F;em&gt; times by the program at
run time, the function housing it will be compiled to machine code, and
evaluation will resume where the interpreter left off. The value of &lt;em&gt;n&lt;&#x2F;em&gt; is also
configurable as an argument to Bril-OSR.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;&#x2F;strong&gt;: We decided to write Bril-OSR in Rust for two main reasons:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;A compiled systems language, when used well, should give some exciting
performance boosts (especially when compared to the baseline TypeScript
interpreter, Brili).&lt;&#x2F;li&gt;
&lt;li&gt;The &lt;code&gt;dynasm-rs&lt;&#x2F;code&gt; crate is an awesome tool for code generation, and is built
with just-in-time compilers in mind!&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;building-a-faster-interpreter-in-rust&quot;&gt;Building a Faster Interpreter in Rust&lt;&#x2F;h3&gt;
&lt;p&gt;To provide a fair baseline to compare with both the JIT and the JIT with OSR, we
built an efficient interpreter in Rust.&lt;&#x2F;p&gt;
&lt;p&gt;First, we took advantage of the &lt;a href=&quot;https:&#x2F;&#x2F;serde.rs&#x2F;&quot;&gt;Serde&lt;&#x2F;a&gt; crate to deserialize Bril JSON into
native Rust structs that represent a Bril program. With
&lt;code&gt;#[derive(Deserialize)]&lt;&#x2F;code&gt;, this process is fairly automatic, and the generated
code is quite efficient as well! To make interpretation easier, the structs
implement the &lt;code&gt;From&lt;&#x2F;code&gt; trait to map Rust strings to specialized enums that
represent opcodes and types in Bril.&lt;&#x2F;p&gt;
&lt;p&gt;Interpretation is implemented by pattern matching by opcode, since each line in
Bril only has a single operation. Variables and their values are stored in an
environment map local to the function that is currently evaluating. On function
calls, Bril-OSR creates a new environment map with the function arguments, and
passes evaluation to the callee.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;&#x2F;strong&gt;: We built off of our first project, &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;making-function-calls-work&#x2F;&quot;&gt;Bril()&lt;&#x2F;a&gt;, as we need to support
multiple functions, arguments, calls, and return values.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;code-generation-and-jit-compilation&quot;&gt;Code Generation and JIT Compilation&lt;&#x2F;h3&gt;
&lt;p&gt;To build a JIT compiler, we needed infrastructure for code generation. We used
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;CensoredUsername&#x2F;dynasm-rs&quot;&gt;&lt;code&gt;dynasm-rs&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; to be able to write assembly code and have it dynamically
assemble and execute as part of the Rust program. Our translation from Bril to
assembly is rather simplistic, as we do not implement any sort of register
allocation or other optimizations.&lt;&#x2F;p&gt;
&lt;p&gt;We first create a standard assembly &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Function_prologue&quot;&gt;function prologue&lt;&#x2F;a&gt;, which involves
storing the current base pointer, and allocating stack space for this function.
We allocate one 64-bit word of stack space for each unique variable that is
defined in the function or is an argument to the function.&lt;&#x2F;p&gt;
&lt;p&gt;Next, we iterate through the instructions of the Bril program and generate
assembly code for each. The assembly for most Bril instructions generally
follows the structure of getting a variable from the stack, applying some
operation to it, and then storing back on the stack. For example, the assembly
for an instruction &lt;code&gt;a: int = add b c&lt;&#x2F;code&gt; would look roughly like:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;mov rax, [rbp - b_stack_position]
add rax, [rbp - c_stack_position]
mov [rbp - a_stack_position], rax
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In order to implement the &lt;code&gt;print&lt;&#x2F;code&gt; instruction, we created basic Rust functions
to print an integer or a boolean, and then used standard calling conventions to
call these functions from the assembly code.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;function-calls&quot;&gt;Function Calls&lt;&#x2F;h4&gt;
&lt;p&gt;The most difficult part of code generation was implementing function calls (as
defined in the Bril() language extension). Because this is a just-in-time
compiler, a function being called may or may not already be compiled. As such,
we cannot simply call the other function using standard calling conventions.
Instead, we created another Rust function, &lt;code&gt;handle_call&lt;&#x2F;code&gt;, for the assembly to
call when encountering a Bril &lt;code&gt;call&lt;&#x2F;code&gt; instruction. To this function, the assembly
passes an integer identifying the Bril function that it wants to call.&lt;&#x2F;p&gt;
&lt;p&gt;This function can then see if the Bril function has already been compiled, and
run it if so, or otherwise decide whether to compile it now or interpret it.
However, to perform these actions, &lt;code&gt;handle_call&lt;&#x2F;code&gt; needs access to the internal
state of the compiler, and, as such, really needs to be a method of the main
compiler struct. And, to call a method, you need to pass in a reference to the
&lt;code&gt;self&lt;&#x2F;code&gt; variable. As such, the assembly Bril function needs to have a reference
to this struct. To solve this, we decided on a convention that every time an
assembly function is called, it is passed a reference to &lt;code&gt;self&lt;&#x2F;code&gt; as an argument.
That reference is then stored on the stack for later use.&lt;&#x2F;p&gt;
&lt;p&gt;Handling function arguments was quite tricky, as Bril() functions can have any
arbitrary number of arguments, and the Rust &lt;code&gt;handle_call&lt;&#x2F;code&gt; function must be able
to pass along these arguments. Because Rust does not support functions with
variable numbers of arguments, we decided to pass the Bril arguments in a Rust
vector to &lt;code&gt;handle_call&lt;&#x2F;code&gt;. It took considerable effort to figure out how to create
a vector from assembly, and pass it as an argument. Unable to find any
definitive information on how vectors are implemented, we resorted to
reverse-engineering vector uses. We heavily used the &lt;a href=&quot;https:&#x2F;&#x2F;godbolt.org&quot;&gt;Godbolt&lt;&#x2F;a&gt; compiler
explorer to quickly see the assembly generated from Rust functions. We made
assumptions about how vectors work based on the outputs we saw, and based our
implementation of function arguments on those assumptions. Having done this, we
were able to construct Rust vectors from assembly and pass them to the
&lt;code&gt;handle_call&lt;&#x2F;code&gt; function, which then forwards them to the callee Bril function to
be deconstructed.&lt;&#x2F;p&gt;
&lt;p&gt;Implementing returning values from Bril functions was considerably easier, as
they can just be placed in the return register (&lt;code&gt;rax&lt;&#x2F;code&gt;), and then forwarded back
through &lt;code&gt;handle_call&lt;&#x2F;code&gt; to the caller.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;combining-the-jit-with-the-interpreter&quot;&gt;Combining the JIT with the Interpreter&lt;&#x2F;h4&gt;
&lt;p&gt;To improve the performance of the JIT, we combined it with the interpreter. We
extended the interpreter to perform basic profiling. It maintains a counter for
each function, keeping track of how many times that function has run. Then, once
that counter reaches a fixed value, the program will then compile the function
instead of interpreting it. After that, every time the function is called, the
compiled version can be simply run.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;on-stack-replacement&quot;&gt;On-Stack Replacement&lt;&#x2F;h3&gt;
&lt;p&gt;To implement on-stack replacement, we extend the profiling information from the
previous section to keep track of the number of times each basic block in a
function is executed. If a function is being interpreted and one of these
counters reaches a certain fixed value, OSR will be initiated immediately. Code
generation will then proceed as before, but with a few OSR-specific additions.
The basic structure of a function compiled via OSR is as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;osr_start:
    # prologue
    # move variable values to stack
    jmp osr_first_inst

regular_start:
    # prologue
    # ...
osr_first_inst:
    # ...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;As such, the assembly has additional code at &lt;code&gt;osr_start&lt;&#x2F;code&gt; which will only execute
this one time for completing the on-stack replacement. Future calls to this
function will ignore all of that and start from the &lt;code&gt;regular_start&lt;&#x2F;code&gt;. In this new
section of the assembly, we move the current values of all of the variables to
their appropriate positions in the stack, and then jump to whatever instruction
the interpreter was about to execute.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;To evaluate Bril-OSR, we aimed to first verify its correctness using the
existing &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&quot;&gt;brili&lt;&#x2F;a&gt; interpreter as a benchmark,
and then its performance on a series of reasonably large-scale Bril tests.&lt;&#x2F;p&gt;
&lt;p&gt;All tests were run on an Intel Core i7-7700HQ CPU @ 2.80GHz with 16GB of RAM,
and using Ubuntu on WSL.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;correctness-tests&quot;&gt;Correctness Tests&lt;&#x2F;h2&gt;
&lt;p&gt;First, we generated multiple programs that covered the breadth of Bril opcodes
from a configurable Python script. We ran the correctness suite on both the
interpreter and just-in-time code generation engine, and ensured that the
run-time effects matched those of Brili.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;&#x2F;strong&gt;: Inspired by Alexa and Gregory&#x27;s foray into
&lt;a href=&quot;https:&#x2F;&#x2F;hypothesis.works&#x2F;&quot;&gt;Hypothesis&lt;&#x2F;a&gt;, we attempted to write property-based
tests for Bril-OSR using Rust&#x27;s
&lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;quickcheck&#x2F;0.9.0&#x2F;quickcheck&#x2F;&quot;&gt;quickcheck&lt;&#x2F;a&gt; crate. Unfortunately,
generating structures representing Bril programs was difficult, so we abandoned
this project. Still, this would be a cool extension to Bril-OSR&#x27;s correctness
test suite.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;performance-tests-sensitivity-study&quot;&gt;Performance Tests (Sensitivity Study)&lt;&#x2F;h2&gt;
&lt;p&gt;In order to compare the performance of the various components of Bril-OSR, we
composed an ablation study.&lt;&#x2F;p&gt;
&lt;p&gt;The three benchmarks we used to evaluate against are as follows:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;FibFunc with &lt;code&gt;n&lt;&#x2F;code&gt; = 100, &lt;code&gt;n&lt;&#x2F;code&gt; = 500&lt;&#x2F;strong&gt;: &lt;code&gt;n&lt;&#x2F;code&gt; functions each iteratively compute
a random, long-running Fibonacci sequence, and returns to main.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;KNN with &lt;code&gt;n&lt;&#x2F;code&gt; = 100, &lt;code&gt;n&lt;&#x2F;code&gt; = 500&lt;&#x2F;strong&gt;: implementation of K=1 Nearest Neighbors in
Bril, with &lt;code&gt;n&lt;&#x2F;code&gt; training and testing points.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Ackermann with &lt;code&gt;n&lt;&#x2F;code&gt; = 2, &lt;code&gt;m&lt;&#x2F;code&gt; = 3&lt;&#x2F;strong&gt;: implementation of the Ackermann function.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Each benchmark was run against each of the following configurations at least 10
times, and the averages&#x2F;deviations were computed with
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sharkdp&#x2F;hyperfine&quot;&gt;hyperfine&lt;&#x2F;a&gt; (shoutout to Wil and Daniel
Glus for the
&lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;faster-interpreter&#x2F;&quot;&gt;inspiration&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;The first row for each benchmark runs just the Rust interpreter, the second row
runs the JIT on every function, and the remaining rows all run different
combinations of heuristics for both the function-level JIT and our
implementation of OSR. &lt;em&gt;Function threshold&lt;&#x2F;em&gt; is defined as the number of times a
function is interpreted before it is compiled, and &lt;em&gt;basic block threshold&lt;&#x2F;em&gt; is
defined as the number of times a basic block is interpreted before it is
compiled. All times are listed in milliseconds.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;bril-osr&#x2F;fib_func_100.png&quot; alt=&quot;&quot; &#x2F;&gt;
&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;bril-osr&#x2F;fib_func_500.png&quot; alt=&quot;&quot; &#x2F;&gt;
&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;bril-osr&#x2F;knn_100.png&quot; alt=&quot;&quot; &#x2F;&gt;
&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;bril-osr&#x2F;knn_500.png&quot; alt=&quot;&quot; &#x2F;&gt;
&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;bril-osr&#x2F;ackermann.png&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Bril-OSR&#x27;s interpreter is typically at least 10x slower than any of the other
configurations, except interestingly with the Ackermann benchmark. This is most
likely because it&#x27;s a smaller benchmark, and the compilation overhead offsets
any performance gains from machine code.&lt;&#x2F;p&gt;
&lt;p&gt;Solely using the function-level JIT performed well for both the KNN benchmarks
for smaller thresholds. As the threshold increased, however, performance began
to tank, and is almost 10x worse for n = 500 when the threshold is 25. In
addition, these configurations performed poorly against the FibFunc benchmarks,
presumably because they lack many calls to the same function.&lt;&#x2F;p&gt;
&lt;p&gt;Solely using on-stack replacement on a basic block level perfomed well on the
FibFunc and Ackermann benchmarks, and actually showed very similar performance
to the mixed configurations tested later. However, it performed relatively
poorly on the KNN benchmarks, presumably because they lack many loops.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, the mixed configurations tended to work well universally when compared
to the others, and based on our benchmarks we lean towards concluding that both
function-level JITs and basic-block level OSR are effective for performance
gains. However, these results still are not very conclusive. JITing every single
function performs the best on every single benchmark, and so we can only really
conclude that compiling more often is better. Regardless, these strategies still
show promise - given more expansive benchmarks (on the level of SQL or a
similarly large project), the benefits of JITs and OSR could become more
apparent.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;future-work&quot;&gt;Future Work&lt;&#x2F;h2&gt;
&lt;p&gt;Bril-OSR succesfully implements examples of both just-in-time compilation and
on-stack replacement, but more importantly, it provides a framework to build
upon with more interesting heuristics and utilization.&lt;&#x2F;p&gt;
&lt;p&gt;First, we would love to incorporate more evaluators to the framework. We could
add an optimizing compiler and switch between all three evaluators based on more
complex heuristics. We can add arbritrary levels of evaluators like this, and do
a more in depth ablation study.&lt;&#x2F;p&gt;
&lt;p&gt;On the other hand, it would be awesome to design more interesting heuristics for
both OSR and JIT. We currently use simple counts as our heuristics, but there&#x27;s
room to use strategies like run-time profiling to make these heuristics more
accurate.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>A Dependently Typed Language</title>
                <pubDate>Wed, 18 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/dependently-typed-language/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/dependently-typed-language/</guid>
                <description>&lt;h2 id=&quot;overview&quot;&gt;Overview&lt;&#x2F;h2&gt;
&lt;p&gt;For this project, we implement the &lt;a href=&quot;https:&#x2F;&#x2F;www.sciencedirect.com&#x2F;science&#x2F;article&#x2F;pii&#x2F;0890540188900053&quot;&gt;Calculus of Constructions&lt;&#x2F;a&gt; (CoC)
by Coquand and Huet, which is a lambda calculus that sits at the top of the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Lambda_cube&quot;&gt;Lambda Cube&lt;&#x2F;a&gt;. This means that
it includes polymorphism, type level operators, and dependent types. With the Simply Typed Lambda Calculus, terms are only
allowed to depend on terms. Polymorphism lets terms depend on types (e.g., generics). Type level operators allow types to
depend on types (e.g., &lt;code&gt;list&lt;&#x2F;code&gt; in OCaml). Dependent types allow types to depend on terms. Dependent types are rather
uncommon in most programming languages, so I&#x27;d suggest reading &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs4110&#x2F;2018fa&#x2F;lectures&#x2F;lecture31.pdf&quot;&gt;these notes&lt;&#x2F;a&gt;
created by our very own &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;%7Easampson&#x2F;&quot;&gt;Adrian Sampson&lt;&#x2F;a&gt; for an introduction to how dependent types work.&lt;&#x2F;p&gt;
&lt;p&gt;Ultimately, dependent types allow us to make full use of the Curry-Howard ismorphism; that is, types correspond to
logical statements, and programs correspond to proofs of such statements. For example, consider the polymorphic
identity function $\Lambda A. \lambda x: a. x$, or written $(\lambda A : *)(\lambda x : A)x$ in CoC. The type of this program
is written $[A: *][x: A]A$, which represents the logical statement $\forall A. a \implies a$. By the Curry-Howard isomorphism,
the identity function serves as a proof for that statement because it inhabits the corresponding type. Through this,
we can write proofs and be assured that they are free of mistakes (or at least moreso than hand-written proofs).&lt;&#x2F;p&gt;
&lt;p&gt;The Calculus of Constructions provides a quite simple type system that allows us to write proofs through programming.
Our goal is to implement CoC and show the ability to write some proofs.&lt;&#x2F;p&gt;
&lt;p&gt;My implementation can be found on &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;chrisroman&#x2F;coc&quot;&gt;GitHub&lt;&#x2F;a&gt;. Special thanks to &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Xaec6&quot;&gt;Vincent Imbimbo&lt;&#x2F;a&gt;
for talking through and learning with me how CoC worked.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design&quot;&gt;Design&lt;&#x2F;h2&gt;
&lt;p&gt;Some of the design decisions about how to write CoC programs came from an &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;lambda-11235&#x2F;ttyped&quot;&gt;existing implementation&lt;&#x2F;a&gt;
of CoC. That implementation is well written and good for working with CoC, though it uses a &lt;em&gt;slightly&lt;&#x2F;em&gt; different syntax.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-grammar&quot;&gt;The Grammar&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cmu.edu&#x2F;%7Efp&#x2F;papers&#x2F;mfps89.pdf&quot;&gt;This paper&lt;&#x2F;a&gt; provides a concise grammar for CoC:&lt;&#x2F;p&gt;
&lt;img src=&quot;grammar.png&quot; style=&quot;width: 80%&quot;&gt;
&lt;p&gt;This formulation is much simpler to understand than what is presented by Coquand and Huet.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;typing-rules&quot;&gt;Typing Rules&lt;&#x2F;h3&gt;
&lt;p&gt;CoC makes a distinction between &lt;em&gt;contexts&lt;&#x2F;em&gt; and &lt;em&gt;objects&lt;&#x2F;em&gt;. Contexts are products over $*$, that is, terms
of the form $[x_1 : M_1] [x_2 : M_2] ... [x_n : M_n] *$. These terms are denoted as $\Gamma$ and $\Delta$.
All other terms are considered objects. The typing rules are presented as follows:&lt;&#x2F;p&gt;
&lt;img src=&quot;typing_rules.png&quot; style=&quot;width: 80%&quot;&gt;
&lt;p&gt;There are two typing judgements: one of the form $\Gamma \vdash \Delta$, which means that $\Delta$ is a valid
context given the valid context $\Gamma$, and $\Gamma \vdash M : P$, which means that $M$ has type $P$ in the
context $\Delta$. Since programs usually start assuming an empty context, we have $\vdash \Gamma$ meaning
$\Gamma$ is a valid context, and $\vdash M : N$ meaning that $M$ has type $N$.&lt;&#x2F;p&gt;
&lt;p&gt;Here is an example of a proof derivation for the type of the identity function $(\lambda A : *)(\lambda x : A)x$:&lt;&#x2F;p&gt;
&lt;img src=&quot;typing_example.jpg&quot; style=&quot;width: 80%&quot;&gt;
&lt;p&gt;We can see the context is named for good reason: it keeps track of the types of variables. Interestingly, it
also doubles as a way to abstract over types. CoC enjoys nice properties like strong normalization.&lt;&#x2F;p&gt;
&lt;p&gt;In addition to the typing rules, we have $\beta$-conversion rules (denoted $P \cong Q$) that essentially provide a way to
reduce terms to a normal form. It is also necessary to prove that types are equivalent. The conversion
rules can be found on pp.101-102 in Coquand and Huet&#x27;s paper. With these conversion rules, they introduce
a new typing rule that takes advantage of it:&lt;&#x2F;p&gt;
&lt;img src=&quot;beta_conversion.png&quot; style=&quot;width: 80%&quot;&gt;
&lt;h3 id=&quot;additions-to-coc&quot;&gt;Additions to CoC&lt;&#x2F;h3&gt;
&lt;p&gt;To make CoC easier to work with, we make a few additions to the language:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;  | LET; UNTYPED; x = ID; EQUALS; t1 = term; IN; p = program
    { Let (Untyped, x, t1, p) }
  | LET; x = ID; EQUALS; t1 = term; IN; p = program
    { Let (Typed, x, t1, p) }
  | THEOREM; x = ID; EQUALS; t = term; WITH; PROOF; proof = term; SEMICOLON; p = program
    { Theorem (x, t, proof, p) }
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The first was adding &lt;code&gt;let&lt;&#x2F;code&gt; expressions so that we can write longer, more useful CoC programs. This
is especially useful because the language can become very hard to parse by hand very quickly, so
breaking it up into &lt;code&gt;let&lt;&#x2F;code&gt; expressions allows us to build bigger abstractions. I added the option to
have the term be typed or untyped. I found this useful sometimes when there was a long term that
could be replicated across later terms but wasn&#x27;t a valid expression on its own.&lt;&#x2F;p&gt;
&lt;p&gt;The second was adding a mechanism that allows us to state a theorem (i.e., a type) and give a proof
of that theorem (i.e., a program with that type). My implementation will make sure that the proof
actually has that type, or else the program will fail. This gives the user a slightly better experience
using the language to prove things.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;I use OCaml for the implementation because pattern matching and functional programming is well suited
for language implementation.&lt;&#x2F;p&gt;
&lt;p&gt;For the parser I use a parser generator, namely Menhir, with OCamllex for the lexer. This made the grammar
straightforward to implement and extend with new constructs.&lt;&#x2F;p&gt;
&lt;p&gt;The typing judgement of the form $\Gamma \vdash M : P$ is quite simple to implement. I created a function
&lt;code&gt;typecheck_term&lt;&#x2F;code&gt; which takes the $\Gamma$ and $M$ and determines what the type $P$ is. The other judgement
of the form $\Gamma \vdash \Delta$ is a little different because just matching on $\Gamma$ is insufficient
for figuring out what $\Delta$ is. In other words, it is not syntax directed based just on the form of
$\Gamma$. Instead, I chose to write a function &lt;code&gt;typecheck_context&lt;&#x2F;code&gt; which takes both $\Gamma$ and $\Delta$
and checks if we can apply the inference rules to derive this statement, instead of directly deriving
$\Delta$ just from $\Gamma$.&lt;&#x2F;p&gt;
&lt;p&gt;Substitution was tricky to implement because I didn&#x27;t use De Bruijn notation at all. Because of this,
it seems like there are still some bugs in substitution. This makes writing the programs somewhat more
tricky, but I found that if I expected something to typecheck when it really didn&#x27;t, then I would change
some of the variable names so they are not being reused. In addition, because I didn&#x27;t use De Bruijn notation,
I had to make sure I compared terms up to $\alpha$-equivalence when necessary.&lt;&#x2F;p&gt;
&lt;p&gt;It was also tricky to fold $\beta$-equivalence into the typing rules because I didn&#x27;t know exactly how
to deal with some rules like reflexivity, symmetry, and transitivity. I wrote a function &lt;code&gt;beta_equiv&lt;&#x2F;code&gt;
that took $\Gamma$ and $M$ and tried to apply a single $\beta$-equivalence rule. Because all terms have
a normal form, I kept applying this function until no changes could be made.&lt;&#x2F;p&gt;
&lt;p&gt;Just like &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;lambda-11235&#x2F;ttyped&quot;&gt;ttyped&lt;&#x2F;a&gt;, I created a simple REPL that I used for quick
testing to see the reduced value and type of various terms.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;For the evaluation, I decided to see what kinds of theorems I could prove from the Coq standard library.
I decided to choose theorems from the &lt;a href=&quot;https:&#x2F;&#x2F;coq.inria.fr&#x2F;library&#x2F;Coq.Init.Logic.html&quot;&gt;Logic&lt;&#x2F;a&gt; library, which
seemed like one of the simplest libraries. This had definitions for things like True and False (or Top and
Bottom respectively), propositional or and and, if and only if, etc. I was able to state and prove a few
theorems.&lt;&#x2F;p&gt;
&lt;p&gt;First we can make some definitions that the library does:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;let False = [a:*]a in
let True = [a: *][I: a]a in
let not = [A: *][_: A] False in
let sum = (\A: *)(\B: *) [R: *][f: [_: A]R][g: [_: B]R]R in
let or_introl = (\A: *)(\B: *)(\x: A)(\R: *)(\f: [_: A]R)(\g: [_: B]R)(f x) in
let or_intror = (\A: *)(\B: *)(\y: B)(\R: *)(\f: [_: A]R)(\g: [_: B]R)(g y) in
...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These are the same definitions as in Coq, with slightly less elegant syntax, especially because
inductive types have to be encoded.&lt;&#x2F;p&gt;
&lt;p&gt;Now here are some proofs about &lt;code&gt;and&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;let prod = (\A: *)(\B: *) [R: *] [_: [_: A][_: B]R] R in
let conj = (\A: *)(\B: *)(\x: A)(\y: B)(\R: *)(\f: [_: A][_: B]R)((f x) y) in

Theorem proj1 = [A: *][B: *][_:  ((prod A) B)]A with
    Proof (\A: *)(\B: *)(\p: ((prod A) B))((p A) (\x: A)(\y: B)x);

Theorem proj2 = [A: *][B: *][_:  ((prod A) B)]B with
    Proof (\A: *)(\B: *)(\p: ((prod A) B))((p B) (\x: A)(\y: B)y);
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here, &lt;code&gt;prod&lt;&#x2F;code&gt; represents propositional &lt;code&gt;and&lt;&#x2F;code&gt;, while &lt;code&gt;conj&lt;&#x2F;code&gt; is the constructor which produces a term
of that type. Then, Theorem &lt;code&gt;proj1&lt;&#x2F;code&gt; and &lt;code&gt;proj2&lt;&#x2F;code&gt; show that $\forall A, B. A \wedge B \implies A$ and
$\forall A, B. A \wedge B \implies B$. When we consider these terms as encodings of pairs, &lt;code&gt;proj1&lt;&#x2F;code&gt;
and &lt;code&gt;proj2&lt;&#x2F;code&gt; are simply the &lt;code&gt;fst&lt;&#x2F;code&gt; and &lt;code&gt;snd&lt;&#x2F;code&gt; functions like in OCaml.&lt;&#x2F;p&gt;
&lt;p&gt;Here are some more proofs relating to &amp;quot;if and only if&amp;quot;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;let iff = (\X: *)(\Y: *) ((prod [_: X]Y) [_: Y]X) in

Theorem iff_refl = [A: *]((iff A) A) with
Proof (\A: *)((((conj [_: A]A) [_: A]A) (\x:A)x) (\x:A)x);

let swap = (\A1:*)(\B1:*)(\p1: ((prod A1) B1))
    ( ( ((conj B1) A1)
        (((proj2 A1) B1) p1)
      )
      (((proj1 A1) B1) p1)
    ) in

Theorem iff_sym = [A2: *][B2: *][_: ((iff A2) B2)]((iff B2) A2) with
Proof (\A2:*)(\B2:*)(\A_imp_B2: ((iff A2) B2))
    (((swap [_: A2]B2) [_: B2]A2) A_imp_B2);
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;iff_refl&lt;&#x2F;code&gt; claims that if and only if is reflexive, while &lt;code&gt;iff_sym&lt;&#x2F;code&gt; claims that it is symmetric.
&lt;code&gt;iff_sym&lt;&#x2F;code&gt; was actually a case where bugs in substitution came up, so I just used different names for
abstractions&#x2F;products.&lt;&#x2F;p&gt;
&lt;p&gt;Interestingly, the last two theorems actually took about a second to typecheck, which is quite
surprising because Coq can typecheck these terms extremely quickly. This may be because Coq is based on
the Calculus of &lt;em&gt;Inductive&lt;&#x2F;em&gt; Constructions, which is quite different from CoC because it introduces
universes. This increased expressivity probably allows for faster implementations.&lt;&#x2F;p&gt;
&lt;p&gt;Unfortunately I didn&#x27;t have more time to try proving more theorems, though the current evaluation
shows that at least relatively simple proofs can be proved.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;hardest-parts-to-get-right&quot;&gt;Hardest Parts to Get Right&lt;&#x2F;h2&gt;
&lt;p&gt;Two of the hardest things to get right were substitution and $\beta$-equivalence. These were the
main reasons why I had bugs in my implementation. Additionally, debugging became extremely
difficult as the programs increased in size simply because so many rules were being applied, it was
hard to see exactly where things were going wrong. While CoC is a relatively simple language, some of
the examples in Coquand and Huet&#x27;s paper were very hard to follow, like the &lt;code&gt;inter&lt;&#x2F;code&gt; example. This made
it more difficult to get an intuitive understanding of how CoC worked.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;further-work&quot;&gt;Further Work&lt;&#x2F;h2&gt;
&lt;p&gt;With more time, I would have liked to implement a general framework for creating inductive types
in Coc explained by &lt;a href=&quot;(https:&#x2F;&#x2F;www.cs.cmu.edu&#x2F;%7Efp&#x2F;papers&#x2F;mfps89.pdf)&quot;&gt;this paper&lt;&#x2F;a&gt;. With my implementation,
it is possible to write the encodings for inductive types like natural numbers and pairs, but we must
do the encoding by hand. One could also consider giving the language the style of an interactive theorem
prover by being able to manipulate proof&#x2F;program state.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;Hopefully this project provides readers with a better understanding of how a dependently typed language
like CoC is implemented. It seems that things like System F$\omega$ get a lot of attention and are
well understood by many people, but CoC and dependent types aren&#x27;t as mainstream. Formal verification
seems to be increasingly important, so learning CoC may prove to be useful.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Finding Redundant Structures in Data Flow Graphs</title>
                <pubDate>Wed, 18 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/dfg-cover/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/dfg-cover/</guid>
                <description>&lt;p&gt;&lt;em&gt;The source code for this project and our profiling analysis can be found
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;avanhatt&#x2F;dfg-coverings&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;In a conventional &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Von_Neumann_architecture&quot;&gt;von Neumann architecture&lt;&#x2F;a&gt;, we might think of computation
at a high level as our computers faithfully carrying out a series of steps.
Like dominoes in a line, the program counter runs through each instruction once
its predecessor completes.&lt;&#x2F;p&gt;
&lt;img src=&quot;dominos.gif&quot; width=30%&#x2F;&gt;
&lt;p&gt;Of course, this mental model is far from the truth—in modern &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Out-of-order_execution&quot;&gt;out-of-order&lt;&#x2F;a&gt;
processors, instructions are aggressively reordered to take advantage of
multiple processing elements at once. The major caveat here is that reordering
must respect the original program&#x27;s &lt;em&gt;data flow&lt;&#x2F;em&gt;. That is, if an instruction
needs to use data that is generated by a previous instruction, then it cannot be
reordered to happen beforehand. From this perspective, we can think of
computation as dictating the flow of data through operations. Each instruction
is a node, and dependencies form edges that flows along. These dependencies form
a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Data-flow_analysis&quot;&gt;&lt;em&gt;data flow graph&lt;&#x2F;em&gt; (DFG)&lt;&#x2F;a&gt; for the program.&lt;&#x2F;p&gt;
&lt;img src=&quot;pipes.gif&quot; width=30%&#x2F;&gt;
&lt;h2 id=&quot;go-with-the-flow-ocean&quot;&gt;Go with the flow 🌊&lt;&#x2F;h2&gt;
&lt;p&gt;Analyzing the data flow graphs of programs allows us to think about the &lt;em&gt;shape&lt;&#x2F;em&gt;
of the computation, independent of the literal order a programmer used
to specify it. In particular, two separate programs are more likely to share
data flow structure than literal source code redundancy (since many reorderings
can maintain the same data flow). Even within the same source program, shared
structure in the data flow graph may indicate core computational patterns.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;data-flow-graphs-for-computational-acceleration&quot;&gt;Data flow graphs for computational acceleration&lt;&#x2F;h3&gt;
&lt;p&gt;If our goal is to compile faster or more energy-efficient code, data flow graphs
can help show us where to focus. By identifying redundant subgraphs in the
structure of data flow graphs, we can find groupings of operations that we
expect to occur frequently enough to benefit from additional optimization
effort. What&#x27;s more, the shape of the subgraphs is also a signal for how
&lt;em&gt;useful&lt;&#x2F;em&gt; the acceleration might be: subgraphs that are wider, rather than simply
linear chains, indicate more opportunity for &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Granularity_(parallel_computing)#Fine-grained_parallelism&quot;&gt;&lt;em&gt;fine-grained parallelism&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;.
Our goals in this project are shaped by the domain of hardware acceleration with
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Heterogeneous_computing&quot;&gt;&lt;em&gt;heterogeneous computing&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;, where a compiler&#x27;s goal is to target multiple
processors, each with differing strengths and weaknesses.&lt;&#x2F;p&gt;
&lt;p&gt;For this project, we build on the &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&quot;&gt;LLVM compiler infrastructure&lt;&#x2F;a&gt; to find
redundant structures in programs&#x27; static data flow graphs. Our goal is to find a
fixed number of subgraph structures that occur the most frequently (that is,
cover the highest number of instructions) throughout the program. We focus on
finding candidate subgraphs with high frequency, and leave analysis and
heterogeneous compilation of those subgraphs to later work.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;building-data-flow-graphs-from-llvm&quot;&gt;Building data flow graphs from LLVM&lt;&#x2F;h2&gt;
&lt;p&gt;Data flow graphs exist at multiple levels of abstraction in a compiler
toolchain, and there are trade-offs to targeting any particular choice.&lt;&#x2F;p&gt;
&lt;p&gt;First, data flow graphs can either represent a program &lt;em&gt;statically&lt;&#x2F;em&gt;, purely from
the program&#x27;s source code, or &lt;em&gt;dynamically&lt;&#x2F;em&gt;, from a program execution trace. A
static DFG has a one-to-one relation to the source code: each operation and its
dependencies are directly translated. The control flow of the program exists
only implicitly: if a data value&#x27;s flow depends on the branching structure of
the program, the DFG would have back edges and cycles. A dynamic DFG captures a
single trace throughout the program, where operations are repeated each time
they are executed. In this case, the data flow graph remains acyclic (with
values only flowing &amp;quot;down&amp;quot;), and loops in the control flow repeat in the
subgraph for each time the loop is executed. However, dynamic data flow graphs
only represent a single execution of the program and may not even cover the
full program behavior. They also may be infeasible to generate ahead of time
for long-running applications, and they tend in practice to be so large
as to limit analysis to fragments of the full dynamic DFGs.&lt;&#x2F;p&gt;
&lt;p&gt;In addition, DFGs can target either the &lt;em&gt;intermediate representation&lt;&#x2F;em&gt; level,
with LLVM-level operations, or at the &lt;em&gt;machine code&lt;&#x2F;em&gt; level, with operations
corresponding to the exact instruction set architecture. The machine code
data flow graph corresponds more directly to the program&#x27;s actual execution, but
is not as general across different targets.&lt;&#x2F;p&gt;
&lt;p&gt;For this project, we use LLVM to target the static DFG at the intermediate
representation level of abstraction. LLVM translates the program source to
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Static_single_assignment_form&quot;&gt;&lt;em&gt;static single assignment (SSA)&lt;&#x2F;em&gt;&lt;&#x2F;a&gt; form, where every variable name can only
be assigned to once. Because
LLVM&#x27;s in-memory intermediate representation stores pointers to instructions&#x27;
operands, we can build a program&#x27;s static data flow graph by inserting edges
to an instruction and from each of its operands. We narrow the project&#x27;s scope
to only consider acyclic subgraphs by considering subgraphs only within basic
block boundaries, which lack branching control flow.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;matching-fixed-dfg-stencils&quot;&gt;Matching fixed DFG stencils&lt;&#x2F;h2&gt;
&lt;p&gt;To begin, let&#x27;s imagine we already have some oracle that has given us a great
candidate subgraph (which we&#x27;ll call a &lt;em&gt;stencil&lt;&#x2F;em&gt;), and our job is to find all
the redundant instantiations of that stencil. If we consider a large program
DFG &lt;code&gt;G&lt;&#x2F;code&gt; and a smaller stencil DFG &lt;code&gt;H&lt;&#x2F;code&gt;, the task is to find as many subgraph
isomorphisms of &lt;code&gt;H&lt;&#x2F;code&gt; and &lt;code&gt;G&lt;&#x2F;code&gt;. Here, the larger program DFG &lt;code&gt;G&lt;&#x2F;code&gt; is generated
directly from the LLVM in-memory representation as described above, but does not
include edges across control flow boundaries. Rather, &lt;code&gt;G&lt;&#x2F;code&gt; is a collection of DFG
components per basic block. In addition, we focus on operations that consume and
produce values directly (such as arithmetic and shift operations) rather than
those that read or write from memory or modify control flow (&lt;code&gt;load&lt;&#x2F;code&gt;, &lt;code&gt;store&lt;&#x2F;code&gt;,
&lt;code&gt;branch&lt;&#x2F;code&gt;, and &lt;code&gt;return&lt;&#x2F;code&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;While graph isomorphism is a notoriously tricky problem, it is also a
common one, and we make heavy use of out-of-the-box graph algorithms. We employ
the &lt;a href=&quot;https:&#x2F;&#x2F;networkx.github.io&#x2F;documentation&#x2F;stable&#x2F;reference&#x2F;algorithms&#x2F;isomorphism.html&quot;&gt;&lt;code&gt;networkx.isomorphism&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;  Python package, which provides tools for iterating
over matches (subgraph isomorphisms) between the program DFG &lt;code&gt;G&lt;&#x2F;code&gt; and a stencil DFG
&lt;code&gt;H&lt;&#x2F;code&gt;. There are two features of a matching which distinguish it from a subgraph isomorphism:
(1) nodes must be matched to nodes of the same &lt;code&gt;opcode&lt;&#x2F;code&gt;, which technically makes the problem
a &lt;em&gt;colored&lt;&#x2F;em&gt; subgraph isomorphism (which fortunately makes the problem easier), and (2) we need to select
mutually exclusive subgraphs, where each node can be assigned to at most once
isomorphic instance (to model actual hardware acceleration).
In the case of a single stencil, we use a greedy heuristic to randomly choose isomorphisms until
there are no longer any remaining choices that are mutually exclusive. When
trying to match multiple stencils, our heuristic tries to find the largest
stencils first. We describe this search process in more detail in our
implementation section.&lt;&#x2F;p&gt;
&lt;p&gt;We started our testing by hand-picking chains of instructions found in our
benchmarking code. From the &lt;a href=&quot;https:&#x2F;&#x2F;embench.org&quot;&gt;Embench&lt;&#x2F;a&gt; embedded programming benchmarking suite,
we used &lt;code&gt;matmult-int.c&lt;&#x2F;code&gt; to chose a few common chains of operations:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;mul&lt;&#x2F;code&gt; → &lt;code&gt;add&lt;&#x2F;code&gt; → &lt;code&gt;srem&lt;&#x2F;code&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;shl&lt;&#x2F;code&gt; → &lt;code&gt;add&lt;&#x2F;code&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;sdiv&lt;&#x2F;code&gt; → &lt;code&gt;mul&lt;&#x2F;code&gt; → &lt;code&gt;add&lt;&#x2F;code&gt;&lt;&#x2F;p&gt;
&lt;p&gt;As we expected, these small human-selected stencils subgraphs performed
especially poorly. On the original program, &lt;code&gt;matmult-int.c&lt;&#x2F;code&gt;, these stencils only
matched less than 4% of instructions.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;identifying-common-dfg-stencils&quot;&gt;Identifying common DFG stencils&lt;&#x2F;h2&gt;
&lt;p&gt;Of course, finding the common subgraphs by hand is pretty antithetical to a
reasonable approach at compiling code. Our real goal is to automate the process
of finding the common DFG stencils to accelerate.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;formal-description-of-the-task&quot;&gt;Formal description of the task&lt;&#x2F;h3&gt;
&lt;p&gt;In this context of ignoring control flow and considering data flows within basic
blocks, we can look at the problem purely graph-theoretically. For a single trace
through the program, the data flow graph $G$ is acyclic, and we would like to
cover as much of it as possible with subgraphs corresponding to the stencils
that we accelerate. Statically, we do not know what the final data flow graph
is, but we do know that we will be able to assemble one by connecting dangling
edges from control-flow-free components: basic blocks.&lt;&#x2F;p&gt;
&lt;p&gt;We would like to find a small collection of graph components $\mathcal H = {H_i, \ldots, H_k}$, which we can use to replace parts of and accelerate programs having basic blocks $\mathcal G = {G_1,\ldots,G_n}$, that maximizes the total saved time:&lt;&#x2F;p&gt;
&lt;p&gt;$$\mathcal S_{\mathcal H}(\mathcal G) := \max_{\mathcal C \in \text{Cov}(\mathcal G, \mathcal H)}~ \sum_{G \in \mathcal G} w_G \cdot  \sum_{H \in \mathcal C_G} f_H \cdot |H|$$&lt;&#x2F;p&gt;
&lt;p&gt;where:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;$\text{Cov}(\mathcal G, \mathcal H)$ is the set of all valid (partial)
coverings of basic blocks with at most one stencil, that is, injective graph
morphisms $\varphi: (\cup \mathcal G) \to \cup \mathcal H$.&lt;&#x2F;li&gt;
&lt;li&gt;$\mathcal C_G$ is the component of the covering $\mathcal C$ of the
total covering on the particular basic block graph $G$.&lt;&#x2F;li&gt;
&lt;li&gt;$w_G$ is independent of $\mathcal H$ and proportional to the expected
number of times $G$ is executed.&lt;&#x2F;li&gt;
&lt;li&gt;$f_H$ is the expected speedup factor from accelerating the component $H$.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Now supposing that $f_H$ was roughly constant, we could achieve the maximum savings by trivially choosing $\mathcal H := \mathcal G$. There are a few problems with this:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;$|\mathcal H|$ is large; there are many of these sub-graphs, which makes the search process substantially less efficient.&lt;&#x2F;li&gt;
&lt;li&gt;Each $H_i \in \mathcal H$ is also large, making the specialized component more expensive.&lt;&#x2F;li&gt;
&lt;li&gt;There is now a dependency between $\mathcal H$ and $\mathcal G$, and so we need to know our program in order to build the components we use to accelerate.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;The third issue is the most important; to a first approximation, the first two are heuristics which help solve it.
This intuition suggests an alternative framing as a statistical learning problem: you&#x27;re given some training data in the form of pieces of program DFGs ($\mathcal G$), and the objective is to find a collection of snippets $\mathcal H$ that not only covers this program, but also can be re-configured to accelerate other programs in the future.
We might imagine that there&#x27;s some
underlying distribution $\mathtt{Programs}$ of programs that people write, in which case
the work we present here can be seen as a solution to the following optimization problem:&lt;&#x2F;p&gt;
&lt;p&gt;$$ \arg\max_{\mathcal H}\left( \mathop{\mathbb E}\limits_{\mathcal G\sim \texttt{Programs}}~ \mathcal S_{\mathcal H}(\mathcal G) - \text{Cost}(\mathcal H) \right)$$&lt;&#x2F;p&gt;
&lt;p&gt;where $\text{Cost}(\mathcal H)$ is the additional cost incurred by choosing to
accelerate the subgraph stencil $\mathcal H$, which is higher for larger
subgraphs. Note that in this presentation, we can no longer choose $\mathcal H$ based on $\mathcal G$,
which resolves issue (3); issues (1) and (2) are partially incorporated into the $\text{Cost}(\mathcal H)$
term,
effectively regularizing the search space by artificially imposing a cost for larger or more sub-graphs. Just like $\ell^1$ regularization, we are explicitly adding a preference for short descriptions, to avoid overfitting to a single program (like setting $\mathcal H:= \mathcal G$ as discussed above).&lt;&#x2F;p&gt;
&lt;p&gt;Rather than solve this optimization problem in closed form, we explicitly look for
a given number of subgraphs (issue 1) and of a pre-selected size (issue 2). In doing so,
we are using the regularization knobs to explicitly avoid over-fitting to the program at hand,
which is necessary because &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;dfg-cover&#x2F;#stencil-generalization&quot;&gt;our held-out evaluation (below)&lt;&#x2F;a&gt; shows
that generalizing to unseen programs is considerably more difficult.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;implementation-strategy&quot;&gt;Implementation strategy&lt;&#x2F;h3&gt;
&lt;p&gt;We first instrument an LLVM module pass that writes out a JSON representation of
the DFG. Our Python module then explores candidate subgraphs using a combination
of our heuristics and the out-of-the-box graph isomorphism tooling.&lt;&#x2F;p&gt;
&lt;p&gt;We implemented and compared two separate algorithms for finding the stencils
from static DFGs generated per-basic-block from LLVM programs.&lt;&#x2F;p&gt;
&lt;p&gt;In both cases, the general idea is to iterate over the DFG&#x27;s connected
components, successively building larger subgraphs. Our first approach is
node-based, and exhaustively considers node subsets up to some size. The second
approach is edge-based, and uses smaller subgraph components to build graphs of
the desired size. In preliminary experiments we found the edge-based approach to
be slightly faster, so we used that approach for our evaluation.&lt;&#x2F;p&gt;
&lt;p&gt;More specifically, the edge-based subgraph stencil generation iteratively grows connected subgraphs.
For each $k$-edge subgraph in the DFG, the algorithm considers adding every edge in the DFG, keeping the new ($k+1$)-edge subgraphs that are connected.
It then finds which of these subgraphs are isomorphic to each other, and constructs a canonical stencil name for each isomorphic set.
These ($k+1$)-edge subgraphs are used for the next iteration of the algorithm.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;mutually-exclusive-matches&quot;&gt;Mutually exclusive matches&lt;&#x2F;h4&gt;
&lt;p&gt;To generate a &lt;em&gt;valid&lt;&#x2F;em&gt; choice of subgraph stencils, we need more than simply an
enumeration of all subgraphs in a program and a way to match them: we also have
to make sure the matches don&#x27;t step on one another&#x27;s toes—that is, we need to
throw out matches until each instruction is only covered by at most a single
component.&lt;&#x2F;p&gt;
&lt;p&gt;Finding the optimal one is difficult: it is related to the weighted optimal
scheduling problem (which &lt;a href=&quot;https:&#x2F;&#x2F;courses.cs.washington.edu&#x2F;courses&#x2F;cse521&#x2F;13wi&#x2F;slides&#x2F;06dp-sched.pdf&quot;&gt;can be solved with dynamic programming&lt;&#x2F;a&gt;
in $O(n \log n)$ time, but on a general directed graph, we get an exponential
factor in the branching coefficient). Rather than solve this problem optimally in
the general case, we implement the greedy biggest-first strategy, and focus
instead on searching for collections of matches which have higher coverage in
the first place.&lt;&#x2F;p&gt;
&lt;p&gt;After generating possible subgraph stencils, we choose a combination that achieves the highest static coverage of the DFG using only mutually exclusive matches.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;search&quot;&gt;Search&lt;&#x2F;h3&gt;
&lt;p&gt;Ultimately, we do not need to search the space exhaustively if we have reasonable heuristics that might cause us to believe that we&#x27;re going in the right direction with certain stencils.
We can then do our search traversal in a different order, guided by the objective function.
This can be done in the form of a beam search: we only keep around the $k$ best subgraphs in the search frontier, and at each step try to expand one to a random neighboring node.
Though not used in our evaluation, beam search can speed up generating larger subgraphs in the future.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;We primarily look at what coverage (percent of
instructions matched by some subgraph over total instructions) we can get on a
given source program. We consider both static and dynamic coverage. In both
cases, 100% coverage is impossible because we exclude instructions with
control-flow implications (&lt;code&gt;phi&lt;&#x2F;code&gt;, &lt;code&gt;branch&lt;&#x2F;code&gt;, &lt;code&gt;return&lt;&#x2F;code&gt;) and those that read and
write from memory (&lt;code&gt;load&lt;&#x2F;code&gt; and &lt;code&gt;store&lt;&#x2F;code&gt;).&lt;&#x2F;p&gt;
&lt;h3 id=&quot;dynamic-coverage-instrumentation&quot;&gt;Dynamic coverage instrumentation&lt;&#x2F;h3&gt;
&lt;p&gt;To generate dynamic coverage information, we instrument our LLVM pass. The pass
adds annotation to each instruction that is matched specifying which stencil it
was covered by, along with the node-isomorphism. Because basic blocks execute
atomically (ignoring exceptions), we generate the count of matched and total
instructions per-block at compile time. We then link a C profiling module with
state for these dynamic counters. At the end of each basic block, our pass adds a call
to a function to increment the profiling counters by the
statically-determined amounts. We add a final function call to LLVM&#x27;s global
destructors list to write the final profiling values both to standard out and an
auxiliary file. For convenience, we also save the static coverage the same way.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;embench-evaluation&quot;&gt;Embench evaluation&lt;&#x2F;h3&gt;
&lt;p&gt;We chose to use the &lt;a href=&quot;https:&#x2F;&#x2F;embench.org&quot;&gt;Embench&lt;&#x2F;a&gt; embedded programming benchmarking suite because
it represents a small but fairly diverse set of programs that can be easily
compiled and executed with LLVM tooling.&lt;&#x2F;p&gt;
&lt;p&gt;For each benchmark, we generated all allowable three-node subgraphs and chose
the two that statically covered the most instructions in that benchmark&#x27;s DFG.
Three-node stencil generation took between a few seconds and 17 minutes for each benchmark
on a 2017 MacBook Pro (2.3 GHz Intel Core i5, 8 GB RAM).&lt;&#x2F;p&gt;
&lt;p&gt;The following graphs show static and dynamic code coverage for each benchmark.
Note that each benchmark&#x27;s coverage was calculated with the subgraphs generated
from that benchmark (and coverages are deterministic, rendering error bars unnecessary).&lt;&#x2F;p&gt;
&lt;img src=&quot;embench-profiling_best-stencil-combos-per-benchmark_half-1.png&quot; width=100%&#x2F;&gt;
&lt;img src=&quot;embench-profiling_best-stencil-combos-per-benchmark_half-2.png&quot; width=100%&#x2F;&gt;
&lt;p&gt;As stated above, a coverage of 100% is elusive because of our restrictions on
what instructions we consider. An interesting component of this profiling data
is that as expected, static and dynamic coverage correlate, but which is better
depends on the particular benchmark. From smaller scale experimentation, the
coverage also varies based on the compiler flags used to generate the original
LLVM IR. In particular, running at a more aggressive &lt;code&gt;-03&lt;&#x2F;code&gt; optimization level
(rather than the &lt;code&gt;-01&lt;&#x2F;code&gt; used here) changes the coverage metrics as loops are
statically unrolled, introducing more redundancy.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;embench-case-study-nettle-256sha&quot;&gt;Embench case study: &lt;code&gt;nettle-256sha&lt;&#x2F;code&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;Digging into &lt;code&gt;nettle-256sha&lt;&#x2F;code&gt;, the benchmark with the best coverage, we can see
that the following combination of three-node subgraph stencils was chosen out of 66
possible three-node subgraphs:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;left&quot;&gt;Stencil&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Number of static matches&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;lshr&lt;&#x2F;code&gt; → &lt;code&gt;or&lt;&#x2F;code&gt; ← &lt;code&gt;shl&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;208&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;xor&lt;&#x2F;code&gt; → &lt;code&gt;xor&lt;&#x2F;code&gt; → &lt;code&gt;add&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;80&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Here are a close-up and a closer-up (marked with a heavy black rectangle) view
of the DFG, with vertices matched to a stencil shown in bright red.
The latter shows three matches of the first stencil and one of the second.&lt;&#x2F;p&gt;
&lt;img src=&quot;nettle-sha256-cropped.png&quot; width=100%&#x2F;&gt;
&lt;img src=&quot;nettle-sha256-cropped-zoomed.png&quot; width=100% style=&#x27;border:2px solid black;&#x27;&#x2F;&gt;
&lt;h3 id=&quot;stencil-generalization&quot;&gt;Stencil generalization&lt;&#x2F;h3&gt;
&lt;p&gt;We also explored generating stencils from one benchmark and testing how well they generalized to the other benchmarks.
The three-node stencils generated and chosen from &lt;code&gt;minver&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;fcmp&lt;&#x2F;code&gt; → &lt;code&gt;select&lt;&#x2F;code&gt; ← &lt;code&gt;fsub&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;getelementptr&lt;&#x2F;code&gt; ← &lt;code&gt;pointer&lt;&#x2F;code&gt; → &lt;code&gt;getelementptr&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;were found at least once in all but five of the other Embench benchmarks, producing dynamic coverage ratios between 0 and 7.68% (with an average of 1.45% ± 2.35).&lt;&#x2F;p&gt;
&lt;p&gt;Stencils generated from other benchmarks achieved even less coverage on the rest of the benchmarks. For example, stencils generated from &lt;code&gt;edn&lt;&#x2F;code&gt;, &lt;code&gt;libwiki&lt;&#x2F;code&gt;, and &lt;code&gt;nbody&lt;&#x2F;code&gt; were not matched in any other benchmarks.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;ongoing-directions&quot;&gt;Ongoing directions&lt;&#x2F;h2&gt;
&lt;p&gt;While finding redundancies in DFGs within each basic block is a good initial
approach, this project could be extended in several directions.&lt;&#x2F;p&gt;
&lt;p&gt;We could build on existing literature in &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Extended_basic_block&quot;&gt;extended basic blocks&lt;&#x2F;a&gt; to find
subgraphs that &lt;em&gt;speculatively&lt;&#x2F;em&gt; occur. That is, in extended basic blocks, we
consider control flows that are likely to jump from one block to another in the
common case, and only fall back to different branches in the case that our guess
of the next block was wrong. In the context of hardware acceleration, we can
imagine building accelerators that handle these larger speculative subgraphs
when possible, and fall back to slower CPU execution if the control flow
differs.&lt;&#x2F;p&gt;
&lt;p&gt;In addition, it would be interesting to compare this project against dynamic
data flow graphs. For example, the &lt;a href=&quot;http:&#x2F;&#x2F;citeseerx.ist.psu.edu&#x2F;viewdoc&#x2F;download;jsessionid=7CE631B431BCCBA459061BC458D53E8F?doi=10.1.1.63.2083&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Redux&lt;&#x2F;a&gt; paper essentially introduced the
formulation of dynamic data flow graphs as we describe them here, and outlines
how to efficiently generate them. From the perspective of hardware
acceleration, the &lt;a href=&quot;https:&#x2F;&#x2F;ieeexplore.ieee.org&#x2F;document&#x2F;8509181&quot;&gt;RADISH&lt;&#x2F;a&gt; project (“Iterative Search for Reconfigurable
Accelerator Blocks with a Compiler in the Loop”) uses Python wrappers to
generate dynamic data flow graphs, and heuristic genetic algorithms to &amp;quot;fuse&amp;quot;
similar dynamic graphs together.&lt;&#x2F;p&gt;
&lt;p&gt;Like RADISH, we could extend our application to target &lt;em&gt;groups&lt;&#x2F;em&gt; of applications
instead of single programs. The scale of this undertaking would require more
clever heuristics than our current search strategies, but would ideally help us
find more general subgraphs to accelerate.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, the impact of this project could be more clearly explicated by
evaluating our subgraph identification with actual computational acceleration.
In particular, we hope this strategy will prove useful in conjunction with other
work that uses compile-time analysis for heterogeneous targets.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Composable Brili Extensions</title>
                <pubDate>Wed, 18 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/extensible/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/extensible/</guid>
                <description>&lt;h2 id=&quot;the-goal&quot;&gt;The Goal&lt;&#x2F;h2&gt;
&lt;p&gt;The first project for this course was to extend the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&quot;&gt;Bril language&lt;&#x2F;a&gt; in any way that we wanted. The initial Bril codebase included a parser from a textual representation of Bril to an AST of the source program encoded in JSON. It also included a program that transforms TypeScript programs into Bril programs and an interpreter for the language once it was in JSON format. Some projects added new standalone extensions to the codebase such as a &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;brildb&#x2F;&quot;&gt;Bril debugger&lt;&#x2F;a&gt; or a &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;bril-c-backend&#x2F;&quot;&gt;Bril to C translator&lt;&#x2F;a&gt;. As long as these projects don&#x27;t make any changes to the underlying grammar or representation and only add new files to the existing codebase, then they can be relatively easily merged into the codebase.&lt;&#x2F;p&gt;
&lt;p&gt;However, some projects created extensions to the Bril language that added new operations, new state, and new control flow to the language. These projects include adding &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;recordtypes&#x2F;&quot;&gt;record types&lt;&#x2F;a&gt;, &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;manually-managed-memory&#x2F;&quot;&gt;dynamically allocated memory&lt;&#x2F;a&gt;, and &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;function-calls&#x2F;&quot;&gt;function calls&lt;&#x2F;a&gt; to the language. All of these projects make changes to the grammar of the language by adding new types of operations. Additionally, all of the project modify the interpreter to support their new operations. All of these changes require adding additional state to the evaluation function, which all conflict with each other.&lt;&#x2F;p&gt;
&lt;p&gt;In this project, I designed and implemented a composable language extension system for the Bril interpreter that allows developers to independently create language extensions that can later be composed together. For example, consider an extension A that adds a new operation &#x27;a&#x27; to the language and an extension B that adds a new operation &#x27;b&#x27; to the language. Once these two extensions are merged into the codebase, other developers should be able to compose extensions A and B together to create an interpreter that supports both operations &#x27;a&#x27; and &#x27;b&#x27;. Furthermore, the developers of extension A and extension B should ideally never have to know about each other&#x27;s existence.&lt;&#x2F;p&gt;
&lt;p&gt;This is different from an extensible language framework, such as &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;andru&#x2F;papers&#x2F;polyglot.pdf&quot;&gt;Polyglot&lt;&#x2F;a&gt;, because language extensions in Polyglot explicitly specify the language that they are extending. With composable extensions, the goal is that the language extension simply defines a small set of assumptions about the abstract language that it is extending and then implements its extensions based on those assumptions. Then, any base language that satisfies those assumptions can be extended by the extension.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-implementation&quot;&gt;The Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;I implemented the extensible interpreter in TypeScript. I first identified some common datatypes that all extensions would use. The first was the base instruction interface. I decided that all instructions would be identifiable by operation field called op (sometimes called a discriminant):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;interface &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseInstruction &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
  op: string;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Next, I decided that all functions in Bril would have a name and a list of instructions or labels:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;interface &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseFunction&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseInstruction&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
  name: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Ident&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  instrs: (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Label&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)[];
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;One interesting thing to note is that the &lt;code&gt;BaseFunction&lt;&#x2F;code&gt; interface is generic over the type of instructions that it will be containing. Finally, a Bril program contained a list of functions:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;interface &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseProgram&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseInstruction&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;F &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseFunction&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
  functions: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;F&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[];
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These three interfaces define the most basic structure that a Bril program can have. All extensions must be extensions to Bril that respect this generic language definition. Put another way, every Bril extension must work on a subtype of the &lt;code&gt;BaseProgram&lt;&#x2F;code&gt; interface parameterized over some &lt;code&gt;I&lt;&#x2F;code&gt; and &lt;code&gt;F&lt;&#x2F;code&gt; that extend &lt;code&gt;BaseInstruction&lt;&#x2F;code&gt; and &lt;code&gt;BaseFunction&lt;&#x2F;code&gt; respectively.&lt;&#x2F;p&gt;
&lt;p&gt;Next, I defined a generic evaluation function for evaluating instructions:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type evalInstr&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;A&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;PS&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt; = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(instr: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;programState: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;PS&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;functionState: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS&lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;A&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The &lt;code&gt;evalInstr&lt;&#x2F;code&gt; function is parameterized over 4 types. The first type, &lt;code&gt;A&lt;&#x2F;code&gt;, is the action type. Every evaluation function needs to generate an action that specifies which instruction to execute next. The second and third types, &lt;code&gt;PS&lt;&#x2F;code&gt; and &lt;code&gt;FS&lt;&#x2F;code&gt;, represent the program state and the function state respectively. The program state type is meant to represent the entire state of the currently running Bril program (so think things like global variables). The function state holds only function local state for the currently executing function (like the values of local variables). The final parameter is &lt;code&gt;I&lt;&#x2F;code&gt;, which is the type of the instruction the evaluation function operates over.&lt;&#x2F;p&gt;
&lt;p&gt;Each extension defines an &lt;code&gt;evalInstr&lt;&#x2F;code&gt; function that specifies how to update the program and function state in terms of the instructions it adds&#x2F;extends as well as which action to generate for those instructions. In order to compose these functions, each extension defines a function that takes in a function of type &lt;code&gt;evalInstr&lt;&#x2F;code&gt; and returns a function of type &lt;code&gt;evalInstr&lt;&#x2F;code&gt;. The idea is that the function that is returned has 2 cases. In the first case the instruction passed in is of the type that this extension is extending&#x2F;adding. In this case the function executes the logic to update the program and function states and return an action according to the operation. In the other case, the instruction is not an instruction that this extension implements. So it dispatches to the &lt;code&gt;evalInstr&lt;&#x2F;code&gt; function that was passed in to the original function. The code looks something like this:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;function &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;evalInstr&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;A&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;PS &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;ProgramState&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FunctionState&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseInstruction&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(
    baseEval: (instr: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, programState:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;PS&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, functionState:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;A
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(instr: bril.Instruction &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;programState:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;PS&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;functionState:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;): &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;A &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;brili_base.Action &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(isExtInstr(instr)) {
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;handleExtInstr(instr);
        } &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;baseEval(instr, programState, functionState);
        }
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Each extension also usually defines types &lt;code&gt;ProgramState&lt;&#x2F;code&gt; and &lt;code&gt;FunctionState&lt;&#x2F;code&gt;, which are record types representing the types of fields that this extension expects to be in the program state and the function state respectively. For example, if&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FunctionState &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{env: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Env&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;};
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;then that means that this extension expects the function state to have an &lt;code&gt;env&lt;&#x2F;code&gt; field of type &lt;code&gt;Env&lt;&#x2F;code&gt; (this is the mapping from variable names to values in the base version of Bril). Because &lt;code&gt;FunctionState&lt;&#x2F;code&gt; is a record type, the restriction &lt;code&gt;FS extends FunctionState&lt;&#x2F;code&gt; implies that any object of type &lt;code&gt;FS&lt;&#x2F;code&gt; must have at least the same fields with the same types as specified in &lt;code&gt;FunctionState&lt;&#x2F;code&gt;. In the above case, &lt;code&gt;FS&lt;&#x2F;code&gt; must be a record that has the &lt;code&gt;env&lt;&#x2F;code&gt; field. It may have more, but it has at least that one field. This is because TypeScript uses structural subtyping for records (as opposed opposed to nominal subtyping).&lt;&#x2F;p&gt;
&lt;p&gt;Composition of extensions now is a simple matter of composing the &lt;code&gt;evalInstr&lt;&#x2F;code&gt; functions together. I implemented a &lt;code&gt;Composer&lt;&#x2F;code&gt; class that is used to compose extensions together which composes extensions in the following way:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;constructor(evalExts: ((baseEval: evalFunc&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;A&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;PS&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;evalFunc&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;A&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;PS&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)[]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, ...&lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; {
    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ed6a43;&quot;&gt;this&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;evalInstr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(instr,programState,functionState) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;throw &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;`unhandled instruction: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;${instr.op}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;`&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    }
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ext &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;of &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;evalExts) {
        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ed6a43;&quot;&gt;this&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.evalInstr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ext(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ed6a43;&quot;&gt;this&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.evalInstr);
    }

    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It simply creates a base &lt;code&gt;evalInstr&lt;&#x2F;code&gt; function that throws an unhandled exception and then in a loop composes all of the &lt;code&gt;evalInstr&lt;&#x2F;code&gt; functions in &lt;code&gt;evalExts&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;In addition to defining how different operations are handled, an extension may also want to override&#x2F;extend the handling of actions generated by evaluating the various operations. First, however, I need to define how control flow is handled in Bril programs. When a Bril program is executing there is a program counter variable called &lt;code&gt;pc&lt;&#x2F;code&gt; that keeps track of where we are in the program. The type of this &lt;code&gt;pc&lt;&#x2F;code&gt; variable is:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;PC&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseInstruction&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;F &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseFunction&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&amp;gt; = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{ function: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;F&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; index: number };
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It contains a &lt;code&gt;function&lt;&#x2F;code&gt; field which specifies which Bril function we are currently in and an &lt;code&gt;index&lt;&#x2F;code&gt;, which specifies which instruction in that function we are executing. These are again parameterized over an instruction and function type in order to support many different kinds of extensions to functions.&lt;&#x2F;p&gt;
&lt;p&gt;The handlers for actions take in the action generated by the current instruction and the current PC and generate a new PC. We also supply the action handler functions the current function and program state in case action handler extensions want access to the current program state. The type of these action handler functions is thus:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type actionHandler&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;A&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;PS&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseInstruction&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;F &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseFunction&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&amp;gt; = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(action: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;A&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;pc: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;PC&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;F&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;programState: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;PS&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;functionState: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS&lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;PC&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;F&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These handlers are also composable in a very similar manner to the &lt;code&gt;evalInstr&lt;&#x2F;code&gt; functions. The extensions export a function that takes in an &lt;code&gt;actionHandler&lt;&#x2F;code&gt; function that represents the action handler function being extended and outputs an &lt;code&gt;actionHandler&lt;&#x2F;code&gt; function. The outputted function executes the extension&#x27;s action handling functions if the current action is one that it handles and otherwise dispatches to the action handling function that it was extending. In the &lt;code&gt;Composer&lt;&#x2F;code&gt; class I do the following, which is very similar to how I composed &lt;code&gt;evalInstr&lt;&#x2F;code&gt; functions:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;constructor(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, actionHandleExts: ((extFunc: actionHandler&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;A&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;P&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;F&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;actionHandler&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;A&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;P&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;F&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)[]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, ...&lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; {
    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ed6a43;&quot;&gt;this&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;handleAction &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(action,pc,programState,functionState) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;throw &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;`unhandled action`&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    };
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ext &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;of &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;actionHandleExts) {
        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ed6a43;&quot;&gt;this&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.handleAction &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ext(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#ed6a43;&quot;&gt;this&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.handleAction);
    }
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...
 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Finally, the composer exposes a function that evaluates a Bril program object which extends the &lt;code&gt;BaseProgram&lt;&#x2F;code&gt; described above:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;evalProg&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Prog &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseProgram&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;F&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(prog: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Prog&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This function finds the main function object and then creates a new &lt;code&gt;PC&lt;&#x2F;code&gt; object with that function set as its function field and the index field set to 0. Then, in a loop it gets the current instruction, executes it and gets back an action, and then updates the &lt;code&gt;pc&lt;&#x2F;code&gt; based on that action. It repeats this until the index of the pc goes out of the current function&#x27;s bounds. The &lt;code&gt;Composer&lt;&#x2F;code&gt; class also takes in two functions &lt;code&gt;initP&lt;&#x2F;code&gt; and &lt;code&gt;initF&lt;&#x2F;code&gt;, which initialize the program and function states respectively.&lt;&#x2F;p&gt;
&lt;p&gt;In order to create a new Bril interpreter which is the composition of extensions A, B, and C, you simply need to define a function state that contains all the fields required by all the function states in all of the extensions, a program state that contains all of the fields required by all of the program states in the extensions, and initialization functions for those types. Then, you simply create a new instance of the &lt;code&gt;Composer&lt;&#x2F;code&gt; class with the extensions and action handlers you want.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;example-extensions&quot;&gt;Example Extensions&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;bril-base&quot;&gt;Bril Base&lt;&#x2F;h3&gt;
&lt;p&gt;In order to demonstrate the usability of this system, I implemented a few extensions and then composed them together. First, I implemented the base Bril language as an extension. For the &lt;code&gt;evalInstr&lt;&#x2F;code&gt; function, this mostly just involved copying over the switch statement on the instruction operation from the base implementation and adding an &lt;code&gt;env&lt;&#x2F;code&gt; field to the function state. Then, the &lt;code&gt;actionHandler&lt;&#x2F;code&gt; function performed the same logic as in the &lt;code&gt;evalFunc&lt;&#x2F;code&gt; function in the current brili code except that variable &lt;code&gt;i&lt;&#x2F;code&gt; got replaced with the &lt;code&gt;pc&lt;&#x2F;code&gt;. I also had to add checks to each of the two new functions to make sure that the instruction&#x2F;action being handled was actually one of the instructions that the function was meant to handle. Otherwise, it would just dispatch to the base instruction evaluation or action handler function that was passed in (similar to how it was described above). Implementing this base extension required very few changes to the code that was in the &lt;code&gt;brili.ts&lt;&#x2F;code&gt; file. Most of it was simply copied over and a few minor tweaks were made.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;manually-managed-memory&quot;&gt;Manually Managed Memory&lt;&#x2F;h3&gt;
&lt;p&gt;The next extension that was implemented was the &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;manually-managed-memory&#x2F;&quot;&gt;manually managed memory&lt;&#x2F;a&gt; extension. This extension added a heap datastructure to the program state and a way to allocate space on that heap. It also added a new value type, a pointer, which pointed to values in the heap. The new operations added by this extension can also load values from the heap into variable and store the value in a variable into the heap. This also requires an environment field in the function state. This leads to the following definitions in the code for this extension:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;ProgramState &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{heap: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Heap&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Value&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;};
type &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FunctionState &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{env: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Env&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;};
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;After declaring these definitions, creating the rest of the extension was simply a matter of copying over the code that handled the new operations from that project into the &lt;code&gt;evalInstr&lt;&#x2F;code&gt; function for that extension. I didn&#x27;t need to define an &lt;code&gt;actionHandler&lt;&#x2F;code&gt; function because this extension did not add any additional control flow to the base Bril language. Because of this assumption, and the assumption that there was an environment field that was being managed in the function state, this extension did kind of need to know that there was going to be an underlying base Bril language that already implemented the control flow and correctly maintained the environment.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;record-types&quot;&gt;Record Types&lt;&#x2F;h3&gt;
&lt;p&gt;The third extension that I ported over was the &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;recordtypes&#x2F;&quot;&gt;record types&lt;&#x2F;a&gt; extension. This extension added record types to the language, as well as operations to create records and access&#x2F;set record fields. This extension added a &lt;code&gt;typeEnv&lt;&#x2F;code&gt; function state variable to their extension to keep track of the defined record types in a function. Similar to the memory extension, it also assumed that there was some kind of local variable environment, so the function type for this extension looked like:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;ProgramState &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{};
type &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FunctionState &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{env: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Env&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, typeEnv: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;TypeEnv&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;};
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This extension also didn&#x27;t require any additional control flow compared to the base Bril language so I also didn&#x27;t implement an &lt;code&gt;actionHandler&lt;&#x2F;code&gt; function.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;function-calls&quot;&gt;Function Calls&lt;&#x2F;h3&gt;
&lt;p&gt;The final extension was to add function calls. This extension was interesting because it added new control flow to the Bril language. There were two different projects that both added function calls to the language: &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;function-calls&#x2F;&quot;&gt;Function Calls and Property-Based Testing&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;exceptions-in-bril&#x2F;&quot;&gt;Exceptions in Bril&lt;&#x2F;a&gt;. The Function Calls and Property-Based Testing project added function calls by recursively calling the &lt;code&gt;evalFunc&lt;&#x2F;code&gt; function and augmenting the behavior of the &lt;code&gt;ret&lt;&#x2F;code&gt; instruction to have &lt;code&gt;evalFunc&lt;&#x2F;code&gt; return a value. This approach didn&#x27;t really fit that well into the new framework that I had build, because in my framework control flow revolves around the &lt;code&gt;pc&lt;&#x2F;code&gt; variable. The second project primarily added exceptions, but in order to add exceptions it also added basic support for function calls. This project&#x27;s method of adding function calls involved creating new stack frames for each function call and pushing and popping the stack frames to implement call and return instructions. This method was more in line with the way my framework handled state, so I decided to just add the function call part of the exceptions project.&lt;&#x2F;p&gt;
&lt;p&gt;This extension required adding several additional fields to both the program and function states. The main thing that was added was an array of stack frames (i.e., a stack) in order to correctly handle function calls and returns. Each stack frame holds the function state and the return pc:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FunctionState &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{ env: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Env &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;};

type &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;StackFrame&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FunctionState&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseInstruction&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;F &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseFunction&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&amp;gt; = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;PC&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;F&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];

&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;export &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;type &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;ProgramState&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FunctionState&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseInstruction&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;F &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Function&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&amp;gt; = &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{ functions: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;F&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[], currentFunctionState: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, callStack: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;StackFrame&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,I,F&amp;gt;[], &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;initF&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: () &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;FS &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;};
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Furthermore, arguments were added to the function records. These arguments consist of a name and a type:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;interface &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Function&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;extends &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseInstruction&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    name: bril.Ident;
    args: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Argument&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[];
    instrs: (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Label&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)[];
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The only additional operation added in this extension was the &lt;code&gt;call&lt;&#x2F;code&gt; operation. The &lt;code&gt;evalInstr&lt;&#x2F;code&gt; function for this extension was quite simple: it just returned a new call action with the name of the function being called and the argument variables being passed into the function. Most of the control flow logic was handled in the &lt;code&gt;actionHandler&lt;&#x2F;code&gt; function exported by this extension. On a call action, this handler created a new stack frame with the &lt;code&gt;initF&lt;&#x2F;code&gt; function provided in the program state record and then pushed the current stack frame onto the call stack. It also made the assumption that all of the &lt;code&gt;evalInstr&lt;&#x2F;code&gt; functions were called with the program state &lt;code&gt;currentFunctionState&lt;&#x2F;code&gt; field. This was indeed the case, as the main loop of the &lt;code&gt;Composer&lt;&#x2F;code&gt; class contained this code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;action &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ed6a43;&quot;&gt;this&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.evalInstr(line, programState, programState.currentFunctionState);
pc &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#ed6a43;&quot;&gt;this&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.handleAction(action, pc, programState, programState.currentFunctionState);
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This allowed the &lt;code&gt;actionHandler&lt;&#x2F;code&gt; function to set the new function state on a function call. In addition, this extension overrides the &lt;code&gt;end&lt;&#x2F;code&gt; action. This action is usually generated by the &lt;code&gt;ret&lt;&#x2F;code&gt; operation in the base Bril implementation. However, instead of terminating the program, this extension modified the behavior to pop the current stack frame off of the program state&#x27;s &lt;code&gt;callStack&lt;&#x2F;code&gt; field, set the &lt;code&gt;currentFunctionState&lt;&#x2F;code&gt; field to that popped stack frame, and finally setting the pc to the return pc in the popped stack frame.&lt;&#x2F;p&gt;
&lt;p&gt;This extension required a bit more effort than the other extensions because of the changes in control flow made to the interpreter. However, much of the logic could simply be copied over from the exceptions project with only minor tweaks.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;brining-it-all-together&quot;&gt;Brining it all together&lt;&#x2F;h3&gt;
&lt;p&gt;In order to test the extensions and their composition I created a new instance of the &lt;code&gt;Composer&lt;&#x2F;code&gt; class with the 4 &lt;code&gt;evalInstr&lt;&#x2F;code&gt; functions from each of the extensions as well as the 2 &lt;code&gt;actionHandler&lt;&#x2F;code&gt; functions from the base and the function call extensions. I then added all of the test cases from each of the extensions to the project and ran the composed interpreter over all of the test cases, which all passed (there was one minor bug but it was easily fixed). I also added a test case that combined some operations from each extension, which was also correct.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation-of-the-system&quot;&gt;Evaluation of the system&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;extension-development&quot;&gt;Extension development&lt;&#x2F;h3&gt;
&lt;p&gt;One of the goals that I had when developing this system was that extensions should be able to be developed in isolation. In evaluating this goal, I will look back at the four extensions that I implemented. For the most part, each extension could be developed in isolation as long as each extension specified the assumptions it was making about the state. For example, the base extension specified that the function state had to include an environment field that stored all of the function local variables. However, because of the use of generics, the base extension never specifically names a type that the function state must be. Instead, it leaves it up to the composer of the extensions. Furthermore, even though the base extension assumes that function state has an environment field, it is kind of implicitly assuming that this field will be used correctly by all other extensions that are composed with this extension. For example, if an extension comes along and lowercases all variable names before getting and setting to the environment, it could potentially interfere with the base extension. Or, if there is a different extension that assumes that function local data is stored in a field called &lt;code&gt;local_vars&lt;&#x2F;code&gt; instead of &lt;code&gt;env&lt;&#x2F;code&gt;, then these two extensions may not be properly composable.&lt;&#x2F;p&gt;
&lt;p&gt;This was a more general trend that I realized while developing the extensions. Extensions be developed in relative isolation using the framework. However, when the extensions are composed together, the composer needs to be aware of all assumptions, implicit and explict, that are made by each extension in order to determine if the extensions can be composed in a meaningful way. On example is that there is some code in the manually managed memory code that assumes that a value is either a number, a boolean, or a pointer. However, when we compose this extension with the record type extension we have values that can be record types as well. Similarly, there is code in the record types extension that assumes that a value is either a number, a boolean, or a record. While these two extensions can operate in isolation in the same program, if I try and create a record with a pointer field things start to break.&lt;&#x2F;p&gt;
&lt;p&gt;In order to have extensions be correctly composable with each other there need to be some conventions that extension developers need to follow. To solve the above example, if the two extensions were a bit more generic with the types of values they assumed in the environment, then they could probably be correctly composable. In other words, while extensions can be developed in isolation, they need to be aware that they will be extended.&lt;&#x2F;p&gt;
&lt;p&gt;The ease at which extensions could be developed was also an important feature that I wanted to have. The minimal number of changes that I had to make to the code for the base, memory, and record type extensions indicate that writing these extensions is not that much more challenging than writing the extensions by simply adding code to the original Bril interpreter. &lt;&#x2F;p&gt;
&lt;p&gt;The main challenge then remains in actually composing the extensions together. However, all of the Bril extensions assumed only the base extension, so multiple versions of the interpreter can easily be made by simply composing the base extension with the extension developed for the project. Then, all of the interpreter code can be merged into the repository in a conflict-free way, because of the isolation of the extensions from each other. There is still another challenge that needs to be solved, which is the extensibility of the parser. This may be solved by rewriting the parser using &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Parser_combinator&quot;&gt;parser combinators&lt;&#x2F;a&gt;. However, I have not explored this option in any great depth.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;type-safety&quot;&gt;Type safety&lt;&#x2F;h3&gt;
&lt;p&gt;Another goal that I had for this project was that when composing extensions, the composer should be assisted as much as possible by the type checker. In TypeScript, you can basically revert to untyped Javascript by simply making everything have type &lt;code&gt;any&lt;&#x2F;code&gt;. If all of the generic types were removed and replaced with the all-encompassing &lt;code&gt;any&lt;&#x2F;code&gt;, then it would be very easy to compose two extensions because their function signatures trivially match. However, you lose some type safety, such as the restrictions placed on the function and program state by each extension. So I added in generics for the program and function states. This is able to catch some type errors at compile time. However, when there is a type error the generics usually generate quite verbose type errors from the compiler, which can sometimes be difficult to parse through, especially because of all of the restrictions placed on the generics (such as the &lt;code&gt;extends&lt;&#x2F;code&gt; conditions). &lt;&#x2F;p&gt;
&lt;p&gt;The generics for the instruction and function types were also added to increase type safety within the main evaluation loop and within the handler functions. Having the &lt;code&gt;pc&lt;&#x2F;code&gt; parameterized on those types allowed me to write code that knew that each function would have a list of instruction and a name. However, because the Bril program is parsed from a JSON file and then cast to a Bril program type without any dynamic checks means that this type safety was largely negated by the initial cast. More fundamentally, the way most extensions check that the instruction passed into &lt;code&gt;evalInstrs&lt;&#x2F;code&gt; is pretty unsound. The &lt;code&gt;instr&lt;&#x2F;code&gt; type that &lt;code&gt;evalInstrs&lt;&#x2F;code&gt; usually takes in is given the following type:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;instr: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Instruction &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;I
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;where &lt;code&gt;I&lt;&#x2F;code&gt; is the generic type and &lt;code&gt;Instruction&lt;&#x2F;code&gt; is the type of the instructions that the extension implements. The &lt;code&gt;isInstruction&lt;&#x2F;code&gt; function is usually just a function that returns a type guard if the opcode of the instruction matches one of the implemented instruction ops (here in the &lt;code&gt;instrOp&lt;&#x2F;code&gt; array):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;function &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;isInstruction&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(instr: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;BaseInstruction&lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: instr is &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Instruction &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;instrOps.some(op &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;op &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;instr.op);
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is not really a safe type guard because one of your instructions could assume a &lt;code&gt;dest&lt;&#x2F;code&gt; field in the instruction but this is only checking that the opcode matches, not that the instruction has all of the correct fields. This could maybe be improved by more comprehensive run time type checking, but would still probably not rule out everything that could go wrong. Furthermore, it could do the wrong thing if extensions are composed in the wrong way. If two extensions &lt;code&gt;A&lt;&#x2F;code&gt; and &lt;code&gt;B&lt;&#x2F;code&gt; implement operation &lt;code&gt;a&lt;&#x2F;code&gt; but an extension &lt;code&gt;A&lt;&#x2F;code&gt; removes some fields of operation &lt;code&gt;a&lt;&#x2F;code&gt;, if &lt;code&gt;B&lt;&#x2F;code&gt; extends &lt;code&gt;A&lt;&#x2F;code&gt;, then &lt;code&gt;B&lt;&#x2F;code&gt; will be operating on the wrong kind of instruction. However, if &lt;code&gt;A&lt;&#x2F;code&gt; extends &lt;code&gt;B&lt;&#x2F;code&gt;, then the &lt;code&gt;a&lt;&#x2F;code&gt; operation will be handled by extension &lt;code&gt;A&lt;&#x2F;code&gt; and never be propagated to &lt;code&gt;B&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Furthermore, one other issue with the above code is that the &lt;code&gt;instrOps&lt;&#x2F;code&gt; array might not contain all of the correct operations. This is actually a bug that did run into because I forgot to include the &lt;code&gt;&amp;quot;const&amp;quot;&lt;&#x2F;code&gt; field in the base extensions &lt;code&gt;instrOps&lt;&#x2F;code&gt; array. If the type &lt;code&gt;Instruction&lt;&#x2F;code&gt; is the discriminated union of all of the instructions supported by an extension then &lt;code&gt;Instruction[&amp;quot;op&amp;quot;]&lt;&#x2F;code&gt; is the discriminated union of all of the opcodes of all of the instructions. Ideally, we would want to transform the type &lt;code&gt;Instruction[&amp;quot;op&amp;quot;]&lt;&#x2F;code&gt; into to &lt;code&gt;instrOps&lt;&#x2F;code&gt; array statically. However, TypeScript doesn&#x27;t really support this transformation. However, you can convert const arrays to discriminated unions, so I came up with the following code snippet to make sure that the &lt;code&gt;instrOps&lt;&#x2F;code&gt; array is equivalent to discriminated unions of the opcodes of the instructions implemented by the extension:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;instrOps &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] as &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;const&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; This implements a type equality check for the above array, providing some static safety
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;type &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;CheckLE &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;typeof &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;instrOps)[number] extends (&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Instruction&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;op&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;? &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;any &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;never;
type &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;CheckGE &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Instruction&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;op&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]) extends (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;typeof &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;instrOps)[number] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;? &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;any &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;never;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;let &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;_: [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;CheckLE&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;CheckGE&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Because TypeScript has conditional types and because constant arrays can be transformed to discriminated union types, if the assignment of &lt;code&gt;[0,0]&lt;&#x2F;code&gt; to a type &lt;code&gt;[CheckLE, CheckGE]&lt;&#x2F;code&gt; does not throw a type error then the &lt;code&gt;instrOps&lt;&#x2F;code&gt; array contains exactly the types in the discriminated union &lt;code&gt;Instruction[&amp;quot;op&amp;quot;]&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;final-thoughts&quot;&gt;Final Thoughts&lt;&#x2F;h2&gt;
&lt;p&gt;TypeScript actually turned out to be a very good language to develop this framework in. The ability to decide how much type safety I want greatly simplified early development of the framework and allowed me to add type safety via generics in an incremental way. More generally, I think this project is a good prototype for building composable interpreter extensions for Bril. The biggest issue right now is that the use of generics makes some boilerplate code quite verbose and can lead to confusing type errors because of the large number of generics and the dependencies between them. The lack of a composable parser does still prevent this approach from being integrated into the codebase as is but is an interesting step in that direction.&lt;&#x2F;p&gt;
&lt;p&gt;The code for this project can be found &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Dan12&#x2F;bril&#x2F;tree&#x2F;extensible&quot;&gt;here&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>An Autoscheduler for Halide</title>
                <pubDate>Wed, 18 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/halide-autoscheduler/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/halide-autoscheduler/</guid>
                <description>&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;halide-lang.org&#x2F;&quot;&gt;Halide&lt;&#x2F;a&gt; is a domain-specific language embedded in C++ for writing code
that processes images and, more generally, arrays.
The main innovation of Halide is that it separates &lt;em&gt;algorithm&lt;&#x2F;em&gt; --- the actual
function being computed -- from &lt;em&gt;schedule&lt;&#x2F;em&gt; --- the decisions regarding when
to perform computations and when to store intermediate results.
This allows developers to write the function that their image pipelines
implement once and then performance-tune the implementation by swapping out
schedules --- different schedules can be used for different platforms while
not modifying function code.&lt;&#x2F;p&gt;
&lt;p&gt;Writing an efficient schedule for Halide functions requires expertise in
performance tuning.
To alleviate this, in this project we create a toy autoscheduler for Halide
that attempts to automatically generate an efficient schedule
for Halide functions.
(Note that Halide has an autoscheduler built-in: see &lt;a href=&quot;https:&#x2F;&#x2F;halide-lang.org&#x2F;papers&#x2F;autoscheduler2019.html&quot;&gt;this paper&lt;&#x2F;a&gt;
for more information.)&lt;&#x2F;p&gt;
&lt;p&gt;Our autoscheduler is implemented in Python 2.7 and can be found at
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;rolph-recto&#x2F;cs6120-autoscheduler&quot;&gt;this repository&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design-overview&quot;&gt;Design Overview&lt;&#x2F;h2&gt;
&lt;p&gt;The following presentation of schedules as trees manipulated by
schedule transformers closely follows Chapter 7 of Jonathan Ragan-Kelley&#x27;s
&lt;a href=&quot;https:&#x2F;&#x2F;people.csail.mit.edu&#x2F;jrk&#x2F;jrkthesis.pdf&quot;&gt;thesis&lt;&#x2F;a&gt;.
The images below are from that document. &lt;&#x2F;p&gt;
&lt;p&gt;In order to search for schedule, we represent them as &lt;em&gt;schedule trees&lt;&#x2F;em&gt;,
wherein the ancestry relationships between nodes represent ordering information.
Schedule trees have the following kinds of nodes:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Root nodes&lt;&#x2F;strong&gt; represent the top of the schedule tree.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Loop nodes&lt;&#x2F;strong&gt; represent the traversal of how the function is computed
along a given dimension.
Loop nodes are associated with a function and a variable (dimension).
Since functions are assumed two-dimensional, by default functions have
two variables: &lt;em&gt;x&lt;&#x2F;em&gt; and &lt;em&gt;y&lt;&#x2F;em&gt;.
Loop nodes also contain information such as whether the loop is run
sequentially, run in parallel, or vectorized.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Storage nodes&lt;&#x2F;strong&gt; represent storage for intermediate results to be used later.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Compute nodes&lt;&#x2F;strong&gt; are the leaves of the schedule tree, and they represent
computation being performed.
Compute nodes can have other compute nodes as children to represent
functions that are inlined instead of loaded from intermediate storage.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Schedule trees are considered &lt;em&gt;well-formed&lt;&#x2F;em&gt; if they satisfy the
following criteria:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The ancestry path from a function&#x27;s compute node to the root node contains
all the loop nodes and the storage node
(if the function is not the output&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#storage-output&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;)
for that function.
Intuitively, this means that the traversal of how the function is computed
is completely defined, and storage for the function&#x27;s results is available.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;If a function calls another function and the callee is not inlined,
the compute node for the callee must occur before the compute node of the
caller in a depth-first traversal.
Intuitively, this ensures that the callee&#x27;s results are stored before
the caller is computed.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;storage-output&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;By convention the output function does not have a storage
node in the schedule tree since it is assumed that storage for the output
has already been allocated and thus there is no decision to be made about
the granularity with which to allocate it.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;For any function we can define the &lt;em&gt;default schedule&lt;&#x2F;em&gt;, which traverses
the output function in row-major order and inlines all called functions,
like so:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;halide-autoscheduler&#x2F;default.png&quot; alt=&quot;default schedule&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;We can give a semantics for schedule trees as nested loops.
Consider the schedule below for three functions, &lt;em&gt;in&lt;&#x2F;em&gt;, &lt;em&gt;bx&lt;&#x2F;em&gt;, and &lt;em&gt;by&lt;&#x2F;em&gt;, where
&lt;em&gt;by&lt;&#x2F;em&gt; calls &lt;em&gt;bx&lt;&#x2F;em&gt; and &lt;em&gt;bx&lt;&#x2F;em&gt; calls &lt;em&gt;in&lt;&#x2F;em&gt;.
The schedule tree on the left represents the nested loop on the right.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;halide-autoscheduler&#x2F;semantics.png&quot; alt=&quot;schedule semantics&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;schedule-transformers&quot;&gt;Schedule Transformers&lt;&#x2F;h3&gt;
&lt;p&gt;We define transformers over schedule trees.
We use these to traverse the search space of schedules.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Split&lt;&#x2F;strong&gt; - split a function&#x27;s variable into two.
For example, we can split a function&#x27;s &lt;code&gt;x&lt;&#x2F;code&gt; variable into &lt;code&gt;x_inner&lt;&#x2F;code&gt;
and &lt;code&gt;x_outer&lt;&#x2F;code&gt;.
This allows &lt;em&gt;tiered&lt;&#x2F;em&gt; traversal of a function&#x27;s extent along one dimension.
For example, splitting the &lt;code&gt;x&lt;&#x2F;code&gt; variable changes this loop:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;for x in [1..16]:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;  a[x] = ...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;into:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;for x_outer in [1..4]:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;  for x_inner in [1..4]:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;    a[(x_outer*4)+x_inner] = ...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Combined with &lt;em&gt;reorder&lt;&#x2F;em&gt;, &lt;em&gt;split&lt;&#x2F;em&gt; can represent schedules that &lt;em&gt;tile&lt;&#x2F;em&gt;
computations.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Change Loop Type&lt;&#x2F;strong&gt; - change how the loop will be traversed; by default the
loop type is &lt;code&gt;sequential&lt;&#x2F;code&gt;, but it could also be &lt;code&gt;parallel&lt;&#x2F;code&gt;, &lt;code&gt;unrolled&lt;&#x2F;code&gt;,
or &lt;code&gt;vectorized&lt;&#x2F;code&gt;.
For simplicity our implementation only supports &lt;code&gt;sequential&lt;&#x2F;code&gt; and &lt;code&gt;vectorized&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reorder&lt;&#x2F;strong&gt; - switch loop nodes for the same function.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hoist &#x2F; lower compute&lt;&#x2F;strong&gt; - change the granularity in which intermediate
results are computed.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hoist &#x2F; lower storage&lt;&#x2F;strong&gt; - change the granularity in which storage for
intermediate results is allocated.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Inline &#x2F; deinline&lt;&#x2F;strong&gt; - inline functions into callers (don&#x27;t store their results
in intermediate storage) or deinline function out of callers.
Intuitively, inlining functions trades off smaller memory usage for
redundant computations, while de-inlining trades off higher memory usage
for fewer redundant computations.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Below are some diagrams to give intuition to how these scheduler transformers
work.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;halide-autoscheduler&#x2F;reorder.png&quot; alt=&quot;reorder&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;halide-autoscheduler&#x2F;hoist-compute.png&quot; alt=&quot;hoist compute&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;halide-autoscheduler&#x2F;lower-compute.png&quot; alt=&quot;lower compute&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;halide-autoscheduler&#x2F;lower-compute.png&quot; alt=&quot;hoist storage&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;halide-autoscheduler&#x2F;inline.png&quot; alt=&quot;inline&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;halide-autoscheduler&#x2F;deinline.png&quot; alt=&quot;deinline&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;bounds-inference&quot;&gt;Bounds Inference&lt;&#x2F;h3&gt;
&lt;p&gt;Now that we have a representation for schedules and a set of
schedule transformers, we are close to arriving at a search algorithm
for finding efficient schedules.
The last component that we need is a notion of &lt;em&gt;cost&lt;&#x2F;em&gt; for schedules.
In order to provide a cost model for schedules, we need to determine the
number of iterations performed by loops in the schedule.
This determines the number of times instructions inside the body of loops
will be executed, as well as the size of intermediate storage to be
allocated.
We determine this by computing the extent in which functions will be computed.
For the output function, we assume that the extent is given by a call to the
&lt;code&gt;realize&lt;&#x2F;code&gt; function.
For called functions that are not inlined, the extent is the dimensions of
the function that will be stored as intermediate results.
Because storage will be reused depending on the granularity
with which intermediate results are stored, the extent of called functions
does not necessarily coincide with the total extent over which the function
will be computed (e.g., the called function might be computed on
a per-scanline basis).&lt;&#x2F;p&gt;
&lt;p&gt;For example, consider the simple pipeline below that has one producer (&lt;code&gt;g&lt;&#x2F;code&gt;)
and one consumer (&lt;code&gt;f&lt;&#x2F;code&gt;):&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;g(x, y) = x * y&lt;&#x2F;code&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;f(x, y) = g(x, y) + g(x+1, y+1)&lt;&#x2F;code&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Given that &lt;code&gt;f&lt;&#x2F;code&gt; is realized in a 512x512 box and a schedule where &lt;code&gt;g&lt;&#x2F;code&gt; is
computed in total before computing &lt;code&gt;f&lt;&#x2F;code&gt;, the extent of &lt;code&gt;g&lt;&#x2F;code&gt; is 513x513.&lt;&#x2F;p&gt;
&lt;p&gt;Computing the extent in which functions will be computed is hard in general,
but since Halide makes the simplifying assumption that all extents are
rectangular (as opposed to, say, polytopes in the polyhedral model),
there is a simple method for doing this:
we only need to check the maximum and minimum points of the caller functions
and check the arguments to the callee.
Note that we also assume that function arguments are drawn from a grammar
of &amp;quot;simple&amp;quot; arithmetic expressions consisting only of &lt;code&gt;+&lt;&#x2F;code&gt;, &lt;code&gt;-&lt;&#x2F;code&gt;, &lt;code&gt;*&lt;&#x2F;code&gt;, &lt;code&gt;&#x2F;&lt;&#x2F;code&gt;,
variables and constants.&lt;&#x2F;p&gt;
&lt;p&gt;In the example above, the extent of &lt;code&gt;f&lt;&#x2F;code&gt; is defined by the box bounded by
&lt;code&gt;(1,1)&lt;&#x2F;code&gt; and &lt;code&gt;(512, 512)&lt;&#x2F;code&gt;.
The arguments to &lt;code&gt;g&lt;&#x2F;code&gt; at these points are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;at &lt;code&gt;(1,1)&lt;&#x2F;code&gt;: &lt;code&gt;(1,1), (2,2)&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;at &lt;code&gt;(512,512)&lt;&#x2F;code&gt;: &lt;code&gt;(512,512), (513,513)&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Thus we can determine the extent of &lt;code&gt;g&lt;&#x2F;code&gt; to be 513x513.&lt;&#x2F;p&gt;
&lt;p&gt;We encode these caller-callee relationships into logical formulas and
use the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Z3Prover&#x2F;z3&quot;&gt;Z3&lt;&#x2F;a&gt; SMT solver to a retrieve model that contains concrete values
for the arguments.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;search-algorithm-for-schedules&quot;&gt;Search Algorithm for Schedules&lt;&#x2F;h3&gt;
&lt;p&gt;Once loop sizes have been inferred, we have enough information to determine
important execution features of the schedule, such as how much memory
it will allocate and how many operations it will perform.
The &lt;em&gt;cost&lt;&#x2F;em&gt; of the schedule is then a weighted sum of these data points.&lt;&#x2F;p&gt;
&lt;p&gt;By default our implementation groups execution features into the following:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;mem&lt;&#x2F;em&gt; - amount of memory allocated&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;loads&lt;&#x2F;em&gt; - number of intermediate results loaded from storage&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;stores&lt;&#x2F;em&gt; - number of intermediate results stored&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;arithmetic operations&lt;&#x2F;em&gt; - number of &lt;code&gt;+&lt;&#x2F;code&gt;, &lt;code&gt;-&lt;&#x2F;code&gt;, &lt;code&gt;*&lt;&#x2F;code&gt; and &lt;code&gt;&#x2F;&lt;&#x2F;code&gt; operations performed&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;mathematical operations&lt;&#x2F;em&gt; - number of &lt;code&gt;sin&lt;&#x2F;code&gt;, &lt;code&gt;cos&lt;&#x2F;code&gt;, &lt;code&gt;tan&lt;&#x2F;code&gt;, &lt;code&gt;sqrt&lt;&#x2F;code&gt; operations
performed&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Each of these groups has a weight that determines the importance of these
features with respect to the schedule&#x27;s cost (see &lt;strong&gt;Evaluation&lt;&#x2F;strong&gt; below).&lt;&#x2F;p&gt;
&lt;p&gt;Now that we can give a notion of cost to schedules, we can search for efficient
schedules.
We use beam search as our search algorithm, with the default schedule as
the starting node.
We describe the concrete parameters used for search below in &lt;strong&gt;Evaluation&lt;&#x2F;strong&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;conversion-to-halide&quot;&gt;Conversion to Halide&lt;&#x2F;h3&gt;
&lt;p&gt;Once we have a candidate schedule tree, we convert it into Halide.
We do this by checking the ancestry path from compute nodes:
this path determines whether a function&#x27;s variables are split,
the traversal order for computing the function, and, for called functions,
the granularity at which the function is stored and computed.&lt;&#x2F;p&gt;
&lt;p&gt;Consider the schedule above for the functions &lt;code&gt;bx&lt;&#x2F;code&gt;, &lt;code&gt;by&lt;&#x2F;code&gt;, and &lt;code&gt;in&lt;&#x2F;code&gt;.
Converted into Halide code, the schedule looks like the following:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;by.reorder(y, x);

bx.store_root();
bx.compute_at(by, y);
bx.reorder(y, x);
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;We evaluate the performance of the autoscheduler over three benchmarks.
We do this by comparing the performance of the autoscheduled run
(OPT configuration) vs. the run with the default schedule (DEF configuration).
We measure runtime and memory usage using &lt;code&gt;gprof&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;For the experiments, we set the weights for execution features as follows:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Group&lt;&#x2F;th&gt;&lt;th&gt;Weight&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;mem&lt;&#x2F;td&gt;&lt;td&gt;0.1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;loads&lt;&#x2F;td&gt;&lt;td&gt;0.5&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;stores&lt;&#x2F;td&gt;&lt;td&gt;0.5&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;arith ops&lt;&#x2F;td&gt;&lt;td&gt;1.0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;math ops&lt;&#x2F;td&gt;&lt;td&gt;10.0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;We run beam search with a depth of 10 and beam width of 300.&lt;&#x2F;p&gt;
&lt;p&gt;For all benchmarks, the output functions are realized across an extent of
2048x2048.
The results below are averaged across three runs.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;benchmark-1&quot;&gt;Benchmark 1&lt;&#x2F;h3&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;g(x,y) = sqrt(cos(x) + sin(y));
f(x,y) = g(x + 1,y) + g(x,y) + g(x + 1,y + 1) + g(x,y + 1);
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Function&lt;&#x2F;th&gt;&lt;th&gt;Runtime (ms)&lt;&#x2F;th&gt;&lt;th&gt;Peak heap usage (bytes)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;f (DEF)&lt;&#x2F;td&gt;&lt;td&gt;87.72&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;f (OPT)&lt;&#x2F;td&gt;&lt;td&gt;13.48&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;g (DEF)&lt;&#x2F;td&gt;&lt;td&gt;N&#x2F;A&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;g (OPT)&lt;&#x2F;td&gt;&lt;td&gt;172.70&lt;&#x2F;td&gt;&lt;td&gt;32874&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h3 id=&quot;benchmark-2&quot;&gt;Benchmark 2&lt;&#x2F;h3&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;blur_x(x,y) = input(x - 1,y) + input(x,y) + input(x + 1,y) &#x2F; 3;
blur_y(x,y) = blur_x(x - 1,y) + blur_x(x,y) + blur_x(x + 1,y) &#x2F; 3;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Function&lt;&#x2F;th&gt;&lt;th&gt;Runtime (ms)&lt;&#x2F;th&gt;&lt;th&gt;Peak heap usage (bytes)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;blur_y (DEF)&lt;&#x2F;td&gt;&lt;td&gt;12.70&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;blur_y (OPT)&lt;&#x2F;td&gt;&lt;td&gt;19.02&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;blur_x (DEF)&lt;&#x2F;td&gt;&lt;td&gt;N&#x2F;A&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;blur_x (OPT)&lt;&#x2F;td&gt;&lt;td&gt;16.16&lt;&#x2F;td&gt;&lt;td&gt;16400&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h3 id=&quot;benchmark-3&quot;&gt;Benchmark 3&lt;&#x2F;h3&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;f(x,y) = x + y;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Function&lt;&#x2F;th&gt;&lt;th&gt;Runtime (ms)&lt;&#x2F;th&gt;&lt;th&gt;Peak heap usage (bytes)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;f (DEF)&lt;&#x2F;td&gt;&lt;td&gt;11.85&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;f (OPT)&lt;&#x2F;td&gt;&lt;td&gt;12.18&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h3 id=&quot;discussion&quot;&gt;Discussion&lt;&#x2F;h3&gt;
&lt;p&gt;Note that for the DEF configuration, only the output functions have runtimes
associated with them since all called functions are inlined.&lt;&#x2F;p&gt;
&lt;p&gt;The autoscheduler performs rather poorly relative to the default schedule.
While it successfully makes space-runtime tradeoffs (e.g., &lt;code&gt;f&lt;&#x2F;code&gt; in Benchmark 1),
allowing the computation of a function to run much faster by saving
intermediate results, it runs more slowly and uses more memory than the
default schedule across all benchmarks.&lt;&#x2F;p&gt;
&lt;p&gt;We believe the poor performance of the autoscheduler has two main causes:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Wrong feature weights.&lt;&#x2F;strong&gt; The feature weights for the cost model
are chosen by fiat; if these were learned instead given a set of
training data, then more the weights can probably better capture the
execution profile of schedules.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Missing execution features&lt;&#x2F;strong&gt;. There are some execution features not
captured in the current cost model that probably has a significant effect
on performance.
Most importantly, the cost model does not reason about locality.
Because of this, the autoscheduler sometimes generates schedules with
loop order that has poor locality
(e.g., a function being traversed in column-major order instead
of row-major order).
It is not clear how to quantify locality in the cost model, but it is an
obvious extension to the cost model.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</description>
            </item>
        
            <item>
                <title>Software Simulation for Data Streaming in HeteroCL</title>
                <pubDate>Wed, 18 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/heterocl-stream/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/heterocl-stream/</guid>
                <description>&lt;p&gt;With the pursuit of higher performance under physical constraints, there has been an increasing deployment of special-purpose hardware accelerators such as FPGAs. The traditional approach to program such devices is by using hardware description languages (HDLs). However, with the rising complexity of the applications, we need a higher level of abstraction for productive programming. C-based high-level synthesis (HLS) is thus proposed and adopted by many industries such as Xilinx and Intel. Nonetheless, to achieve high performance, users usually need to modify the algorithms of applications to incorporate different types of hardware optimization, which makes the programs less productive and maintainable. To solve the challenge, recent work such as &lt;a href=&quot;http:&#x2F;&#x2F;heterocl.csl.cornell.edu&#x2F;&quot;&gt;HeteroCL&lt;&#x2F;a&gt; proposes the idea of decoupling the algorithm from the hardware customization techniques, which allows users to explore the design space and the trade-offs efficiently. In this project, we focus on extending HeteroCL with data streaming support by providing functional software-level simulation (in contrast with hardware-level simulation, where we simulate after hardware synthesis). Experimental results show that with LLVM JIT runtime, we can have orders of speedup compared with the software simulation provided by HLS tools.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;why-data-streaming&quot;&gt;Why Data Streaming?&lt;&#x2F;h3&gt;
&lt;p&gt;Unlike traditional devices such as CPUs and GPUs, FPGAs do not have a pre-defined memory hierarchy (e.g., caches and register files). Namely, to achieve better performance, the users are required to design their memory hierarchy, including data access methods such as streaming. In this project, we focus on the streaming between on-chip modules. The reason that we are interested in the cross-module streaming is that it introduces more parallelism to the designs. To be more specific, we can use streaming to implement task-level parallelism. We use the following example written in HeteroCL to illustrate the idea of streaming.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;@hcl.def_([A.shape, B.shape, C.shape])&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#323232;&quot;&gt;M1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(A, B, C):&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;with &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;hcl.for_(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
        B[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;A[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
        C[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;A[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;- &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
@hcl.def_([B.shape, D.shape])&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#323232;&quot;&gt;M2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(B, D):&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;with &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;hcl.for_(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
        D[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;B[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
@hcl.def_([C.shape, E.shape])&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#323232;&quot;&gt;M3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(C, E):&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;with &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;hcl.for_(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
        E[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;C[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;- &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
M1(A, B, C)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
M2(B, D)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
M3(C, E)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In this example, &lt;code&gt;M1&lt;&#x2F;code&gt; takes in one input tensor &lt;code&gt;A&lt;&#x2F;code&gt; and writes to two output tensors &lt;code&gt;B&lt;&#x2F;code&gt; and &lt;code&gt;C&lt;&#x2F;code&gt;. Then, &lt;code&gt;M2&lt;&#x2F;code&gt; and &lt;code&gt;M3&lt;&#x2F;code&gt; read from &lt;code&gt;B&lt;&#x2F;code&gt; and &lt;code&gt;C&lt;&#x2F;code&gt; and write to &lt;code&gt;D&lt;&#x2F;code&gt; and &lt;code&gt;E&lt;&#x2F;code&gt;, respectively. We can see that &lt;code&gt;M2&lt;&#x2F;code&gt; and &lt;code&gt;M3&lt;&#x2F;code&gt; have no data dependence and can thus be run in parallel. Moreover, these two modules can start as soon as they receive an output produced by &lt;code&gt;M1&lt;&#x2F;code&gt;. To realize such task-level parallelism, we can replace the intermediate results &lt;code&gt;B&lt;&#x2F;code&gt; and &lt;code&gt;C&lt;&#x2F;code&gt; with data streams. We illustrate the difference between before and after applying data streaming with the following figure.&lt;&#x2F;p&gt;
&lt;img src=&quot;exec_time.png&quot; width=&quot;500&quot; &gt;
&lt;h3 id=&quot;data-streaming-in-heterocl&quot;&gt;Data Streaming in HeteroCL&lt;&#x2F;h3&gt;
&lt;p&gt;The key feature of HeteroCL is to decouple the algorithm specification from the hardware optimization techniques, which is also applicable to streaming optimization. To specify streaming between modules, we use the primitive &lt;code&gt;to(tensor, dst, src, depth=1)&lt;&#x2F;code&gt;. It takes four arguments. The first one is the tensor that will be replaced with a stream. The second one is the destination module and the third one is the source module. Finally, users can also specify the depth of the stream. Currently, the data stream is implemented with FIFOs. HeteroCL will provide other types of streaming in the future. Following, we show how to specify data streaming with our previous example.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;s &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;hcl.create_schedule([A, B, C, D, E])&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
s.to(B, s[M2], s[M1], depth&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
s.to(C, s[M3], s[M1], depth&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;software-simulation-for-data-streaming&quot;&gt;Software Simulation for Data Streaming&lt;&#x2F;h3&gt;
&lt;p&gt;It is not enough with the programming language support only. We also need the ability to simulate the programs after applying data streaming. One way to do that is by using the existing HeteroCL back ends. Namely, we can generate HLS code with data streaming and use the HLS tools to run software simulation. Note that the software simulation here refers to cycle-inaccurate simulation. The reason why we only focus on cycle-inaccurate simulation is that to complete cycle-accurate simulation, we need to run through high-level synthesis, which could be time-consuming in some cases. We can see that the existing back ends require users to have HLS tools installed, which is not ideal for an open-source programming framework. Moreover, the users will need a separate compilation to run the simulation. Thus, in this project, we introduce a CPU simulation flow to HeteroCL by extending the LLVM JIT runtime. With this feature, users can quickly verify the correctness of a program after adding data streaming.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;implementation-details&quot;&gt;Implementation Details&lt;&#x2F;h3&gt;
&lt;p&gt;The code can be seen &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Hecmay&#x2F;heterocl&#x2F;tree&#x2F;stream&quot;&gt;here&lt;&#x2F;a&gt;. The key idea is to simulate data streaming with threads. In other words, each module will be executed using a single thread. We also implement a scheduling algorithm to decide the firing of a thread and the synchronization between threads. For streaming, we implement the streams by using one-dimensional buffers. We assign the size of a buffer according to the specified FIFO depth. Currently, we only provide blocking reads and blocking writes. Non-blocking operations will be left as our future work. In the following sections, we describe the algorithms and implementation details.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;module-scheduling&quot;&gt;Module Scheduling&lt;&#x2F;h4&gt;
&lt;p&gt;The purpose of this algorithm is to schedule each module by assigning it with a timestep, which indicates the execution order between modules. Namely, modules that can be executed in parallel are assigned with the same timestep. Similarly, if two modules are executed in sequence, they are assigned with different timesteps. Note that the numbers assigned to two consecutive executions do not need to be continuous. Since each module is executed with a single thread, a thread synchronization is enforced between two consecutive timesteps.&lt;&#x2F;p&gt;
&lt;p&gt;To begin with, we first assign each module with a group number. Modules within the same group are executed in sequence, while modules in different groups can be executed in parallel. To assign the group number, we first build a dataflow graph (DFG) according to the input program. An example is shown in the following figure, where the solid lines mean normal read&#x2F;write operations while the dotted lines refer to the read&#x2F;write of data streams.&lt;&#x2F;p&gt;
&lt;img src=&quot;DFG.png&quot; width=&quot;300&quot; &gt;
&lt;p&gt;After the DFG is built, we remove all the dotted lines. Then, we assign a unique ID to each connected component. This ID will be the group number. An example is shown below.&lt;&#x2F;p&gt;
&lt;img src=&quot;group.png&quot; width=&quot;300&quot; &gt;
&lt;p&gt;Now, we can start the scheduling process by assigning the timestep to each module. We first perform a very simple as-soon-as-possible (ASAP) algorithm. Namely, the first module within each group will be assigned with timestep 0. After that, we assign the timestep of each module according to the data dependence. An example is shown below.&lt;&#x2F;p&gt;
&lt;img src=&quot;schedule1.png&quot; width=&quot;300&quot; &gt;
&lt;p&gt;However, this is not correct because as we mentioned above, modules connected with streams should be run in parallel. Namely, they will share the same timestep. To solve that, we add one dotted line back at a time and correct the timesteps. We also need to correct its succeeding modules accordingly. After all dotted lines are added, we finish our scheduling algorithm.&lt;&#x2F;p&gt;
&lt;img src=&quot;schedule2.png&quot; width=&quot;300&quot; &gt;
&lt;p&gt;Note that there exist cases where we cannot solve. For example, if two modules &lt;code&gt;A&lt;&#x2F;code&gt; and &lt;code&gt;B&lt;&#x2F;code&gt; are connected with a solid line, and the producer &lt;code&gt;A&lt;&#x2F;code&gt; streams to a module &lt;code&gt;M&lt;&#x2F;code&gt; while &lt;code&gt;B&lt;&#x2F;code&gt; also streams from &lt;code&gt;M&lt;&#x2F;code&gt;, then there exists no valid scheduling according to our constraints. One possible way to solve that is by merging &lt;code&gt;A&lt;&#x2F;code&gt; and &lt;code&gt;B&lt;&#x2F;code&gt; into a new module &lt;code&gt;A_B&lt;&#x2F;code&gt;. In this case, the streaming from&#x2F;to &lt;code&gt;M&lt;&#x2F;code&gt; becomes an internal stream, which can be scheduled easily by assigning &lt;code&gt;A_B&lt;&#x2F;code&gt; and &lt;code&gt;C&lt;&#x2F;code&gt; with the same timestep. The reason why in this implementation we do not merge the two modules is that it is possible that we reuse &lt;code&gt;A&lt;&#x2F;code&gt; or &lt;code&gt;B&lt;&#x2F;code&gt; for other computations. In this case, we will need to reconstruct the DFG. Thus, we leave this as our future work.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;parallel-execution-with-threads&quot;&gt;Parallel Execution with Threads&lt;&#x2F;h4&gt;
&lt;p&gt;After we assign each module with a timestep, we can start to execute them via threads. Before we execute a module with a new thread, we check whether all modules assigned with smaller timesteps are completed. In other words, we first check whether all modules assigned with smaller timesteps are fired. If not, we schedule the current module to be executed in the future by pushing it into a sorted execution list. Then, if all modules with smaller timesteps are fired, we check whether they are finished. If not, we perform thread synchronization (e.g., by using &lt;code&gt;thread.join()&lt;&#x2F;code&gt; in C++). Finally, we need to execute the modules in the execution list. Since the list is sorted, we do not need to worry about new modules being inserted into the list.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;stream-buffers&quot;&gt;Stream Buffers&lt;&#x2F;h4&gt;
&lt;p&gt;In this work, we implement the streams with buffers that act like FIFOs. Instead of popping from or pushing to the buffers, we maintain a &lt;strong&gt;head&lt;&#x2F;strong&gt; and a &lt;strong&gt;tail&lt;&#x2F;strong&gt; pointer for each buffer. The pointers are stored as integer numbers. The head pointer points to the next element that will be read from, and the tail pointer points to the next element that will be written to. We update the pointers each time an element is written to or read from the buffer. We need to perform modulo operations if the pointer value is greater than the buffer size (i.e., FIFO depth). Since we may have two threads updating the pointers at the same time, we use &lt;code&gt;std::atomic&lt;&#x2F;code&gt; provided by C++ to make sure there is no data race. Finally, we maintain a map so that we can access a stream according to its ID.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;llvm-jit-extension&quot;&gt;LLVM JIT Extension&lt;&#x2F;h4&gt;
&lt;p&gt;To enable users with a one-pass compilation, we extend the existing LLVM JIT runtime in HeteroCL. It is complicated and hard to maintain if we implement both threads and stream buffers using pure LLVM. Thus, we implement them with C++ and design an interface composed of a set of functions. For instance, we have &lt;code&gt;BlockingRead&lt;&#x2F;code&gt;, &lt;code&gt;BlockingWrite&lt;&#x2F;code&gt;, &lt;code&gt;ThreadLaunch&lt;&#x2F;code&gt;, and &lt;code&gt;ThreadSync&lt;&#x2F;code&gt;. Then, inside our JIT compiler, we call the functions by using LLVM external calls.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h3&gt;
&lt;p&gt;In this section, we evaluate our implementation by using both unit tests and realistic benchmarks. Experiments are performed on a server with 2.20GHz Intel Xeon processor and 128 GB memory. We verify the correctness of the final result and compare the total run time in different cases.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;unit-tests&quot;&gt;Unit Tests&lt;&#x2F;h4&gt;
&lt;p&gt;The tests can be found &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Hecmay&#x2F;heterocl&#x2F;blob&#x2F;stream&#x2F;tests&#x2F;test_cpu_stream.py&quot;&gt;here&lt;&#x2F;a&gt;. Following we breifly illustrate what each test does by using the DFGs. &lt;&#x2F;p&gt;
&lt;img src=&quot;unit_test.png&quot; width=&quot;700&quot; &gt;
&lt;p&gt;For unit tests, we compare the run time before and after applying data streaming. The results are shown in the following table. We run the results for 1000 times and calculate the average.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;Testcase&lt;&#x2F;th&gt;&lt;th&gt;Original (ms)&lt;&#x2F;th&gt;&lt;th&gt;Multi-threading (ms)&lt;&#x2F;th&gt;&lt;th&gt;Speedup&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;two_stages&lt;&#x2F;td&gt;&lt;td&gt;0.0592&lt;&#x2F;td&gt;&lt;td&gt;0.0554&lt;&#x2F;td&gt;&lt;td&gt;1.070&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;three_stages&lt;&#x2F;td&gt;&lt;td&gt;0.0831&lt;&#x2F;td&gt;&lt;td&gt;0.0715&lt;&#x2F;td&gt;&lt;td&gt;1.162&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;internal_stage&lt;&#x2F;td&gt;&lt;td&gt;N&#x2F;A&lt;&#x2F;td&gt;&lt;td&gt;0.0638&lt;&#x2F;td&gt;&lt;td&gt;N&#x2F;A&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;fork_stage&lt;&#x2F;td&gt;&lt;td&gt;0.0865&lt;&#x2F;td&gt;&lt;td&gt;0.0758&lt;&#x2F;td&gt;&lt;td&gt;1.141&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;merge_stage&lt;&#x2F;td&gt;&lt;td&gt;0.0906&lt;&#x2F;td&gt;&lt;td&gt;0.0739&lt;&#x2F;td&gt;&lt;td&gt;1.226&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;The average speedup of our test cases is 1.150, which makes sense because now we use multi-thread execution. Note that for the third benchmark (i.e., &lt;code&gt;test_internal_stage&lt;&#x2F;code&gt;), the functionalities are different before and after applying data streaming. To be more specific, we list the test program here.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;@hcl.def_([A.shape, B.shape, C.shape, D.shape])&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#323232;&quot;&gt;M1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(A, B, C, D):&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;with &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;hcl.for_(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
        B[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;A[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
        D[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;C[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
@hcl.def_([B.shape, C.shape])&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#323232;&quot;&gt;M2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(B, C):&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;with &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;hcl.for_(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;as &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
        C[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;B[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
M1(A, B, C, D)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
M2(B, C)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
s &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;hcl.create_schedule([A, B, C, D])&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
s.to(B, s[M2], s[M1], depth&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
s.to(C, s[M1], s[M2], depth&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We can see that without applying streaming, the production of &lt;code&gt;D&lt;&#x2F;code&gt; is not affected by &lt;code&gt;M2&lt;&#x2F;code&gt;. However, if we specify &lt;code&gt;C&lt;&#x2F;code&gt; to be streamed from &lt;code&gt;M2&lt;&#x2F;code&gt; to &lt;code&gt;M1&lt;&#x2F;code&gt;, the original memory read of &lt;code&gt;C&lt;&#x2F;code&gt; in &lt;code&gt;M1&lt;&#x2F;code&gt; now becomes a blocking read. This also demonstrates that without the simulation support for streaming, some hardware behaviors cannot be correctly represented.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;realistic-benchmark&quot;&gt;Realistic Benchmark&lt;&#x2F;h4&gt;
&lt;p&gt;We also show the evaluation results from a realistic benchmark, which is more complicated than the synthetic tests in the unit tests. Due to time limitation, we only use the Sobel edge detector, which is a popular edge detecting algorithm in image processing. We compare the results with the software simulation tool provided by the HLS compiler. More specifically, we first generate Vivado HLS code with &lt;code&gt;hls::stream&lt;&#x2F;code&gt;. Then we use &lt;code&gt;csim&lt;&#x2F;code&gt; to run the software simulation. The evaluation results are shown below. We also show the time overhead due to compilation.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;Simulation Method&lt;&#x2F;th&gt;&lt;th&gt;Simulation Time (s)&lt;&#x2F;th&gt;&lt;th&gt;Compilation Overhead (s)&lt;&#x2F;th&gt;&lt;th&gt;Total Run Time (s)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;LLVM JIT&lt;&#x2F;td&gt;&lt;td&gt;0.00094&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;0.00094&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;Vivado HLS csim&lt;&#x2F;td&gt;&lt;td&gt;1.63&lt;&#x2F;td&gt;&lt;td&gt;1.29&lt;&#x2F;td&gt;&lt;td&gt;2.92&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;We can see that with LLVM JIT runtime, we can have orders of speedup compared with HLS simulation. Moreover, the overhead caused by compilation is not negligible for HLS simulation.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;conclusion-and-future-work&quot;&gt;Conclusion and Future Work&lt;&#x2F;h3&gt;
&lt;p&gt;In this work, we implement a software simulation runtime for data streams in HeteroCL by extending the existing LLVM JIT back end. We implement the simulation runtime with multi-threading in C++. Moreover, we propose a scheduling algorithm that exploits the task-level parallelism of a program after applying data streaming. Finally, we use unit tests to verify our work and use a realistic benchmark to demonstrate the programming efficiency over existing HLS tools.&lt;&#x2F;p&gt;
&lt;p&gt;Our next step will be testing our extension with more realistic benchmarks. In addition, by parsing HLS reports, we may be able to perform the cycle-accurate simulation. Then we can compare the performance of our scheduling algorithm with those implemented in existing HLS tools. In the end, we want to submit a pull request to the upstream HeteroCL repository.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Evaluating the Performance Implications of Physical Addressing</title>
                <pubDate>Wed, 18 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/novm/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/novm/</guid>
                <description>&lt;h2 id=&quot;introduction-to-virtual-addressing&quot;&gt;Introduction to Virtual Addressing&lt;&#x2F;h2&gt;
&lt;p&gt;Modern processors use &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Virtual_address_space&quot;&gt;&lt;em&gt;virtual addressing&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;
to access &lt;em&gt;actual&lt;&#x2F;em&gt; memory locations through a translation layer.
Only highly privileged software, such as the operating system (OS),
has access to physical memory addresses while all other processes
can only refer to memory via these virtual ones.
When a process requests memory (e.g., via &lt;code&gt;malloc&lt;&#x2F;code&gt;),
the OS will allocate physical memory in fixed size chunks, called pages,
and then map them into the process&#x27; virtual address space.
This allows the OS to allocate whichever regions of physical memory happen to be free despite the fact that the process may have requested a large, contiguous allocation.&lt;&#x2F;p&gt;
&lt;p&gt;Virtual addressing provides a few key abstractions for user-level software:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;A fully contiguous address space.&lt;&#x2F;li&gt;
&lt;li&gt;A unique address space not shared by any other process.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;The former enables software to easily calculate relative memory addresses;
accessing any element in an array requires only one or two instructions
to add the offset of the base pointer and then load from memory.
Similarly, locations on the program stack are computed relative to the current stack pointer.
Neither of these &amp;quot;pointer arithmetic&amp;quot; operations would be valid if
executed on the physical addresses.
The latter is a useful security primitive that enables
strong process memory isolation &amp;quot;for free,&amp;quot; since there is no way for a process
to even reference memory owned by another process
(unless the OS maps some physical location into both address spaces).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-case-against-virtual-addressing&quot;&gt;The Case Against Virtual Addressing&lt;&#x2F;h2&gt;
&lt;p&gt;The translation of virtual addresses is accelerated by dedicated hardware
called the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Translation_lookaside_buffer&quot;&gt;Translation Lookaside Buffer&lt;&#x2F;a&gt; (TLB).
This acts as a &amp;quot;translation cache&amp;quot; and hides most of the cost of virtual address translation,
except for when an address is not present in the TLB.
Missing in the TLB triggers a complex series of physical memory accesses
called &amp;quot;walking the page table&amp;quot; and tends to be extremely expensive
(especially if this has to be handled by software).&lt;&#x2F;p&gt;
&lt;p&gt;For workloads that allocate very large amounts of memory, the TLB can&#x27;t actually &amp;quot;reach&amp;quot;
all of the necessary memory addresses, causing frequent
&lt;a href=&quot;https:&#x2F;&#x2F;research.cs.wisc.edu&#x2F;multifacet&#x2F;papers&#x2F;isca13_direct_segment.pdf&quot;&gt;TLB misses&lt;&#x2F;a&gt;.
In these cases, it&#x27;s not uncommon for the CPU to be running only a single application
which would like to manage its own memory anyway; the aforementioned advantages
of virtual addressing are significantly reduced but the cost in TLB misses can be devastating to performance.
The other major cause of TLB misses is frequent context switching between processes,
which typically triggers a complete flush of the TLB state. For multithreaded applications
which rely heavily on system calls (e.g., webservers), this can incur
&lt;a href=&quot;https:&#x2F;&#x2F;www.microsoft.com&#x2F;en-us&#x2F;research&#x2F;wp-content&#x2F;uploads&#x2F;2016&#x2F;02&#x2F;osr2007_rethinkingsoftwarestack.pdf&quot;&gt;overheads of up to 20%&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Furthermore, virtual addressing is not a requirement for memory security.
There are many different proposals (and even some usable implementations)
of &lt;em&gt;tagged memory&lt;&#x2F;em&gt; architectures, where physical memory locations are associated with
&lt;em&gt;tags&lt;&#x2F;em&gt; that control how those locations can be accessed by software.
Some examples include: the &lt;a href=&quot;https:&#x2F;&#x2F;www.cl.cam.ac.uk&#x2F;research&#x2F;security&#x2F;ctsrd&#x2F;cheri&#x2F;&quot;&gt;CHERI capability architecture&lt;&#x2F;a&gt;;
the &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=2694383&quot;&gt;PUMP processor for software-defined metadata&lt;&#x2F;a&gt;;
and the &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=3243743&quot;&gt;secure information flow CPU, Hyperflow&lt;&#x2F;a&gt;.
Instead of relying on a process&#x27; inability to address memory,
these designs use hardware to efficiently check whether or not a memory access
is allowed by the system&#x27;s security policy. In these designs,
the protection provided by virtual addressing is either mostly or completely redundant.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;removing-virtual-addressing&quot;&gt;Removing Virtual Addressing&lt;&#x2F;h1&gt;
&lt;p&gt;Let us imagine that we are running code on one of these tagged memory architectures
and we want to eliminate virtual addressing and the overheads it entails.
In this world, we can still ask our OS for memory via &lt;code&gt;malloc&lt;&#x2F;code&gt;; however it returns
back to us a physically contiguous memory region (rather than virtually contiguous).
For the large memory applications described above that manage their own memory,
they would likely start by &lt;code&gt;malloc&lt;&#x2F;code&gt;-ing most of the computer&#x27;s physical memory
and then never calling &lt;code&gt;malloc&lt;&#x2F;code&gt; again. Little would change for such programs
(except that the spatial locality assumptions their designers had originally
made about memory layout are more likely to reflect reality).&lt;&#x2F;p&gt;
&lt;p&gt;However, programs which request new allocations throughout their lifetimes
may no longer be able to execute correctly. Since &lt;code&gt;malloc&lt;&#x2F;code&gt; returns a physical memory region,
the OS needs to find a large enough space inside the memory to allocate. Due to the
presence of &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Fragmentation_(computing)&quot;&gt;fragmentation&lt;&#x2F;a&gt;,
it is possible that no such region exists. In that case, &lt;code&gt;malloc&lt;&#x2F;code&gt; returns &lt;code&gt;0&lt;&#x2F;code&gt; and,
in all likelihood, the program explodes.&lt;&#x2F;p&gt;
&lt;p&gt;Remember that such fragmentation was present
with virtual addressing as well, but the OS could stitch together various fragmented segments
to form a single virtual allocation. Therefore, programs should strive to allocate memory in fixed-size chunks;
essentially, they should assume that the OS can only allocate them pages of physical memory
and it&#x27;s &lt;em&gt;their job to stripe datastructures across them&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;experimental-setup&quot;&gt;Experimental Setup&lt;&#x2F;h1&gt;
&lt;p&gt;To evaluate the impact of software changes required in lieu of virtual addressing,
we ran experiments with the following configurations. First, we ran all of our tests
on a computer with 8 Intel i7-7700 CPUs clocked at 3.60GHz, with 32GB of physical memory, running Ubuntu 16.04.
Secondly, we followed the &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;Benchmarking.html&quot;&gt;guidlines&lt;&#x2F;a&gt; provided by LLVM to reduce
variance; in particular, every test was executed on a single, isolated processor core.
While we ran all of our tests ten times and report averages of our measurements, with
this setup we observed very little variance with typically less than 0.01% standard deviation.
Finally, we assumed that some reasonable amount of stack could be pre-allocated contiguously,
even on a physically addressed machine. We chose 32KB since that was approximately the smallest
sized stack required to execute the benchmarks normally.&lt;&#x2F;p&gt;
&lt;p&gt;Unfortunately, we could not actually execute any tests using physical addressing,
since there is no reliable method for allocating physical memory in user space.
While there are several &lt;a href=&quot;https:&#x2F;&#x2F;blog.linuxplumbersconf.org&#x2F;2017&#x2F;ocw&#x2F;system&#x2F;presentations&#x2F;4669&#x2F;original&#x2F;Support%20user%20space%20POSIX%20conformant%20contiguous__v1.00.pdf&quot;&gt;proposals&lt;&#x2F;a&gt; for how to implement these
features, they aren&#x27;t currently supported in Linux.
While there are reconfigurations and workarounds that could enable
this evaluation the solutions are not lightweight.
Therefore, our results are overhead measurements that represent worst-case performance;
we don&#x27;t actually expect any of our tests to result in speedups.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;dealing-with-the-stack&quot;&gt;Dealing With The Stack&lt;&#x2F;h1&gt;
&lt;p&gt;The &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Call_stack&quot;&gt;stack&lt;&#x2F;a&gt; presents another potential issue.
Current compilers assume that stack-allocated variables can be addressed relative to the
stack pointer, which is stored in a register.
Obviously, while this is an efficient mechanism for address computation,
this scheme doesn&#x27;t work if any given stack frame is not comprised of physically contiguous memory.&lt;&#x2F;p&gt;
&lt;p&gt;For certain applications, it is likely that we can allocate a single stack page at start-up
and then go on with our lives. In this case, the restrictions mentioned above aren&#x27;t really
an issue. However, programs &lt;em&gt;may&lt;&#x2F;em&gt; allocate large data structures on the stack, may recurse
deeply or may have dynamically sized stack allocations. In these cases, we can run into
the issues described above since the stack we&#x27;ve already allocated may not be large enough.&lt;&#x2F;p&gt;
&lt;p&gt;One solution to this problem is to dynamically allocate stack frames whenever a function call
is made. In this case, every function prologue needs to check the current stack and see if
there&#x27;s enough space. If there is, then the function executes normally; otherwise,
the function asks the OS for a memory region big enough to store the current function&#x27;s
entire stack frame before running. During the function epilogue, the program should
then &lt;code&gt;free&lt;&#x2F;code&gt; that memory.&lt;&#x2F;p&gt;
&lt;p&gt;It turns out that &lt;a href=&quot;https:&#x2F;&#x2F;gcc.gnu.org&#x2F;&quot;&gt;gcc&lt;&#x2F;a&gt; has implemented
exactly this functionality and calls it &lt;a href=&quot;https:&#x2F;&#x2F;gcc.gnu.org&#x2F;wiki&#x2F;SplitStacks&quot;&gt;&amp;quot;stack splitting&amp;quot;&lt;&#x2F;a&gt;.
You can check out that link for a detailed explanation of the &lt;code&gt;-fsplit-stack&lt;&#x2F;code&gt; option for
gcc, but it essentially implements the algorithm described above, modulo some tricks
for making the common case fast and maintaining its own small free list for stack pages.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;overhead-of-stack-checking&quot;&gt;Overhead of Stack Checking&lt;&#x2F;h2&gt;
&lt;p&gt;We evaluated the performance impact of using split stacks on two microbenchmarks
designed to be bottlenecked on function calls, which respectively do and do not
trigger run-time memory allocations.
The first microbenchmark was a naive program to compute the 50th
number in the Fibonnacci sequence without memoization; this did not require
a large amount of stack so we use it to measure the overhead of &lt;em&gt;just checking&lt;&#x2F;em&gt;
whether or not there is enough space.
The other microbenchmark naively recursively computes the sum of the first &lt;em&gt;n&lt;&#x2F;em&gt; integers.
We executed this benchmark with &lt;code&gt;n=1000000000&lt;&#x2F;code&gt; so that it would trigger run-time allocations
by recursing very deeply.&lt;&#x2F;p&gt;
&lt;img src=&quot;stackcheckmicro.png&quot; style=&quot;width:100%&quot;&#x2F;&gt;
&lt;p&gt;This diagram plots the execution time of these two microbenchmarks with the &lt;code&gt;-fstack-split&lt;&#x2F;code&gt;
option enabled, normalized to regular execution (statically allocated stacks).
As you can see, our Fibonnacci benchmark has only about a 15% increase in runtime
caused by checking remaining stack space. While not an insignificant cost, most
programs will not execute nearly as high a density of function calls and should not
see such high overheads.
The recursive sum benchmark ended up executing 6410250 different run-time allocations
of 1544 bytes. While it had a very large performance impact, we could certainly tune
the stack allocation algorithm to request larger chunks of memory to reduce the frequency
of &lt;code&gt;malloc&lt;&#x2F;code&gt; system calls.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;parsec-benchmarks&quot;&gt;PARSEC Benchmarks&lt;&#x2F;h3&gt;
&lt;p&gt;While these microbenchmarks give us a good upper bound on worst-case overhead,
we wanted to evaluate some more realistic tests. We chose the &lt;a href=&quot;https:&#x2F;&#x2F;parsec.cs.princeton.edu&#x2F;&quot;&gt;PARSEC&lt;&#x2F;a&gt;
benchmarks, mostly because we used them for a prior project in this class and could test them easily.
The execution times for these benchmarks are the built-in &amp;quot;Regions of Interest&amp;quot; and exclude initialization
and warm up times for each program.&lt;&#x2F;p&gt;
&lt;img src=&quot;stackcheckparsec.png&quot; style=&quot;width:100%&quot;&#x2F;&gt;
&lt;p&gt;With these four benchmarks, there was almost no impact of applying the split-stack option.
We should note that the StreamCluster benchmark actually &lt;em&gt;sped up&lt;&#x2F;em&gt; when using split-stack;
likely this is some sort of memory alignment effect à la &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=1508275&quot;&gt;Mytcowicz et al.&lt;&#x2F;a&gt;.
In any case, we should probably consider this impact to be negligible.&lt;&#x2F;p&gt;
&lt;p&gt;Of these benchmarks, only Ferret actually required dynamically allocating stack space.
Each of these allocations was 618 KB, which is a potential concern. It is unclear, in a real
system using only physical addressing, whether or not allocations of this size would
be frequently servicable or not. I hypothesize that real systems with many gigabytes or terabytes
of memory with even severe fragmentation will be able to regularly respond to allocations
in the kilobyte range; however, evalutaing this is future work.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;large-object-allocations&quot;&gt;Large Object Allocations&lt;&#x2F;h1&gt;
&lt;p&gt;The other major modification to programs would be supporting large memory allocations.
Since it is probably unreliable to request very large contiguous memory regions, we must adopt a new
strategy. To evaluate the potential impact of these changes, we modified the Blackscholes
benchmark from the PARSEC suite. Blackscholes uses two dynamically allocated arrays,
which we replaced with custom &lt;code&gt;Array&lt;&#x2F;code&gt; objects that we implemented to use only fixed size
allocations. We chose to modify this application not only because it consists of a single,
easily-modifiable source file, but also because it iterates through large arrays
and is very likely to be negatively impacted by array access latency and spatial locality within the array.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;custom-array-implementation&quot;&gt;Custom Array Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;We implemented our &lt;code&gt;Array&lt;&#x2F;code&gt; using a tree-based datastructure that mimics
the functionality of page tables in a virtual address environment.
We support up to three levels of allocation, where the final
level contains data and all previous levels contain pointers to
other pages. In the following diagram, the &lt;strong&gt;L1&lt;&#x2F;strong&gt; page contains
pointers to &lt;strong&gt;L2&lt;&#x2F;strong&gt; pages, which contain pointers to the &lt;strong&gt;L3&lt;&#x2F;strong&gt; pages.
Each &lt;strong&gt;L3&lt;&#x2F;strong&gt; page contains the actual object array data.&lt;&#x2F;p&gt;
&lt;img src=&quot;treedata.png&quot; style=&quot;width:100%&quot;&#x2F;&gt;
&lt;p&gt;As an optimization, the constructor for &lt;code&gt;Array&lt;&#x2F;code&gt; determines how many
levels of pages are required to store all of the data. For instance,
in the event the data fits in a single page, then the &lt;strong&gt;L1&lt;&#x2F;strong&gt; page will hold data
and no &lt;strong&gt;L2&lt;&#x2F;strong&gt; or &lt;strong&gt;L3&lt;&#x2F;strong&gt; pages will be allocated.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluating-tree-overheads&quot;&gt;Evaluating Tree Overheads&lt;&#x2F;h2&gt;
&lt;p&gt;We had to modify the Blackscholes benchmark slightly to replace the calls
to &lt;code&gt;malloc&lt;&#x2F;code&gt; to use our C++ &lt;code&gt;Array&lt;&#x2F;code&gt; objects. This only involved modifying
the array allocation and deallocation since we overloaded the &lt;code&gt;[]&lt;&#x2F;code&gt; operator
for normal array dereferences.
Blackscholes only uses pointer arithmetic for allocation purposes so we didn&#x27;t
need to modify any other instructions.&lt;&#x2F;p&gt;
&lt;p&gt;The main sources of overhead we anticipated from these datastructures
were not only the increased number of instructions to access data,
but also the reduced spatial locality of data within the array, which
depends on how big our physical allocations actually were.
Therefore, we evaluated a number of different configurations, where
our library used different sized &amp;quot;pages,&amp;quot; ranging from 4KB to 1MB.&lt;&#x2F;p&gt;
&lt;p&gt;Furthermore, the original Blackscholes program used complicated pointer
arithmetic to allocate five contiguous arrays from a single call to &lt;code&gt;malloc&lt;&#x2F;code&gt;.
In our original modification, we treated these as five separate &lt;code&gt;Array&lt;&#x2F;code&gt; objects
since we can&#x27;t guarantee address continuity. In the &amp;quot;Modified Blackscholes&amp;quot;
test, we re-wrote this to be a single array of &lt;code&gt;struct&lt;&#x2F;code&gt; objects so that
there might be more spatial locality between fields accessed around the same time.&lt;&#x2F;p&gt;
&lt;img src=&quot;treearraybs.png&quot; style=&quot;width:100%&quot;&#x2F;&gt;
&lt;p&gt;We saw a worst-case overhead around 17% with 8 KB pages. Out of the
base &lt;code&gt;Array&lt;&#x2F;code&gt; object implementation, 32 KB performed the best. Intuitively,
it makes sense that larger pages start to provide diminishing returns once they
exceed L1 and L2 CPU cache sizes. We tried to confirm these intuitions using
performance counters but found that L1 cache miss rates were very close across
all configurations and LLC (last level cache) miss rates varied wildly even
across executions of the same configuration. Likely this was caused by interference
with processes running on other cores or the OS itself.&lt;&#x2F;p&gt;
&lt;p&gt;However, using other performance counters we did notice that both the
original and modified Blackscholes programs had very similar IPC (instruction per cycle)
values, indicating that CPU efficiency wasn&#x27;t significantly impacted and
the primary overhead was simply caused by executing more instructions.&lt;&#x2F;p&gt;
&lt;p&gt;As a simple test of this, we modified the &lt;code&gt;Array&lt;&#x2F;code&gt; code to always use trees
of depth three (two layers of pointers and a single data layer), which removed
some of the runtime checks required to access data. The results for that test are
in the &amp;quot;No Branches&amp;quot; column above. Other than in the 1 MB case, this configuration
performed much better than the others, with only a 3% overhead in the 16 KB case.
In a robust implementation, one could achieve this effect by
using a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Factory_method_pattern&quot;&gt;&amp;quot;factory&amp;quot;&lt;&#x2F;a&gt; pattern
to create the appropriate depth &lt;code&gt;Array&lt;&#x2F;code&gt; for the given allocation. Ideally,
we would be able to determine this requirement statically so that array accesses
could be in-lined; this would probably avoid the vast majority of the overheads
caused by using this data structure.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;random-access-overheads&quot;&gt;Random Access Overheads&lt;&#x2F;h2&gt;
&lt;p&gt;The Blackscholes benchmark primarily scans through large arrays; we wanted
to measure the overhead of a microbenchmark with significantly less spatial locality.
In addition, we wanted to compare the impact of our changes on a long running test
that used small arrays. To achieve both of these, we wrote a small benchmark that
initializes the values of an array to a set of pseudorandom values (generated by
multiplying the index by a large prime number). Then we execute a pointer chase
through the array by looking up the value at location 0 and then treating the value
as the next index to inspect (modulo array size).&lt;&#x2F;p&gt;
&lt;p&gt;Unlike the Blackscholes benchmark, this runs a large number of iterations but
can be configured to use a small or large array. The small test was
sized to fit into a single data page and therefore should not incur extra memory
accesses compared to the traditional array implementation. For this test, we
used 4KB pages. Like before, we include a &amp;quot;No Branches&amp;quot; configuration which is
precompiled to remove the run-time checks to determine the correct element look-up behavior.&lt;&#x2F;p&gt;
&lt;img src=&quot;randomtest.png&quot; style=&quot;width:100%&quot;&#x2F;&gt;
&lt;p&gt;We can see here that the random access cost on large arrays is, unsurprisingly,
expensive. In the linear access case, there is still quite a bit of spatial locality
and &lt;code&gt;Array&lt;&#x2F;code&gt; pointer pages are likely to be in cache for multiple data accesses. With random
access, the larger amount of memory required to store the data increases the cache pressure.
Small arrays do suffer from some overhead in this test but likely this is primarily
caused by the increase in dynamic instruction count (the &amp;quot;No Branches&amp;quot; case executes 1.5 times
as many instructions as the baseline).&lt;&#x2F;p&gt;
&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h1&gt;
&lt;p&gt;While our results are not completely easily explicable, they do at least somewhat
follow our intuitions. Memory allocation is a complex process dependent upon
a number of system variables and operating system implementation. In order
to better understand what is going on with these results, we must both
sample across more test benchmarks and measure performance in a more controlled
setting (e.g., by using &lt;a href=&quot;https:&#x2F;&#x2F;emeryberger.com&#x2F;research&#x2F;stabilizer&#x2F;&quot;&gt;Stabilizer&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;These preliminary results suggest that highly optimized datastructures,
tuned for physical memory allocation could impose very little overhead.
Furthermore, at least for programs which do not allocate large amounts of memory
on the stack, the cost of checking stack size and occasionally allocating new stack
frame pages would be negligible. All in all, if we could actually address physical
memory, we very well might see improvements in performance while also
simplifying much of the underlying hardware and operating system.&lt;&#x2F;p&gt;
&lt;p&gt;The source code for the &lt;code&gt;Array&lt;&#x2F;code&gt; object, the microbenchmarks and the instrumentation
modifications we made to libgcc can be found on &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;dz333&#x2F;non-contiguous-mem&quot;&gt;github&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Quantum Vectorization</title>
                <pubDate>Wed, 18 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/qvec/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/qvec/</guid>
                <description>&lt;p&gt;The code used in this blog post is hosted &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;pbb59&#x2F;ScaffCC&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;&#x2F;h2&gt;
&lt;p&gt;In this blog, we describe our efforts to develop a compiler pass to vectorize the implicit parallelism present in quantum algorithms. Quantum algorithms are probabilistic, and so need to be run multiple times to get a &amp;quot;reliable&amp;quot; result. Since each of these program runs are independent, several can be performed simultaneously on the same hardware without changing the final result, so long as the hardware has space to support the additional logic.&lt;&#x2F;p&gt;
&lt;p&gt;In this project, we developed an LLVM pass to transform code to help take advantage of this program structure.  Our LLVM pass rewrites code to duplicate all algorithm instructions associated with each array onto physical hardware.  We cannot conclude if this approach provides speedup without a proper experimental setup, but we have found that such a pass can be run on realistic quantum code to produce somewhat vectorized algorithms.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;quantum-computing&quot;&gt;Quantum Computing&lt;&#x2F;h2&gt;
&lt;p&gt;Quantum Computing has exploded into the popular imagination in the past decade due to the promise of massive theoretical speedups over conventional digital computers. Whether real quantum hardware can live up to the promise remains to be seen, but that has not stopped researchers from developing complex toolflows and algorithms.&lt;&#x2F;p&gt;
&lt;p&gt;The computing paradigm of quantum computers is inherently different from a standard &amp;quot;classical&amp;quot; computer. Instead of representing a bit as either a &lt;code&gt;0&lt;&#x2F;code&gt; or &lt;code&gt;1&lt;&#x2F;code&gt;, quantum bits (qubits) represent a bit of information using a quantum superposition &lt;code&gt;a |0&amp;gt; + b |1&amp;gt;&lt;&#x2F;code&gt;, where &lt;code&gt;|0&amp;gt;&lt;&#x2F;code&gt; and &lt;code&gt;|1&amp;gt;&lt;&#x2F;code&gt; represent possible realizable states and &lt;code&gt;a&lt;&#x2F;code&gt; and &lt;code&gt;b&lt;&#x2F;code&gt; are normalized constants related to probability of measuring the respective state. Although &lt;code&gt;a&lt;&#x2F;code&gt; and &lt;code&gt;b&lt;&#x2F;code&gt; theoretically hold infinite information, it is only practically possible to measure a bit of information from the state as in the classical case as the state collapses to one of the realizable state &lt;code&gt;|0&amp;gt;&lt;&#x2F;code&gt; or &lt;code&gt;|1&amp;gt;&lt;&#x2F;code&gt; upon measurement.&lt;&#x2F;p&gt;
&lt;p&gt;Quantum computing offers unique computing properties due to the nature of a qubit. The main computational differences between quantum and classical computing include the following properties:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Property&lt;&#x2F;th&gt;&lt;th&gt;Quantum&lt;&#x2F;th&gt;&lt;th&gt;Classical CPU&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Architecture&lt;&#x2F;td&gt;&lt;td&gt;Spatial&lt;&#x2F;td&gt;&lt;td&gt;von Neumann&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Data&lt;&#x2F;td&gt;&lt;td&gt;Quantum State&lt;&#x2F;td&gt;&lt;td&gt;Voltage Level&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Control&lt;&#x2F;td&gt;&lt;td&gt;External (Laser)&lt;&#x2F;td&gt;&lt;td&gt;Voltage Level&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;States per bit&lt;&#x2F;td&gt;&lt;td&gt;Exponential&lt;&#x2F;td&gt;&lt;td&gt;Linear&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;In general, a quantum computer can implement all of the computation primitives that a classical one can. Both, in theory, can be turing complete with a universal set of logic gates. However, quantum computing also has computational primitives that classical computers don&#x27;t share. These primitives are key to quantum supremacy: the concept that a quantum computer can theoretically outperform classical computers.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Unique compute&lt;&#x2F;th&gt;&lt;th&gt;Example Usage&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Large State Space&lt;&#x2F;td&gt;&lt;td&gt;Chemical Reaction Simulation&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Entanglement&lt;&#x2F;td&gt;&lt;td&gt;Combinatorial Optimization (TSP)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Amplitude Magnification&lt;&#x2F;td&gt;&lt;td&gt;Database Search&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Probabilistic (multiple results)&lt;&#x2F;td&gt;&lt;td&gt;?&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Phase&lt;&#x2F;td&gt;&lt;td&gt;?&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;A potential downside to quantum computing is that it is inherently probabilistic. An output on one execution may be entirely different from the output on the next run. Quantum algorithms must be designed so that the correct answer must have measurement probability &amp;gt;50%. The answer can then be inferred by repeating the execution many times and taking the majority result. Many quantum algorithms exist in Bounded Quantum Polynomial class (BQP) where the correct answer can be found in polynomial time with probability at least 2&#x2F;3.  It can, however, practically be time and resource intensive to run quantum programs a sufficient number of times to achieve a reasonable confidence.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;opportunities-for-vectorization&quot;&gt;Opportunities for Vectorization&lt;&#x2F;h2&gt;
&lt;p&gt;The probabilistic nature requires that multiple repeated runs of the same program be executed. The number of runs required to obtain a &amp;quot;correct&amp;quot; result depends on one&#x27;s error threshold and the design of the algorithm; specifically, the number of runs to obtain error &lt;code&gt;e&lt;&#x2F;code&gt; is given as &lt;code&gt;O(log(1&#x2F;e))&lt;&#x2F;code&gt;. Thus, there are diminishing returns for running the algorithm many times, but it is important to run the algorithm a &amp;quot;reasonable&amp;quot; amount to achieve acceptably low error.&lt;&#x2F;p&gt;
&lt;p&gt;The naive method to repeatedly apply the algorithm is to run many iterations of the algorithm sequentially. This serialization can potentially increase the runtime depending on how many repeated applications are required. Consider, for instance, the entanglement program below written in &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.princeton.edu&#x2F;research&#x2F;techreps&#x2F;TR-934-12&quot;&gt;Scaffold&lt;&#x2F;a&gt;. It must be run multiple times to get representative results.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;module &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;catN &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;( qbit &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;bit, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;const int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;n ) {
  H( bit[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] );
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;( &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; n; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
    CNOT( bit[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;][i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;], bit[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;][i] );
    CNOT( bit[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;][i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;], bit[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;][i] );
  }
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; measure each bit
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;( &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; n; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)
    MeasZ(bit[i]);
}

&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;main &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
  qbit bits[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
  catN( bits, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
}

&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;By preempting the need to run this program multiple times, we can directly incorporate the implicit outer loop and produce something like the following:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;module &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;catN &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;( qbit &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;bit, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;const int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;n ) {
  H( bit[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] );
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;( &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; n; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
    CNOT( bit[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;][i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;], bit[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;][i] );
    CNOT( bit[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;][i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;], bit[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;][i] );
  }
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; measure each bit
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;( &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; n; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)
    MeasZ(bit[i]);
}

&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;main &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
  qbit bits[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;][&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
    catN( &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(bits[i]), &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
  }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now that we have a data-parallel outer loop, we can schedule multiple runs together in spare quantum resource and potentially vectorize the runs if the architecture allows. By exposing this parallelism, we expect to achieve speedup over running the repeats sequentially. Note that we are &lt;em&gt;not&lt;&#x2F;em&gt; trying to vectorize the underlying algorithm; such an implementation would require more information about each algorithm and may fail due to data dependencies. We are instead vectorizing the implicit data-parallel nature of data structures in probabilistic computing due to repeated runs.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;We designed a quantum compiler pass within the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;epiqc&#x2F;ScaffCC&quot;&gt;ScaffCC&lt;&#x2F;a&gt; compiler infrastructure. ScaffCC adds IR passes and quantum computer backends on top of LLVM, so our pass is written as one would for a classical compiler.&lt;&#x2F;p&gt;
&lt;p&gt;The pass first records each instance of the &lt;code&gt;alloca&lt;&#x2F;code&gt; command for vectorization. We make the assumption that qubit arrays are the only memory structures allocated by these programs.  This assumption is based on observations of program samples included in the ScaffCC repository.&lt;&#x2F;p&gt;
&lt;p&gt;Once every allocation is recorded, each of these commands is cloned a number of times equal to the &lt;code&gt;qvlen&lt;&#x2F;code&gt; argument. We then fully traverse the dataflow graph to copy all dependent instructions. We traverse the dependence graph starting from the allocations in a breadth-first manner, so that we copy a dependent instruction only when all of its dependencies have already been copied. This is required to have the copied values available for use in later instruction copies. Quantum computers are spatial architectures, so functions are inlined to a single basic block. Thus, our dataflow graph algorithm was able to reach the whole program.&lt;&#x2F;p&gt;
&lt;p&gt;We do not actually implement vector instructions because it would require extensive backend work to target the simulator. The simulator does support operations in parallel, but does not have give any timing information. Because of this, the extensive backend work would also not show any meaningful results.&lt;&#x2F;p&gt;
&lt;p&gt;It is worth noting that this implementation does not scale to situations where qubit allocations include dependencies (such as if a qubit allocation used the size of a previous allocation).  We choose to ignore these cases as a simplifying assumption.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;We evaluated our technique using the ScaffCC compiler infrastructure and a &lt;a href=&quot;http:&#x2F;&#x2F;qutech.nl&#x2F;qx-quantum-computer-simulator&#x2F;&quot;&gt;quantum computer simulator&lt;&#x2F;a&gt;. Due to constraints of the simulator we were limited to small benchmarks using a small number of qubits. We still chose to use the simulator to check for correctness as well as get a sense of the probability distributions for the simulated algorithm. We identified six benchmarks that had ~10 or less qubits. One of the benchmarks, QFT (quantum Fourier transform), is an intermediate step in most algorithms and not meant to be measured, so we excluded it. The benchmarks are enumerated below, along with the number of times to repeat the execution. This number is mostly made up and is between 10 and 100 depending on how fast the algorithm ran.&lt;&#x2F;p&gt;
&lt;p&gt;We used a pass to count the number of dynamic gate operations for each of the benchmarks. Note that all loops are unrolled in a quantum program because quantum computers are a spatial architecture, i.e., the number of static instructions is the same as the number of dynamic instructions.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Benchmark&lt;&#x2F;th&gt;&lt;th&gt;Qubits Used&lt;&#x2F;th&gt;&lt;th&gt;Gates&lt;&#x2F;th&gt;&lt;th&gt;Repeats&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Cat&lt;&#x2F;td&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;td&gt;8&lt;&#x2F;td&gt;&lt;td&gt;100&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Ising&lt;&#x2F;td&gt;&lt;td&gt;10&lt;&#x2F;td&gt;&lt;td&gt;220&lt;&#x2F;td&gt;&lt;td&gt;100&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;VQE&lt;&#x2F;td&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;td&gt;148&lt;&#x2F;td&gt;&lt;td&gt;100&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Grover&lt;&#x2F;td&gt;&lt;td&gt;11&lt;&#x2F;td&gt;&lt;td&gt;174&lt;&#x2F;td&gt;&lt;td&gt;100&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Ground State&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;8713&lt;&#x2F;td&gt;&lt;td&gt;10&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;The simulator does not give timing information, so we created a rough timing model. We assume the target quantum computer uses Ion Trap technology. Here, a microwave laser can implement quantum gates by shining onto qubits. SIMD is possible by directing the laser to multiple qubits at once (&lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=2694357&quot;&gt;citation&lt;&#x2F;a&gt;). These &amp;quot;instructions&amp;quot; are likely not as fast as a &amp;gt;1GHz instruction cache on a classical computer, so amortizing the cost of control is important. Thus, we quantify the timing by the number of total laser pulses required to run the algorithm with enough repeats.
Additionally, we consider a quantum computer with 20 logical qubits that can be used for multiple simultaneous runs.&lt;&#x2F;p&gt;
&lt;p&gt;We do not consider any spatial scheduling problems and assume the qubit regions working on different runs are effectively in isolation. We can statically predict the best run-time using our model,&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;Time = Repeats * Gates * Gate_Time &#x2F; Vector_Length&lt;&#x2F;code&gt;&lt;&#x2F;p&gt;
&lt;p&gt;We consider relative speedup to baseline, so the actual time to execute a gate is a constant factor that will be divided out. The theoretical speedup is then given by,&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;Speedup = Floor(Max Qubits &#x2F; Used Qubits)&lt;&#x2F;code&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Our theoretical results for each benchmark on a quantum computer is given below for a 20 qubit machine and a 53 qubit machine like Google&#x27;s recent Sycamore computer, which &lt;a href=&quot;https:&#x2F;&#x2F;www.ibm.com&#x2F;blogs&#x2F;research&#x2F;2019&#x2F;10&#x2F;on-quantum-supremacy&#x2F;&quot;&gt;arguably&lt;&#x2F;a&gt; achieved quantum supremacy for the first time.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Benchmark&lt;&#x2F;th&gt;&lt;th&gt;Speedup (20 qubits)&lt;&#x2F;th&gt;&lt;th&gt;Speedup (53 qubits)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Cat&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;13&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Ising&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;VQE&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;13&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Grover&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Ground State&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;8&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;We also experimentally compile each benchmark with our pass and execute the program through the simulator to check for &amp;quot;correctness&amp;quot;. Each program successfully compiled and executed on the simulator. We explicitly checked the &lt;code&gt;cat&lt;&#x2F;code&gt; program for correctness. In this algorithm a group of 4 qubits are entangled to be either all 0s or all 1s. We verified that there were multiple groups of 4 qubits with this property. The other algorithms also seemed to have reasonable outputs (a mix of 0s and 1s that changed on a run-by-run basis).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;We implemented an LLVM pass to vectorize the implicit data parallel repetition loop needed to produce precise quantum computing results.  Through this implementation, we show that such an optimization is possible and can be readily applied to some common quantum programs.  We then used this pass with a quantum gate simulator to predict the speedup possible by applying such an optimization.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>A Simple Way to Implement a Bad High Level Synthesis Compiler</title>
                <pubDate>Fri, 13 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/futil-fsm/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/futil-fsm/</guid>
                <description>&lt;h2 id=&quot;goal&quot;&gt;Goal&lt;&#x2F;h2&gt;
&lt;p&gt;The goal of this project was to experiment with a novel approach to writing a HLS compiler. In particuar we compile &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cucapra&#x2F;futil&quot;&gt;Futil&lt;&#x2F;a&gt;, a novel intermediate representation, to Verilog by generating finite state machines
that implement Futil&#x27;s control constructs. This project is divided into two main parts:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Convert a Control AST in Futil to an intermediate FSM structure&lt;&#x2F;li&gt;
&lt;li&gt;Generate RTL from the intermediate FSM structure as well as the Futil structure&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;background&quot;&gt;Background&lt;&#x2F;h2&gt;
&lt;p&gt;Futil is made of two sub-languages, a &lt;em&gt;structure&lt;&#x2F;em&gt; language for describing a static computation graph that represents the physical structure of a circuit,
and a &lt;em&gt;control&lt;&#x2F;em&gt; language for dynamically choosing which part of the static computation graph runs at a particular time.
The ultimate goal of Futil is to be a general framework, similar to LLVM, for working with optimizing HLS compilers.
However, the immediate goal of Futil is to provide an Verilog backend for the Dahlia language. This is what we present in this blog.&lt;&#x2F;p&gt;
&lt;p&gt;The structural language is straightforward to convert to Verilog; it already is very close to Verilog. However, the control language does not have a straightforward representation in Verilog. Our plan is to convert these statements into a finite state machine with the same semantics.
The finite state machine is then easy to translate into Verilog.&lt;&#x2F;p&gt;
&lt;p&gt;A typical Futil program is shown below:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;(define&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;namespace &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;prog
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(define&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;component main () ()
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;==&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;((new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std a0 (std_reg &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ const0 out) (@ a0 in))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std const0 (std_const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std b0 (std_reg &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ const1 out) (@ b0 in))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std const1 (std_const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std gt0 (std_gt &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ a0 out) (@ gt0 left))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ const2 out) (@ gt0 right))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std const2 (std_const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std y0 (std_reg &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ const3 out) (@ y0 in))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std const3 (std_const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std z0 (std_reg &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ const4 out) (@ z0 in))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std const4 (std_const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)))
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;==&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(seq
     (par
      (enable a0 const0)
      (enable b0 const1))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ gt0 out) (enable gt0 a0 const2)
         (enable y0 const3)
         (enable z0 const4)))))
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The first arrow is pointing to the structure and the second arrow is pointing to the control.
The structure is an unordered list of two kinds of simple statements: &lt;code&gt;(new-std b0 (std_reg 32 0))&lt;&#x2F;code&gt; stands for instantiation of the library component &lt;code&gt;b0&lt;&#x2F;code&gt; with bitwidth parameter &lt;code&gt;32&lt;&#x2F;code&gt; and value parameter of &lt;code&gt;0&lt;&#x2F;code&gt;, and &lt;code&gt;(-&amp;gt; (@ a0 out) (@ gt0 left))&lt;&#x2F;code&gt; represents wiring the &lt;code&gt;out&lt;&#x2F;code&gt; port of component &lt;code&gt;a0&lt;&#x2F;code&gt; with the &lt;code&gt;left&lt;&#x2F;code&gt; port of component &lt;code&gt;gt0&lt;&#x2F;code&gt;. The control part specifies which components are &lt;em&gt;active&lt;&#x2F;em&gt; with &lt;code&gt;enable&lt;&#x2F;code&gt; keyword, and the execution logic with &lt;code&gt;par&lt;&#x2F;code&gt;, &lt;code&gt;seq&lt;&#x2F;code&gt;,&lt;code&gt;if&lt;&#x2F;code&gt;, and &lt;code&gt;while&lt;&#x2F;code&gt; keywords. Think of activating a component like a function call. When a component is &lt;em&gt;active&lt;&#x2F;em&gt; it is allowed to &lt;em&gt;run&lt;&#x2F;em&gt; and produce valid outputs.&lt;&#x2F;p&gt;
&lt;p&gt;In this project, we are interested in changing all the control logic to finite state machines (FSMs) and then generating simulatable Verilog program based on both FSMs and the structures.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design-overview&quot;&gt;Design Overview&lt;&#x2F;h2&gt;
&lt;p&gt;Futil is the backend for Dahlia. The Futil semantics are designed to allow for easy translation from higher level language, like Dahlia, but this creates a gap between the Futil semantics and Verilog implementations. The table below shows the efforts required to translate the Futil semantics to synthesizable Verilog implementations.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Futil Semantics&lt;&#x2F;th&gt;&lt;th&gt;Verilog&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Invalid wire&lt;&#x2F;td&gt;&lt;td&gt;Read wires&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Component Reusing&lt;&#x2F;td&gt;&lt;td&gt;MUX, Read wires&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Control&lt;&#x2F;td&gt;&lt;td&gt;FSM&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h3 id=&quot;read-signals&quot;&gt;&lt;em&gt;Read&lt;&#x2F;em&gt; Signals&lt;&#x2F;h3&gt;
&lt;p&gt;In Futil semantics, &lt;code&gt;enable&lt;&#x2F;code&gt; keyword is used to determine whether a component is active. It is the easiest way of translating a program into hardware. However, this implicitly assumes that the signal on a wire is not valid or readable until we &lt;code&gt;enable&lt;&#x2F;code&gt; a component. We therefore require any &lt;em&gt;data&lt;&#x2F;em&gt; wire to have one extra bit to specify whether the signal is readable.&lt;&#x2F;p&gt;
&lt;p&gt;Another way to think about this is that the type of a data wire is &lt;code&gt;Option&amp;lt;T&amp;gt;&lt;&#x2F;code&gt;; the wire is either &lt;code&gt;Some(t)&lt;&#x2F;code&gt; or &lt;code&gt;None&lt;&#x2F;code&gt;, depending on whether the module is enabled.
In order to encode this in Verilog, we add an extra bit to every data wire to encode the tag of the variant.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;mux&quot;&gt;&lt;em&gt;MUX&lt;&#x2F;em&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;A component can be used more than once in Futil. For instance, in the above example,
&lt;code&gt;const0&lt;&#x2F;code&gt; and &lt;code&gt;const2&lt;&#x2F;code&gt; are both connected to the input of &lt;code&gt;a0&lt;&#x2F;code&gt;. In Futil, we deal with this by only enabling
&lt;code&gt;a0&lt;&#x2F;code&gt; and &lt;code&gt;const0&lt;&#x2F;code&gt;, or &lt;code&gt;a0&lt;&#x2F;code&gt; and &lt;code&gt;const2&lt;&#x2F;code&gt; at the same time (as seen in the example below).
However, in Verilog the register needs to choose the input from &lt;code&gt;const0&lt;&#x2F;code&gt; and &lt;code&gt;const2&lt;&#x2F;code&gt;.
This introduces a multiplexer (MUX).&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;(enable a0 const0)
(enable a0 const2)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;At different time steps, read signals tells which wire to the MUX is readable. Therefore, we can use the read wires (the variant tag bit), to serve as &lt;strong&gt;sel&lt;&#x2F;strong&gt; signals for MUX.
In other words, this MUX can be thought of as a function &lt;code&gt;List&amp;lt;Option&amp;lt;T&amp;gt;&amp;gt; -&amp;gt; T&lt;&#x2F;code&gt; that chooses a &lt;code&gt;Some(t)&lt;&#x2F;code&gt; from a list of options. We assume that only wire feeding into a MUX will be valid at a time.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;fsm&quot;&gt;&lt;em&gt;FSM&lt;&#x2F;em&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;In Futil, there are control constructs like &lt;code&gt;if&lt;&#x2F;code&gt;, &lt;code&gt;while&lt;&#x2F;code&gt; etc. These can be translated into FSM in Verilog implementation, which is the main goal of this project. However, before getting to that, we created intermediate FSM expressions in Futil. An FSM component has:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;input and output ports,&lt;&#x2F;li&gt;
&lt;li&gt;connection of wires between its own ports and other components&#x27; port,&lt;&#x2F;li&gt;
&lt;li&gt;internal control logic that determines the output signals.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The internal control logic of a FSM component can be divided into several states that determine the output signals. A state transfers to another according to some input signals. In general, all FSM components are composed of one &lt;strong&gt;Start&lt;&#x2F;strong&gt; state, some &lt;strong&gt;Intermediate&lt;&#x2F;strong&gt; states and one &lt;strong&gt;End&lt;&#x2F;strong&gt; state.&lt;&#x2F;p&gt;
&lt;p&gt;Consider the syntax &lt;code&gt;(enable A B)&lt;&#x2F;code&gt;. The &lt;strong&gt;Start&lt;&#x2F;strong&gt; state transfers to the &lt;strong&gt;Intermediate&lt;&#x2F;strong&gt; state when the &lt;em&gt;valid&lt;&#x2F;em&gt; signal is high. At the &lt;strong&gt;Intermediate&lt;&#x2F;strong&gt; state, the FSM sends out valid signals to subcomponents &lt;code&gt;A&lt;&#x2F;code&gt; and &lt;code&gt;B&lt;&#x2F;code&gt;, and waits for &lt;em&gt;ready&lt;&#x2F;em&gt; signals from them to be high. Once both of the &lt;em&gt;ready&lt;&#x2F;em&gt; signals are high, the FSM transfers to &lt;strong&gt;End&lt;&#x2F;strong&gt; state and outputs &lt;em&gt;ready&lt;&#x2F;em&gt; signals to notify any components waiting for this component to finish. It transfers back to &lt;strong&gt;Start&lt;&#x2F;strong&gt; state when the &lt;em&gt;valid&lt;&#x2F;em&gt; signal is low, indicating the upper components have received the &lt;em&gt;ready&lt;&#x2F;em&gt; signal and finished execution so it is safe for the FSM to go back to the &lt;strong&gt;Start&lt;&#x2F;strong&gt; state. The same design logic applies to all FSMs. The only difference happens in intermediate state(s): the &lt;code&gt;seq&lt;&#x2F;code&gt; FSM has one or more intermediate states and one intermediate state only transfers to next state when receiving a high &lt;em&gt;ready&lt;&#x2F;em&gt; signal from the previous state; the &lt;code&gt;if&lt;&#x2F;code&gt; FSM sends &lt;em&gt;valid&lt;&#x2F;em&gt; to the module that executes the comparison and receives both &lt;em&gt;ready&lt;&#x2F;em&gt; and &lt;em&gt;condition&lt;&#x2F;em&gt; signals which determine the state it should transfer to; the &lt;code&gt;while&lt;&#x2F;code&gt; FSM transfers to loop &lt;strong&gt;Body&lt;&#x2F;strong&gt; state when the &lt;em&gt;condition&lt;&#x2F;em&gt; signal is high and goes to &lt;strong&gt;End&lt;&#x2F;strong&gt; State when the condition is low.&lt;&#x2F;p&gt;
&lt;img src=&quot;fsm.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;img src=&quot;flow.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;To realize what we describe in design overview, we gradually added intermediate passes.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;first-intermediate-pass&quot;&gt;First Intermediate Pass&lt;&#x2F;h3&gt;
&lt;p&gt;There is a &lt;em&gt;Visitor&lt;&#x2F;em&gt; trait in this pass, which performs a recursive walk of abstract syntax tree (AST), so each individual pass can perform modification to the AST with function calls including &lt;code&gt;add_structure&lt;&#x2F;code&gt;, &lt;code&gt;add_input_port&lt;&#x2F;code&gt;, &lt;code&gt;remove_structure&lt;&#x2F;code&gt;, etc.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;read-wires&quot;&gt;Read Wires&lt;&#x2F;h4&gt;
&lt;p&gt;The first pass is adding read wires. We go though all input and output ports of each component, adding corresponding &lt;em&gt;read&lt;&#x2F;em&gt; ports, and then each wire of the components and adding &lt;em&gt;read&lt;&#x2F;em&gt; wires to the ports. We do this pass ahead of creating FSM signatures because we don&#x27;t want to create &lt;em&gt;read&lt;&#x2F;em&gt; wires for control signals like &lt;em&gt;valid&lt;&#x2F;em&gt; and &lt;em&gt;ready&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;fsm-signatures&quot;&gt;FSM Signatures&lt;&#x2F;h4&gt;
&lt;p&gt;This pass translates control syntax to FSM components.&lt;&#x2F;p&gt;
&lt;p&gt;Based on the design logic of FSMs, we can specify the inputs and outputs of each FSM component and the wires connecting each ports to its subcomponents. Notice we also need to add &lt;em&gt;cond_read&lt;&#x2F;em&gt; signals to specify whether the &lt;em&gt;condition&lt;&#x2F;em&gt; signal from the comparison component is readable.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;FSM&lt;&#x2F;th&gt;&lt;th&gt;Input Ports&lt;&#x2F;th&gt;&lt;th&gt;Output Ports&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;enable&lt;&#x2F;code&gt;&#x2F;&lt;code&gt;par&lt;&#x2F;code&gt;&#x2F;&lt;code&gt;seq&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;em&gt;val, rdy_A, rdy_B, ...&lt;&#x2F;em&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;em&gt;rdy, val_A, val_B, ...&lt;&#x2F;em&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;if&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;em&gt;val, rdy_con, cond, cond_read, rdy_T, rdy_F&lt;&#x2F;em&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;em&gt;rdy, val_con, val_T, val_F&lt;&#x2F;em&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;while&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;em&gt;val, rdy_con, cond, cond_read, rdy_body&lt;&#x2F;em&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;em&gt;rdy, val_con, val_body&lt;&#x2F;em&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h3 id=&quot;interfacing&quot;&gt;Interfacing&lt;&#x2F;h3&gt;
&lt;p&gt;This pass creates input port &lt;em&gt;clock&lt;&#x2F;em&gt; for all components and &lt;em&gt;valid&lt;&#x2F;em&gt; for the top level component. Notice in Futil, we do not have an explicit notion of time steps. However, to make things easier for RTL translation, we created these ports.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;mux-signatures&quot;&gt;MUX Signatures&lt;&#x2F;h4&gt;
&lt;p&gt;Similar to creating FSM signatures, we need to specify the inputs and outputs of each MUX component and the wires connecting each ports to its subcomponents. To do this, we create a Hashmap indexing with destination ports and store a vector of source ports according to the wiring of the component. For each destination port, if there is more than one source port connecting to it, we create a MUX.&lt;&#x2F;p&gt;
&lt;p&gt;Notice there is a difference between control signal and data signals though. Control signals do not have corresponding &lt;em&gt;read&lt;&#x2F;em&gt; wires and no matter which control signal is high, the output is high. On the other hand, data signals always have corresponding &lt;em&gt;read&lt;&#x2F;em&gt; wires to explicitly specify if the data on the &lt;em&gt;data&lt;&#x2F;em&gt; wire is readable. The data signal should be chosen according to its &lt;em&gt;read&lt;&#x2F;em&gt; wires. Therefore at implementation, we go through wires twice. The first time we go through it we record &lt;em&gt;read&lt;&#x2F;em&gt; wires going to the same destination components. The second time we go through it we actually create the large MUX with both &lt;em&gt;data&lt;&#x2F;em&gt; and corresponding &lt;em&gt;read&lt;&#x2F;em&gt; wires.&lt;&#x2F;p&gt;
&lt;p&gt;The last step is removing old wires connecting the same destination port with more than one source port.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;second-intermediate-pass&quot;&gt;Second Intermediate Pass&lt;&#x2F;h3&gt;
&lt;h4 id=&quot;fsm-implementation&quot;&gt;FSM Implementation&lt;&#x2F;h4&gt;
&lt;p&gt;This pass creates true FSM representations in Futil AST.  Each &lt;code&gt;FSM&lt;&#x2F;code&gt; has a &lt;em&gt;name&lt;&#x2F;em&gt; field, a &lt;em&gt;states&lt;&#x2F;em&gt; Hashmap storing states and indexing of the state, a &lt;em&gt;start index&lt;&#x2F;em&gt; corresponding to the first state being created and a &lt;em&gt;last index&lt;&#x2F;em&gt; pointing to the last state being created. Each &lt;code&gt;State&lt;&#x2F;code&gt; is composed of a vector of &lt;em&gt;outputs&lt;&#x2F;em&gt;, where each &lt;em&gt;output&lt;&#x2F;em&gt; is specified with output &lt;em&gt;value&lt;&#x2F;em&gt; and &lt;em&gt;port name&lt;&#x2F;em&gt; and a vector for &lt;em&gt;transitions&lt;&#x2F;em&gt;, where each transition is a tuple of &lt;em&gt;next state index&lt;&#x2F;em&gt; and &lt;em&gt;inputs&lt;&#x2F;em&gt; of &lt;em&gt;value&lt;&#x2F;em&gt; and &lt;em&gt;port name&lt;&#x2F;em&gt;, so that the transition happens when the input has certain &lt;em&gt;value&lt;&#x2F;em&gt;. Finally, there is a default state for each state that is an optional field, telling which state it should transition to when no &lt;em&gt;transition&lt;&#x2F;em&gt; condition is met.&lt;&#x2F;p&gt;
&lt;p&gt;We provide abstract methods &lt;code&gt;new(name: &amp;amp;str) -&amp;gt; StateIndex&lt;&#x2F;code&gt;, &lt;code&gt;new_state() -&amp;gt; StateIndex&lt;&#x2F;code&gt;, &lt;code&gt;get_state(idx: StateIndex) -&amp;gt; &amp;amp;mut State&lt;&#x2F;code&gt; for &lt;code&gt;FSM&lt;&#x2F;code&gt; and &lt;code&gt;push_output(output: ValuedPort)&lt;&#x2F;code&gt;and &lt;code&gt;add_transition(transition: Edge) &lt;&#x2F;code&gt; for &lt;code&gt;State&lt;&#x2F;code&gt; to generate actual FSM inner logic according to the graph we made in design review.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;fsm-to-rtl-generation&quot;&gt;FSM to RTL generation&lt;&#x2F;h3&gt;
&lt;p&gt;After generating an FSM, we need to translate the entire structure of inputs, outputs and states to synthesizable hardware using Verilog. This is done by breaking down a Verilog file into distinct components.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Module Declaration: Here we define the name of the module along with the inputs and outputs for it.&lt;&#x2F;li&gt;
&lt;li&gt;Wire&#x2F;reg definitions: These are internal signals that are used within the module.&lt;&#x2F;li&gt;
&lt;li&gt;FSM: FSMs can be represented in Verilog using 3 &lt;code&gt;always&lt;&#x2F;code&gt; blocks - State transition, Next state logic, State outputs.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;To expand on how all the 3 &lt;code&gt;always&lt;&#x2F;code&gt; blocks are generated we discuss them below:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;State transition: This is pretty standard. It actually changes the state at a clock edge. Since this is a generic block it can be created without any inputs.&lt;&#x2F;li&gt;
&lt;li&gt;Next state logic: This block has a bunch of cases for all the states. For each state in the FSM struct, based on the input transitions to it, we have &lt;code&gt;if else&lt;&#x2F;code&gt;  statements for next state logic.&lt;&#x2F;li&gt;
&lt;li&gt;Output logic: This block contains output signals for each state represented by cases, similar to the previous block. In addition to having Verilog statements for all the relevant outputs in the state, we also assign the rest of the outputs of the FSM to be zero for now. This is done to avoid inferred latches, which can occur if all outputs are not assigned in each state even though they don&#x27;t change.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;We used a &lt;a href=&quot;http:&#x2F;&#x2F;homepages.inf.ed.ac.uk&#x2F;wadler&#x2F;papers&#x2F;prettier&#x2F;prettier.pdf&quot;&gt;Wadler-style&lt;&#x2F;a&gt; printing api provided by the &lt;a href=&quot;https:&#x2F;&#x2F;docs.rs&#x2F;pretty&#x2F;0.7.1&#x2F;pretty&#x2F;&quot;&gt;pretty&lt;&#x2F;a&gt; rust crate to format the Verilog files.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;hardest-parts&quot;&gt;Hardest Parts&lt;&#x2F;h2&gt;
&lt;ol&gt;
&lt;li&gt;Futil is implemented with &lt;a href=&quot;https:&#x2F;&#x2F;www.rust-lang.org&#x2F;&quot;&gt;Rust&lt;&#x2F;a&gt;, so we spent some time to get familiar with the language.&lt;&#x2F;li&gt;
&lt;li&gt;The design of the FSM representation changed multiple times. Because the state should be stored as pointer and then modified when we add transitions and outputs to it. Rust will force the user to use reference counting, which we did not realize at first. Also, even with reference counting, we would have to create reference cycles which would prevent the FSMs from being freed. We therefore used on HashMaps in the end.&lt;&#x2F;li&gt;
&lt;li&gt;Futil &lt;em&gt;read&lt;&#x2F;em&gt; signals are not common in Verilog coding convention. We had two models in our minds: the Verilog valid&#x2F;response model and our Futil valid&#x2F;read model. We messed things up because of the existence of the two models and spent a huge amount of time discussing which one should be the most ideal design.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;We evaluated our compiler by simulating the generated Verilog. We generated Futil programs with a simple backend we wrote for the Dahlia compiler. The Verilog simulation was donew with an open source tool called &lt;a href=&quot;https:&#x2F;&#x2F;www.veripool.org&#x2F;projects&#x2F;verilator&#x2F;wiki&#x2F;Intro&quot;&gt;Verilator&lt;&#x2F;a&gt; which turns Verilog into a &lt;code&gt;C++&lt;&#x2F;code&gt; object that you can link to, manipulate
the inputs, and watch the outputs. This generates a &lt;code&gt;.vcd&lt;&#x2F;code&gt; file that you can view in a wave form viewer like &lt;code&gt;gtkwave&lt;&#x2F;code&gt;. From here you can explore the values of different wires across time.&lt;&#x2F;p&gt;
&lt;p&gt;Although we got the core of the compiler working, we weren&#x27;t able to test very complicated Dahlia programs because we did not implement memories. We also do not correctly generate the logic to multiplex between different inputs to a single component so we were not able to fully take advantage of the parallelism that hardware can provide. Despite these problems, we were still able to get some programs working. Below is a very simple program that simply checks whether a number is greater than 5.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;let a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
let b &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;---
if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  let y &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;20&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
} &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
  let z &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;40&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Below is an almost equivalent Futil program. We&#x27;ve drawn an arrow to the difference between the two. This statement simply performs the comparison &lt;code&gt;b &amp;gt; 5&lt;&#x2F;code&gt; using the same comparison component as the one used in the condition of the &lt;code&gt;if&lt;&#x2F;code&gt; statement. We&#x27;ve made this change to prove that we can use the same module multiple times. Although this sounds simple, it actually requires muxing between the control signals produced by two different FSMs. We also have to generate muxing between &lt;code&gt;a0&lt;&#x2F;code&gt; and &lt;code&gt;b0&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;(define&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;namespace &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;prog
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(define&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;component main () ()
    ((new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std a0 (std_reg &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ const0 out) (@ a0 in))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std const0 (std_const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std b0 (std_reg &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ const1 out) (@ b0 in))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std const1 (std_const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std gt0 (std_gt &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ a0 out) (@ gt0 left))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ b0 out) (@ gt0 left))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ const2 out) (@ gt0 right))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std const2 (std_const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 5&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std y0 (std_reg &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ const3 out) (@ y0 in))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std const3 (std_const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 20&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std z0 (std_reg &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ const4 out) (@ z0 in))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std const4 (std_const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 40&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)))
    (seq
     (par
      (enable a0 const0)
      (enable b0 const1))
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;---&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(enable gt0 b0 const2)
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ gt0 out) (gt0 a0 const2)
         (enable y0 const3)
         (enable z0 const4)))))
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This simple program results in a whopping 969 lines of Verilog code (which I will not paste here).
We simulated the code in our simple test bench and were able to generate the following signal diagram:&lt;&#x2F;p&gt;
&lt;img src=&quot;if_trace.png&quot; width=&quot;200%&quot; height=&quot;400%&quot;&gt;
From top to bottom, the signals are:
&lt;ul&gt;
&lt;li&gt;clock&lt;&#x2F;li&gt;
&lt;li&gt;state of the fsm for the if control statement&lt;&#x2F;li&gt;
&lt;li&gt;output of the greater than comparison component&lt;&#x2F;li&gt;
&lt;li&gt;read output signal for the greater than component&lt;&#x2F;li&gt;
&lt;li&gt;the value in the y register&lt;&#x2F;li&gt;
&lt;li&gt;the value in the z register&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Notice that the read output signal goes high twice. The first one corresponds to the first time we do the comparison and the second time it goes high is for the comparison in the condition of the if statement.&lt;&#x2F;p&gt;
&lt;p&gt;From this diagram, we can see that the true branch of the program was correctly taken and that value &lt;code&gt;20&lt;&#x2F;code&gt; was put into the register &lt;code&gt;y&lt;&#x2F;code&gt;. The &lt;code&gt;z&lt;&#x2F;code&gt; register remains in its default state.&lt;&#x2F;p&gt;
&lt;p&gt;For a slightly more interesting example, and because it is the classic hello world program of hardware, we implemented a counter. The Dahlia code is the following:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;let i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;---
while &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and the equivalent Futil code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;(define&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;namespace &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;prog
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(define&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;component main () ()
    ((new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std i0 (std_reg &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ const0 out) (@ i0 in))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std const0 (std_const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std add0 (std_add &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ i0 out) (@ add0 left))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ const2 out) (@ add0 right))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std const2 (std_const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ add0 out) (@ i0 in))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std lt0 (std_lt &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ i0 out) (@ lt0 left))
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ const1 out) (@ lt0 right))
     (new&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;std const1 (std_const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32 10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)))
    (seq
     (enable i0 const0)
     (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(@ lt0 out) (lt0 i0 const1)
       (enable i0 add0 const2)))))
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Running the resulting 509 lines of Verilog code gives us the following trace:&lt;&#x2F;p&gt;
&lt;img src=&quot;while_trace.png&quot; width=&quot;200%&quot;&gt;
&lt;p&gt;Finally, we got a simple implementation of Fibonacci running. Here is the Dahlia code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;let a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
let i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;---&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
let b &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;---
while &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  let tmp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b;
  i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;---&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  b &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tmp;
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;---&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tmp;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Notice that we have a lot more triple dashes than Dahlia requires. This is because without them we run into the aforementioned problems with muxing.&lt;&#x2F;p&gt;
&lt;img src=&quot;fib.png&quot; width=&quot;100%&quot;&gt;
&lt;p&gt;Our compiler resulted in 1412 lines of Verilog code. I compared this against the equivalent C++ program compiled with the Vivado HLS toolchain. Their compiler resulted in 148 lines of Verilog code. Although this is an imperfect metric, it does show that this method of compiling to hardware has a large overhead.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;Overall this was a very interesting project. Although the overhead of this approach is very high, we successfully demonstrated that you can build a simple but functional HLS compiler in ~2 weeks with this approach. Additionally, because most of the compilation work took place within Futil, it would be straightforward to improve the quality of the output by writing more passes. I think that this provides an excellent baseline so that we can explore the impact of more optimizations.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>CompCert: Formally Verified C Compiler</title>
                <pubDate>Mon, 09 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/comp-cert/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/comp-cert/</guid>
                <description>&lt;h2 id=&quot;motivation&quot;&gt;Motivation&lt;&#x2F;h2&gt;
&lt;p&gt;The primary motivation of this paper is that compilers these days form a base of trust for most modern applications. If the application&#x27;s code is correct, then compiled executable of that application will also be correct. However, most modern compilers like GCC and LLVM do have bugs, some of which silently miscompile code without emitting any error messages. Most of these bugs occur when the compiler performs transformations and optimization passes over the source program. The goal of CompCert is to create a compiler that will never silently miscompile code. The way CompCert accomplishes this is by formally verifying (i.e., proving) that each compiler pass does not change the original meaning of the program. The formally verified parts of CompCert are written in &lt;a href=&quot;https:&#x2F;&#x2F;coq.inria.fr&#x2F;&quot;&gt;Coq&lt;&#x2F;a&gt;, which is a proof assistant based on the calculus of inductive constructions.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;semantic-preservation&quot;&gt;Semantic Preservation&lt;&#x2F;h2&gt;
&lt;p&gt;In order for a compiler to be correct it needs to preserve the semantics of our source program. In this section, we discuss how the paper formalizes the notion of semantic correctness.&lt;&#x2F;p&gt;
&lt;p&gt;The paper assumes that the source and target languages have formal languages that assign &lt;em&gt;observable behaviors&lt;&#x2F;em&gt; to each program. The notation $S \Downarrow B$ means that the program $S$ executes with observable behavior $B$. An observable behavior includes things like whether the program terminates or not, and various &lt;em&gt;going wrong&lt;&#x2F;em&gt; behaviors such as accessing an array out of bounds or invoking an undefined operation like dividing by zero. It also includes a trace of all external calls (system calls) that record the input and output of the external functions. However, it doesn&#x27;t include the state of memory.&lt;&#x2F;p&gt;
&lt;p&gt;The strongest definition of semantic preservation is that a source program $S$ has exactly the same set of possible behaviors as a compiled program $C$:&lt;&#x2F;p&gt;
&lt;p&gt;$\forall B, S \Downarrow B \Leftrightarrow C \Downarrow B$&lt;&#x2F;p&gt;
&lt;p&gt;However, this definition is too strict because it doesn&#x27;t give the compiler room to perform certain desirable optimizations, such as dead code elimination, because doing so may optimize away certain &lt;em&gt;going wrong&lt;&#x2F;em&gt; behaviors. For example, if the result of an operation that divides a number by zero is never used, we want the compiler to be able to get rid of it. But doing so means that the compiled program has one fewer going wrong behavior than the source program. For this reason, the paper only requires that all of the safe behaviors of the source program are preserved in the compiled program:&lt;&#x2F;p&gt;
&lt;p&gt;$S \texttt{safe} \Rightarrow (\forall B, C \Downarrow B \Rightarrow S \Downarrow B)$&lt;&#x2F;p&gt;
&lt;p&gt;$S \texttt{safe}$ is a predicate that means $S$ doesn&#x27;t have any going wrong behaviors. This definition enforces that all observable behaviors of $C$ are a subset of the possible behaviors of $S$ and that if $S$ does not go wrong, then $C$ doesn&#x27;t go wrong either.&lt;&#x2F;p&gt;
&lt;p&gt;The paper actually uses the contrapositive of this statement because it is practically easier to prove since you can induct on the execution of $S$:&lt;&#x2F;p&gt;
&lt;p&gt;$\forall B \notin \texttt{Wrong}, S \Downarrow B \Rightarrow C \Downarrow B$&lt;&#x2F;p&gt;
&lt;h3 id=&quot;verification-vs-validation&quot;&gt;Verification vs. Validation&lt;&#x2F;h3&gt;
&lt;p&gt;The paper models a compiler as a total function, &lt;code&gt;Comp(S)&lt;&#x2F;code&gt;, from source programs to either &lt;code&gt;OK(C)&lt;&#x2F;code&gt;, a compiled program, or &lt;code&gt;Error&lt;&#x2F;code&gt;, the output that represents a compile-time error, signifying that the compiler was unable to produce code. There are two approaches for establishing that a compiler has the semantic preservation property discussed above: verifying the compiler directly using formal methods or verifying a &lt;em&gt;validator&lt;&#x2F;em&gt;, a boolean function accompanying the compiler that verifies the output of the compiler separately. An unverified compiler along with a verified validator provides the same guarantees as a verified compiler because you can guard the result of the unverified compiler with the validator:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
Comp&#x27;(S) =
  match Comp(S) with
  | Error -&amp;gt; Error
  | OK(C) -&amp;gt; if Validate(S, C) then OK(C) else Error
&lt;&#x2F;pre&gt;
&lt;p&gt;The validation approach is convenient because sometimes the validator is significantly simpler than the compiler. We&#x27;ll see this approach used later for verifying part of the register pass.&lt;&#x2F;p&gt;
&lt;p&gt;Verifying the compiler directly using formal methods amounts to proving that each step in the semantics of the source program corresponds to a sequence of steps in the semantics of the target program with the same observable effects. If you can also show that the initial states and final states of the source and target programs are equivalent then this proves semantic equivalence. This is represented in the following simulation diagram:&lt;&#x2F;p&gt;
&lt;img src=&quot;simulation.png&quot; style=&quot;width: 50%&quot;&gt;
&lt;p&gt;$S_1$ represents a state in the execution of a program from the source language and $S_1&#x27;$ represents the equivalent state in the execution of the target program. The &lt;code&gt;~&lt;&#x2F;code&gt; line is an equivalence relation between states from the source semantics to states in the target semantics. The &lt;code&gt;~&lt;&#x2F;code&gt; line in the diagram is showing that $S_1$ and $S_1&#x27;$ are equivalent states. The down arrows represent a single step in the execution of the program and the $t$ label represents the observable effects that took place in this step.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;structure-of-the-compiler&quot;&gt;Structure of the Compiler&lt;&#x2F;h2&gt;
&lt;p&gt;The source language of the CompCert compiler is Clight, which is a subset of C that includes most familiar C programming constructs like pointers, arrays, structs, if&#x2F;then statements and loops. The compiler front end consists of an unverified parser that parses a source file to a Clight AST. From there, the formally verified section of the compiler performs several passes that repeatedly simplify and transform the representation of the source code all the way to PowerPC assembly code. Then, an unverified assembler and linker take the assembly code and generate an executable that can be run.&lt;&#x2F;p&gt;
&lt;img src=&quot;structure.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;In total, CompCert formally defines 8 intermediate languages and 14 passes over them. The 14 passes must be proven to preserve the semantics of the original program. The first few passes simplify the C code by converting all types to either ints or floats (pointers get converted to ints) and explicitly describing memory accesses. The result of these passes is an intermediate representation called Cminor, and below is an example of a translation from Clight to Cminor.&lt;&#x2F;p&gt;
&lt;img src=&quot;transf_ex.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;As you can see, function signatures have been made explicit, implicit casts have been made explicit (like the cast from float to int), array accesses have been transformed into exact byte offsets from pointers, and the size of the function&#x27;s activation record has been made explicit. The explicit function activation record is used fo dealing with the address-of (&amp;amp;) operator, which requires function local variables to be mapped to a location on the stack frame of the function.&lt;&#x2F;p&gt;
&lt;p&gt;Next, CompCert performs instruction selection for the specific architecture that it is targeting. This is done via instruction tiling which can recognize basic algebraic identities. For example, the instruction selection pass will transform &lt;code&gt;8 + (x + 1) × 4&lt;&#x2F;code&gt; into &lt;code&gt;x × 4 + 12&lt;&#x2F;code&gt;. These algebraic identities are proven in CompCert in order to assist in the semantic preservation proof. The selected instructions are very similar to available PowerPC instructions. Next, CompCert makes the control flow of the program more explicit via a transformation to the RTL IR, which represents control flow using a CFG. In addition to generating a CFG, the RTL representation transforms variables into pseudo-registers, of which there are an unlimited supply of. The RTL representation is a convenient representation to perform optimizations on, so CompCert runs several dataflow analyses on the program in order to perform optimizations such as constant propagation, common subexpression elimination, and lazy code motion.&lt;&#x2F;p&gt;
&lt;p&gt;The next transformation pass performed by CompCert maps the pseudo registers to hardware registers or abstract stack locations using a register allocation algorithm. The algorithm implements an approximation of the graph coloring algorithm in OCaml which is used in CompCert using the validator method discussed above. Further passes linearize the CFG, spill registers on the stack and insert the necessary loads for temporaries. Some simple optimizations like branch tunneling (removing branches to branches) are also performed as part of these passes. Finally, CompCert performs instruction scheduling to increase instruction level parallelism on super scaler processors, such as PowerPC processors, and generates PowerPC assembly code.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;verification-of-the-register-allocation-pass&quot;&gt;Verification of the Register Allocation Pass&lt;&#x2F;h2&gt;
&lt;p&gt;In order to explain the verification process in more depth, the paper describes some of the more technical details of the register allocation pass. The register allocation pass operates on the RTL IR. In this representation functions are represented as CFGs with instructions that roughly map to assembly instructions supported by the PowerPC architecture. However, the instructions use an infinite supply of pseudo-registers, also known as temporaries. The execution semantics of the RTL IR are given by a set of small step semantics. The small step semantics operate over a global environment, which includes a list of all of the temporaries and their values as well as the state of memory. CompCert represents memory as a collection of blocks with a bounded size. Pointers are described as pointing to some offset from the base of a memory block.&lt;&#x2F;p&gt;
&lt;p&gt;In order to produce performant code, as many of the temporaries as possible should be mapped to hardware registers instead of being stored on the stack. The register allocation algorithm starts with a liveness analysis of each program point. For every program point $l$ the liveness analysis computes the set of variables that are live coming into program point $l$. This is typically expressed by solving the reverse dataflow equations with a transfer function that removes all defined temporaries at program point $l$ and adds all temporaries that were used at program point $l$. Consider the following code snippet:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;1  b = a + 2;
2  c = b*b;
3  b = c + 1;
4  return b*a;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;On line 4, the variables $a$ and $b$ are live. Program point 3 defines $b$ and uses $c$. Therefore, variable $c$ must be live at line 3. However, $b$ must no longer be live before line 3 because it was just redefined. On line 2, $c$ is defined and $b$ is used, so $c$ is no longer live but $b$ is. Finally, on line 1 $b$ is defined so it is not live before line 1. This gives the following live variable sets coming into each program point:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;1  b = a + 2;     LV = {a}
2  c = b*b;       LV = {a,b}
3  b = c + 1;     LV = {a,c}
4  return b*a;    LV = {a,b}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The reason live variable analysis is important for register allocation is that it helps build an interference graph. The interference graph represents temporaries as nodes. Edges between two nodes A and B mean that temporaries A and B cannot be assigned the same hardware register. If two temporaries are live at the same time, then they cannot be assigned to the same hardware register. When building the interference graph you simply inspect the live variable sets at each program point and add edges between all temporaries in the live variable set. Then, you need to color the graph to assign temporaries to hardware registers. For the above code snippet the interference graph would be:&lt;&#x2F;p&gt;
&lt;img src=&quot;int.png&quot; style=&quot;width: 40%&quot;&gt;
&lt;p&gt;The live variable analysis is implemented in Coq. Furthermore, the analysis is proven to generate live variable sets that are supersets of the actual live variable sets at a program point. The paper claims this is easier to prove and does not violate the correctness of the register allocation step. This is because supersets of the actual live variable sets only add more edges to the interference graph, which still maintains the correctness of the register allocation pass.&lt;&#x2F;p&gt;
&lt;p&gt;The actual coloring of the interference graph is implemented in unverified OCaml code due to its complexity. The function is then used in the proof of semantic preservation using the validator approach. As a reminder, the validator approach allows CompCert to use the OCaml code only if the output is valid. Otherwise, compilation fails and no code is emitted. The correctness conditions for the coloring $\phi$ of the temporaries is:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;$\phi(r) \neq \phi(r&#x27;)$ if $r$ and $r&#x27;$ interfere&lt;&#x2F;li&gt;
&lt;li&gt;$\phi(r) \neq l$ if $r$ and $l$ interfere ($l$ is a machine register or stack location. The interference graph can be pre-colored and some pseudo registers can interfere with hardware registers or stack locations)&lt;&#x2F;li&gt;
&lt;li&gt;$\phi(r)$ and $r$ have the same register class (either int or float)&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;After coloring the graph, the RTL IR is transformed to LTL IR by replacing temporaries according to the mapping $\phi$. For each temporary $r$, $\phi(r)$ either a hardware register or a stack location. In order to prove that the transformation preserves the semantics of the original program, the lock-step simulation approach discussed above is used. The equivalence relation on states requires that control flow is preserved, memory contents are the same, and that the registers are somehow preserved. The first two properties are intuitively correct because register allocation doesn&#x27;t really affect control flow or memory state at a program point. However, the equivalence between temporaries and hardware registers is a bit more subtle. This is because the value in a hardware register might not be the same as the value in a temporary mapped to it. For example, if two temporaries that do not interfere are mapped to the same hardware register, the value in the hardware register will not be the same as one of the two temporary values at some point in time. Therefore, the paper states that a relaxation was proven where at every program point $l$, $R(r) = R&#x27;(\phi(r))$ for all $r$ live at point $l$ ($R$ is the state of the register mapping).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;In general CompCert is able to output code with similar performance characteristics to gcc at -O1 and -O2.&lt;&#x2F;p&gt;
&lt;img src=&quot;perf.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;However, we think a more important metric is the correctness of CompCert, since this was the primary purpose of creating it. This is something the paper was not able to do because nobody had seriously tested CompCert at the time of the paper&#x27;s release. However, several automated compiler testing tools such as &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.utah.edu&#x2F;%7Eregehr&#x2F;papers&#x2F;pldi11-preprint.pdf&quot;&gt;Csmith&lt;&#x2F;a&gt; and Orion (from the &lt;a href=&quot;http:&#x2F;&#x2F;vuminhle.com&#x2F;pdf&#x2F;pldi14-emi.pdf&quot;&gt;Equivalence Module Inputs&lt;&#x2F;a&gt; paper) have reported a handful of bugs in CompCert over the years. We looked into issues on the official CompCert GitHub page and bug reports generated by Orion to try and figure out what some of the bugs were and where the manifested themselves.&lt;&#x2F;p&gt;
&lt;p&gt;The GitHub repository for &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;AbsInt&#x2F;CompCert&quot;&gt;CompCert&lt;&#x2F;a&gt; seems to have been created in 2015, while the paper is dated 2009. So there may be several bugs found in the original version of CompCert that were patched when the GitHub repository was started. The Orion project includes a &lt;a href=&quot;https:&#x2F;&#x2F;web.cs.ucdavis.edu&#x2F;%7Esu&#x2F;emi-project&#x2F;compcert.html&quot;&gt;page&lt;&#x2F;a&gt; with all of the bugs it found that led to open issues on the CompCert GitHub repository. There are 31 total issues on this list, with 27 of them being marked as fixed and the remaining 4 marked as won&#x27;t fix. This list of issues was reported between August 2016 and May 2017. Most of the issues reported seem to be issues in the front end OCaml code and are mostly crash failures. There are a few crash failures associated with the unverified register allocation code that seem to have required some updates to the Coq proofs such as &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;AbsInt&#x2F;CompCert&#x2F;issues&#x2F;183&quot;&gt;issue 183&lt;&#x2F;a&gt;. However, it does not appear as if there was a case where CompCert silently generated a miscompilation due to an error in the formally verified parts of CompCert. This suggests that CompCert does indeed succeed at its goal of creating a C compiler with no miscompilation bugs.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Provably Correct Peephole Optimizations with Alive</title>
                <pubDate>Fri, 06 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/alive/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/alive/</guid>
                <description>&lt;p&gt;In previous discussions, we&#x27;ve considered research systems that find bugs in compiler implementations via &lt;em&gt;differential testing&lt;&#x2F;em&gt;.
To page you back in, &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;bug-finding&#x2F;&quot;&gt;CSmith&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;equivalence-modulo-inputs&#x2F;&quot;&gt;Equivalence Modulo Inputs (Orion)&lt;&#x2F;a&gt; both used clever tactics to generate randomized test programs and inputs, with the goal of finding instances where compilers produce different output than expected.
These systems exploit a key assumption: while we don&#x27;t have an oracle that determines the ground truth correct behavior for any program (without a precise semantics), we can expect compilers to produce the &amp;quot;same&amp;quot; behavior across different implementations.&lt;&#x2F;p&gt;
&lt;p&gt;On the other hand, there are fully verified compilers such as &lt;a href=&quot;http:&#x2F;&#x2F;compcert.inria.fr&quot;&gt;CompCert&lt;&#x2F;a&gt; that guarantee against mis-compilations, but do so at the cost of supporting entire language surfaces and getting fast, optimized code.&lt;&#x2F;p&gt;
&lt;p&gt;What about middle ground, where we leverage a correctness oracle for some particularly tricky portions of a commonly-used optimizing compiler?&lt;&#x2F;p&gt;
&lt;p&gt;Lopes et al.’s &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=2737965&quot;&gt;“Provably Correct Peephole Optimizations with Alive”&lt;&#x2F;a&gt;, from PLDI 2015, takes one flavor of this approach.
Instead of treating the compiler itself as a black-box system that we try to break from the outside, Alive &lt;em&gt;proves&lt;&#x2F;em&gt; that the high-level insights behind certain optimizations are correct.
Alive is built for &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&quot;&gt;LLVM&lt;&#x2F;a&gt;, our friendly massively-optimizing, ahead-of-time, heavily-used beast of a compiler.
Alive aims to hit a design point that is &lt;em&gt;both&lt;&#x2F;em&gt; practical and formal—the provable guarantees of verified compilation, for one component of a very pragmatic compiler.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;peephole-optimizations&quot;&gt;Peephole optimizations&lt;&#x2F;h3&gt;
&lt;p&gt;In particular, Alive focuses on LLVM&#x27;s peephole optimizations—those that involve replacing a small set of (typically adjacent) instructions with an equivalent, faster set.
For example, a clever compiler might replace &lt;code&gt;%x = mul i8 %y, 2&lt;&#x2F;code&gt; (&lt;code&gt;x = y * 2&lt;&#x2F;code&gt;) with &lt;code&gt;%x = shl i8 %y, 1&lt;&#x2F;code&gt; (&lt;code&gt;x&lt;&#x2F;code&gt; = &lt;code&gt;y&lt;&#x2F;code&gt; &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Arithmetic_shift&quot;&gt;shift left&lt;&#x2F;a&gt; &lt;code&gt;1&lt;&#x2F;code&gt;).
While these optimizations may &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=2462741&quot;&gt;&amp;quot;delight hackers&amp;quot;&lt;&#x2F;a&gt;, they are also extremely tricky to get right for edge cases and boundary conditions.&lt;&#x2F;p&gt;
&lt;p&gt;Alive&#x27;s specific focus was inspired by the author&#x27;s previous work on &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;bug-finding&#x2F;&quot;&gt;CSmith&lt;&#x2F;a&gt;, which found that the single buggiest file in LLVM was in the instruction combiner, home of over 20,000-C++-lines (!) of peephole optimizations.
Since its publication in 2015, Alive has been used to fix and prevent dozens of bugs and improve code concision in production LLVM.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;system-overview&quot;&gt;System overview&lt;&#x2F;h2&gt;
&lt;p&gt;Below is a high-level overview of Alive&#x27;s approach.&lt;&#x2F;p&gt;
&lt;p&gt;First, Alive comes with its own domain-specific language (DSL) that was designed to resemble LLVM&#x27;s intermediate representation.
Optimizations are written in this DSL with a source (left hand side) and and target (right hand side) template, which abstract over constant values and exact data types.
The semantics of each side are encoded into logical formulas.
Then, Alive generates verification conditions that cover the full range of potential cases, including special treatment of undefined behavior.
The verification conditions are handed to an off-the-shelf SMT (Satisfiability Modulo Theory) solver, &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Z3Prover&#x2F;z3&quot;&gt;Z3&lt;&#x2F;a&gt;, which proves their validity of provides a counterexample.
If the verification conditions are provably correct, Alive is able to generate C++ code that implements the optimization (which the developer can then link into LLVM).
If the verification conditions fail, Alive provides the developer with a counter-example (in terms of the original source and target template).&lt;&#x2F;p&gt;
&lt;img src=&quot;sys-diagram.png&quot; width=&quot;700&quot; &gt;
&lt;h2 id=&quot;grokking-undefined-behavior&quot;&gt;Grokking undefined behavior&lt;&#x2F;h2&gt;
&lt;p&gt;The greatest technical challenge for a compiler or verification engineer in this space is wrangling with undefined behavior.
One of the authors of Alive, John Regehr, has &lt;a href=&quot;https:&#x2F;&#x2F;blog.regehr.org&#x2F;archives&#x2F;1170&quot;&gt;several&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;blog.regehr.org&#x2F;archives&#x2F;1467&quot;&gt;excellent&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;blog.regehr.org&#x2F;archives&#x2F;1467&quot;&gt;blog&lt;&#x2F;a&gt; posts on the topic.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;refinement&quot;&gt;Refinement&lt;&#x2F;h3&gt;
&lt;p&gt;By definition, compilers are allowed to produce different results for the same source program in the presence of undefined behavior.
However, compilers are &lt;em&gt;not&lt;&#x2F;em&gt; allowed to introduce undefined behavior for a program and input that was well-defined in the unoptimized source code.
That is, the burden on a verifier like Alive is to show that optimization targets are &lt;em&gt;refinements&lt;&#x2F;em&gt; of the source: the optimized target can include a more, but not less, specific subset of behaviors of the source.&lt;&#x2F;p&gt;
&lt;p&gt;To illustrate this, let&#x27;s look at a &lt;a href=&quot;https:&#x2F;&#x2F;bugs.llvm.org&#x2F;show_bug.cgi?id=20186&quot;&gt;real bug&lt;&#x2F;a&gt; in an optimization that Alive discovered in production LLVM (also described &lt;a href=&quot;https:&#x2F;&#x2F;blog.regehr.org&#x2F;archives&#x2F;1170&quot;&gt;here&lt;&#x2F;a&gt;).
The optimization aims to simplify an expression that negates the division of a variable &lt;code&gt;x&lt;&#x2F;code&gt; with a constant &lt;code&gt;C&lt;&#x2F;code&gt;, from the explicit &lt;code&gt;0 - (x&#x2F;C)&lt;&#x2F;code&gt;, to the simpler &lt;code&gt;x &#x2F; -C&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;In the Alive DSL, we specify this with:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;%div = sdiv %x, C
%r = sub 0, %div
  =&amp;gt;
%r = sdiv %x, -C
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;When we hand this optimization off to Alive, we get:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;Precondition: true
%div = sdiv %x, C
%r = sub 0, %div
  =&amp;gt;
%r = sdiv %x, -C

Done: 1
ERROR: Domain of definedness of Target is smaller than Source&amp;#39;s for i2 %r

Example:
%x i2 = 2 (0x2)
C i2 = 1 (0x1)
%div i2 = 2 (0x2)
Source value: 2 (0x2)
Target value: undef
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The problem here is the interplay between an edge case (signed integer overflow) and undefined behavior.
When the concrete type is &lt;code&gt;i2&lt;&#x2F;code&gt; and the values are &lt;code&gt;x = -2&lt;&#x2F;code&gt; and &lt;code&gt;C = 1&lt;&#x2F;code&gt;, &lt;code&gt;x&#x2F;-C = -2&#x2F;-1 = 2&lt;&#x2F;code&gt;, but &lt;code&gt;2&lt;&#x2F;code&gt; overflows a 2-bit signed integer! While mathematically this is also true in the source template, LLVM&#x27;s language reference states that overflow in &lt;code&gt;sdiv&lt;&#x2F;code&gt; is undefined behavior, the same of which is not true for &lt;code&gt;sub&lt;&#x2F;code&gt;.
Thus, the target template introduced undefined behavior in a case where there previously was none, so it is &lt;em&gt;not&lt;&#x2F;em&gt; a refinement.&lt;&#x2F;p&gt;
&lt;p&gt;In order to fix this bug, the LLVM developers added a precondition that &lt;code&gt;C != 1&lt;&#x2F;code&gt; and &lt;code&gt;C&lt;&#x2F;code&gt; is not a sign bit.
In Alive, we can represent this precondition as &lt;code&gt; ((C != 1) &amp;amp;&amp;amp; !isSignBit(C))&lt;&#x2F;code&gt;, and the optimization verifies.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;poison&quot;&gt;Poison&lt;&#x2F;h3&gt;
&lt;p&gt;An additional complication in handling undefined behavior is that LLVM actually has two flavors of &lt;em&gt;deferred&lt;&#x2F;em&gt; (non-crashing) undefined behavior: the &lt;code&gt;undef&lt;&#x2F;code&gt; value, and implicit &lt;em&gt;poisoned&lt;&#x2F;em&gt; values.&lt;&#x2F;p&gt;
&lt;p&gt;Poison values are a stronger form of undefined behavior: they happen when a side-effect-free instruction produces a result that might later trigger undefined behavior.
The true undefined behavior only occurs if&#x2F;when a poisoned value is later used by an instruction that &lt;em&gt;does&lt;&#x2F;em&gt; have side effects (for example, a division by zero).
Poison values are not represented explicitly in LLVM IR, and can only be identified via careful analysis.
Alive models poison in a similar way to &lt;code&gt;undef&lt;&#x2F;code&gt; values: target templates can only yield poison values if the source did as well.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluating-alive-s-impact&quot;&gt;Evaluating Alive&#x27;s impact&lt;&#x2F;h2&gt;
&lt;p&gt;At the time of publication in 2015, Alive&#x27;s authors (manually) ported 334 peephole optimizations.
Optimizations varied in verification time from a few seconds to several hours.
From these 334 optimizations, Alive found 8 bugs.&lt;&#x2F;p&gt;
&lt;p&gt;In addition, the authors build a version of LLVM with the default instruction combiner replaced by Alive-generated C++ for their 334 optimizations.
They found that despite not covering all of the previous optimizations, LLVM+Alive maintained within 10% of the performance of LLVM on SPEC 2000 and 2006 benchmarks.
Much more interestingly, however, the authors show how little coverage these optimizations received in the existing tests and benchmarks.
An instrumented LLVM-Alive run on LLVM&#x27;s nightly test suite and both SPEC benchmarks found that only 159 of the 334 optimizations were triggered:&lt;&#x2F;p&gt;
&lt;img src=&quot;llvm-alive.png&quot; width=&quot;500&quot; &gt;
&lt;p&gt;That is, nearly half of the peephole optimizations ported to Alive were untested via the existing manual test and benchmark flow!&lt;&#x2F;p&gt;
&lt;p&gt;In addition to their hard performance numbers, Alive&#x27;s authors reached out to LLVM developers to incorporate Alive into work-in-progress patches.
The authors report they found &amp;quot;dozens&amp;quot; of proposed incorrect optimization implementations, which they were able to provide counter-examples to prevent with the help of Alive.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;key-take-aways&quot;&gt;Key take-aways&lt;&#x2F;h2&gt;
&lt;p&gt;Alive leaves us with several key nuggets of wisdom:&lt;&#x2F;p&gt;
&lt;h4 id=&quot;dsl-smt-profit&quot;&gt;&lt;em&gt;DSL + SMT = profit&lt;&#x2F;em&gt;&lt;&#x2F;h4&gt;
&lt;p&gt;Alive demonstrates that finding a domain-specific language for your goals, in this case concise peephole optimizations, can be especially fruitful for verification.
The authors argue that DSLs help compiler engineers reason about code.
Beyond that, Alive shows that a DSL makes translation of semantics to a formal logic like SMT more tractable than trying to wrangle with the full semantics of languages like C or LLVM IR directly.
Later work on Alive (&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;AliveToolkit&#x2F;alive2&quot;&gt;&amp;quot;Alive2&amp;quot;&lt;&#x2F;a&gt;) has also introduced tools to help translate LLVM IR to Alive&#x27;s DSL in an automated fashion.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;usability-matters-in-formal-methods&quot;&gt;&lt;em&gt;Usability matters in formal methods&lt;&#x2F;em&gt;&lt;&#x2F;h4&gt;
&lt;p&gt;Alive is a formal system, but it is also a deeply practical one.
It recognized that there is impact to be had from building verification systems closer to where working programmers spend their day-to-day-hacking, in part by targeting a massive existing code base in a piecewise, workable way.
In addition, Alive&#x27;s DSL and counter-examples were designed with an interface meant to be familiar to LLVM engineers, which undoubtedly paid off in the adoption of this work.
Finally, the authors of Alive engaged closely with the LLVM community, from frequenting the RFC discussion channels to publishing high-level blog posts on their contributions.
A less optimistic lesson, however, is that technology transfer is &lt;em&gt;still&lt;&#x2F;em&gt; &lt;strong&gt;really&lt;&#x2F;strong&gt; hard.
Despite the project&#x27;s deep engagement with the community, LLVM has still not wholesale replaced most of its instruction combinations with Alive-generated code.
There is always more work to be done reconciling ideals from research prototypes with the difficult constraints of industry-scale software engineering!&lt;&#x2F;p&gt;
&lt;h4 id=&quot;undefined-behavior-is-pernicious&quot;&gt;&lt;em&gt;Undefined behavior is pernicious&lt;&#x2F;em&gt;&lt;&#x2F;h4&gt;
&lt;p&gt;One of the trickiest part of the job for both industry compiler engineers and research verification hackers is dealing with undefined behavior.
Eliminating undefined behavior entirely isn&#x27;t feasible in an aggressive optimizing compiler that wants to exploit speculation, so researchers need to continue to push for better methodology to keep it contained, understandable, and verifiable.
In particular, the authors of Alive have been among several researchers who have pushed for LLVM to change its treatment of deferred undefined behavior.
In 2016, they shared a proposal titled &lt;a href=&quot;https:&#x2F;&#x2F;lists.llvm.org&#x2F;pipermail&#x2F;llvm-dev&#x2F;2016-October&#x2F;106182.html&quot;&gt;&amp;quot;Killing undef and spreading poison&amp;quot;&lt;&#x2F;a&gt; that advocated for removing the &lt;code&gt;undef&lt;&#x2F;code&gt; value, adding an IR-level &lt;code&gt;poison&lt;&#x2F;code&gt; value, and introducing a &lt;code&gt;freeze&lt;&#x2F;code&gt; instruction that would stop the prorogation of poison by resolving to an arbitrary value.
Alive includes a &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;nunoplopes&#x2F;alive&#x2F;tree&#x2F;newsema&quot;&gt;branch&lt;&#x2F;a&gt; modeling these new semantics.
Just last month, LLVM took another step toward realizing this vision by adding the &lt;code&gt;freeze&lt;&#x2F;code&gt; instruction.&lt;&#x2F;p&gt;
&lt;div class=&quot;center&quot;&gt;
&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;the freeze instruction finally landed in LLVM!&lt;a href=&quot;https:&#x2F;&#x2F;t.co&#x2F;W6odosWUe0&quot;&gt;https:&#x2F;&#x2F;t.co&#x2F;W6odosWUe0&lt;&#x2F;a&gt;&lt;br&gt;docs:&lt;a href=&quot;https:&#x2F;&#x2F;t.co&#x2F;kKcCpJUH1L&quot;&gt;https:&#x2F;&#x2F;t.co&#x2F;kKcCpJUH1L&lt;&#x2F;a&gt;&lt;br&gt;lots of work left to do but this is a big step towards making LLVM have a clear and consistent undefined behavior model&lt;&#x2F;p&gt;&amp;mdash; John Regehr (@johnregehr) &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;johnregehr&#x2F;status&#x2F;1191765816422760448?ref_src=twsrc%5Etfw&quot;&gt;November 5, 2019&lt;&#x2F;a&gt;&lt;&#x2F;blockquote&gt; &lt;script async src=&quot;https:&#x2F;&#x2F;platform.twitter.com&#x2F;widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;&#x2F;script&gt;
&lt;&#x2F;div&gt;
&lt;h2 id=&quot;what-s-left-on-the-table&quot;&gt;What&#x27;s left on the table&lt;&#x2F;h2&gt;
&lt;p&gt;Alive has several threats to validity called out in the paper: Alive&#x27;s implementation itself is not verified (though a later &lt;a href=&quot;https:&#x2F;&#x2F;sf.snu.ac.kr&#x2F;aliveinlean&#x2F;&quot;&gt;AliveInLean&lt;&#x2F;a&gt; verifies components), the correctness relies on the authors faithfully manually translating tricky LLVM semantics, and the bounded-verification with SMT solving is often either incomplete or slow.
In addition, later work tackles extending Alive to some &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.rutgers.edu&#x2F;%7Esantosh.nagarakatte&#x2F;papers&#x2F;alive-fp-sas16.pdf&quot;&gt;floating point optimizations&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;In addition, Alive opens the question of whether SMT solving can be used to &lt;em&gt;synthesize&lt;&#x2F;em&gt; optimizations, instead of just verifying them once they are already written.
This type of work is often in the category of super-optimization, and is undertaken by both the previously-discussed &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;chlorophyll&#x2F;&quot;&gt;Chlorophyll&lt;&#x2F;a&gt; project and &lt;a href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;1711.04422.pdf&quot;&gt;other&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;link.springer.com&#x2F;chapter&#x2F;10.1007&#x2F;978-3-662-46663-6_9&quot;&gt;related&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;devmtg&#x2F;2011-11&#x2F;Sands_Super-optimizingLLVMIR.pdf&quot;&gt;projects&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>The Cult of Posits</title>
                <pubDate>Tue, 03 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/posits/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/posits/</guid>
                <description>&lt;p&gt;Computers are incapable of representing arbitrary real numbers exactly.
This is due to two intractable facts of real numbers:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Uncountably Infinite Domain&lt;&#x2F;strong&gt;: There are an infinite number of real numbers.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Uncountably Infinite Precision&lt;&#x2F;strong&gt;: Some real numbers require infinite precision.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Since computers use a finite number of bits, computer architects must settle on capturing a finite number of real numbers at a finite level of precision.
This is done by fixing a mapping between bit patterns and real numbers called a &lt;strong&gt;representation&lt;&#x2F;strong&gt;.
A representation makes a tradeoff between the quantity of representable numbers and the level of precision.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-floating-point-representation&quot;&gt;The Floating Point Representation&lt;&#x2F;h2&gt;
&lt;p&gt;The &lt;strong&gt;floating point representation&lt;&#x2F;strong&gt; is the most widely used.
Numbers are written in the form: 
$$(-1^s) * 1.m * 2^e$$ &lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;$1.m$, the &lt;em&gt;mantissa&lt;&#x2F;em&gt;, and $e$, the &lt;em&gt;exponent&lt;&#x2F;em&gt;, are fractional and integer binary values, respectively. &lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;$s$ is a single bit denoting the sign of the represented number.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The design tradeoff between quantity and precision is captured by the number of bits dedicated to the mantissa and exponent.&lt;&#x2F;p&gt;
&lt;p&gt;In practice, however, the IEEE 754 floating point standard slightly modifies this scheme to account for two perceived limitations:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Small Number Gap: there is a relatively large gap between the representation of the largest negative number, and the smallest positive number.
&lt;ul&gt;
&lt;li&gt;To account for this, the numbers with the smallest exponent are &lt;em&gt;denormalized&lt;&#x2F;em&gt;. 
Denormalized values are spread out linearly, rather than exponentially.
For floating points, denormalization occurs between the largest negative and smallest positive numbers raised to the second largest exponent.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Before denormalization:
&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;posits&#x2F;not-denormalized.png&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;After denormalization:
&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;posits&#x2F;denormalized.png&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Bogus Results: the result of overflow (e.g., dividing by a very small number) or partial functions applied to elements outside of their domains (e.g, division by zero) have no representation.
&lt;ul&gt;
&lt;li&gt;The case of overflow is captured by the &lt;em&gt;positive and negative infinity&lt;&#x2F;em&gt; values, each represented by the bit pattern corresponding to an all ones exponent and all zeros mantissa, and differentiated by the sign bit.&lt;&#x2F;li&gt;
&lt;li&gt;The case of a non-result of a partial function is captured by the &lt;em&gt;NaN&lt;&#x2F;em&gt; value (meaning, &amp;quot;not a number&amp;quot;), represented by the various bit patterns with an all ones exponent and non-zero mantissa.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;All this results in a glutonous representation:
&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;posits&#x2F;float-fun.png&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-posit-representation&quot;&gt;The Posit Representation&lt;&#x2F;h2&gt;
&lt;p&gt;The &lt;em&gt;posit representation&lt;&#x2F;em&gt; &lt;em&gt;should&lt;&#x2F;em&gt; be the most widely used representation.
The numbers represented by posits are similar to floating points, but differ by the introduction of a so-called &lt;em&gt;regime&lt;&#x2F;em&gt; term, as follows: &lt;&#x2F;p&gt;
&lt;p&gt;$$(-1^s) * 1.m * useed^k * 2^e$$&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;$useed = 2^{2^{es}}$, where $es$ is a parameter of the representation.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;In his &lt;a href=&quot;https:&#x2F;&#x2F;posithub.org&#x2F;docs&#x2F;Posits4.pdf&quot;&gt;seminal paper&lt;&#x2F;a&gt;, Gustafson explains the &lt;em&gt;genius&lt;&#x2F;em&gt; behind this design:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;The regime bits may seem like a weird and artificial construct, 
but they actually arise from a natural and elegant geometric mapping of binary integers to the projective real numbers on a circle.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;... The value $2^{2^{es}}$ is called useed because it arises so often.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Fascinating. &lt;&#x2F;p&gt;
&lt;p&gt;The purity of posits is reflected in its effortless expression:
&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;posits&#x2F;posit-fun.png&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-divinity-of-posits&quot;&gt;The Divinity of Posits&lt;&#x2F;h2&gt;
&lt;p&gt;The posit representation maps numbers around the topological circular loop in religiously significant quadrants.
&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;posits&#x2F;holy-circle.png&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;At the heavenly North of the circle, symbolizing the Alpha and Omega, Our Father to which we solemly pray, lies the glorious positive and negative infinity.
At its opposite, the wicked, immoral South of the circle, lies nothing of value, the value $0$.
Meanwhile, on the earthly plane, God&#x27;s children enjoy free will, where they choose between positive one at the East and negative one at the West.&lt;&#x2F;p&gt;
&lt;p&gt;The quadrants induced by these points are then symmetrically populated by the rest of the points. 
The $useed$ determines where the &amp;quot;center&amp;quot; of these quadrants resides as follows:
&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;posits&#x2F;useed-circle.png&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Much like Adam and Eve, the $useed$ determines how the quadrants in the circle are populated.
Positive values lie at the right of the circle, while negative values lie at the left, and reciprocal values reflect across the equator.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;posits&#x2F;populated-circle.png&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h1 id=&quot;comparing-numerical-representations&quot;&gt;Comparing Numerical Representations&lt;&#x2F;h1&gt;
&lt;p&gt;Tragically, we live in a fallen world, full of non-believers and doubting Thomases.
This necessitates effective proselytizing that speaks not only to broken spirits, but also to faithless minds.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;qualitative-comparison&quot;&gt;Qualitative Comparison&lt;&#x2F;h2&gt;
&lt;p&gt;Unlike IEEE&#x27;s extravagantly piecewise nature, posits opt for piecewise minimalism:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Whereas there are two floating point representations of zero (both positive and negative), there is only one such posits representation: the all zero bit pattern.&lt;&#x2F;li&gt;
&lt;li&gt;Whereas positive and negative infinity are distinctly represented as floating points, posits unite these values into one representation: the bit pattern with all zeros save for the first bit.&lt;&#x2F;li&gt;
&lt;li&gt;Whereas floating point numbers are polluted with &lt;code&gt;NaN&lt;&#x2F;code&gt; values, posits are cleansed of such unclean special values.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Often, this simple observation suffices to awaken floating point enthusiasts out of their delusion.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;decimal-accuracy-comparison&quot;&gt;Decimal Accuracy Comparison&lt;&#x2F;h2&gt;
&lt;p&gt;At other times, a more quantitive approach is called for;
to this end, Gustafson proposes a variety of metric-based comparisons for numerical representation.
First, he asks: how can we assess how accurately a number is represented in a given representation?
To answer this, he proposes the &lt;em&gt;decimal accuracy&lt;&#x2F;em&gt; metric: the number of significant bits of a real number on a logarithmic scale, as in the decibel system.&lt;&#x2F;p&gt;
&lt;p&gt;Charitably, Gustafson elucidates non-believers why the logarithmic scale is the one true choice:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;A “perfect spacing” of ten numbers between 1 and 10 in a real number system would be not the evenly-spaced counting numbers 1 through 10, 
but exponentially-spaced $1,10^{\frac{1}{10}},10^{\frac{2}{10}},...,10^{\frac{9}{10}},10$.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Thus, the &amp;quot;canonical position&amp;quot; of a (positive, non-zero) real number $n$ is given by $p$, where $n = 10^{\frac{p}{10}}$,
and so it follows that the distance, or representation error, between a real number $n$ and its representation $n&#x27;$ is given by $| \log_{10}(\frac{n}{n&#x27;}) | $.
A straightforward adjustment adapts this scheme to negative numbers as well, and we ignore zero, since posits represents it exactly.
The inverse of this error which yields the &lt;em&gt;perfect metric for accuracy&lt;&#x2F;em&gt;.
Equipped with this metric, we glean the significant digits of the representation, the &lt;em&gt;decimal accuracy&lt;&#x2F;em&gt;, by taking its base 10 logarithm.&lt;&#x2F;p&gt;
&lt;p&gt;Gustafson exhaustively plots decimal accuracy for both representations in 8-bits, demonstrating posit supremacy:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;posits&#x2F;decimal-accuracy.png&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;We make two observations from this plot:
First, posits distribute decimal accuracy symmetrically across representations, while floating points fail to deliver at larger numbers, which are ignored in favor of &lt;code&gt;NaN&lt;&#x2F;code&gt;.
Furthermore, posits favor decimal accuracy at the meat of the number line, a sizable domain around 0, whereas floating points over-emphasizes the ends of the number line.
Finally, posits represents more numbers: they cast the light of representation across a larger domain. 
In other words, the &lt;em&gt;dynamic range&lt;&#x2F;em&gt; of posits, a further metric, exceeds that of floating points.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;operation-accuracy-comparison&quot;&gt;Operation Accuracy Comparison&lt;&#x2F;h2&gt;
&lt;p&gt;Besides raw representation error, there is the possibility of errors generated from primitive operations.
Gustafson addressed this for basic arithmetic operations by means of &amp;quot;closure plots&amp;quot;.
Such a plot visually depicts the accuracy of every combination of inputs.
For instance, the multiplication closure plot below paints the input domain darker where more accurate results are achieved, as measured by decimal accuracy:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;posits&#x2F;mul-closure.png&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Accuracy of single argument operations between representations are compared by plotting sorted error.
For instance, Gustafson compares accuracy of square root in the following plot:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;posits&#x2F;sqrt-losses.png&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h1 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h1&gt;
&lt;p&gt;As part of our cultist duties, we compare the accuracy of 32-bit floating point and posit representation by comparing their accuracy under a variety of benchmarks.
In each benchmark, we express real number calculations in terms of operations over 64-bit &lt;code&gt;double&lt;&#x2F;code&gt;s.
In other words, the &lt;code&gt;double&lt;&#x2F;code&gt; type serves as the &amp;quot;oracle&amp;quot; baseline from which we compute accuracy.
Each &lt;code&gt;double&lt;&#x2F;code&gt; benchmark is then compiled to LLVM IR, upon which we apply a &lt;code&gt;posit&lt;&#x2F;code&gt; LLVM pass and a &lt;code&gt;float&lt;&#x2F;code&gt; LLVM pass respectively to generate associated benchmarks.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, we compute percent error by comparing &lt;code&gt;float&lt;&#x2F;code&gt; and &lt;code&gt;posit&lt;&#x2F;code&gt; benchmark results to the &lt;code&gt;double&lt;&#x2F;code&gt; baseline benchmark.
We regard benchmark errors as arising from the inaccuracy that is characteristic of the particular 32-bit representation, compounded with successive operations.
With our deepest apologies to Gustafson, we compute these errors on a linear scale, rather than a logarithmic one, and use these as metrics to compare the accuracy of the two representations.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;llvm-pass-implementation&quot;&gt;LLVM pass implementation&lt;&#x2F;h2&gt;
&lt;p&gt;We aimed for our &lt;code&gt;float&lt;&#x2F;code&gt; and &lt;code&gt;posit&lt;&#x2F;code&gt; passes to insert as much representation-specific functionality as possible into our benchmarks.
The case of &lt;code&gt;float&lt;&#x2F;code&gt;s allows a full translation, since every &lt;code&gt;double&lt;&#x2F;code&gt; operation can be cast as a &lt;code&gt;float&lt;&#x2F;code&gt; operations.
The case of &lt;code&gt;posit&lt;&#x2F;code&gt;s lacked full support, however: we did not provide implementations for all existing LLVM &lt;code&gt;double&lt;&#x2F;code&gt; operations.&lt;&#x2F;p&gt;
&lt;p&gt;To accommodate both cases under one pass implementation, 
we designed our pass such that it first lowers each supported &lt;code&gt;double&lt;&#x2F;code&gt; operation and operands down to the target representation, subsequently lifting the result back to a &lt;code&gt;double&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Since each rewrite has a purely local effect, we can avoid reasoning about complex allocation and type casting scenarios.
Though computationally inefficient, this approach immediately supports all programs and is easily extendible.&lt;&#x2F;p&gt;
&lt;p&gt;In pseduo-LLVM, the general structure of a pass to type &lt;code&gt;typ&lt;&#x2F;code&gt; converts an arbitrary &lt;code&gt;double&lt;&#x2F;code&gt; operation &lt;code&gt;op&lt;&#x2F;code&gt; and its operands proceeds as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;x : double = ..
y : double = ..
result : double = op double x y
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;--&amp;gt;&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;x : double = ..
y : double = ..

x_typ : typ = convert double x to typ;
y_typ : typ = convert double y to typ;
result_typ : typ = typ_op x_typ y_typ; 

result double = convert result_typ typ to double
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This generic pass structure parametrizes &lt;code&gt;typ&lt;&#x2F;code&gt; into three components:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;convert double to typ&lt;&#x2F;code&gt;: conversion from &lt;code&gt;double&lt;&#x2F;code&gt; to &lt;code&gt;typ&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;convert typ to double&lt;&#x2F;code&gt;: conversion from &lt;code&gt;typ&lt;&#x2F;code&gt; to &lt;code&gt;double&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;typ_op&lt;&#x2F;code&gt;: implementation over &lt;code&gt;typ&lt;&#x2F;code&gt; values of &lt;code&gt;op&lt;&#x2F;code&gt;, the corresponding &lt;code&gt;double&lt;&#x2F;code&gt; operation&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The &lt;code&gt;float&lt;&#x2F;code&gt; pass specifies these components using the following LLVM constructs:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;The LLVM precision-demotion instruction: &lt;code&gt;fptrunc double to float&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;The LLVM precision-promotion instruction: &lt;code&gt;fpext float to double&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;LLVM&#x27;s floating point operations with type parameter set to &lt;code&gt;float&lt;&#x2F;code&gt; (e.g., &lt;code&gt;fadd float x y&lt;&#x2F;code&gt;)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The &lt;code&gt;posit&lt;&#x2F;code&gt; pass draws these components from external C functions implemented in &lt;a href=&quot;https:&#x2F;&#x2F;gitlab.com&#x2F;cerlane&#x2F;SoftPosit-Python&quot;&gt;Cerlane Leong&#x27;s SoftPosit repository&lt;&#x2F;a&gt;.
In particular, we borrow the basic arithmetic operations (&lt;code&gt;+&lt;&#x2F;code&gt;, &lt;code&gt;-&lt;&#x2F;code&gt;, &lt;code&gt;*&lt;&#x2F;code&gt;, and &lt;code&gt;&#x2F;&lt;&#x2F;code&gt;) over 32-bit posits. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;benchmark-results&quot;&gt;Benchmark Results&lt;&#x2F;h3&gt;
&lt;p&gt;We ran several benchmarks from &lt;a href=&quot;https:&#x2F;&#x2F;fpbench.org&#x2F;benchmarks.html&quot;&gt;FPBench&lt;&#x2F;a&gt; that accomodated our limited selection of posit operators.&lt;&#x2F;p&gt;
&lt;p&gt;For each input, we calculate percent error for &lt;code&gt;float&lt;&#x2F;code&gt; and &lt;code&gt;posit&lt;&#x2F;code&gt; benchmark results, regarding the corresponding &lt;code&gt;double&lt;&#x2F;code&gt; benchmark results as the correct, &amp;quot;oracular&amp;quot; values. 
We chose computationally feasible sample sizes for each benchmark approximately using the heuristic $20^{vars}$, where $vars$ denotes the number of input variables for the benchmark, reflecting the need for more samples over larger input spaces.
Inputs were linearly sampled over the valid input ranges specified for each benchmark.
This simple strategy ensured that test cases would also include the case of exactly representable integers, in which all of the error arises from operation error.
We report the mean and standard deviation of the percent error over all examined inputs:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;code&gt;sum&lt;&#x2F;code&gt;&lt;&#x2F;th&gt;&lt;th&gt;Mean Error&lt;&#x2F;th&gt;&lt;th&gt;Error Std Dev&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Float&lt;&#x2F;td&gt;&lt;td&gt;7.7e-8&lt;&#x2F;td&gt;&lt;td&gt;4.8e-8&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Posit&lt;&#x2F;td&gt;&lt;td&gt;5.1e-9&lt;&#x2F;td&gt;&lt;td&gt;3.2e-9&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;$n$= 1000&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;code&gt;x_by_xy&lt;&#x2F;code&gt;&lt;&#x2F;th&gt;&lt;th&gt;Mean Error&lt;&#x2F;th&gt;&lt;th&gt;Error Std Dev&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Float&lt;&#x2F;td&gt;&lt;td&gt;1.1e-7&lt;&#x2F;td&gt;&lt;td&gt;9.2e-8&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Posit&lt;&#x2F;td&gt;&lt;td&gt;6.9e-9&lt;&#x2F;td&gt;&lt;td&gt;5.7e-9&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;$n$ = 961&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;code&gt;delta4&lt;&#x2F;code&gt;&lt;&#x2F;th&gt;&lt;th&gt;Mean Error&lt;&#x2F;th&gt;&lt;th&gt;Error Std Dev&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Float&lt;&#x2F;td&gt;&lt;td&gt;5.8e-7&lt;&#x2F;td&gt;&lt;td&gt;6.5e-6&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Posit&lt;&#x2F;td&gt;&lt;td&gt;3.6e-8&lt;&#x2F;td&gt;&lt;td&gt;3.8e-7&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;$n$= 262144&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;code&gt;kepler0&lt;&#x2F;code&gt;&lt;&#x2F;th&gt;&lt;th&gt;Mean Error&lt;&#x2F;th&gt;&lt;th&gt;Error Std Dev&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Float&lt;&#x2F;td&gt;&lt;td&gt;2.5e-7&lt;&#x2F;td&gt;&lt;td&gt;1.0e-7&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Posit&lt;&#x2F;td&gt;&lt;td&gt;1.6e-8&lt;&#x2F;td&gt;&lt;td&gt;7.9e-9&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;$n$= 262144&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;code&gt;floudas1&lt;&#x2F;code&gt;&lt;&#x2F;th&gt;&lt;th&gt;Mean Error&lt;&#x2F;th&gt;&lt;th&gt;Error Std Dev&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Float&lt;&#x2F;td&gt;&lt;td&gt;2.1e-7&lt;&#x2F;td&gt;&lt;td&gt;1.6e-7&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Posit&lt;&#x2F;td&gt;&lt;td&gt;1.4e-8&lt;&#x2F;td&gt;&lt;td&gt;1.2e-8&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;$n$= 92572956&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h1 id=&quot;discussion&quot;&gt;Discussion&lt;&#x2F;h1&gt;
&lt;p&gt;As prophesied by Gustafson, posits consistently outperform floating points by orders of magnitude.
However, our LLVM &lt;code&gt;posit&lt;&#x2F;code&gt; pass is grossly inefficient: it introduces loads and stores operations to convert between &lt;code&gt;double&lt;&#x2F;code&gt; and &lt;code&gt;posit&lt;&#x2F;code&gt; for each &lt;code&gt;double&lt;&#x2F;code&gt; operation.
We blame this sad state of affairs on ignorant, non-posit architectures, which necessitate such a pass in the first place.&lt;&#x2F;p&gt;
&lt;p&gt;Gustafson shares what we&#x27;re missing out on by not being woke:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;A posit processing unit takes less circuitry than an IEEE float FPU. With lower power use and smaller silicon footprint, the posit operations per second (POPS) supported by a chip can be significantly higher than the FLOPS using similar hardware resources. GPU accelerators and Deep Learning processors, in particular,can do more per watt and per dollar with posits, yet deliver superior answer quality&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Indeed, we could be talking in terms of POPS instead of FLOPS.
Nonetheless, our pass allows us to simulate and give thanks for posit accuracy.&lt;&#x2F;p&gt;
&lt;p&gt;Our approach relies on treating the &lt;code&gt;double&lt;&#x2F;code&gt; type as the &amp;quot;ground truth&amp;quot; representation for benchmarks. 
Although this is an approximation, since no finite representation has perfect accuracy, 
we assume that the accumulated error in &lt;code&gt;double&lt;&#x2F;code&gt; benchmarks will be truncated or rounded off when comparing with the less precise 32-bit representations.
In other words, we assume that casting to a &lt;code&gt;double&lt;&#x2F;code&gt; from a &lt;code&gt;float&lt;&#x2F;code&gt; or 32-bit &lt;code&gt;posit&lt;&#x2F;code&gt; can be done without loss of information.
Although this assumption does not always hold, we found it to be sufficient for practical testing and justified the streamlined design of our LLVM pass.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;conclusion-the-posit-prayer&quot;&gt;Conclusion: The Posit Prayer&lt;&#x2F;h1&gt;
&lt;p&gt;Every morning and every evening, run your favorite posit benchmark and recite the following:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;[posits] provide compelling advantages over floats, including larger dynamic range, higher accuracy, better closure, bitwise identical results across systems, simpler hardware, and simpler exception handling.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Confess your use of floating points, repent, and be cleansed.
Contact the authors for induction into the Cult of Posits.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Compiler Validation via Equivalence Modulo Inputs</title>
                <pubDate>Mon, 02 Dec 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/equivalence-modulo-inputs/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/equivalence-modulo-inputs/</guid>
                <description>&lt;p&gt;Imagine you, a clever person who wants her C programs to run faster,
sat down, thought very hard, and developed a new compiler optimization.
Say you implement it as a transformation pass in LLVM so that other people
can take advantage of your cleverness to make their programs run faster as well.
You run your optimization pass over a few benchmarks and see that it does
indeed make some programs run faster.
But a question nags you: how do you know that your optimization is correct?
That is, how do you know that your optimization doesn&#x27;t change the semantics
of the input program?&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Equivalence Modulo Inputs&lt;&#x2F;em&gt; (henceforth EMI), a testing technique introduced
by Le &amp;amp; al. in a &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=2594334&quot;&gt;PLDI 2014 paper&lt;&#x2F;a&gt;, allows our compiler hacker
above to test her optimization rigorously without much effort.
EMI is especially effective at finding &lt;em&gt;miscompilation&lt;&#x2F;em&gt; bugs,
wherein compilers produce wrong code, which is much more pernicious than
&lt;em&gt;compiler crashes&lt;&#x2F;em&gt;, bugs where the compiler terminates abnormally.
This allows EMI to test the optimization phase of compilers much more rigorously
than prior work such as &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=1993532&quot;&gt;Csmith&lt;&#x2F;a&gt;, which finds fewer miscompilations than
compiler crashes.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;background&quot;&gt;Background&lt;&#x2F;h2&gt;
&lt;p&gt;EMI is a form of &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.swarthmore.edu&#x2F;%7Ebylvisa1&#x2F;cs97&#x2F;f13&#x2F;Papers&#x2F;DifferentialTestingForSoftware.pdf&quot;&gt;differential testing&lt;&#x2F;a&gt;, the widely applicable idea that if
multiple systems presumed to be equivalent produce different outputs on the same
input, then there is a bug in at least one of the systems.&lt;&#x2F;p&gt;
&lt;p&gt;Csmith also differentially tests compilers by generating random test cases,
leading to 281 bug reports for GCC and LLVM by the time it was published.
Whereas Csmith compares output from multiple compilers on the same input
program, EMI compares output from different programs compiled by the same
compiler.&lt;&#x2F;p&gt;
&lt;p&gt;Explicitly wanting to avoid Csmith&#x27;s painstaking approach to restricting random
program generation with aggressive safety analysis, Le &amp;amp; al. design their
implementation of EMI around generating equivalent variants of valid seed
programs, as we will see below. The EMI and Csmith approaches are not
oppositional; in fact, Le &amp;amp; al. incorporate Csmith into their workflow. The vast
majority (all but four) of bugs identified by EMI were found with EMI variants
of random programs generated by Csmith.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;some-definitions&quot;&gt;Some definitions&lt;&#x2F;h2&gt;
&lt;p&gt;Let us now formally define EMI and show how we can use it as a condition
to determine whether a compiler is buggy.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Equivalence modulo inputs (EMI)&lt;&#x2F;strong&gt;.
Given an input set $I$ and a program $P$, a program $Q$ is an EMI-variant of
$P$ (read: $P$ and $Q$ are EMI) relative to $I$ if $P$ and $Q$ have
the same denotations (&amp;quot;mean the same thing&amp;quot;) over all inputs in $I$.
Formally, for all $i$ in $I$,
$\llbracket P \rrbracket(i) = \llbracket Q \rrbracket (i)$.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;It is immediately clear that EMI is a relaxation of semantic equivalence,
wherein $P$ and $Q$ have the same denotations for all possible inputs.&lt;&#x2F;p&gt;
&lt;p&gt;For example, the following two programs are semantically equivalent
(and thus EMI for any input set):&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;$\lambda x. (x + x)$ and $\lambda x. (2 * x)$&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The following two programs are &lt;em&gt;not&lt;&#x2F;em&gt; semantically equivalent yet EMI over
input set {0}:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;$\lambda x. x$ and $\lambda x. 0$&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Now that we have a formal definition of EMI, how can we use it as a condition
to check whether a compiler is buggy or not?&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;EMI-validity&lt;&#x2F;strong&gt;. Given an input set $I$, a compiler $C$ is &lt;em&gt;EMI-valid&lt;&#x2F;em&gt;
relative to $I$ if for any program $P$ and EMI-variant $Q$,
it is the case that $C(P)(i) = C(Q)(i)$ for all $i$ in input set $I$. &lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;&lt;strong&gt;If a compiler is not EMI-valid, then we consider it buggy.&lt;&#x2F;strong&gt; 
But the inverse is not true: if a compiler &lt;em&gt;is&lt;&#x2F;em&gt; EMI-valid, it can
still be buggy!
Consider the degenerate compiler that maps all source programs to the same
target program.
The compiler is EMI-valid for any input set, but it is obviously buggy.
Thus EMI-validity is a conservative overapproximation for compiler correctness.&lt;&#x2F;p&gt;
&lt;p&gt;Why is this useful? Couldn&#x27;t we just define validity for a compiler over
&lt;em&gt;semantic equivalence&lt;&#x2F;em&gt;, not just its relaxed counterpart in EMI?
(You can imagine defining EMI-validity where the input set is all possible
inputs.)
EMI solves two hard practical problems in differential testing of compilers:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;How do we generate &amp;quot;equivalent&amp;quot; variants of input programs?&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;How do we check that the compiler&#x27;s output programs are &amp;quot;equivalent&amp;quot;?&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Using the more stringent condition of semantic equivalence makes solving
these practical problems hard—indeed, in the general case
(2) is undecidable by a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Rice%27s_theorem&quot;&gt;famous result in computability theory&lt;&#x2F;a&gt;.
But the more relaxed condition of EMI makes these tractable.
As we&#x27;ll seen with the implementation of Orion below, there is
an efficient procedure for generating EMI-variants from seed programs,
thus solving (1).
We determine that output programs are &amp;quot;equivalent&amp;quot; if they are EMI,
which, since we&#x27;re only checking equivalence over a particular set of
inputs, gives an efficient procedure for (2).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;emi-in-practice-orion&quot;&gt;EMI in Practice: Orion&lt;&#x2F;h2&gt;
&lt;p&gt;Le &amp;amp; al. realize the promise of Equivalence Modulo Inputs by implementing Orion,
a bug-finding tool for C compilers.
Given a seed program and an input set, it generates EMI-variants of the seed
program and then checks if a compiler configuration is EMI-valid with respect
to these.&lt;&#x2F;p&gt;
&lt;p&gt;To generate EMI-variants, Orion uses code coverage information provided
by tools such as &lt;code&gt;gcov&lt;&#x2F;code&gt; to modify unexecuted parts of the seed program.
Intuitively, this procedure generates EMI-variants of the seed program
since unexecuted statements should not affect the output of the compiled program.&lt;&#x2F;p&gt;
&lt;p&gt;Specifically, Orion probabilistically deletes unexecuted statements of
seed programs to generate EMI-variants.
The authors consider other mutation strategies as part of future work,
but as we&#x27;ll see in the evaluation section below, the simple strategy of
deleting statements works well in practice.&lt;&#x2F;p&gt;
&lt;p&gt;Orion&#x27;s EMI-variant generation algorithm is sketched below in &lt;code&gt;gen_variant&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#323232;&quot;&gt;prune_visit&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(prog, statement, coverage_set):
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# probabilistically delete unexecuted statement
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;statement &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;not in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;coverage_set &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;and &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;flip_coin(statement):
    prog.delete(statement)

  &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# otherwise, traverse its children
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;else&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;:
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;child &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;statement.children:
      prune_visit(prog, child, coverage_set)

&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#323232;&quot;&gt;gen_variant&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(prog, coverage_set):
  emi_variant &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;clone(prog)
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;statement &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;emi_variant:
    prune_visit(emi_variant, statement, coverage_set)

  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;emi_variant
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;gen_variant&lt;&#x2F;code&gt; takes as input a seed program and the coverage set—the
set of all statements executed for some input in the input set.
It clones the program into &lt;code&gt;emi_variant&lt;&#x2F;code&gt; and then uses &lt;code&gt;prune_visit&lt;&#x2F;code&gt; to
probabilistically delete unexecuted statements.&lt;&#x2F;p&gt;
&lt;p&gt;With its EMI-variant generation algorithm outlined, we can now sketch
the algorithm by which Orion validates C compilers.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#323232;&quot;&gt;validate&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(compiler, prog, input_set):
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# compile with no optimizations
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;out_prog &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;compiler.compile(prog, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;NO_OPTIMIZATION&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)

  &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# generate reference output
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;in_out_set &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[(i, out_prog.execute(i)) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;input_set]

  &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# get coverage info
  # a statement is considered covered if it was executed
  # by the program on any input
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;coverage_set &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;set&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;()
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;input_set:
    coverage_set &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;union(coverage_set, coverage(prog, i))

  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;range&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;MAX_ITER&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;):
    emi_variant &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;gen_variant(prog, coverage_set)

    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;config &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;compiler.configurations:
      out_emi_variant &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;compiler.compile(emi_variant, config)

      &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# check if compiled EMI variant is equivalent over all inputs
      &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i, o &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;in_out_set:
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# compiler is not EMI-valid!
        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;emi_o &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;out_emi_variant.execute(i)
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;emi_ o &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;o:
          report_bug(compiler, config, prog, emi_variant, i, o, emi_o)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;validate&lt;&#x2F;code&gt; takes as input a compiler (&lt;code&gt;compiler&lt;&#x2F;code&gt;), a seed program (&lt;code&gt;prog&lt;&#x2F;code&gt;),
and an input set (&lt;code&gt;input_set&lt;&#x2F;code&gt;).
First, &lt;code&gt;validate&lt;&#x2F;code&gt; compiles &lt;code&gt;prog&lt;&#x2F;code&gt; with &lt;code&gt;compiler&lt;&#x2F;code&gt; using no optimizations,
the output of which (&lt;code&gt;out_prog&lt;&#x2F;code&gt;) is then used to generate a reference set
of outputs for &lt;code&gt;input_set&lt;&#x2F;code&gt;.
Next, it uses a code coverage tool (&lt;code&gt;coverage&lt;&#x2F;code&gt;) to determine the set of
executed statements over all inputs in &lt;code&gt;input_set&lt;&#x2F;code&gt;.
In its &amp;quot;main loop,&amp;quot; &lt;code&gt;validate&lt;&#x2F;code&gt; generates an EMI-variant of &lt;code&gt;prog&lt;&#x2F;code&gt; using
the computed coverage information by calling &lt;code&gt;gen_variant&lt;&#x2F;code&gt;.
For every relevant compiler configuration, it then compiles the EMI-variant
and runs the output program over all inputs in &lt;code&gt;input_set&lt;&#x2F;code&gt; to check that it
returns the same output as recorded in the reference set.
If not, we flag the current compiler configuration as having a bug
(&lt;code&gt;report_bug&lt;&#x2F;code&gt;).
&lt;code&gt;validate&lt;&#x2F;code&gt; repeats this main loop for some number of iterations (&lt;code&gt;MAX_ITER&lt;&#x2F;code&gt;)
to find more bugs using different EMI-variants.&lt;&#x2F;p&gt;
&lt;p&gt;The authors note that the implementation effort for Orion is much less
burdensome than other bug-finding tools such as Csmith:
whereas Csmith is about 30K-40K lines of C++, Orion is only about 500 lines
of shell scripts and 1K lines of C++.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;In order to evaluate EMI—even in its concrete implementation
Orion—several questions must be answered:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;What compilers (and compiler configurations) will be tested?&lt;&#x2F;em&gt;&lt;br &#x2F;&gt;
&lt;br &#x2F;&gt;
The authors test GCC and LLVM, popular open-source compilers with transparent
bug tracking. The latest development builds of the compilers were tested on an
x86_64 machine, targeting both 32- and 64-bit machines. Because the goal is to
find miscompilations arising from optimizations, the common optimization
configurations were all tested: &lt;code&gt;-O0&lt;&#x2F;code&gt;, &lt;code&gt;-O1&lt;&#x2F;code&gt;, &lt;code&gt;-Os&lt;&#x2F;code&gt;, &lt;code&gt;-O2&lt;&#x2F;code&gt;, &lt;code&gt;-O3&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;What seed programs will be profiled and pruned?&lt;&#x2F;em&gt;&lt;br &#x2F;&gt;
&lt;br &#x2F;&gt;
Some seed programs were taken from the GCC, LLVM, and &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kframework&#x2F;c-semantics&quot;&gt;KCC&lt;&#x2F;a&gt; regression test
suites. The authors report attempting to use tests from open-source projects,
but were unable to reduce and interpret the resulting bugs.&lt;br &#x2F;&gt;
&lt;br &#x2F;&gt;
The bulk of the bugs were found by starting from randomly generated Csmith
programs, likely because each consisted of, on average, thousands of lines of
code with a high proportion of unexecuted lines.&lt;br &#x2F;&gt;
&lt;br &#x2F;&gt;
Though the compiler test programs were verified to be correct by experts,
there was no one verifying that the random Csmith programs produced correct
output. Only equivalence (preserved by the pruning process) is necessary to
ensure EMI variants are able to detect bugs, greatly increasing the pool of seed
programs.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;What parameters will guide the pruning process?&lt;&#x2F;em&gt;&lt;br &#x2F;&gt;
&lt;br &#x2F;&gt;
Each seed program had a random number of variants generated, with an expectation
of eight variants. The two random parameters that control the likelihood of a
given statement getting pruned were independently reset to a uniform new value
between 0 and 1 after each pruning.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;What will be done with bugs once any have been found?&lt;&#x2F;em&gt;&lt;br &#x2F;&gt;
&lt;br &#x2F;&gt;
The authors used a combination of &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=2254104&quot;&gt;C-reduce&lt;&#x2F;a&gt; and &lt;a href=&quot;http:&#x2F;&#x2F;delta.tigris.org&#x2F;&quot;&gt;Berkeley Delta&lt;&#x2F;a&gt; to shrink the size
of EMI programs that generated different outputs. They attempted to reject
programs that triggered undefined behavior by using compiler warnings, static
analysis, and &lt;a href=&quot;http:&#x2F;&#x2F;compcert.inria.fr&quot;&gt;CompCert&lt;&#x2F;a&gt;. The final step was reporting the bugs using the
compilers&#x27; transparent bug-tracking tools.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;With this context, on to the headline result:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Orion found 147 confirmed, unique bugs in GCC and LLVM over the course of
eleven months in 2013.&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Le &amp;amp; al. evaluate these bugs in a twofold way: 1) quantitative description of
components affected by bugs, and 2) qualitative evaluation of about ten
generated programs.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;quantitative-description&quot;&gt;Quantitative description&lt;&#x2F;h4&gt;
&lt;p&gt;A major strength of the evaluation is its integration with the bug reporting
workflows for GCC and LLVM. While the authors go perhaps too far in asserting,
&amp;quot;First, most of our reported bugs have been confirmed and fixed by developers,
illustrating their relevance and importance (as it often takes substantial
effort to fix a miscompilation),&amp;quot; the fact that 182 of the 195 total reported
bugs (with 35 of these getting marked duplicate) were confirmed by outside
experts to really exist is evidence that EMI is a viable bug-finding strategy.&lt;&#x2F;p&gt;
&lt;p&gt;95 of the confirmed bugs were miscompilations, lending credence to the authors&#x27;
initial claim that Orion is able to target miscompilations more easily than
Csmith alone can. The most bugs were found in the development trunks of both GCC
and LLVM. More bugs were also found in increasing levels of optimization, with
the most under &lt;code&gt;-O3&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The authors also found performance bugs through comparing compilers to each
other, in a differential testing scenario similar to that used by Csmith. 19 of
the 147 confirmed bugs were performance issues.&lt;&#x2F;p&gt;
&lt;p&gt;It&#x27;s important to note that these were only the bugs that were found by Orion.
Because Orion specifically targeted optimization phases, it is understandable
that GCC Tree Optimization and RTL optimization were the components with the
most discovered bugs (LLVM developers did not classify the reported bugs). These
components did not necessarily have more bugs than others, nor were these the
only possible bugs.&lt;&#x2F;p&gt;
&lt;p&gt;The authors do not make an attempt to evaluate the search space that Orion
explored in producing these reported bugs. Nor do they explicitly determine the
proportion of the generated variants that led to identified bugs. They only
report that they didn&#x27;t record how many seed programs they started with or how
many variants they generated (merely estimating &amp;quot;millions to tens of millions&amp;quot;).
They also do not report (and likely did not record) the Csmith configurations or
Orion&#x27;s dynamic pruning parameters.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;qualitative-examples&quot;&gt;Qualitative examples&lt;&#x2F;h4&gt;
&lt;p&gt;The confirmed bugs are said to span compiler segfaults, internal compiler
errors, performance problems, and wrong code generation. The authors present and
interpret a handful of the bugs that were confirmed and fixed by compiler
developers. We highlight just two of those for a flavor of the generated
programs. Note that the authors only show the reduced code they reported to
compiler developers; they show neither the non-reduced versions nor the EMI
variants.&lt;&#x2F;p&gt;
&lt;p&gt;The following example led to a segfault when compiled with GCC due to a wrong
offset computation in an optimization pass called &amp;quot;predictive commoning,&amp;quot; a form
of common subexpression elimination:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b, f, d[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;][&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;unsigned int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; c;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;main&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(c &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; c &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; c&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(d[b &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;][c] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; d[b &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;][c])
      &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(f)
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;break&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Clang incorrectly vectorized the following code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;main&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  lbl:
    a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    b&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;--&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(b) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;goto&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; lbl;
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h4 id=&quot;current-statistics&quot;&gt;Current statistics&lt;&#x2F;h4&gt;
&lt;p&gt;The website for the &lt;a href=&quot;https:&#x2F;&#x2F;web.cs.ucdavis.edu&#x2F;%7Esu&#x2F;emi-project&#x2F;&quot;&gt;EMI project&lt;&#x2F;a&gt; shows the number of bugs found and fixed
by tools using the EMI methodology.
It shows an astronomical number of bugs found in GCC and LLVM,
and the usefulness of the technique for compilers for other languages
like Scala.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Compiler&lt;&#x2F;th&gt;&lt;th&gt;Bugs reported&lt;&#x2F;th&gt;&lt;th&gt;Bugs fixed&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;GCC&#x2F;LLVM&lt;&#x2F;td&gt;&lt;td&gt;1,602&lt;&#x2F;td&gt;&lt;td&gt;1,007&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;ICC&lt;&#x2F;td&gt;&lt;td&gt;35&lt;&#x2F;td&gt;&lt;td&gt;?&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;CompCert&lt;&#x2F;td&gt;&lt;td&gt;31&lt;&#x2F;td&gt;&lt;td&gt;27&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;scala&#x2F;dotty&lt;&#x2F;td&gt;&lt;td&gt;42&lt;&#x2F;td&gt;&lt;td&gt;17&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h2 id=&quot;discussion&quot;&gt;Discussion&lt;&#x2F;h2&gt;
&lt;p&gt;The remaining examples in the paper cover problems with jump-threading logic,
global value numbering, inlining, vectorization, and performance. Because the
authors analyze only a few cherry-picked examples, a question remains: Are these
eight examples representative of all of the other bugs?&lt;&#x2F;p&gt;
&lt;p&gt;Additionally, the authors claim: &amp;quot;EMI variants generated from existing code, say
via Orion, are likely programs that people actually write.&amp;quot;
Is this true, especially when random programs are used as seeds?
Is this even true of the two examples discussed above?&lt;&#x2F;p&gt;
&lt;p&gt;The results indicate that the kind of seed programs heavily influence the number
of bugs that are discovered. Randomly-generated Csmith seed programs revealed
far more bugs than those taken from compiler test suites and open-source project
tests. This suggests that EMI should be used in conjunction with an existing
fuzzer. Do other fuzzers provide amenable seed programs?&lt;&#x2F;p&gt;
&lt;p&gt;Finally, the authors tout EMI as a general validation technique that can be
use to differentially test applications such as compilers for other languages.
Do you think this methodology will be as useful for other applications as it
is for testing C compilers?&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Finding and Understanding Bugs in C Compilers</title>
                <pubDate>Thu, 21 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/bug-finding/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/bug-finding/</guid>
                <description>&lt;h2 id=&quot;overview&quot;&gt;Overview&lt;&#x2F;h2&gt;
&lt;p&gt;Csmith is a tool that helps uncover bugs in compilers through random testing, also known as &lt;em&gt;fuzzing&lt;&#x2F;em&gt;. Given multiple C compilers that implement the C standard, we can run programs through all the compilers and compare their output.&lt;&#x2F;p&gt;
&lt;img src=&quot;random-testing.jpg&quot; style=&quot;width: 60%&quot;&gt;
&lt;p&gt;We can determine which compilers have a &amp;quot;bug&amp;quot; by doing a majority vote on the output. Even though this has the potential to misattribute bugs to the wrong compiler, the fact that different compilers would produce different output would be alarming. This kind of testing allows a relatively quick search of the program space to determine programs that are very likely to be buggy.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Discussion Question: Is this kind of random testing still useful in the face of formally verified compilers, e.g., CompCert?&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;examples-of-wrong-code&quot;&gt;Examples of Wrong Code&lt;&#x2F;h2&gt;
&lt;p&gt;To give a more concrete idea of what kinds of programs we can expect to find bugs in, the authors provide some actual bugs that were found. Generally, these bugs occur when the compiler performs optimizations on programs.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Wrong Safety Check:&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;(x == c1) || (x &amp;lt; c2)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;is equivalent to&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;x &amp;lt; c2
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;when &lt;code&gt;c1&lt;&#x2F;code&gt; and &lt;code&gt;c2&lt;&#x2F;code&gt; are constants and &lt;code&gt;c1 &amp;lt; c2&lt;&#x2F;code&gt;. However, Csmith found an example in LLVM where &lt;code&gt;(x == 0) || (x &amp;lt; -3)&lt;&#x2F;code&gt; was transformed to &lt;code&gt;(x &amp;lt; -3)&lt;&#x2F;code&gt;, even though it&#x27;s not the case that &lt;code&gt;0 &amp;lt; -3&lt;&#x2F;code&gt;. The issue was that LLVM did an unsigned comparison, so the transformation was seemingly safe.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Wrong Analysis:&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:  static int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; g[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:  static int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;g[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:  static int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;q &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;g[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:  int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;foo (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;      g[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:      *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:      *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;q;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:      return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; g[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here, GCC evaluated this program to 1 instead of 0. The reason is because &lt;code&gt;q&lt;&#x2F;code&gt; was accidentally marked as read only, meaning &lt;code&gt;p&lt;&#x2F;code&gt; and &lt;code&gt;q&lt;&#x2F;code&gt; can&#x27;t alias (even though they really do). Thus, line 7 looks like dead code because line 8 simply overwrites &lt;code&gt;*p&lt;&#x2F;code&gt;; this is only safe when &lt;code&gt;p&lt;&#x2F;code&gt; and &lt;code&gt;q&lt;&#x2F;code&gt; don&#x27;t alias. So line 7 was removed by dead store elimination.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Wrong Analysis:&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: void &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;foo(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:   int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:   for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; x&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:     if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(x) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;continue&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:     if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(x) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;break&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:   &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;printf&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;%d&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, x);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Compiling with LLVM caused this program to print &lt;code&gt;1&lt;&#x2F;code&gt; instead of &lt;code&gt;5&lt;&#x2F;code&gt;. A loop optimization called &amp;quot;scalar evolution analysis&amp;quot; evaluates loop properties like induction variables and maximum number of iterations. The optimization saw line 5 and incorrectly determined that the loop runs once, meaning &lt;code&gt;x&lt;&#x2F;code&gt; must evaluate to &lt;code&gt;1&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;I find this kind of program interesting because very few programmers in practice would end up writing a function that uses a continue and a break guarded by the same condition. However, when testing optimizations that computes properties of programs, it seems reasonable to test code that never actually runs.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design-goals&quot;&gt;Design Goals&lt;&#x2F;h2&gt;
&lt;p&gt;The authors provide two design goals for Csmith:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Every randomly generated program must be well formed and have a single interpretation based on the C standard.&lt;&#x2F;li&gt;
&lt;li&gt;Maximize &amp;quot;expressiveness&amp;quot;. Expressiveness is the idea that the generated programs should use a wide variety and combinations of language features. For example, Csmith allows for programs with function definitions, global definitions, most C expressions and statements, control flow, arrays, pointers, and more. Some language features that are disallowed include dynamic memory allocation, recursion, and function pointers.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;It is important to avoid undefined behavior because by definition, compilers are allowed to produce different outputs for the same program in the presence of undefined behavior. In an extreme case, consider this program from &lt;a href=&quot;https:&#x2F;&#x2F;kristerw.blogspot.com&#x2F;2017&#x2F;09&#x2F;why-undefined-behavior-may-call-never.html&quot;&gt;this blog post:&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;#include &amp;lt;cstdlib&amp;gt;

typedef int (*Function)();

static Function Do;

static int EraseAll() {
  return system(&amp;quot;rm -rf &#x2F;&amp;quot;);
}

void NeverCalled() {
  Do = EraseAll;
}

int main() {
  return Do();
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;When compiling with &lt;code&gt;clang&lt;&#x2F;code&gt; with no optimizations, the program segfaults. However, when compiling with &lt;code&gt;-O1&lt;&#x2F;code&gt;, the program will run &lt;code&gt;rm -rf &#x2F;&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Interestingly, design goal #2 is in direct opposition to design goal #1. The more features we allow to be tested (like arrays and pointers), the easier it is to cause undefined behavior. First, we will discuss at a high level how programs are randomly generated. Then, we will explore how Csmith ensures the generated code doesn&#x27;t exhibit undefined behavior.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;randomly-generating-programs&quot;&gt;Randomly Generating Programs&lt;&#x2F;h2&gt;
&lt;p&gt;To generate programs, first Csmith generates types that can be used later during program generation. Then starting from &lt;code&gt;main&lt;&#x2F;code&gt;, the authors break down how the program gets filled in.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Csmith selects a random production from the grammar, based on the current context, with potentially non-uniform probability. For example, when inside a function body, Csmith can choose things like &lt;code&gt;if&lt;&#x2F;code&gt; and &lt;code&gt;for&lt;&#x2F;code&gt; statements, but not a &lt;code&gt;continue&lt;&#x2F;code&gt; if the code is not in a loop.
&lt;ul&gt;
&lt;li&gt;Csmith operates on a subset of C. Interestingly, it doesn&#x27;t seem like the authors have a grammar explicitly written down for this subset.&lt;&#x2F;li&gt;
&lt;li&gt;For productions that require a &lt;em&gt;target&lt;&#x2F;em&gt; (e.g., a variable), Csmith can randomly decide to generate a new target or use an existing one.&lt;&#x2F;li&gt;
&lt;li&gt;The same goes for productions requiring a type.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;When choosing a nonterminal production, Csmith will recursively choose productions given the appropriate context. For example, if a function call production is chosen, then expressions for the parameters must be generated.&lt;&#x2F;li&gt;
&lt;li&gt;Csmith implements an interprocedural pointer analysis, which needs to keep track of &amp;quot;points-to facts&amp;quot; about the program. Csmith keeps the set of points-to facts up do date while it generates programs. A more detailed explanation can be found on &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;csmith-project&#x2F;csmith&#x2F;blob&#x2F;master&#x2F;doc&#x2F;pa.txt&quot;&gt;github&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Csmith uses a set of &lt;em&gt;saftey checks&lt;&#x2F;em&gt; to make sure the newly generated bit of code is well formed, following design goal #1. If a safety check fails, the changes are undone and we try again.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;img src=&quot;program-generation.png&quot; style=&quot;width: 70%&quot;&gt;
&lt;h2 id=&quot;safety-mechanisms&quot;&gt;Safety Mechanisms&lt;&#x2F;h2&gt;
&lt;p&gt;A program can crash for various reasons. It might be because of compiler failures, but more often, due to unsafe programs. Csmith wants to avoid these pitfalls with proper safety mechanisms.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Discussion Question: Think about what safety issues can be caused by the randomly generated programs.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;A summary of all mechanisms shown below. We will provide detailed explanations for each.&lt;&#x2F;p&gt;
&lt;img src=&quot;safety_mechanism.jpg&quot; style=&quot;width: 70%&quot;&gt;
&lt;h3 id=&quot;integer-safety&quot;&gt;Integer Safety&lt;&#x2F;h3&gt;
&lt;p&gt;The safety problem of integers comes from undefined behaviors (UB) such as signed overflow:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;signedOverflow&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; either true or UB due to signed overflow
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and shift-past-bitwidth:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;shiftPastBitwidth&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; UB when evaluated on 32 bit platform 
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Csmith generates wrapper functions for arithmetic operators whose operands might overflow according to the compiler standard, though there are tricky UBs not defined and the authors have to figure out themselves (such as &lt;code&gt;INT_MIN % -1&lt;&#x2F;code&gt;).&lt;&#x2F;p&gt;
&lt;h3 id=&quot;type-safety&quot;&gt;Type Safety&lt;&#x2F;h3&gt;
&lt;p&gt;The tricky aspect of C&#x27;s type safety is qualifier safety. Modifying constant-qualified or volatile-qualified objects through nonconstant references is &lt;a href=&quot;https:&#x2F;&#x2F;wiki.sei.cmu.edu&#x2F;confluence&#x2F;display&#x2F;c&#x2F;EXP40-C.+Do+not+modify+constant+objects&quot;&gt;undefined behavior&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;const int **&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ipp; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; the value pointer to shall not be modified
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ip;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;const int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;42&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
 
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;constViolation&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  ipp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ip; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; UB
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ipp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Valid
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ip &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;   &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Modifies constant i (was 42)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Csmith ensures type safety with type checks.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;pointer-safety&quot;&gt;Pointer safety&lt;&#x2F;h3&gt;
&lt;p&gt;The first kind of pointer safety problem is null pointer dereference.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;nullDereference&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; cause execption if p is NULL
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This can be easily avoided by dynamic checks.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;safeDereference&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;NULL&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
    	&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a; 
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;However, there is no reliable method to identify an invalid pointer that points to a function-scoped variable.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;invalidDereference&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;( p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;NULL&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;){
    	&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a; 
    }
}
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; outside this function, we cannot dereference or compare p with other pointer 
&#x2F;&#x2F; before it becomes valid again!
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;em&gt;Discussion Question:&lt;&#x2F;em&gt; What could be the solution here?*&lt;&#x2F;p&gt;
&lt;p&gt;One way is to force a pointer, which points to a function-scoped variable, to never outlive the function; however, this is too restrictive. Csmith instead chooses to do pointer analysis that is flow sensitive, field sensitive, context sensitive, path insensitive, and array-element insensitive. It maintains points-to sets that contain explicit locations (including null and out of scope element) that may be referenced and checks before each deference.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;effect-safety&quot;&gt;Effect Safety&lt;&#x2F;h3&gt;
&lt;p&gt;In the C99 standard, undefined behavior can be caused by “the order of evaluation of the function designator, the actual arguments, and subexpressions within the actual arguments&amp;quot;, and if &amp;quot;between two sequence points, an object is modified more than once, or is modified and the prior value is read other than to determine the value to be stored.&amp;quot;&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;undefinedExcutionSequence&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
	some_func(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;printf&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;first&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;), &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;printf&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;second&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; which printf is called first?
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;	i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= ++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; what is i?
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Csmith conservatively analyzes the &lt;em&gt;effect&lt;&#x2F;em&gt; of every expression. An effect contains &lt;em&gt;may-read&lt;&#x2F;em&gt; and &lt;em&gt;may-write&lt;&#x2F;em&gt; locations of the expression. The same location cannot be both read and write or written twice except for assignment. This is done incrementally. For newly generated code, a check is performed to decide abandon or keep the code. For example, when generating  &lt;code&gt;p + funcMayModifyP()&lt;&#x2F;code&gt;, Csmith will abandon &lt;code&gt;funcMayModifyP()&lt;&#x2F;code&gt; .&lt;&#x2F;p&gt;
&lt;h3 id=&quot;array-safety&quot;&gt;Array Safety&lt;&#x2F;h3&gt;
&lt;p&gt;Out of boundary safety issues are always there for arrays.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;outOfBound&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
    a[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; out of boundary
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The example I give is simple to avoid, but when the index of &lt;code&gt;a&lt;&#x2F;code&gt; is a variable, it is hard to tell whether it is in-bounds or not. Csmith only generates index variables of &lt;code&gt;for&lt;&#x2F;code&gt; loops counter and ensures the &lt;code&gt;for&lt;&#x2F;code&gt; loop never exceeds the boundary. For arbitrary index variables, Csmith applies the modulo operator. If both techniques do not work—for example, when array length increases—Csmith emits explicit checks against array lengths.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;initializer-safety&quot;&gt;Initializer Safety&lt;&#x2F;h3&gt;
&lt;p&gt;Csmith initializes variables right after declaration and bans &lt;code&gt;goto&lt;&#x2F;code&gt; statements to ensure the execution is in order.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Discussion Question: What kind of cases can be omitted due to the current design of program generation and safety constraints?&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;global-safety&quot;&gt;Global Safety&lt;&#x2F;h3&gt;
&lt;p&gt;Because Csmith generates the code incrementally, newly generated code can threaten the old code.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;incrementallyGeneratedUnsafeProgram&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;while &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; 
        p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; unsafe because of the back-edge
    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Csmith performs checking at each newly generated line except for loops. Loops are checked at the end when the back-edge is logically created. If unsafe lines appear, Csmith deletes line by line until the safety requirements are satisfied.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design-trade-offs&quot;&gt;Design Trade-offs&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;allow-implementation-defined-behavior&quot;&gt;Allow Implementation-defined Behavior&lt;&#x2F;h3&gt;
&lt;p&gt;Implementation-defined behavior is equivalent to &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Unspecified_behavior&quot;&gt;unspecified behavior&lt;&#x2F;a&gt;, which may vary on different implementations of a programming language. Csmith designers believe it is unrealistic to &amp;quot;retain a single interpretation across all possible choices of implementation-defined behaviors&amp;quot;. They allow Csmith to give different outputs when compilers have implementation-defined behavior of:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;the widths and representations of integers.&lt;&#x2F;li&gt;
&lt;li&gt;behavior when casting to a signed integer type when the value cannot be represented in an object of the target type.&lt;&#x2F;li&gt;
&lt;li&gt;results of bitwise operations on signed integers. &lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;There are roughly three kinds of compiler targets:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Targets&lt;&#x2F;th&gt;&lt;th&gt;&lt;code&gt;int&lt;&#x2F;code&gt; bit length&lt;&#x2F;th&gt;&lt;th&gt;&lt;code&gt;long&lt;&#x2F;code&gt; bit length&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;x86-64 (ARMv8)&lt;&#x2F;td&gt;&lt;td&gt;32&lt;&#x2F;td&gt;&lt;td&gt;64&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;x86, ARM, and PowerPC&lt;&#x2F;td&gt;&lt;td&gt;32&lt;&#x2F;td&gt;&lt;td&gt;32&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;MSP430 and AVR&lt;&#x2F;td&gt;&lt;td&gt;16&lt;&#x2F;td&gt;&lt;td&gt;16&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Csmith performs testing within but not cross the equivalent classes.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;no-ground-truth&quot;&gt;No Ground Truth&lt;&#x2F;h3&gt;
&lt;p&gt;Csmith does not have a ground truth since it is unrealistic to have a human check each program. Instead, it takes the majority vote. In practice, two uncorrelated compilers have not output the same incorrect output, according to the authors. The explanation for this is the substantial diversity among intermediate languages (IRs).&lt;&#x2F;p&gt;
&lt;h3 id=&quot;no-guarantee-of-termination&quot;&gt;No Guarantee of Termination&lt;&#x2F;h3&gt;
&lt;p&gt;A Csmith-generated program can be of any length. In practice, a time-out function is called to terminate program that takes too long to finish.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;target-middle-end-bugs&quot;&gt;Target Middle-End Bugs&lt;&#x2F;h3&gt;
&lt;p&gt;Csmith targets checking how compilers perform transformations on IRs rather than standard conformance as commercial test suites. For instance, Csmith does not spend efforts to test how compilers handle long identifier names.&lt;&#x2F;p&gt;
&lt;p&gt;There are several choices made because of the target:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;The Csmith designers manually tune the 80 probability variables to generate programs with a balanced mix of arithmetic and bitwise operations, of loops and straight-line code, and of single-level and multi-level indirections, etc.&lt;&#x2F;li&gt;
&lt;li&gt;Encouraging Csmith to generate idiomatic code, e.g., loops that access all elements of an array.&lt;&#x2F;li&gt;
&lt;li&gt;Discouraging Csmith from generating source-level diversity that is unlikely to improve the “coverage” of a compiler’s IR, e.g., additional levels of parentheses around expressions.&lt;&#x2F;li&gt;
&lt;li&gt;Designing Csmith efficiently generates runnable programs of a few tens of thousands of lines in a few seconds.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;opportunistic-bug-finding&quot;&gt;Opportunistic Bug Finding&lt;&#x2F;h3&gt;
&lt;p&gt;The Csmith designers tested 11 compilers and reports the errors to the developers. The commercial compiler developers did not care, while the GCC and LLVM teams responded quickly. By the time the paper is finalized, 79 GCC bugs and 202 LLVM bugs (2% of all LLVM bug reports) are reported and most of them are fixed. CompCert is such a good compiler that the under-development version of CompCert is the only compiler Csmith finds only 2 wrong-code errors after 6 CPU-years of testing.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;bug-types&quot;&gt;Bug Types&lt;&#x2F;h3&gt;
&lt;p&gt;Before we move on, I would like to introduce bug types for understanding the experiment results:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;A &lt;em&gt;crash error&lt;&#x2F;em&gt; is one that crashes the compiler during compilation or exits with non-zero termination code.&lt;&#x2F;li&gt;
&lt;li&gt;A &lt;em&gt;wrong-code error&lt;&#x2F;em&gt; is one that during run-time, a program produces the wrong results, crashes or abnormal termination code, or never terminates.&lt;&#x2F;li&gt;
&lt;li&gt;A &lt;em&gt;silent wrong-code error&lt;&#x2F;em&gt; is a wrong error that does not produce a compiler warning during compilation.&lt;&#x2F;li&gt;
&lt;li&gt;An &lt;em&gt;assertion failure&lt;&#x2F;em&gt; is an LLVM internal consistency check failure.&lt;&#x2F;li&gt;
&lt;li&gt;An &lt;em&gt;internal compiler failure&lt;&#x2F;em&gt; is a GCC internal consistency check failure.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;quantitative-comparison-of-gcc-and-llvm-versions&quot;&gt;Quantitative Comparison of GCC and LLVM Versions&lt;&#x2F;h3&gt;
&lt;p&gt;The following figure shows the compilation and execution results of LLVM 1.9–2.8, GCC 3.[0–4].0, and GCC 4.[0–5].0 given the input of 1,000,000 Csmith randomly generated programs. Every program was compiled at –O0, –O1, –O2, –Os, and –O3. A test case is valid if every compiler terminated within five minutes and if every compiled random program terminated within 5 seconds. The top and bottom row are the error rates of different compiler versions. The authors also find the source of the compilation error and plot it in the middle row.&lt;&#x2F;p&gt;
&lt;img src=&quot;llvm_gcc.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;h3 id=&quot;bug-finding-performance-as-a-function-of-test-case-size&quot;&gt;Bug-Finding Performance as a Function of Test-Case Size&lt;&#x2F;h3&gt;
&lt;p&gt;The goal of designing Csmith is to find many defects quickly, and to what size the program should Csmith generate to achieve that goal becomes a question. When reporting the error, the authors preferred smaller programs over larger one because they are easier to debug and report. The figure below shows the experiment performed to learn the error number and runtime trade-off given the same runtime. &lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Discussion Questions:&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Is there a big benefit in using small vs. large programs as tests?&lt;&#x2F;li&gt;
&lt;li&gt;Most tests we write are relatively small; are there bugs that can only be caught with really large programs?&lt;&#x2F;li&gt;
&lt;li&gt;How small would our program space need to be in order to &amp;quot;exhaustively&amp;quot; search for bugs?&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;img src=&quot;size_error.jpg&quot; style=&quot;width: 60%&quot;&gt;
&lt;h3 id=&quot;bug-finding-performance-compared-to-other-tools&quot;&gt;Bug-Finding Performance Compared to Other Tools&lt;&#x2F;h3&gt;
&lt;p&gt;Finally, there is also a performance test of Csmith against other bug finding tools, listed in the following table. &lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Tools&lt;&#x2F;th&gt;&lt;th&gt;Description&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Randprog (Turner 2005, Eide 2008)&lt;&#x2F;td&gt;&lt;td&gt;Csmith forked from it. &lt;&#x2F;br&gt; No complex control flow and data structures such as pointers, arrays, and structs.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;DDT (McKeeman 1998)&lt;&#x2F;td&gt;&lt;td&gt;The first to propose differential checking and undefined behavior (but only a small set). &lt;&#x2F;br&gt; More expressive (can generate more legal program).&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Quest (Lindig 2007)&lt;&#x2F;td&gt;&lt;td&gt;Self-checking, not differential checking. &lt;&#x2F;br&gt; Less expressive.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Given the same time of bug finding, Csmith is much more efficient than existing testing tools.&lt;&#x2F;p&gt;
&lt;img src=&quot;other-tool.jpg&quot; style=&quot;width: 60%&quot;&gt;
&lt;h3 id=&quot;code-coverage&quot;&gt;Code Coverage&lt;&#x2F;h3&gt;
&lt;p&gt;Finally, there is also a code coverage test on Csmith generated programs as shown in the following table. &lt;&#x2F;p&gt;
&lt;img src=&quot;coverage.jpg&quot; style=&quot;width: 70%&quot;&gt;
&lt;p&gt;Adding 10,000 Csmith generated programs to the existing test suite of LLVM and GCC did not increase the coverage a lot. &lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Discussion Questions:&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;The authors do not come up with a good explanation for the code coverage issue. What might be the reason? &lt;&#x2F;li&gt;
&lt;li&gt;Tests that are randomly generated will never be like tests that are created by humans. For example, the factorial function will almost never be randomly generated. Does this mean that this kind of testing is still useful? Why and how?&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
</description>
            </item>
        
            <item>
                <title>The Transmeta Code Morphing Software</title>
                <pubDate>Thu, 21 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/transmeta/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/transmeta/</guid>
                <description>&lt;p&gt;Today we are reading the 2003 paper on Transmeta CMS (Code Morphing
Software&lt;sup&gt;TM&lt;&#x2F;sup&gt;). The CMS layer ran x86 programs on the Transmeta
Corporation&#x27;s Crusoe microprocessor, which had an internal architecture that was
much simpler than an x86. While much of the terminology in the paper is
non-standard, I hope it was clear that CMS is a just-in-time (JIT) compiler for
x86 targeting Crusoe&#x27;s internal instruction set architecture (ISA).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;why-a-jit&quot;&gt;Why a JIT?&lt;&#x2F;h2&gt;
&lt;p&gt;How and when the computer industry settled on (for?) the x86 instruction set
architecture I do not know, but we can surmise from the engineering effort
expended by Transmeta that it happened before 2003. At the time, new general
purpose processors needed to expose an x86 interface to programmers.&lt;&#x2F;p&gt;
&lt;p&gt;Faced with this task, Transmeta engineers could have gone the obvious route and
built an x86 clone in hardware. The paper argues for the internal ISA and CMS
technique as follows.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;This approach allows a simple, compact, low-power microprocessor
implementation, with the freedom to modify the internal ISA between
generations, while supporting the broad range of legacy x86 software
available.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;To flip all these adjectives, the Crusoe designers recognized that a direct
x86 implementation like Intel&#x27;s was complicated, sprawling, and high-power. They
understood that Intel&#x27;s infamous commitment to backwards compatibility was good
for business but made it difficult to modify hardware when it might improve
maintainability, space usage, or power efficiency. Hiding their actual
architecture behind an x86 abstraction solved this problem.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-internal-isa&quot;&gt;The internal ISA&lt;&#x2F;h2&gt;
&lt;p&gt;Crusoe&#x27;s internal ISA is a VLIW (very long instruction word) ISA with 64
general-purpose registers and 32 floating-point registers, which is more than
the x86. In a VLIW instruction set like Crusoe&#x27;s, each instruction is really
several smaller instructions which are issued in parallel. In the terminology of
the paper the large instructions are &amp;quot;molecules&amp;quot; composed of 2 or 4 &amp;quot;atoms&amp;quot;.
The internal ISA avoids handling pipeline stalls---instead it expects the CMS
compiler to generate safe code by separating conflicting operations.&lt;&#x2F;p&gt;
&lt;p&gt;The hardware supports deoptimization by shadowing state and exposing &lt;code&gt;commit&lt;&#x2F;code&gt;
and &lt;code&gt;rollback&lt;&#x2F;code&gt; operations for copying live state to the shadowed state and
reverting to shadowed state respectively. In particular, every register has
a corresponding shadow register. All writes to memory are held in a gated store
buffer that is only flushed to main memory following a &lt;code&gt;commit&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-code-morphing-system&quot;&gt;The Code Morphing System&lt;&#x2F;h2&gt;
&lt;p&gt;The CMS includes a software x86 interpreter which runs programs accurately while
also monitoring performance statistics. Once it notices a particular code region
has run more than some threshold number of times, it stops interpreting, commits
the current state, and tries running a just-in-time compiled version of the code
region. The compiled code is stored in a &amp;quot;translation cache&amp;quot; or Tcache.&lt;&#x2F;p&gt;
&lt;p&gt;The JIT will reorder instructions in order to get an efficient schedule on the
VLIW architecture. This is necessary for performance: Figure 2 in the paper
shows a mean of 33% performance degradation across several applications when
reordering is disabled.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;exceptions-and-interrupts&quot;&gt;Exceptions and Interrupts&lt;&#x2F;h3&gt;
&lt;p&gt;Occasionally compiled code will encounter exceptions in the internal ISA.
Sometimes these exceptions are the result of speculative compilation (e.g.,
a reordering of instructions causing a memory fault) but sometimes they are
genuine exceptions which should be propagated up to the x86 layer (e.g.,
division by zero).&lt;&#x2F;p&gt;
&lt;p&gt;When Tcached code hits an exception, the CMS issues a rollback instruction to
restore architectural state to a previous checkpoint and tries interpreting the
region instead. If the exception goes away, CMS assumes it was due to
reordering. Otherwise it is a genuine x86 exception and gets propogated up to
the program.&lt;&#x2F;p&gt;
&lt;p&gt;Interrupts work similarly to exceptions, but CMS does not try retranslating the
region in which the interrupt occurs.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;reordering-constraints&quot;&gt;Reordering constraints&lt;&#x2F;h3&gt;
&lt;p&gt;Consider a program that writes a 1 to address &lt;code&gt;x&lt;&#x2F;code&gt; and then reads from address
&lt;code&gt;y&lt;&#x2F;code&gt;. If these two pointers alias, it is unsafe to reorder the read to run before
the write. Similarly, if &lt;code&gt;x&lt;&#x2F;code&gt; and &lt;code&gt;y&lt;&#x2F;code&gt; are backed by a memory-mapped I&#x2F;O device,
reordering the operations would be unsafe because it would cause the program&#x27;s
I&#x2F;O behavior to change.&lt;&#x2F;p&gt;
&lt;p&gt;The CMS speculatively optimizes with reordering and handles these potential
issues by turning them into faults, which trigger deoptimization before anything
bad can happen. Reorderd instructions are tagged to let the processor know they
were reordered. Special &amp;quot;alias hardware&amp;quot; does lightweight alias tracking at run
time and faults if there may be aliasing between two reordered operations.
Reordered operations that access IO address space also fault. The offending code
region is then recompiled without the reorderings.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;self-modifying-code&quot;&gt;Self-modifying code&lt;&#x2F;h3&gt;
&lt;p&gt;It is not uncommon for x86 programs to modify themselves. The paper observes
that it is a standard technique in games, embedded code, and Windows device drivers.
Following the approach of turning correctness issues into faults, Transmeta
could (and apparently at one point did) write-protect code pages to cause faults
and then fall back to interpreting the self-modifying code. Falling back to
the interpreter is a serious performance penalty for self-modifying programs, so
the paper includes a few techniques for handling self-modifying code.&lt;&#x2F;p&gt;
&lt;p&gt;Finer-grained write protection can help, since code is likely to be modified
only in a few places. Crusoe supports this and it gets some speedup over the
page granularity write protection approach.&lt;&#x2F;p&gt;
&lt;p&gt;Introducing &amp;quot;prologues&amp;quot; that check preconditions on translated code can also
work. The paper refers to this as self-validation and self-checking. The idea is
to tack a header onto the translated code which looks up the source page and
verifies that nothing has changed since it was translated. &lt;&#x2F;p&gt;
&lt;p&gt;Finally, it is possible to recognize common self-modifying code patterns and
compile them to ordinary static code. &lt;&#x2F;p&gt;
&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h1&gt;
&lt;p&gt;The Transmeta CMS system is a compiler solution to a hardware problem. While
implemnting a just-in-time compiler for x86 is subtle and difficult, the
performance and maintainability benefits for user code and the hardware seem
worth it.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Chlorophyll: Synthesis-Aided Compiler for Low-Power Spatial Architectures</title>
                <pubDate>Wed, 20 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/chlorophyll/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/chlorophyll/</guid>
                <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;&#x2F;h2&gt;
&lt;p&gt;Energy efficiency has been increasingly important for embedded processors and internet of things (IoT) systems. Many spatial hardware architectures have been proposed to address the power consumption problem without compromising the performance. GreenArrays 144 (GA144) is one example of the low power spatial processor. GA144 is a stack-based 18-bit processor without shared memory and fixed clock. To program GA144, users have to write an assembly-like low level programming language &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;chlorophyll&#x2F;#1&quot;&gt;[1]&lt;&#x2F;a&gt; to define the data storage, data movement and communication logic to make use of the spatial architecture. This paper proposes a programming model and synthesis-aided compiler to compile high level C-like programs into binary executables on GA144. The compiler-optimized program can achieve comparable results with an expert written low level program.&lt;&#x2F;p&gt;
&lt;p&gt;An additional motivation for investigating Chlorophyll is the need for a higher-level abstraction over arrayForth, the language used on GreenArray chips. ArrayForth requires the programmer to partition data structures and code as well as to insert the requisite communication code. Chlorophyll aims to abstract away communication code as well as some partitioning logic that can be inferred by the synthesizer and separator. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;background&quot;&gt;Background&lt;&#x2F;h3&gt;
&lt;p&gt;Spatial hardware is an architecture where the users are required to assign data, computations, and communications explicitly to use its computing resources, storage, and interconnect network. Compared with general purpose processors, spatial architectures can achieve comparatively good, or even better performance and energy efficiency for some specific applications. However, spatial hardware is especially hard to program and debug as users have to manually manage the low level hardware resources and efficiently transform their programs in order to take advantage of the hardware features. &lt;&#x2F;p&gt;
&lt;p&gt;To mitigate programmability concerns, one possible solution is to use certain domain specific language (DSL) like Spatial &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;chlorophyll&#x2F;#2&quot;&gt;[2]&lt;&#x2F;a&gt; or T2S &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;chlorophyll&#x2F;#3&quot;&gt;[3]&lt;&#x2F;a&gt;. DSLs provide specialized APIs to help users realize different optimization on target hardware with much less effort. People can also use hardware templates to solve the problem. DNN&#x2F;FPGA Co-Design &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;chlorophyll&#x2F;#4&quot;&gt;[4]&lt;&#x2F;a&gt; developed a complication flow from an ML workload to certain spatial accelerator like FPGA or CGRA by searching through template IP pools with certain constraints. Such approaches improve the programmability without sacrificing much performance, but still require users to have expertise for the target hardware.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;chlorophyll-overview&quot;&gt;Chlorophyll Overview&lt;&#x2F;h3&gt;
&lt;p&gt;Chlorophyll decomposes the problem of compiling high level programs into spatial machine code into four sub-problems:&lt;&#x2F;p&gt;
&lt;img src=&quot;flow.png&quot; width=&quot;700&quot; &gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Partitioning&lt;&#x2F;strong&gt;: Search the partitioning scheme that minimizes the communication between different logical cores&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Layout&lt;&#x2F;strong&gt;: Assign the partitioned program segments to physical cores such that the actual communication cost is minimized &lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Code Separation&lt;&#x2F;strong&gt;: Generate communication code for data movements between physical cores&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Code Generation&lt;&#x2F;strong&gt;: Search and generate high performance binary code taking advantage of hardware features of the target machine&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The users can develop their programs in C and compile to GA144 without worrying about the communication and data movement. The compiler partitions the code, explores the design space, and automatically optimizes the program to improve the performance.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;partitioning-synthesizer&quot;&gt;Partitioning Synthesizer&lt;&#x2F;h3&gt;
&lt;h4 id=&quot;type-system&quot;&gt;Type System&lt;&#x2F;h4&gt;
&lt;p&gt;Chlorophyll provides an annotation based syntax for users to indicate partition types for variables, arrays and operators. Partition types can be declared using annotation or inferred by the synthesizer. Here follows an example of annotating arrays into different partitions.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;@{[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;64&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;} x[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;64&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]; 
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;@{[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;64&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;} y[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;64&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]; 
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;@{[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;64&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;} z[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;64&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]; 
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;@&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(i from &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; to &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;64&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  z[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;leftrotate(x[i],y[i],r)&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;@place(z[i]) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; 
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;@&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6 32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;@&lt;&#x2F;code&gt; in the program is an indicator of the partition index, and &lt;code&gt;@place&lt;&#x2F;code&gt; will return the partition type the argument. &lt;code&gt;!&lt;&#x2F;code&gt; is an operation of sending data from one partition to another. In the example above, array x, y and z are partitioned into two logical cores (first half and second half stored in separate cores). The computation of &lt;code&gt;leftrotate&lt;&#x2F;code&gt; function may happen in the same or a different logical core depending on the result of type inference. &lt;&#x2F;p&gt;
&lt;p&gt;For loops in Chlorophyll programs, the synthesizer conducts loop splitting to make variables in each split loop have the same partition ID. The loop in the example above will be decomposed into multiple non-overlapping sub-loops.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(i from &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; to &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  z[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;leftrotate(x[i]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,y[i]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,r&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;@&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4 1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;@&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6 1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;@&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6 32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(i from &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; to &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;64&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  z[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;leftrotate(x[i]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,y[i]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,r&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;@&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5 1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;@&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6 1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;@&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6 32&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In the partitioning stage, Chlorophyll compiler performs type checking and type inference to generate the completely partitioned code, where each operand will have an explicit partition type as well as the destination partition ID (if it will be used by operations on other logical cores). The partitioning is conducted with the following two key functions:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Communication Interpreter&lt;&#x2F;strong&gt;: The interpreter calculates the communication count of given program &lt;code&gt;P&lt;&#x2F;code&gt; with every valid input &lt;code&gt;x&lt;&#x2F;code&gt; under a complete partition type annotation $\sigma$. The maximum communication count is defined as
$$
MaxComm(P, \sigma) = max_{x}Comm(P, \sigma, x)
$$ If the input program is partially annotated, the interpreter returns a formula in terms of symbolic variables for partition types. The returned value is used as an input to a constraint-based solver to search for an optimal partitioning scheme for the given program (i.e., better type inference on the un-annotated variables). &lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Partition Space Checker&lt;&#x2F;strong&gt;: Given a complete partition scheme for each variable, array and other constructs in the program, the space checker checks the space usage of in each partition, to make sure that the memory usage is no larger than the storage capacity on the core.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The partitioning synthesizer utilizes these two functions as constraints to a backend solver to infer the prtitioning scheme minimizing the communication while not violating the memory constraint. Every time the solver finds a feasible solution, the synthesizer will lower the upper bound constraint of communication count to push for a better solution in next runs.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;layout&quot;&gt;Layout&lt;&#x2F;h3&gt;
&lt;p&gt;The layout problem is formalized as a quadratic assignment problem (QAP), where we search for the assignment function to minimize the total communication cost:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;facilities set $P$ (a set of partition types we derived from partitioning synthesizer)&lt;&#x2F;li&gt;
&lt;li&gt;the weight distance $W(a,b)$ returning the weight factor between two facilities $a$ and $b$&lt;&#x2F;li&gt;
&lt;li&gt;function $D(a, b)$ returning the distance between two given locations&lt;&#x2F;li&gt;
&lt;li&gt;Assignment function mapping different facility (in our case function) into a specific location (i.e., logical core on GA144)
$$
f = argmin_{f}\sum_{a, b\in P}W(a, b)D(f(a), f(b))
$$&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The QAP problem is a classic combinatorial problem and can be efficiently solved with Simulated Anealing (SA). And we use the layout result to generate communication code in the code separation phase.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;code-separation&quot;&gt;Code Separation&lt;&#x2F;h3&gt;
&lt;p&gt;The program is now separated into per-core program fragments. These program fragments communicate with read and write operations, that the separater subsequently inserts. This design is advantageous as the GA144 architecture does not support shared memory, and cores communicate via synchronous channels.&lt;&#x2F;p&gt;
&lt;!-- &lt; TODO INSERT why preserving the order of operations within each program fragment prevents deadlock&gt; --&gt;
&lt;p&gt;Consider this example with only basic statements.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;int@3 x = (1 +@2 2)!3 *@3 (3 +@1 4)!3;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The separator traverses the AST in postorder to place sub-expressions according to their partitions, and then add the requisite communication code.
This program is separated into these program fragments:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;partition 1: write(E, 3 + 4);
partition 2: write(E, 1 + 2); write(E, read(W));
partition 3: int x = read(W) * read(W);
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These partitions are mapped to cores (0,1), (0,2), and (0,3), arranged west to east.&lt;&#x2F;p&gt;
&lt;img src=&quot;com.png&quot; width=&quot;200&quot;&gt;
&lt;p&gt;In the program the operation &lt;code&gt;(3+4)&lt;&#x2F;code&gt; occurs at partition 1 and is written to partition 3. However, since these partitions are not on adjacent cores, the separator inserts a &lt;code&gt;write EAST&lt;&#x2F;code&gt; of this result at partition 1 to partition 2.
Next, at partition 2, the operation &lt;code&gt;(1+2)&lt;&#x2F;code&gt; occurs, and is written to partition 3. The separater inserts a &lt;code&gt;write EAST&lt;&#x2F;code&gt; of this result. In addition, the separater inserts a &lt;code&gt;write EAST&lt;&#x2F;code&gt; for the result of partition 1 on partition 2, that was written to its data stack previously. Note that the arithmetic operations on partition 1 and partition 2 happen in parallel.
Finally, with both results written to the data stack of the core mapped by partition 3, the operation &lt;code&gt;(3+7)&lt;&#x2F;code&gt; occurs by reading its data stack to source operands.&lt;&#x2F;p&gt;
&lt;p&gt;This process is much more challenging with control flow and functions. When considering functions, a function call in the original program translates to a function call at each of the cores on which the function resides.&lt;&#x2F;p&gt;
&lt;!-- TODO maybe talk about arrays --&gt;
&lt;h3 id=&quot;code-generation&quot;&gt;Code Generation&lt;&#x2F;h3&gt;
&lt;img src=&quot;overview.png&quot; width=&quot;600&quot; &gt;
Typically, code generation is performed using a dynamic programming algorithm that utilizes local optimization to optimize larger and larger code fragments. Think of tiling an abstract assembly tree and building up larger and larger optimized tiles from smaller ones. This approach is not as effective for code generation for nontraditional architectures with nonlocal optimizations, such as optimizations for circular stacks. 
&lt;p&gt;Instead of writing new transformation rules, the authors search for an optimized program in the space of candidate programs. One way to do this is through superoptimization.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Superoptimization&lt;&#x2F;em&gt; is the process of searching the space of all instruction sequences and verifying the candidate programs against a reference implementation with naively-generated code. Unfortunately, according to this paper, superoptimization only scales to roughly 25 instructions. However, there has been more recent work to mitigate this with increased contextual information and early pruning of invalid candidate programs &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;chlorophyll&#x2F;#4&quot;&gt;[4]&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;naive-code-generation&quot;&gt;Naive Code Generation&lt;&#x2F;h4&gt;
&lt;p&gt;In order to create this reference implementation, each per-core program is translated to machine code. Each operation in the high-level program is stored in a &lt;em&gt;superoptimizable unit&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;A &lt;em&gt;state&lt;&#x2F;em&gt; of the machine is a collection of the data stack, return address stack, memory, and special registers. &lt;&#x2F;p&gt;
&lt;p&gt;Each superoptimizable unit contains a &lt;em&gt;live region&lt;&#x2F;em&gt;. A &lt;em&gt;live region&lt;&#x2F;em&gt; indicates the parts of a machine&#x27;s state that store live variables at the end of execution of the unit.&lt;&#x2F;p&gt;
&lt;p&gt;These units can then be merged into a longer sequence as a &lt;em&gt;superoptimizable segment&lt;&#x2F;em&gt;. &lt;&#x2F;p&gt;
&lt;p&gt;Since we do not support recursion, it is possible to statically determine the depth of the stack at any point in the program, and thus reject all programs that overflow the stack.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;modular-superoptimization&quot;&gt;Modular Superoptimization&lt;&#x2F;h4&gt;
&lt;p&gt;A behavior of a program is specified by its sequence of instructions &lt;em&gt;P&lt;&#x2F;em&gt;, and its live region &lt;em&gt;L&lt;&#x2F;em&gt;. 
Consider the resulting data stacks from executing &lt;em&gt;P&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;img src=&quot;stack_spec.png&quot; width=&quot;500&quot; &gt;
&lt;p&gt;&lt;em&gt;P&lt;&#x2F;em&gt; changes the data stack from $\alpha$|$\beta$ to $\alpha$|$\gamma$. Assume $\alpha$|$\gamma$ is in the live region. We say &lt;em&gt;P&#x27;&lt;&#x2F;em&gt; is equivalent to &lt;em&gt;P&lt;&#x2F;em&gt; if &lt;em&gt;P&#x27;&lt;&#x2F;em&gt; produces $\alpha$|$\gamma$, and the stack pointers after executing &lt;em&gt;P&lt;&#x2F;em&gt; and &lt;em&gt;P&#x27;&lt;&#x2F;em&gt; point to the same location.&lt;&#x2F;p&gt;
&lt;p&gt;Consider figure (b) above. Let &lt;em&gt;P&#x27;&lt;&#x2F;em&gt; be the sequence of instructions resulting in (b). Note that under this definition of equivalence, the resulting stacks in (a) and (b) are different. This is because the stack pointer is pointing to different places on each stack.&lt;&#x2F;p&gt;
&lt;p&gt;In order to decrease the number of false negatives, the authors loosen the specification. Assume &lt;em&gt;P&lt;&#x2F;em&gt; results in a data stack of $\alpha$|$\gamma$. &lt;em&gt;P&#x27;&lt;&#x2F;em&gt; is equivalent to &lt;em&gt;P&lt;&#x2F;em&gt; if &lt;em&gt;P&#x27;&lt;&#x2F;em&gt; results in a data stack of $\delta$|$\alpha$|$\gamma$, where $\delta$ may be empty. Additionally, they place no restrictions on the stack pointer.&lt;&#x2F;p&gt;
&lt;p&gt;This is valid because the stacks are circular, and therefore leaving garbage at the bottom of the stack will simply be overwritten as the stack pointer is incremented accordingly.&lt;&#x2F;p&gt;
&lt;p&gt;This engenders additional optimization as well. Consider the corresponding stacks for a subtract operation.&lt;&#x2F;p&gt;
&lt;img src=&quot;subexample.png&quot; width=&quot;400&quot; &gt;
&lt;p&gt;Here, b and a are popped from the stack, subtracted, and the result is pushed back onto the stack. The &lt;em&gt;P&lt;&#x2F;em&gt; that results in only &lt;code&gt;b-a&lt;&#x2F;code&gt; left on the stack is 8 instructions long, with 3 instructions to remove the remaining garbage value &lt;code&gt;a&lt;&#x2F;code&gt; from the stack. However, as noted before, it is in fact legal to leave &lt;code&gt;a&lt;&#x2F;code&gt; at the bottom of the stack, thus saving 3 instructions. &lt;&#x2F;p&gt;
&lt;h4 id=&quot;sliding-windows&quot;&gt;Sliding Windows&lt;&#x2F;h4&gt;
&lt;p&gt;To scale superoptimization to real world programs, the authors employ the &lt;em&gt;sliding windows technique&lt;&#x2F;em&gt;.
This algorithm merges superoptimizable units into a superoptimizable segment. It starts with an empty segment, superoptimizes the largest sequence possible from the input instruction sequence, and appends the result to the output, if valid.
In the case where no valid segment is found from the superoptimizer, it appends the the first unit from the segment to the output, and attempts to build another sequence to superoptimize. 
Then, it iterates until the instruction sequence is empty.&lt;&#x2F;p&gt;
&lt;p&gt;The superoptimizer executes a binary search over programs given a cost model. It uses counterexample-guided inductive synthesis (CEGIS) to synthesize a program of cost $k$, and if one exists, to synthesize a program of cost $k&#x2F;2$, and so on.&lt;&#x2F;p&gt;
&lt;p&gt;When perusing papers that cite this one, we found the primary critique of the &lt;em&gt;sliding windows technique&lt;&#x2F;em&gt; to be the focus on local optimization and inability to make whole-program optimality guarantees. Other synthesis approaches such as metasketches, claim to provide these guarantees &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;chlorophyll&#x2F;#6&quot;&gt;[6]&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;!-- #### SMT Formulas
Once we have our superoptimizable segment, we optimize the segment using SMT solvers, specifically Z3. We use the GreenArrays cost model to approximate execution time for a given program. Then, we perform binary search over a candidate program space to find one of optimal execution time. --&gt;
&lt;!-- #### Address Space Compression
When it comes to supertoptimization, the smaller the memory required, the smaller the search space. In order to allow superoptimization to scale, the authors compress our address space during superoptimization. One primary example of compression focuses on memory assigned to arrays.  --&gt;
&lt;!-- INSERT: It is unclear how they change arrays to be of length two and modify the rest of the addresses accordingly? --&gt;
&lt;!-- 
After a valid output program is found, the address space is then decompressed and verified to be the same as the original input program. This is advantageous as verification is much faster than synthesis. --&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;performance&quot;&gt;Performance&lt;&#x2F;h3&gt;
&lt;p&gt;To assess the performance of the partitioning synthesizer, the authors implement a heuristic partitioner that greedily merges small unknown partitions when there is communication between the two.
To assess performance of the layout synthesizer, the authors compare the default layout synthesizer that considers approximate communication costs with a modified version  that assumes communication costs between every pair is 1. To assess performance, the authors compare generated programs with and without superoptimization. Finally, the authors compare sliding windows against fixed windows. &lt;&#x2F;p&gt;
&lt;img src=&quot;expert.png&quot; width=&quot;600&quot; &gt;
&lt;p&gt;For each benchmark, 5 different versions of the program are generated. We include a key here to interpret the figure above:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;(sliding | fixed | ns)&lt;&#x2F;strong&gt; = (sliding-window | fixed-window | no) superoptimization&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;(p | hp)&lt;&#x2F;strong&gt; = partitioning synthesizer | heuristic partitioner&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;(l | il)&lt;&#x2F;strong&gt; = layout synthesizer | imprecise layout synthesizer&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;On average, when comparing (ns+p+l) with (ns+hp+l), the authors find a 5% speedup when using partitioning synthesizer over the heuristic partitioner. They confirm the precise layout is crucial and note the 1.8x speedup when using the precise layout for the Convolution benchmark. This is due to the significant parallel patterns in the Convolution benchmark where placing parallel groups closeby can have a large impact.
When looking at superoptimization, the authors find an average 30% speedup on programs that use superoptimization. Furthermore, the programs that use sliding-window superoptimization are on average 4% faster than programs that use fixed-window superoptimization.&lt;&#x2F;p&gt;
&lt;p&gt;The authors further claim that programs generated with synthesis are comparable to &lt;em&gt;highly-optimized&lt;&#x2F;em&gt; &lt;em&gt;expert-written&lt;&#x2F;em&gt; programs. &lt;&#x2F;p&gt;
&lt;img src=&quot;exec_time.png&quot; width=&quot;490&quot; &gt;
&lt;img src=&quot;space.png&quot; width=&quot;500&quot; &gt;
&lt;img src=&quot;energy.png&quot; width=&quot;500&quot; &gt;
&lt;p&gt;On these single-core programs, the authors found the generated programs to be on average 46% slower, 44% less energy efficient, and 47% longer than the expert written programs.
It is important to note that the programs that were tested are single-core. When running the single multicore program the authors have access to through the partitioning synthesizer, the synthesizer times out and the heuristic partitioner fails to produce a program that fits in memory. The authors partition this program by manually finding partition annotations, examining the machine code, and iterating. They compare two programs partitioned this way such that one is initially bigger than memory before it is superoptimized, and one that fits on the cores without superoptimization.&lt;&#x2F;p&gt;
&lt;p&gt;While the authors find the multicore program with superoptimization is 65% slower than the experts&#x27; implementation and 7% faster than non-superoptimized program, it is unclear whether we can confirm &amp;quot;these generated programs are comparable with experts&#x27; not only on small programs but on a real application.&amp;quot; It is challenging to compare the results from the multicore test with the performance of the experts&#x27; implementation as the authors manually partitioned these &lt;em&gt;generated programs&lt;&#x2F;em&gt; themselves. Even so, they found them to be 65% slower and 70% less energy efficient than the experts&#x27; implementations.&lt;&#x2F;p&gt;
&lt;p&gt;While the threshold for stating these generated programs are &lt;em&gt;comparable&lt;&#x2F;em&gt; to expert implementations is subjective, the data from the multicore test appears moderately contrived. We would like to see more naturally generated partitions for more multicore examples before leaping to the claim that these generated programs are comparable to real applications.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;discovering-optimizations-for-unusual-hardware&quot;&gt;Discovering Optimizations for Unusual Hardware&lt;&#x2F;h3&gt;
&lt;p&gt;The authors go on to claim their supertoptimizer can discover optimizations that traditional compilers may not. They hand select a couple of programs with potential bit-manipulation optimizations and find the superoptimizer discovers the expected strength reduction optimizations as well as CSE. We found the primary takeaway from this claim to be the superoptimizer&#x27;s ability to find optimizations specific to the unusual hardware, exploiting special instructions not found in common ISAs. This addresses one of the motivating challenges of this paper: Building a mature compiler with heuristic-guided tree rewrites takes quite a bit of time, and low-power architectures are a moving target as they are under very active investigation. This challenge demands a way to discover optimizations for changing architectures, and these initial abilities of the superoptimizer are promising.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;opportunity-for-improvement-with-human-insight&quot;&gt;Opportunity for Improvement with Human Insight&lt;&#x2F;h3&gt;
&lt;p&gt;One last item to note is the time it takes for the superoptimizer to converge on a valid optimized program is significant. Below we see the results for single-core and multicore benchmarks with their corresponding convergence times.&lt;&#x2F;p&gt;
&lt;img src=&quot;singlecore.png&quot; width=&quot;500&quot; &gt;
&lt;img src=&quot;multicore.png&quot; width=&quot;500&quot;&gt;
&lt;p&gt;The authors&#x27; claim that injecting human insight into the superoptimizer in the form of templates or pinning code to cores can help mitigate these times. They provide minimal examples and evidence of the potential impact of this human insight. We note the tradeoff between convergence times and programmability, one of their primary motivations for Chlorophyll as a higher-level abstraction over arrayForth.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;optimization-opportunity-loss&quot;&gt;Optimization Opportunity Loss&lt;&#x2F;h3&gt;
&lt;p&gt;The separation of partitioning and superoptimization engenders a few places for improvement. Chlorophyll uses a schedule-oblivious routing strategy. Therefore, if core A can communicate with core B via core X or core Y, it will choose the route arbitrarily. If core X is busy during this time, the optimal route would be through core B, instead of waiting on X to finish its work. When determining the importance of this scheduling issue, we wanted to know what proportion of execution time is dedicated to communication cost, including time spent waiting. Given the strict space constraints, it is unclear whether inserting instructions to conditionally use different communication instructions saves enough execution time to be worth the space. Further benchmarks would be required to determine if it is worth it, and if so, in which cases. &lt;&#x2F;p&gt;
&lt;!-- If this time spent is significant, it may be worth it to perform a static analysis before code separation in order to better predict which cores will be busy to more optimally route data. --&gt;
&lt;h2 id=&quot;ga144-powered-lemonade-bleach-battery-demo&quot;&gt;GA144 Powered Lemonade-Bleach Battery Demo&lt;&#x2F;h2&gt;
&lt;p&gt;Check out a &lt;a href=&quot;https:&#x2F;&#x2F;youtu.be&#x2F;zMfdef-nYGY&quot;&gt;demo&lt;&#x2F;a&gt; in which the GA144 chip runs an integer division program entirely powered by a lemonade-bleach battery. The battery provides 1.6V and the application draws 2-3 mA.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;references&quot;&gt;References&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;a id=&quot;1&quot;&gt;[1]&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;bitlog.it&#x2F;20141224_getting_started_with_the_ga144_and_arrayforth.html&quot;&gt;Getting Started With The GA144 And ArrayForth&lt;&#x2F;a&gt;
&lt;a id=&quot;2&quot;&gt;[2]&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;web.stanford.edu&#x2F;%7Ekozyraki&#x2F;publications&#x2F;2018.spatial.pldi.pdf&quot;&gt;David Keoplinger et al. Spatial: A Language and Compiler for Application Accelerators&lt;&#x2F;a&gt;
&lt;a id=&quot;3&quot;&gt;[3]&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;nitish2112.github.io&#x2F;publication&#x2F;t2s-tensor-fccm2019.pdf&quot;&gt;Nitish Srivastava et al. T2S-Tensor : Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations&lt;&#x2F;a&gt;
&lt;a id=&quot;4&quot;&gt;[4]&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1904.04421&quot;&gt;Cong Hao et al. FPGA&#x2F;DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge&lt;&#x2F;a&gt;
&lt;a id=&quot;5&quot;&gt;[5]&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=2872387&quot;&gt;Phitchaya Mangpo Phothilimthana et al. Scaling up Superoptimization&lt;&#x2F;a&gt;
&lt;a id=&quot;6&quot;&gt;[6]&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;homes.cs.washington.edu&#x2F;%7Eluisceze&#x2F;publications&#x2F;synapse-popl16.pdf&quot;&gt;James Bornholt et al. Optimizing Synthesis with Metasketches&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Bril Phase Selection and Ordering</title>
                <pubDate>Tue, 19 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/bril-phase-selection-and-ordering/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/bril-phase-selection-and-ordering/</guid>
                <description>&lt;h1 id=&quot;background&quot;&gt;Background&lt;&#x2F;h1&gt;
&lt;p&gt;Modern compilers are often organized in terms of independent &amp;quot;passes&amp;quot;. That
is, a compiler will consist of tens or even hundreds of optimization passes
that each do a clearly defined task. Along with these many passes, however,
comes the problem of ordering them. For example, it isn&#x27;t obvious whether one
should run constant folding or copy propagation first. Although some of these
may have well defined answers, it&#x27;s likely that one can find an example in
which the ordering performs poorly.&lt;&#x2F;p&gt;
&lt;p&gt;We set out to try and build a ML-based phase selection&#x2F;ordering system. Due
to some amount of problems&#x2F;discoveries, we ended up only analyzing the
effects of different orders and generating an optimal sequence of
optimizations.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;design-overview&quot;&gt;Design Overview&lt;&#x2F;h1&gt;
&lt;h2 id=&quot;optimizations&quot;&gt;Optimizations&lt;&#x2F;h2&gt;
&lt;p&gt;To set up the phase ordering task, we selected and implemented several optimization passes:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;For Dead Code Elimination (DCE), we used the given implementation directly.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;For data flow analysis based optimization passes, we implemented Copy Propagation and Constant Folding, as defined in the lecture. Each pass will gather data flow analysis information first then modify the instructions in place.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;For control flow graph optimizations, we implemented Branch Removal, Unreachable Code Elimination, CFG cleaning and  Tail Merging. We refer to this &lt;a href=&quot;http:&#x2F;&#x2F;user.it.uu.se&#x2F;%7Ekostis&#x2F;Teaching&#x2F;KT2-10&#x2F;Slides&#x2F;ControlFlowOpts.pdf&quot;&gt;slide&lt;&#x2F;a&gt; for more details.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Branch Removal&lt;&#x2F;p&gt;
&lt;p&gt;If we can decide the guard value to be true or false, then we can eliminate one of the two branches accordingly.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Unreachable Code Elimination(UCE)&lt;&#x2F;p&gt;
&lt;p&gt;The basic blocks unreachable from the entry block will be removed.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;CFG cleaning&lt;&#x2F;p&gt;
&lt;p&gt;We can simplify CFG by several transformations that eliminate useless edges and combine some basic blocks. Specifically, we can 1) replace a br instruciton with two identical destination to a jmp instruciton; 2) merging empty blocks; 3) merging two blocks with only one edge in between under some cases.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Tail Merging&lt;&#x2F;p&gt;
&lt;p&gt;If the predecesssor or successors share a same section of code in the end or begining correspondingly, we can change the jump to reuse the same segment of code.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;In general, the optimization we selected are tailored for the simple
structure of Bril, which has a lot of contant calculation and branches that
do not allow falling through. Passes like DCE and UCE will likely to remove
more codes, but other passes might or might not generate new opportunities
for DCE and UCE. Thus we need to order passes properly to obtain best
performance, especially under the case where we can only apply limited number
of passes.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;random-program-generator&quot;&gt;Random Program Generator&lt;&#x2F;h2&gt;
&lt;p&gt;In order to analyze programs in Bril, one obstacle was that we didn&#x27;t have a
large enough corpus of code to analyze. To do statistical analysis or any
kind of optimization techniques, we need a substantial training set to
evaluate our ordering methods.&lt;&#x2F;p&gt;
&lt;p&gt;To create this training set, we created a random Bril program generator. Our
program generator evaluates each decision independently, and doesn&#x27;t take any
care to preserve program wide structure. We follow this flow chart at each
step.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;First, we randomly chose whether we want to insert a binary operation, a constant, or an id.&lt;&#x2F;li&gt;
&lt;li&gt;Next, given the instruction type, we decide whether it should be a boolean or an integer.&lt;&#x2F;li&gt;
&lt;li&gt;Finally, we choose whether this instruction should write to a new identifier or an already existing identifier. If it&#x27;s an already existing identifier we uniformly choose one from all the identifiers of the correct type.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;One thing that we&#x27;ve ignored so far is control flow. Considering that several
of our optimizations are control flow optimizations, we clearly need to
handle them somehow. However, introducing arbitrary control flow is fairly
tricky. For example, we may construct non-terminating programs, which make
evaluating them&#x2F;checking our optimizations for bugs tricky.&lt;&#x2F;p&gt;
&lt;p&gt;Thus, we settled on constructing a Directed Acyclic Graph. Every fixed
interval, we insert a label. In addition to that label, we have a chance of
adding a jump&#x2F;branch to a succeeding label.&lt;&#x2F;p&gt;
&lt;p&gt;We note that there are a lot of hand set probabilities in this generation
process that have the ability to heavily affect the types of programs
generated. For example, introducing a lot of control flow makes it more
difficult for other optimizations to be applied. Thus, we select a mix of
random constants.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;phase-ordering-analysis&quot;&gt;Phase Ordering Analysis&lt;&#x2F;h2&gt;
&lt;p&gt;So, first of all, how much does the phase ordering even matter? To evaluate
the effectiveness of our optimizations, we will use # of instructions as a
proxy.&lt;&#x2F;p&gt;
&lt;p&gt;We will use three different test sets for this. One where we have a lot of
control flow, one with no control flow, and one with intermediate control
flow. These will be called CFLow, CFMid, and CFHigh.&lt;&#x2F;p&gt;
&lt;p&gt;Then, we apply 200 random permutations of the 7 optimizations mentioned
above. We record the worst and the best performing permutation, and report
that here.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Dataset&lt;&#x2F;th&gt;&lt;th&gt;Avg Low&lt;&#x2F;th&gt;&lt;th&gt;Avg High&lt;&#x2F;th&gt;&lt;th&gt;Avg Ratio&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;CFLow&lt;&#x2F;td&gt;&lt;td&gt;18.7&lt;&#x2F;td&gt;&lt;td&gt;26.3&lt;&#x2F;td&gt;&lt;td&gt;.72&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;CFMid&lt;&#x2F;td&gt;&lt;td&gt;34.0&lt;&#x2F;td&gt;&lt;td&gt;39.0&lt;&#x2F;td&gt;&lt;td&gt;.87&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;CFHigh&lt;&#x2F;td&gt;&lt;td&gt;52.7&lt;&#x2F;td&gt;&lt;td&gt;54.7&lt;&#x2F;td&gt;&lt;td&gt;.96&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;As we can see, especially on the programs with low amounts of control flow,
there can be a huge difference between the best and worst permutation.&lt;&#x2F;p&gt;
&lt;p&gt;However, merely the min and max does not provide much information about how
the distribution of possible permutations performs. To show that, we will
normalize all permutation performances by the worst performing permutations
and plot the distribution as a violin plot. We see that there is a tail of
permutations that performs substantially better than the majority of
permutations, many of which perform as bad as the worst permutation.&lt;&#x2F;p&gt;
&lt;p&gt;CFLow:&lt;&#x2F;p&gt;
&lt;img src=&quot;low_violin.png&quot; style=&quot;width: 50%&quot;&gt;
&lt;p&gt;CFMid:&lt;&#x2F;p&gt;
&lt;img src=&quot;mid_violin.png&quot; style=&quot;width: 50%&quot;&gt;
&lt;p&gt;CFHigh:&lt;&#x2F;p&gt;
&lt;img src=&quot;high_violin.png&quot; style=&quot;width: 50%&quot;&gt;
&lt;h2 id=&quot;phase-ordering-heuristics&quot;&gt;Phase Ordering Heuristics&lt;&#x2F;h2&gt;
&lt;p&gt;However, questions remain about the nature of the good performing
permutations. Is there a single true permutation? Or does the ideal
permutation vary among programs? To answer that, we generated random
permutations and evaluated them across all 3 datasets. We report the
percentage (# instrs of optimized program&#x2F;# instrs of unoptimized program) of
the best single permutation across a given dataset, the average percentage
for optimizing the permutation per program, and the percentage for the best
overall permutation across all three datasets.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Dataset&lt;&#x2F;th&gt;&lt;th&gt;Best Permutatation&lt;&#x2F;th&gt;&lt;th&gt;Best Individual Permutation&lt;&#x2F;th&gt;&lt;th&gt;Overall Permutation&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;CFLow&lt;&#x2F;td&gt;&lt;td&gt;35.81&lt;&#x2F;td&gt;&lt;td&gt;35.32&lt;&#x2F;td&gt;&lt;td&gt;36.15&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;CFMid&lt;&#x2F;td&gt;&lt;td&gt;59.28&lt;&#x2F;td&gt;&lt;td&gt;59.24&lt;&#x2F;td&gt;&lt;td&gt;59.28&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;CFHigh&lt;&#x2F;td&gt;&lt;td&gt;77.60&lt;&#x2F;td&gt;&lt;td&gt;77.54&lt;&#x2F;td&gt;&lt;td&gt;77.61&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Somewhat disappointingly, we find that there are permutations that always
perform very close to the optimal permutation for any program. On average,
the optimal permutations reduced the code length to 57.68% of the original
code length.&lt;&#x2F;p&gt;
&lt;p&gt;Doing so, we also find that there are many permutations with the best
performance as the best permutation (selected by best average performance
across all 3 datasets). We were curious if there was essentially some subset
of the optimizations that mattered. We analyze all of the &amp;quot;optimal&amp;quot;
permutations for common subsequences, and find that all of the optimal
permutations had the following order of permutations:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Copy Propagation&lt;&#x2F;li&gt;
&lt;li&gt;Constant Folding&lt;&#x2F;li&gt;
&lt;li&gt;Branch Removal&lt;&#x2F;li&gt;
&lt;li&gt;Dead Code Elimination&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;We find that for our datasets, these are the only optimizations that matter.
Removing the other 3 optimizations still provides the same performance across
our dataset.&lt;&#x2F;p&gt;
&lt;p&gt;However, note that per program, there is still occasionally performance
improvements to be found. Of our 150 random programs, there are 10 programs
in which we can find a better order than the fixed dataset wide one. We&#x27;re
unable to isolate any particular pattern these programs&#x2F;permutations have.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;future-work&quot;&gt;Future Work&lt;&#x2F;h1&gt;
&lt;p&gt;Originally we were interested in using ML to choose optimization orders.
However, because of how effective a fixed optimization order is, we decided
that we were unlikely to get substantive improvements from a ML based
ordering system. We suspect that this is due to the low amount of
optimizations we have, as well as the lack of more &amp;quot;ambiguous&amp;quot; optimizations.
Optimizations like constant folding or dead code elimination always improves
performance of code (we think...). Thus, it&#x27;s possible that there is a fixed
optimization order that achieves almost optimal performance. However, the
implementation cost of adding more optimization passes (including verifying
their correctness) limited us to only 7 optimizations.&lt;&#x2F;p&gt;
&lt;p&gt;Future work would include more optimization passes. If adding more
optimization passes gave more ambiguity in the optimization ordering, we
could then explore program specific orderings.&lt;&#x2F;p&gt;
&lt;p&gt;Another aspect of optimization passes that we completely ignored is the fact
that we can run each optimization pass more than once. For example, dead code
elimination is often run many times due to its computational cheapness and
the fact that dead code is often generated by other optimization passes. It
could be worth exploring that as well.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>A Spatial Path Scheduling Algorithm for EDGE Architectures</title>
                <pubDate>Mon, 18 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/edge-architecture/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/edge-architecture/</guid>
                <description>&lt;p&gt;This paper proposes a new scheduling algorithm for compiling programs to EDGE architectures, easpecially for the TRIPS architecture. EDGE (Explicit Data Graph Execution) is a class of ISAs different from the usual ISAs we have seen before (e.g., RISC and CSIC). It supports &lt;strong&gt;direct instruction communication&lt;&#x2F;strong&gt;. To be more specific, instead of specifying the source and destination registers of an instruction, direct instruction communication describes the producer-consumer behavior of a set of instructions. Namely, the instructions will be executed in a dataflow fashion. Once the input instructions are finished, the current instruction can fire. With the EDGE ISA, we need a corresponding compiler that schedules the instructions and maps them to the spatial architectures such as TRIPS. This paper first demonstrates how previous work tackles the problem. Then, the paper shows how to get the approximate optimal solutions with simulated annealing. Finally, it propses a new scheduling algorithm (i.e., spatial path scheduling (SPS) algorithm) and evaluates it with benchmarks from a wide range of applications. The results show that with the SPS algorithm, it can achieve substantial improvement compared with the previous work.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;trips-an-edge-architecture&quot;&gt;TRIPS: An EDGE Architecture&lt;&#x2F;h2&gt;
&lt;p&gt;With the rising complexity of existing applications and the need for low-power solutions for emerging silicon technology, we need an instruction-set architecture (ISA) that provides the following features. First, it can exploit different kinds of parallelism (e.g., data-level parallelism and thread-level parallelism) under fixed pipeline depth. Second, it can support power-efficient performance. Third, it can provide flexible data communication to avoid delays caused by long on-chip wires. Finally, it can run various applications with the same set of execution and memory units.&lt;&#x2F;p&gt;
&lt;p&gt;To achieve all four features, the TRIPS architecture builds up an array of execution units, where the instructions can be executed, and the computation results can be moved flexibly between different units. Following shows an example.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;ADD r1 r2 r3
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The above instruction is a traditional RISC instruction. It first reads values from register &lt;code&gt;r1&lt;&#x2F;code&gt; and &lt;code&gt;r2&lt;&#x2F;code&gt;. Then it adds the values then stores the result back to register &lt;code&gt;r3&lt;&#x2F;code&gt;. However, for an EDGE instruction, we do not specify the inputs, we only specify the outputs. Following is an example.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;ADD T2 T3
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In the above EDGE instruction, &lt;code&gt;T2&lt;&#x2F;code&gt; and &lt;code&gt;T3&lt;&#x2F;code&gt; are the destination execution units. Namely, after the addition, the result will be sent to the unit &lt;code&gt;T2&lt;&#x2F;code&gt; and &lt;code&gt;T3&lt;&#x2F;code&gt;. Each execution unit is implemented with an ALU and instruction buffers. In addition, the TRIPS architectures use &lt;strong&gt;block-atomic execution&lt;&#x2F;strong&gt;. Namely, the instructions are grouped into a block (usually consists of 128 instructions), which will be mapped to the execution units. The TRIPS architecture can file at most eight blocks at the same time. The architecture also has other components such as register files, caches, and control units. Since the architecture implements the EDGE ISA, each instruction describes which units the computation result will go. Similarly, an execution unit fires once all its inputs arrive.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;compilation-flow-for-trips&quot;&gt;Compilation Flow for &lt;a href=&quot;https:&#x2F;&#x2F;ieeexplore.ieee.org&#x2F;document&#x2F;1310240&quot;&gt;TRIPS&lt;&#x2F;a&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;A naive way to compile a program and deploy it to a TRIPS architecture is to generate the RISC assembly first. Then, the compiler analyzes the data dependence and builds a dataflow graph (DFG). The compiler also includes instructions from all branches of a conditional instruction (e.g., &lt;code&gt;br&lt;&#x2F;code&gt;). Next, according to the DFG, the compiler maps all instructions to the execution units with a scheduler. The compiler also needs to add data movement instructions if necessary. Finally, the compiler generates the EDGE ISA by referring to the mapping it creates in the previous stage. &lt;&#x2F;p&gt;
&lt;p&gt;One observation from the compilation flow is that the out-of-order execution is enabled naturally with the TRIPS architecture. Namely, the execution order is determined dynamically at run time. The compiler only statically determines the instruction mapping. If we compare with VLIW, both the execution order and instruction mapping are determined statically at compile time. On the other hand, for an out-of-order pipelined processor, both the execution order and instruction mapping are determined at run time.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;spatial-scheduling-problem&quot;&gt;Spatial Scheduling Problem&lt;&#x2F;h2&gt;
&lt;p&gt;There is no doubt that there exist many optimization possibilities within this naive compilation flow. For instance, how should we group the instructions into blocks? How to perform branch prediction? And most importantly, how should we map the instructions to the execution units? The last problem is discussed in detail within the paper, where the authors first use simulated annealing to derive approximate optimal results. Then, they propose a spatial path scheduling (SPS) algorithm that maps the instructions to the execution units with several heuristics. Finally, the authors compare the SPS algorithm with a baseline algorithm proposed in 2004. &lt;&#x2F;p&gt;
&lt;p&gt;The main idea of both the baseline and the SPS algorithms are the same. First, we create an initial set contains certain instructions. Then for each instruction in the set, we assign it with a &lt;strong&gt;placement cost&lt;&#x2F;strong&gt;. The instruction that has a higher cost will be scheduled first. After that, we add other instructions to the set until all instructions are scheduled. The difference between the baseline and the proposed algorithm is the function to calculate the cost.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;greedy-algorithm-grst&quot;&gt;Greedy Algorithm - GRST&lt;&#x2F;h3&gt;
&lt;p&gt;The initial set contains the instructions having no input or all its inputs are scheduled. For GRST, it adopts several heuristics. For example, it schedules the instructions within the critical paths first. It also considers data locality by placing load and store instructions next to the caches. Similarly, it places instructions that have register outputs next to the register files. Ideally, the distance between an instruction and the register is the same as the number of succeeding instructions until the final write operation. However, there exist many limitations. For example, the register files and caches are banked within the TRIPS architecture. This algorithm also serves as the baseline for the SPS algorithm that will be discussed later.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;simulated-annealing&quot;&gt;Simulated Annealing&lt;&#x2F;h3&gt;
&lt;p&gt;One naive way to evaluate the scheduling results is by comparing them with the optimal results. However, it is an NP-complete problem. To solve that, the authors use simulated annealing, which can find an approximate optimal solution within a large search space. The cost function is defined as the number of cycles that are used to run the whole schedule. The authors further apply guided simulated annealing to gather the results with a shorter time.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;spatial-path-scheduling-sps&quot;&gt;Spatial Path Scheduling (SPS)&lt;&#x2F;h3&gt;
&lt;p&gt;The main idea of this algorithm is that, for each instruction, it calculates the cost for each possible position in the ALU array. Then the final placement cost is the minimal cost among all positions. Unlike GRST, the initial set contains the instructions having no or &lt;strong&gt;at least one&lt;&#x2F;strong&gt; scheduled input. Then after the calculation of the placement cost, the instruction with the highest placement cost will be scheduled first. This idea is similar to GRST, where the instructions in the critical paths will be scheduled first. To further improve the algorithm, the authors get the idea for other heuristics by comparing the results generated by SPS and simulated annealing.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;contention-modeling&quot;&gt;Contention Modeling&lt;&#x2F;h4&gt;
&lt;p&gt;The authors model two types of contention, which are ALU and link contention, respectively. For the first one, the idea is to solve the resource conflict when scheduling two instructions with the same ALU (i.e., execution unit). For ALU contention, we can further categorize it into two types. One is intra-block contention, and the other one is inter-block contention. For intra-block contention, it is similar to GRST, where the algorithm checks whether the two instructions that mapped to the same ALU may have resource conflict or not. If that is the case, then we add the penalty to the placement cost of the instruction we are scheduling. &lt;&#x2F;p&gt;
&lt;p&gt;For inter-block contention, it is not handled by GRST. The idea is that the two instructions inside two different blocks may have resource conflicts. The authors try to add the number of consumers of the already scheduled instructions to the placement cost. However, this is too conservative since the consumers of the scheduled instructions do not necessarily have resource conflicts with the instruction that will be scheduled. To counter this, the algorithm removes candidates that must not have resource conflicts. For link contention, it is not trivial to calculate because the network utilization is unknown until run time. &lt;&#x2F;p&gt;
&lt;p&gt;To calculate the final penalty, the algorithm sums up both intra- and intro-contention for ALUs. In addition, it introduces two factors, which are fullness and criticality, respectively. For the former, it corresponds to how full a block is. For the latter, it compares the maximum path length of an instruction with the length of the critical path. The final placement cost is just a combination of all the above numbers.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;global-register-prioritization&quot;&gt;Global Register Prioritization&lt;&#x2F;h4&gt;
&lt;p&gt;The main idea of this heuristic is performing loop-related optimizations. The algorithm performs the following three heuristics.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Smaller loops (in terms of the number of blocks) will be scheduled first. &lt;&#x2F;li&gt;
&lt;li&gt;Loop-carried dependencies will be scheduled first. &lt;&#x2F;li&gt;
&lt;li&gt;When calculating the longest path, the algorithm will also consider the predecessor and the successor of the current block.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h4 id=&quot;path-volume-scheduling&quot;&gt;Path Volume Scheduling&lt;&#x2F;h4&gt;
&lt;p&gt;The goal here is to find the best path from a source unit to a destination unit. This is not as simple as it seems because some units might be fully occupied by other instructions already. In addition, we need to fit all intermediate instructions to the path we find. The authors apply a depth-first search with &lt;strong&gt;iterative deepening&lt;&#x2F;strong&gt;. The idea is to set an upper bound first. It will keep increasing the upper bound until it finds a feasible solution.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;putting-it-all-together&quot;&gt;Putting It All Together&lt;&#x2F;h4&gt;
&lt;p&gt;Finally, we can combine all heuristics into a single function that calculates the placement cost. Namely, we first consider the penalty brought by the contention. Then, we consider the loop-related optimizations. Finally, we include the routing cost derived by the pathfinding algorithm.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;The authors selected a wide range of benchmarks with different levels of concurrency and memory behavior. Since dense blocks require more algorithmic work, the authors modified the selected benchmarks by hand. Then they compared the results with the baseline GRST algorithm and the simulated annealing results. They also show the results after applying each heuristic alone. In conclusion, their SPS algorithm improves the cycle count by over 21% in average comparing with GRST. Specifically, if we do not combine all heuristics, each heuristic can only provide at most 4% improvement in average. The path volumn scheduling cannot even provide improvement when being applied alone. Finally, if we compare with the simulated annealing results, the SPS results are within 5% difference in average. In addition, the authors performed a cross validation by performing the same algorithm on the unmodified benchmarks. The results still demonstrate an over 17% improvement in average comparing with GRST.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion-and-thoughts&quot;&gt;Conclusion and Thoughts&lt;&#x2F;h2&gt;
&lt;p&gt;This paper proposes a scheduling algorithm specific to the TRIPS architecture and it also shows decent evaluation results. This work is actually very interesting because there are more and more spatial architectures nowadays. For example, coarse-grained reconfigurable arrays (CGRAs) are very simimilar to this TRIPS architecture. The major difference is that CGRAs are reconfigurable. Namely, for TRIPS, each execution unit is just an ALU. On the other hand, for CGRAs, each compute unit is reconfigurable. They also share the same routing&#x2F;scheduling problem. This problem is more complicated when we have other kinds of spatial architectures, such as FPGAs. If we make the problem more fine-grained, it becomes the real place &amp;amp; route problem during hardware synthesis.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;questions&quot;&gt;Questions&lt;&#x2F;h2&gt;
&lt;ul&gt;
&lt;li&gt;What are the trade-offs brought by the TRIPS architecture compared with traditional multi-stage pipelined processors?&lt;&#x2F;li&gt;
&lt;li&gt;How can we compare the proposed scheduling algorithm with other traditional algorithms, e.g., list scheduling, ASAP, ALAP? What are the differences?&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</description>
            </item>
        
            <item>
                <title>MemPass</title>
                <pubDate>Thu, 14 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/mempass/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/mempass/</guid>
                <description>&lt;p&gt;While manual access to memory allocation&#x2F;deallocation can allow the experienced programmer to significantly improve the performance of their program, it can also open the door to a host of memory safety bugs: use after free, double free, out of bounds memory accesses, and the like. Detecting these problems are difficult on the programmer&#x27;s end and leave room for exploitation by malicious users.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;memory-safety-vulnerabilities&quot;&gt;Memory Safety Vulnerabilities&lt;&#x2F;h2&gt;
&lt;p&gt;Many memory safety vulnerabilities occur when the user tries to access or deallocate memory that isn&#x27;t available for use, either because it was freed, or never allocated in the first place. Take for example the following program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
ptr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;malloc&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* sizeof&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;));&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
free(ptr);&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;After allocating 16 bytes, the starting address for this block of memory is stored in &lt;code&gt;ptr&lt;&#x2F;code&gt;, and freed shortly afterward.&lt;&#x2F;p&gt;
&lt;p&gt;Now, assume that the programmer attempts to access the memory stored at the address in &lt;code&gt;ptr&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ptr, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ptr2;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
ptr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;malloc&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* sizeof&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;));&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
ptr[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
free(ptr);&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
ptr2 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;malloc&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* sizeof&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;));&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
printf(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;%d&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, ptr[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]);&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Since &lt;code&gt;ptr&lt;&#x2F;code&gt; has been freed, the second call to &lt;code&gt;malloc&lt;&#x2F;code&gt; (which is of same size as the first) is likely to allocate the free space &lt;code&gt;ptr&lt;&#x2F;code&gt; points to. Now, let&#x27;s assume a malicious attacker managed to store a dangerous payload in this space. Since the program tries to use &lt;code&gt;ptr&lt;&#x2F;code&gt; after it has been freed, it might actually end up accessing the attacker&#x27;s payload, not the original &lt;code&gt;ptr&lt;&#x2F;code&gt; data as intended!&lt;&#x2F;p&gt;
&lt;p&gt;Double free errors occur upon consecutive &lt;code&gt;free()&lt;&#x2F;code&gt; call with the same memory address. For instance, consider the following code snippet from our test cases (described in the evaluation section):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;main&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
	&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;malloc&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;sizeof&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;));&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
	&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;free&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(ptr);&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
	&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;free&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(ptr); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;* Double Free Error! *&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here, &lt;code&gt;ptr&lt;&#x2F;code&gt; is allocated and then freed twice.  Generally, double-freeing a block of main memory will corrupt the state of the memory manager, and could allow a malicious attacker to write arbitrary dangerous code to the memory location that was freed twice.&lt;&#x2F;p&gt;
&lt;p&gt;Memory Leaks occur when the programmer forgets to free memory which is no longer needed.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;main&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;malloc&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;sizeof&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;));&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The consequences of memory leaks is often degraded performance. Memory leaks reduce the available amount of memory in a system, and if not enough space is available, the program may slow down due to thrashing or stop altogether. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;introducing-mempass&quot;&gt;Introducing MemPass&lt;&#x2F;h2&gt;
&lt;p&gt;Here, we present an analysis tool (creatively titled MemPass) for LLVM to detect use after free bugs, double free bugs, and memory leakage. Here&#x27;s how it works. &lt;&#x2F;p&gt;
&lt;p&gt;Once any of the above vulnerabilities are detected, the program will throw a warning to the user, along with the line numbers of the original error to allow the programmer to debug the issue. Upon completion of a program&#x27;s execution, MemPass will generate a report listing the detected vulnerabilities in the original program.&lt;&#x2F;p&gt;
&lt;p&gt;In practice, we can only detect memory leaks after the program has completely executed, while use-after-free and double-free errors can be detected during runtime and should result in the program halting to prevent subsequent malicious code from executing. We decided to only return warnings for this report to make testing more manageable.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design-overview&quot;&gt;Design Overview&lt;&#x2F;h2&gt;
&lt;p&gt;As mentioned at the beginning, there is a large breadth of possible memory safety vulnerabilities. In order to tackle a modest subset of these bugs, our strategy of choice is a simple dynamic analysis pass over the LLVM IR.&lt;&#x2F;p&gt;
&lt;p&gt;MemPass inserts instrumentation after relevant memory allocation instructions, recording the relevant addresses. Any time the program attempts to either access or deallocate memory, MemPass will then check if those addresses are still available for use. If not, the program will throw a warning.&lt;&#x2F;p&gt;
&lt;p&gt;In essence, if the program ever try to deallocate or access a memory address that isn&#x27;t currently allocated in MemPass&#x27;s hashtable, we have a bug!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;instrumenting-llvm-ir&quot;&gt;Instrumenting LLVM IR&lt;&#x2F;h2&gt;
&lt;p&gt;To track memory allocations, deallocations, and accesses in LLVM IR, MemPass needs to insert instrumentation after relevant instructions.&lt;&#x2F;p&gt;
&lt;p&gt;While we could insert our own, carefully crafted LLVM instructions every time, we opted to write a runtime library and link it to the main program with our pass (as described &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;llvm-pass-skeleton&#x2F;tree&#x2F;rtlib&quot;&gt;here&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;In this runtime library, we can write a series of functions to grab relevant data, and then perform the appropriate steps to detect any memory safety vulnerabilities. For each instruction, we log:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Instruction&lt;&#x2F;th&gt;&lt;th&gt;Logging&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;alloca&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;size and (stack) pointer address&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;malloc&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;size and pointer address&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;calloc&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;size and pointer address&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;free&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;pointer address&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;load&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;address to load from&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;store&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;address to store to&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Now, all we have to do is insert a &lt;code&gt;call&lt;&#x2F;code&gt; to one of our library functions after every relevant memory instruction, and the &lt;code&gt;llvm-link&lt;&#x2F;code&gt; tool will do all the heavy lifting!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;tracking-memory-allocation&quot;&gt;Tracking Memory Allocation&lt;&#x2F;h2&gt;
&lt;p&gt;In order to better illustrate how MemPass works with a real LLVM program, lets take the following buggy program segment:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
ptr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;malloc&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* sizeof&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;));&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
free(ptr);&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
free(ptr);&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We see a double free vulnerability, which we want to catch and emit as a warning to the user.&lt;&#x2F;p&gt;
&lt;p&gt;The relevant call to &lt;code&gt;malloc&lt;&#x2F;code&gt; in LLVM IR translates as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;%7 = call noalias i8* @malloc(i64 16) #3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;To pass the relevant data to our logging functions, we need to grab both the memory address that &lt;code&gt;malloc&lt;&#x2F;code&gt; allocated, as well as the amount of memory that was allocated. Luckily, &lt;code&gt;%7&lt;&#x2F;code&gt; stores the pointer address as an 8-bit integer pointer, and the size is a 64-bit integer operand to the &lt;code&gt;malloc&lt;&#x2F;code&gt; call itself.&lt;&#x2F;p&gt;
&lt;p&gt;Thus, MemPass inserts a call to our &lt;code&gt;logMalloc&lt;&#x2F;code&gt; library function:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;%7 = call noalias i8* @malloc(i64 16) #3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
call void @logMalloc(i8* %7, i64 16)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Armed with this data, MemPass stores the address and memory size as key-value pairs in a hashtable. This allows us to easily check if an address has already been allocated.&lt;&#x2F;p&gt;
&lt;p&gt;Note that this approach will work similarly with any calls to &lt;code&gt;calloc&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;checking-free&quot;&gt;Checking Free&lt;&#x2F;h2&gt;
&lt;p&gt;Returning to the above example, the two calls to &lt;code&gt;free&lt;&#x2F;code&gt; roughly translate to (with instrumentation, after omitting a few loads and bitcasts):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;call void @free(i8* %10) #3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
call void @logFree(i8* %10)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
call void @free(i8* %12) #3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
call void @logFree(i8* %12)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In essence, MemPass sends the relevant addresses that we want to free to our runtime library. Taking the first address in &lt;code&gt;%10&lt;&#x2F;code&gt;, MemPass checks the allocation table to see if has been allocated. If so, this is a valid attempt to free!&lt;&#x2F;p&gt;
&lt;p&gt;On the other hand, our second call to &lt;code&gt;free&lt;&#x2F;code&gt; occurs attempts to free &lt;code&gt;%12&lt;&#x2F;code&gt;, which is the same address as &lt;code&gt;%10&lt;&#x2F;code&gt;. MemPass simply checks the allocation table, and since the address is no longer stored here we have a double free bug. MemPass prints this as a warning to the console, and continues searching for vulnerabilities.&lt;&#x2F;p&gt;
&lt;p&gt;While MemPass doesn&#x27;t handle calls to &lt;code&gt;realloc&lt;&#x2F;code&gt;, it would be quite tractible to do so now. Just remove the old memory address from the allocation hash, and add the new address that the function returns (along with its size).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;use-after-free-accessing-invalid-memory&quot;&gt;Use After Free: Accessing invalid memory&lt;&#x2F;h2&gt;
&lt;p&gt;One of the more difficult aspects of MemPass&#x27;s implementation is to find a way to handle accesses to memory after a pointer has been freed. Consider the following program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
ptr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;malloc&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* sizeof&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;));&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
free(ptr);&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
ptr[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The naïve solution would be to add instrumentation after every load or store instruction in the LLVM IR, and compare the addresses to our allocation table. However, we run into a series of complications once we look at the actual IR.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;%3 = alloca i32, align 4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
%4 = alloca i32, align 4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
%5 = alloca i8**, align 8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
%6 = alloca i32*, align 8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
store i32 0, i32* %3, align 4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
store i32 %0, i32* %4, align 4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
store i8** %1, i8*** %5, align 8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
%7 = call noalias i8* @malloc(i64 16) #3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
%8 = bitcast i8* %7 to i32*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
store i32* %8, i32** %6, align 8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
%9 = load i32*, i32** %6, align 8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
%10 = bitcast i32* %9 to i8*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
call void @free(i8* %10) #3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
%11 = load i32*, i32** %6, align 8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
%12 = getelementptr inbounds i32, i32* %11, i64 0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
store i32 4, i32* %12, align 4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
ret i32 0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;First, the program accesses more memory than what was allocated by the programmer through calls to &lt;code&gt;malloc&lt;&#x2F;code&gt;. If MemPass compares the address accessed by one of the first &lt;code&gt;store&lt;&#x2F;code&gt; instructions with the allocation hash, it wouldn&#x27;t find the address and emit a false-positive warning.&lt;&#x2F;p&gt;
&lt;p&gt;In reality, programs also allocates stack frame memory with &lt;code&gt;alloca&lt;&#x2F;code&gt; instructions. In order to handle these extra allocations, MemPass adds extra instrumentation here, and adds the stack frame addresses&#x2F;sizes onto the allocation table.&lt;&#x2F;p&gt;
&lt;p&gt;However, there&#x27;s another problem. Some of the load and store instructions use pointers of arbitrary types. If MemPass doesn&#x27;t know what the pointer types are, it can&#x27;t pass those addresses to the runtime library for evaluation.&lt;&#x2F;p&gt;
&lt;p&gt;A solution that MemPass employs is to insert &lt;code&gt;bitcast&lt;&#x2F;code&gt; instructions after every load or store instruction, converting the address pointer from its arbitrary type to an 8-bit integer pointer. Since we&#x27;re only comparing addresses and not the actual values at these addresses, this should work somewhat well.&lt;&#x2F;p&gt;
&lt;p&gt;With &lt;code&gt;i8*&lt;&#x2F;code&gt; pointers, all MemPass needs is the size of the memory chunk that a load or store plans to interact with. While this data is not immediately accessible, LLVM provides a handy DataLayout class. After grabbing the type of the element that the &lt;em&gt;original&lt;&#x2F;em&gt; pointer points to, MemPass can extract its size and pass that to our library functions.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, we need to actually check if we are accessing memory that is available as per the allocation table. MemPass takes the difference of the address in question with every pointer address in the allocation table, and compares it to the appropriate sizes. If the address is within the bounds of a chunk of memory, then we are fine. Otherwise, the program will emit a use after free warning, and continue to look for more vulnerabilities.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;program-termination&quot;&gt;Program Termination&lt;&#x2F;h2&gt;
&lt;p&gt;On program termination, we need to check the allocation table for any remaining memory that has not been freed. In order to differentiate stack memory that was allocated with &lt;code&gt;alloca&lt;&#x2F;code&gt;, any memory allocated with &lt;code&gt;malloc&lt;&#x2F;code&gt; is written to a separate file, along with any pointer addresses passed to &lt;code&gt;free&lt;&#x2F;code&gt;. Now, MemPass just searches this file and emits any malloced addresses that were not freed. This is compiled into a final report, listing all of the memory safety vulnerabilities (among double free, use after free, and memory leak) that were detected throughout the execution of this program, along with their line numbers.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementation-extras&quot;&gt;Implementation extras&lt;&#x2F;h2&gt;
&lt;p&gt;Another possible implementation scheme we considered was the use of a &lt;em&gt;deallocation&lt;&#x2F;em&gt; hashmap in addition to the allocation hashmap, to store memory that has been freed. This way, MemPass does not need add instrumentation after &lt;code&gt;alloca&lt;&#x2F;code&gt; instructions; it just needs to store memory addresses allocated with malloc. However, every time memory is allocated or deallocated, MemPass must check addresses in the other map to ensure there are no overlaps.&lt;&#x2F;p&gt;
&lt;p&gt;Both for the proposed framework and our current one, some sort of segmentation tree implementation could be useful to store memory bounds as intervals and compare them quickly. However, the overhead of building this tree might not be worth the benefits for small programs.&lt;&#x2F;p&gt;
&lt;p&gt;Another implementation idea that would have been much more effective in hindsight would be to find some way invalidate pointers after they are freed. This way, MemPass does not need to add instrumentation after every load and store (the program can exit when accesssing a specific &amp;quot;invalid&amp;quot; pointer). &lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;When evaluating, we aimed to catch all the use-after-free, double-free, and memory leak errors that we could. We wrote a series of small correctness tests first to verify that our algorithm worked as expected. Then, we selected a series of benchmark tests, checking both how many bugs MemPass was able to catch and the runtime overhead of our instrumentation.&lt;&#x2F;p&gt;
&lt;p&gt;Our testing procedure is as follows:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Translate each test file to LLVM IR with &lt;code&gt;clang&lt;&#x2F;code&gt; and the &lt;code&gt;disable-O0-optnone&lt;&#x2F;code&gt; flag.&lt;&#x2F;li&gt;
&lt;li&gt;Run MemPass on the new LLVM IR with the &lt;code&gt;opt&lt;&#x2F;code&gt; tool.&lt;&#x2F;li&gt;
&lt;li&gt;Translate runtime libraries to LLVM IR with &lt;code&gt;clang&lt;&#x2F;code&gt; and the &lt;code&gt;O3&lt;&#x2F;code&gt; flag.&lt;&#x2F;li&gt;
&lt;li&gt;Use &lt;code&gt;llvm-link&lt;&#x2F;code&gt; to link all files together&lt;&#x2F;li&gt;
&lt;li&gt;Compile this final, linked file with &lt;code&gt;clang&lt;&#x2F;code&gt; and run it with the bash &lt;code&gt;time&lt;&#x2F;code&gt; tool. Log the user time measurements (not the sys or real).&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;All tests were run on an Intel Core i7-7700HQ CPU @ 2.80GHz with 16GB of RAM, and using Ubuntu on WSL.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;benchmark-testing-runtime-overhead&quot;&gt;Benchmark Testing: Runtime Overhead&lt;&#x2F;h3&gt;
&lt;p&gt;In order to evaluate the overhead cost due to MemPass, we instrumented a subset of the LLVM Test Suite (specifically the Stanford Test Suite). Each benchmark was run 5 times, and the average and standard deviation were recorded in the following tables.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Program (Original)&lt;&#x2F;th&gt;&lt;th&gt;Average (s)&lt;&#x2F;th&gt;&lt;th&gt;SD (s)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;BubbleSort&lt;&#x2F;td&gt;&lt;td&gt;0.048&lt;&#x2F;td&gt;&lt;td&gt;0.009&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;FloatMM&lt;&#x2F;td&gt;&lt;td&gt;1.220&lt;&#x2F;td&gt;&lt;td&gt;0.086&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;IntMM&lt;&#x2F;td&gt;&lt;td&gt;0.000&lt;&#x2F;td&gt;&lt;td&gt;0.000&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Oscar&lt;&#x2F;td&gt;&lt;td&gt;0.010&lt;&#x2F;td&gt;&lt;td&gt;0.000&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Perm&lt;&#x2F;td&gt;&lt;td&gt;0.030&lt;&#x2F;td&gt;&lt;td&gt;0.000&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Queens&lt;&#x2F;td&gt;&lt;td&gt;0.010&lt;&#x2F;td&gt;&lt;td&gt;0.000&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Quicksort&lt;&#x2F;td&gt;&lt;td&gt;0.064&lt;&#x2F;td&gt;&lt;td&gt;0.004&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;RealMM&lt;&#x2F;td&gt;&lt;td&gt;0.010&lt;&#x2F;td&gt;&lt;td&gt;0.000&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Towers&lt;&#x2F;td&gt;&lt;td&gt;0.028&lt;&#x2F;td&gt;&lt;td&gt;0.009&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Treesort&lt;&#x2F;td&gt;&lt;td&gt;0.106&lt;&#x2F;td&gt;&lt;td&gt;0.019&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Program (MemPass)&lt;&#x2F;th&gt;&lt;th&gt;Average (s)&lt;&#x2F;th&gt;&lt;th&gt;SD (s)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;BubbleSort&lt;&#x2F;td&gt;&lt;td&gt;44.834&lt;&#x2F;td&gt;&lt;td&gt;0.1995&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;FloatMM&lt;&#x2F;td&gt;&lt;td&gt;166.354&lt;&#x2F;td&gt;&lt;td&gt;2.174&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;IntMM&lt;&#x2F;td&gt;&lt;td&gt;0.244&lt;&#x2F;td&gt;&lt;td&gt;0.012&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Oscar&lt;&#x2F;td&gt;&lt;td&gt;1.270&lt;&#x2F;td&gt;&lt;td&gt;0.016&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Perm&lt;&#x2F;td&gt;&lt;td&gt;2.322&lt;&#x2F;td&gt;&lt;td&gt;0.186&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Queens&lt;&#x2F;td&gt;&lt;td&gt;8.198&lt;&#x2F;td&gt;&lt;td&gt;0.252&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Quicksort&lt;&#x2F;td&gt;&lt;td&gt;19.060&lt;&#x2F;td&gt;&lt;td&gt;0.240&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;RealMM&lt;&#x2F;td&gt;&lt;td&gt;0.446&lt;&#x2F;td&gt;&lt;td&gt;0.013&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Towers&lt;&#x2F;td&gt;&lt;td&gt;2.852&lt;&#x2F;td&gt;&lt;td&gt;0.279&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Treesort&lt;&#x2F;td&gt;&lt;td&gt;5.948&lt;&#x2F;td&gt;&lt;td&gt;0.305&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;&#x2F;strong&gt; Some programs had an almost negligible runtime, which we record as 0.000 in the charts.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;MemPass&#x27;s overhead can be anywhere from a 100x slowdown to over 1000x in the case of Bubblesort. This can be attributed to the fact that MemPass adds instrumentation not only after &lt;code&gt;malloc&lt;&#x2F;code&gt; and &lt;code&gt;free&lt;&#x2F;code&gt; calls, but also after every &lt;code&gt;load&lt;&#x2F;code&gt; and &lt;code&gt;store&lt;&#x2F;code&gt;. The benchmarks in question (especially FloatMM, IntMM, and RealMM) do many matrix computations, which end up blowing up the performance cost of our instrumentation significantly. In hindsight, restricting instrumentation to only &lt;code&gt;malloc&lt;&#x2F;code&gt; and &lt;code&gt;free&lt;&#x2F;code&gt; would significantly reduce the overhead of MemPass. As explained before, we could find some clever way to invalidate pointers once they are freed, thus causing the program to exit when trying to access an invalid location in memory.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;benchmark-testing-correctness&quot;&gt;Benchmark Testing: Correctness&lt;&#x2F;h3&gt;
&lt;p&gt;We evaluated the correctness of MemPass on subsets of the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;regehr&#x2F;itc-benchmarks&quot;&gt;Toyota ITC&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;samate.nist.gov&#x2F;SRD&#x2F;testsuite.php#sardsuites&quot;&gt;SARD-100&lt;&#x2F;a&gt; benchmark tests. The Toyota benchmark tests consist of a family of memory tests, and two test suites that we used are &lt;code&gt;Double Free&lt;&#x2F;code&gt; and &lt;code&gt;Memory Leak&lt;&#x2F;code&gt; tests.  The SARD-100 benchmark tests are similar, and we used the &lt;code&gt;cwe-401-memory-leak&lt;&#x2F;code&gt;, &lt;code&gt;cwe-415-double-free&lt;&#x2F;code&gt; and &lt;code&gt;cwe-416-use-after-free&lt;&#x2F;code&gt; suites.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;&#x2F;strong&gt;: We wanted to use the Invalid Memory Access (Use After Free) test set in Toyota as well, but we kept running into segmentation faults with the tests and struggled to debug them.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;In general, the &lt;code&gt;Double Free&lt;&#x2F;code&gt; and &lt;code&gt;cwe-415-double-free&lt;&#x2F;code&gt; benchmark tests consist of normal double free errors, freeing in constant&#x2F;variable if statements, freeing in a function, freeing in conditional while loops, and freeing in for loops.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;Memory Leak&lt;&#x2F;code&gt;, &lt;code&gt;cwe-401-memory-leak&lt;&#x2F;code&gt;, and &lt;code&gt;cwe-416-use-after-free&lt;&#x2F;code&gt; benchmark tests consist of a series of tests such as allocating memory without freeing, allocating in conditional statements, freeing based on function return values, allocating memory in mutually recursive functions and various branching scenarios.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;benchmark-results&quot;&gt;Benchmark Results&lt;&#x2F;h3&gt;
&lt;p&gt;In &lt;a href=&quot;https:&#x2F;&#x2F;nikolai-kosmatov.eu&#x2F;publications&#x2F;vorobyov_ks_tap_2018.pdf&quot;&gt;Detection of Security Vulnerabilities in C Code using Runtime Verification&lt;&#x2F;a&gt;, the authors provide benchmark test results for E-ACSL, Google Sanitizer, and RV-Match on both the Toyota-ITC and SARD-100 test suites.&lt;&#x2F;p&gt;
&lt;p&gt;With the Toyota dataset, all three of these related memory vulnerability detection tools were run on their dynamic memory tests, which look for errors such as &lt;code&gt;Double Free&lt;&#x2F;code&gt;, &lt;code&gt;Memory Leak&lt;&#x2F;code&gt;, &lt;code&gt;Null Memory&lt;&#x2F;code&gt;, among many others. The numbers below refer to the percent of tests the tools were able to correctly detect the appropriate memory vulnerability for.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Defect Type&lt;&#x2F;th&gt;&lt;th&gt;E-ACSL&lt;&#x2F;th&gt;&lt;th&gt;Sanitizer&lt;&#x2F;th&gt;&lt;th&gt;RV-Match&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Dynamic Memory Tests&lt;&#x2F;td&gt;&lt;td&gt;94%&lt;&#x2F;td&gt;&lt;td&gt;78%&lt;&#x2F;td&gt;&lt;td&gt;94%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Unfortunately, the above statistics are not that useful, as our LLVM pass is only able to target double free, memory leaks and use-after-free errors. Still, within the Dynamic Memory tests, we considered the &lt;code&gt;Double Free&lt;&#x2F;code&gt; and &lt;code&gt;Memory Leak&lt;&#x2F;code&gt; suites of tests. &lt;code&gt;Double Free&lt;&#x2F;code&gt; contains 12 cases, and &lt;code&gt;Memory Leak&lt;&#x2F;code&gt; contains 18. Over these test suites we achieved the following results:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Dynamic Memory Test&lt;&#x2F;th&gt;&lt;th&gt;Double Free&lt;&#x2F;th&gt;&lt;th&gt;Memory Leak&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;MemPass&lt;&#x2F;td&gt;&lt;td&gt;91.6%&lt;&#x2F;td&gt;&lt;td&gt;77.7%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;After looking through the failed test cases, we found that the failed double free test and three of the four failed memory leak tests used either static or global variables. Our somewhat sketchy pointer type conversion from earlier seemed to struggle to convert the types of these variables correctly to 8-bit pointers, and thus MemPass ended up generating many false positives. With a more involved type-checking&#x2F;conversion system (or something else we failed to consider), we should avoid failing most of these cases.&lt;&#x2F;p&gt;
&lt;p&gt;On the other hand, the paper provided a much more granular breakdown of test results on the Sard-100 test suite. Their results, along with our results, are displayed below:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Non-Memory Defects&lt;&#x2F;th&gt;&lt;th&gt;E-ACSL&lt;&#x2F;th&gt;&lt;th&gt;Sanitizer&lt;&#x2F;th&gt;&lt;th&gt;RV-Match&lt;&#x2F;th&gt;&lt;th&gt;MemPass&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;CWE-416: Use After Free&lt;&#x2F;td&gt;&lt;td&gt;100% (6&#x2F;6)&lt;&#x2F;td&gt;&lt;td&gt;100% (6&#x2F;6)&lt;&#x2F;td&gt;&lt;td&gt;100% (6&#x2F;6)&lt;&#x2F;td&gt;&lt;td&gt;100% (6&#x2F;6)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;CWE-415: Double Free&lt;&#x2F;td&gt;&lt;td&gt;100% (6&#x2F;6)&lt;&#x2F;td&gt;&lt;td&gt;100% (6&#x2F;6)&lt;&#x2F;td&gt;&lt;td&gt;67% (4&#x2F;6)&lt;&#x2F;td&gt;&lt;td&gt;67% (4&#x2F;6)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;CWE-401: Memory Leak&lt;&#x2F;td&gt;&lt;td&gt;100% (5&#x2F;5)&lt;&#x2F;td&gt;&lt;td&gt;80% (4&#x2F;5)&lt;&#x2F;td&gt;&lt;td&gt;60% (3&#x2F;5)&lt;&#x2F;td&gt;&lt;td&gt;100% (5&#x2F;5)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Once again, the double free test cases that we fail are related to casting issues generating false positives.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;&#x2F;strong&gt; Our implementation attempts to achieve some sense of completeness, as we only emit a warning if a program tries to interact with memory that was not allocated to begin with. However, this dynamic analysis comes at the cost of soundness, especially since we only interact with one execution path at a time. In addition, due to casting errors we can end up with some false positives in certain cases.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h3 id=&quot;dynamic-vs-static-analysis&quot;&gt;Dynamic vs Static Analysis&lt;&#x2F;h3&gt;
&lt;p&gt;While a dynamic analysis is certainly interesting and useful, it&#x27;s difficult to analyze certain memory bugs such as memory leaks, since they cannot be detected until the program terminates. In addition, a dynamic analysis can only check individual executions of a program, and therefore might miss bugs with programs that have a large number of inputs.&lt;&#x2F;p&gt;
&lt;p&gt;A static analysis that uses some sort of use-def chain would be another interesting way to triage these vulnerabilities, and would relax much of the overhead that our method produces. In addition, it would be able to analyze all possible execution paths at once, and therefore complete a more &amp;quot;sound&amp;quot; analysis of the various bugs that may be present in the program.&lt;&#x2F;p&gt;
&lt;p&gt;The code for MemPass can be found here: https:&#x2F;&#x2F;github.com&#x2F;splashofcrimson&#x2F;memPass&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Dynamic Edge Profiling with Optimal Counter Placement</title>
                <pubDate>Wed, 13 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/dynamicedgeprofiling/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/dynamicedgeprofiling/</guid>
                <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;&#x2F;h2&gt;
&lt;p&gt;For this project, we wanted to implement an LLVM pass that dynamically profiled basic block edges. The naïve implementation would be to add a profiling mechanism to every edge in the program, but that comes with a significant amount of overhead. To minimize overhead, we strived to add profiling to the minimum number of edges necessary to obtain the same amount of profiling information. The key insight is that we can determine if an edge is traversed using profiling information from preceding and succeeding edges if the CFG structure ensures the program can only run through a simple path between those profiled edges.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;algorithm&quot;&gt;Algorithm&lt;&#x2F;h2&gt;
&lt;p&gt;To select the edges to profile, we used Knuth&#x27;s algorithm:&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;!-- ![](https:&#x2F;&#x2F;i.imgur.com&#x2F;NbavLnq.png) --&gt;
&lt;img src=&quot;array.png&quot; style=&quot;max-width: 100%&quot; &gt;
---
Essentially, this algorithm first finds the minimum spanning tree of the edges in a function and then adds profiling instrumentation---code that logs profiling information---to all edges not in the spanning tree. Note that this algorithm uses the first block in a function as the root of the spanning tree, and that edges are treated as bidirectional when checking for cycles.
&lt;p&gt;This algorithm works because the acyclic nature of the spanning tree entails that there is exactly one path between any two instrumented edges (or the entry or exit of the function). Thus, logging the traversal of one instrumented edge implies traversal of the program through the path from the previously logged edge location to the current one. This also implies that this is the minimum number of instrumented edges because having one less instrumented edge would either create a path that could be traversed without logging any profiling information or create multiple paths that lack sufficient profiling information to deduce which path was taken. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;offline-edge-count-extrapolation-algorithm&quot;&gt;Offline Edge Count Extrapolation Algorithm&lt;&#x2F;h3&gt;
&lt;p&gt;After we run our program, we know the number of times our instrumented edges were traversed. In order to determine the number of times the remaining edges were produced, we need to do some post-processing of our CFG with our runtime results.&lt;&#x2F;p&gt;
&lt;p&gt;The idea behind this algorithm is to iterate over the edges and fill in the edge count until convergence. The program will converge under the assumption that the edge profiling information supplied is sufficient to determine the rest of the edge counts. It is valid to fill in an edge count if there is a simple path between two nodes with known edge count. You can find detailed pseudocode and proofs for this offline algorithm in this &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;pubs&#x2F;2010-04-NeustifterProfiling.pdf&quot;&gt;dissertation&lt;&#x2F;a&gt; by Andreas Neustifter.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;We wanted our instrumentation implementation to be as noninvasive as possible for the base code as pervasive changes could make the LLVM IR harder to reason about and could prevent future optimizations. To achieve that, we implemented the bulk of the profiling information logging in a separate runtime library file. This strategy allows us to reduce instrumentation to two function calls, one to &lt;code&gt;logsrc(num1)&lt;&#x2F;code&gt; in the edge&#x27;s &amp;quot;source&amp;quot; block and one to &lt;code&gt;logdest(num2)&lt;&#x2F;code&gt; in the &amp;quot;destination&amp;quot; block, both of which reside in the library file. These functions calls take in a block-unique integer (&lt;code&gt;num1&lt;&#x2F;code&gt; and &lt;code&gt;num2&lt;&#x2F;code&gt; above) that is generated during the pass as arguments, meaning that we can identify an edge traversal with a pair of &lt;code&gt;logsrc(num1), logdest(num2)&lt;&#x2F;code&gt; function calls. The unique integers were simply the order in which the blocks were iterated through in our pass. &lt;&#x2F;p&gt;
&lt;p&gt;This approach allows us to leave the CFG unchanged, which would not be the case if we took the straightforward approach of adding a new block in the &amp;quot;middle&amp;quot; of every edge to hold the profiling code.&lt;&#x2F;p&gt;
&lt;p&gt;Since the function arguments are unique for each block, this means at most one &lt;code&gt;logsrc(num1)&lt;&#x2F;code&gt; and &lt;code&gt;logdest(num2)&lt;&#x2F;code&gt; is necessary in each block. To log each edge traversal, the runtime library keeps track of the order in which those functions are called as it is possible to trigger one of those functions without having traversed an edge with profiling instrumentation, though this does &lt;em&gt;not&lt;&#x2F;em&gt; impact correctness as the functions would not be triggered as a &lt;code&gt;logsrc&lt;&#x2F;code&gt;-&lt;code&gt;logdest&lt;&#x2F;code&gt; pair (and thus would not cause profiling information to be logged). For example, a natural loop&#x27;s backedge, from block &lt;code&gt;A&lt;&#x2F;code&gt; to block &lt;code&gt;B&lt;&#x2F;code&gt; (where &lt;code&gt;B&lt;&#x2F;code&gt; is the loop header) could be instrumented. In this case, &lt;code&gt;A&lt;&#x2F;code&gt; would contain a &lt;code&gt;logsrc(num1)&lt;&#x2F;code&gt; call and &lt;code&gt;B&lt;&#x2F;code&gt; would contain a &lt;code&gt;logdest(num2)&lt;&#x2F;code&gt; call. &lt;code&gt;B&lt;&#x2F;code&gt;&#x27;s &lt;code&gt;logdest(num2)&lt;&#x2F;code&gt; call would be triggered whenever the program enters the loop header, but since that call was not immediately preceded by the matching &lt;code&gt;logsrc(num1)&lt;&#x2F;code&gt; from &lt;code&gt;A&lt;&#x2F;code&gt;, the runtime library will know that the &lt;code&gt;A&lt;&#x2F;code&gt;→&lt;code&gt;B&lt;&#x2F;code&gt; edge traversal did not actually happen so no profiling information is logged. In some rare cases, there may be a path from a block with &lt;code&gt;logsrc(num1)&lt;&#x2F;code&gt; to &lt;code&gt;logdest(num2)&lt;&#x2F;code&gt; that does not go along the instrumented edge (i.e., it takes another routhe through multiple un-insturmented blocks), so in those cases we add a function call to &lt;code&gt;logdisrupt(num3)&lt;&#x2F;code&gt; in the first node on the un-instrumented path to signal that an alternative route was taken.&lt;&#x2F;p&gt;
&lt;p&gt;As the profiling logic and data are written in a runtime library disjoint from the LLVM pass, this information can be outputted however the user wants with minor edits to the library file. We chose to print it to standard out for simplicity. There is also the matter of correlating the block numbers assigned during the LLVM pass to the actual blocks, and this is done by printing the block-to-number matching for the source and destination blocks of each instrumented edge to standard out during the pass.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;correctness&quot;&gt;Correctness&lt;&#x2F;h3&gt;
&lt;p&gt;To measure performance, we ran a subset of C benchmarks from the &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;TestSuiteGuide.html&quot;&gt;LLVM test suite&lt;&#x2F;a&gt;. We collected the tabulated results and CFGs to manually verify our instrumentation was on an optimal number of edges with correct edge counts. We ran the algorithm described above to extrapolate the uninstrumented edges. We include one test here for visibility.
Consider the following CFGs for a program with two functions, main and testFunc.&lt;&#x2F;p&gt;
&lt;!-- ![](https:&#x2F;&#x2F;i.imgur.com&#x2F;h9Stzm2.png =300x) ![](https:&#x2F;&#x2F;i.imgur.com&#x2F;ls7PM8e.png =300x)  --&gt;
&lt;img src=&quot;main.png&quot;&#x2F;&gt;
&lt;img src=&quot;testfunc.png&quot;&#x2F;&gt;
We run our instrumented program with input argument `5` and procure the table below.
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Edge&lt;&#x2F;th&gt;&lt;th&gt;Count&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;[6-&amp;gt;4]&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;[9-&amp;gt;10]&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;When we ran our pass, we also output the edges that were instrumented.
These edges were &lt;code&gt;[0-&amp;gt;2]&lt;&#x2F;code&gt;, &lt;code&gt;[6-&amp;gt;4]&lt;&#x2F;code&gt;, and &lt;code&gt;[9-&amp;gt;10]&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;To verify these results, we consider our program. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;void testFunc(int n){
    if(n%2==1) printf(&amp;quot;The number is odd\n&amp;quot;);
}

int main(int num) {
    for(int i =0; i&amp;lt;5; i++){
        printf(&amp;quot;Iteration %i\n&amp;quot;, i);
    }
    if(num%2 == 0){
        printf(&amp;quot;The number is even\n&amp;quot;);
    }
    else{
        testFunc(num);
    }
    printf(&amp;quot;%i\n&amp;quot;, num + 2);
    return 0;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In order to automatically verify these results, we implement a naïve LLVM pass for edge profiling that instruments every edge. We then compare the results collected and extrapolated from optimal edge profiling with the results from running the naïve implementation.&lt;&#x2F;p&gt;
&lt;p&gt;Consider the results from running the naïve implementation.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Edge&lt;&#x2F;th&gt;&lt;th&gt;Count&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;[1-&amp;gt;2]&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;[3-&amp;gt;4]&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;[4-&amp;gt;5]&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;[5-&amp;gt;6]&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;[6-&amp;gt;4]&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;[7-&amp;gt;9]&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;[9-&amp;gt;10]&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;We confirm that our results for optimal placement match the results from naïve placement. In order to verify the remaining edges, we run the extrapolation algorithm discussed above by hand. The next step is to implement the extrapolation and automatically verify that the optimal placement pass produces the same results as the naïve pass for all edges.&lt;&#x2F;p&gt;
&lt;p&gt;We showed one test here, and performed similar analyses on several tests we designed with tricky CFGs in terms of loops and function calls.&lt;&#x2F;p&gt;
&lt;p&gt;To measure optimality in terms of the number of instrumented edges, we run the algorithm by hand on a few of our test CFGs. We find that our implementation results in the same number of instrumented edges as we expect. While we would like to automate this process, we were unable to find an LLVM pass that did similar edge count profiling to compare against.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;performance&quot;&gt;Performance&lt;&#x2F;h3&gt;
&lt;p&gt;We used C benchmarks from the &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;TestSuiteGuide.html&quot;&gt;LLVM test suite&lt;&#x2F;a&gt;. Specifically, we used single-source benchmarks as we did not want support instrumenting external files&#x2F;libraries and complex build processes. There were 66 total (we excluded a few that had unusual clang compilation or linking errors). 
For each program, we measured the performance after these three stages of passes:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;original code → LLVM &lt;code&gt;-O3&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;original code → LLVM &lt;code&gt;-O3&lt;&#x2F;code&gt; → Our Profiling Pass&lt;&#x2F;li&gt;
&lt;li&gt;original code → LLVM &lt;code&gt;-O3&lt;&#x2F;code&gt; → Our Profiling Pass → LLVM &lt;code&gt;-O3&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;We ran each stage 3 times per program. &lt;&#x2F;p&gt;
&lt;p&gt;On average, we found that adding profiling takes 5x longer than the unprofiled code, and the optimized profiling takes 4.8x longer than the unprofiled code.&lt;&#x2F;p&gt;
&lt;p&gt;Below are our performance results. The data has been normalized to the average runtime for each program&#x27;s original code after LLVM &lt;code&gt;-O3&lt;&#x2F;code&gt;. This means that if a bar&#x27;s height is 2, that stage&#x27;s average runtime took 2x longer than the optimized but unprofiled code. &lt;&#x2F;p&gt;
&lt;!-- ![](https:&#x2F;&#x2F;i.imgur.com&#x2F;uICyHTe.png) --&gt;
&lt;img src=&quot;results.png&quot; style=&quot;max-width: 100%&quot; &gt;
&lt;p&gt;A next step to further improve performance would be to place counters on edges that are less likely to be executed, decreasing the number of dynamic instructions. This can be done by additional profiling and then creating a maximum spanning tree using the estimated edge weights instead of an arbitrary spanning tree.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;challenges&quot;&gt;Challenges&lt;&#x2F;h2&gt;
&lt;p&gt;One of the primary challenges of this project was correctly installing LLVM on each of our machines with the correct version and &lt;code&gt;&#x2F;include&#x2F;&lt;&#x2F;code&gt; directories. When planning our design, we utilized several online resources that varied quite a bit in terms of targeted LLVM version. Available source files change quite a bit, and therefore we realized we needed to be more aware of the version of LLVM these resources were using when reading them. &lt;&#x2F;p&gt;
&lt;p&gt;The other main challenge was figuring out how to work with LLVM. Part of this work encompassed determining the tools and information we had access to and what information we needed to collect manually. We found the most effective way to learn how to work with LLVM was from experimenting with examples. In particular, when implementing dynamic edge profiling, we found passes that implement other dynamic passes like dynamic instruction count to be especially helpful.&lt;&#x2F;p&gt;
&lt;p&gt;The source code for the LLVM pass detailed above can be found on &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kavoor&#x2F;llvm-pass-skeleton&quot;&gt;GitHub&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Implementing Efficient Path Profiling</title>
                <pubDate>Wed, 13 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/efficient-path-prof/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/efficient-path-prof/</guid>
                <description>&lt;h2 id=&quot;the-goal&quot;&gt;The Goal&lt;&#x2F;h2&gt;
&lt;p&gt;The goal of this project was to implement the path profiling algorithm from the &amp;quot;&lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=243857&quot;&gt;Efficient Path Profiling&lt;&#x2F;a&gt;&amp;quot; paper by Thomas Ball and James R. Larus as an LLVM pass. The LLVM pass should generate profiling instrumentation for each function in the source code and then generate a profiling report after the code is finished running.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;The paper already gave good explanations and some pseudocode for how to implement the path profiling instrumentation algorithm. I will now briefly describe what the algorithm is actually doing. The algorithm presented in the paper assigns a number to each path through a CFG from a designated entry block to a designated exit block. When executing the entry block of the CFG, some instrumentation is inserted to set a path register &lt;code&gt;r = 0&lt;&#x2F;code&gt;. Then, the algorithm figures out which edges to add instrumentation to in order to have &lt;code&gt;r&lt;&#x2F;code&gt; equal to the path number that was taken once execution reaches the exit block of the CFG. Then, some code is added to the exit block of the CFG to increment a value in a path table like so: &lt;code&gt;path_table[r]++&lt;&#x2F;code&gt;. While the program is running the instrumentation will accumulate a path profile of the execution of the program through the CFG. Then, when the program exits this information can be saved somewhere and analyzed.&lt;&#x2F;p&gt;
&lt;p&gt;There are a few more details when the algorithm deals with loops. Whenever a back edge is taken the algorithm considers it a termination of the current execution through the CFG and then begins profiling a new execution of the CFG starting from the loop header. For example, consider the following CFG:&lt;&#x2F;p&gt;
&lt;img src=&quot;simple_cfg.png&quot; style=&quot;width: 30%&quot;&gt;
&lt;p&gt;There are 4 paths in this graph. There are two that start at block A: A,B,C and A,B,D. The other 2 start at the loop header, block B: B,C and B,D. The source of a back edge is similar to an exit block in that it can terminate a path. Then, the back edge also starts a new path.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;algorithm-steps&quot;&gt;Algorithm steps&lt;&#x2F;h3&gt;
&lt;p&gt;The version of the paper&#x27;s algorithm that I implemented has 6 steps in order to accomplish the above goal on any CFG with back edges. I will also briefly discuss the ways in which this differs from the algorithm in the paper.&lt;&#x2F;p&gt;
&lt;p&gt;First, the algorithm verifies that the CFG has exactly one block with no parents, which it denotes as the entry block, and one block with no children, which it denotes as the exit block.&lt;&#x2F;p&gt;
&lt;p&gt;Second, the algorithm removes all back edges from the CFG and for each back edge it adds two edges: one edge going from the entry block to the destination of the back edge (the loop header) and one edge going from the source of the back edge (the bottom of the loop) to the exit block. This creates a DAG out of the CFG. Adding these two edges allows the algorithm to simply count the number of paths in the graph by counting the number of paths from the entry to exit block in the DAG. This count will include the loop paths from the entry block to the bottom of a loop and the paths that start from loop headers. Sticking with our small graph example from above, this is the DAG that would be created by the algorithm:&lt;&#x2F;p&gt;
&lt;img src=&quot;simple_dag.png&quot; style=&quot;width: 30%&quot;&gt;
&lt;p&gt;Third, the algorithm performs a topological sort of this DAG and then uses the topological sort to compute a mapping &lt;code&gt;Val&lt;&#x2F;code&gt; from edges to integers. The &lt;code&gt;Val&lt;&#x2F;code&gt; mappings are computed such that each path from the entry block to the exit block has a unique &lt;code&gt;Val&lt;&#x2F;code&gt; sum. This is the algorithm for computing &lt;code&gt;Val&lt;&#x2F;code&gt; given in the paper:&lt;&#x2F;p&gt;
&lt;img src=&quot;val_algo.png&quot; style=&quot;width: 80%&quot;&gt;
&lt;p&gt;Informally, for each block &lt;code&gt;v&lt;&#x2F;code&gt; in reverse topological order, it is iteratively computing the number of paths from &lt;code&gt;v&lt;&#x2F;code&gt; to the exit block by summing the number of paths on blocks &lt;code&gt;v&lt;&#x2F;code&gt; is a parent of. For each edge &lt;code&gt;e&lt;&#x2F;code&gt; from &lt;code&gt;v&lt;&#x2F;code&gt; to a child of &lt;code&gt;v&lt;&#x2F;code&gt;, &lt;code&gt;Val(e)&lt;&#x2F;code&gt; gets set to the current path count of &lt;code&gt;v&lt;&#x2F;code&gt;. There is a proof in the paper of why this algorithm generates a unique &lt;code&gt;Val&lt;&#x2F;code&gt; sum for every path through the CFG. An example &lt;code&gt;Val(e)&lt;&#x2F;code&gt; mapping is shown for the small sample graph we have been working with:&lt;&#x2F;p&gt;
&lt;img src=&quot;simple_val.png&quot; style=&quot;width: 30%&quot;&gt;
&lt;p&gt;If we consider one of the edges from A to B to represent the loop back edge, then we get that the val sum of the paths are:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;Path&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Val&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;A,B,D&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;A,B,C&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;B,C&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;3&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;B,D&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;The edge with &lt;code&gt;Val = 2&lt;&#x2F;code&gt; from A to B represents reinitializing the path number register &lt;code&gt;r&lt;&#x2F;code&gt; to 2 when we take the back edge in the loop.&lt;&#x2F;p&gt;
&lt;p&gt;Fourth, the algorithm generates a maximum spanning tree of the graph gotten by taking the DAG and adding an edge from the exit to the entry block (and also removing the directionality of the edges). This does create a cycle in the graph so it is no longer a DAG but the next step requires the graph to have this edge to work correctly. The maximum spanning tree of the graph is calculated by starting with an empty set of edges and iteratively adding edges as long as no cycle is created. For the small sample graph we have been using, this would be an example of a maximum spanning tree (the bolded edges are the edges in the tree):&lt;&#x2F;p&gt;
&lt;img src=&quot;simple_mst.png&quot; style=&quot;width: 30%&quot;&gt;
&lt;p&gt;Fifth, for every chord of the spanning tree (edge not in the maximum spanning tree) the algorithm computes a mapping &lt;code&gt;Inc&lt;&#x2F;code&gt; from edges to increment values. The algorithm to do this was specified in another paper referenced by the &amp;quot;Efficient Path Profiling&amp;quot; paper but is actually quite simple. For every edge &lt;code&gt;e&lt;&#x2F;code&gt; not in the maximum spanning tree, there must exist a path through the spanning tree between the two endpoints. We reinterpret edge &lt;code&gt;e&lt;&#x2F;code&gt; as a directed edge from some block &lt;code&gt;u&lt;&#x2F;code&gt; to some block &lt;code&gt;v&lt;&#x2F;code&gt; in the CFG and find the path through the spanning tree from &lt;code&gt;u&lt;&#x2F;code&gt; to &lt;code&gt;v&lt;&#x2F;code&gt;. We then take the directed sum of all &lt;code&gt;Val(e&#x27;)&lt;&#x2F;code&gt; where &lt;code&gt;e&#x27;&lt;&#x2F;code&gt; is an edge on the path through the spanning tree. By directed sum, I mean that if in the path we go along edge &lt;code&gt;e&#x27;&lt;&#x2F;code&gt; in the same direction that &lt;code&gt;e&#x27;&lt;&#x2F;code&gt; is in in the CFG then we all &lt;code&gt;Val(e&#x27;)&lt;&#x2F;code&gt; to &lt;code&gt;Inc(e)&lt;&#x2F;code&gt;. If we go across the edge backwards, we subtract &lt;code&gt;Val(e&#x27;)&lt;&#x2F;code&gt; from &lt;code&gt;Inc(e)&lt;&#x2F;code&gt;. Finally, we add &lt;code&gt;Val((u,v))&lt;&#x2F;code&gt; to &lt;code&gt;Inc(e)&lt;&#x2F;code&gt; to finish the computation of &lt;code&gt;Inc(e)&lt;&#x2F;code&gt;. For the sample graph and the maximum spanning tree computed above, this is the &lt;code&gt;Inc&lt;&#x2F;code&gt; mapping:&lt;&#x2F;p&gt;
&lt;img src=&quot;simple_inc.png&quot; style=&quot;width: 30%&quot;&gt;
&lt;p&gt;All &lt;code&gt;Inc&lt;&#x2F;code&gt; mappings are required to have the same sum on a path &lt;code&gt;P&lt;&#x2F;code&gt; through the CFG as the sum of the &lt;code&gt;Val&lt;&#x2F;code&gt; mappings for path &lt;code&gt;P&lt;&#x2F;code&gt;. The edge with &lt;code&gt;Inc(e) = -2&lt;&#x2F;code&gt; for the edge from A to B was computed by taking the path from B to A in the spanning tree. &lt;code&gt;Val((A,B))&lt;&#x2F;code&gt; was 2 but because we traversed the edge backwards, we subtracted 2 from &lt;code&gt;Inc&lt;&#x2F;code&gt; of the other edge.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, the algorithm inserts instrumentation according to the &lt;code&gt;Inc&lt;&#x2F;code&gt; mapping. For every edge in the original CFG with a non-zero &lt;code&gt;Inc&lt;&#x2F;code&gt; value, the instruction &lt;code&gt;r += Inc(e)&lt;&#x2F;code&gt; is inserted by creating a new basic block between the source and destination of the edge with that instruction. The block is then inserted into the CFG by rerouting the source&#x27;s successor to the new block and adding a jump at the end of the new block to the destination block. The entry block adds an instruction to initialize &lt;code&gt;r=0&lt;&#x2F;code&gt; and the exit block adds an instruction &lt;code&gt;path_table[r++]&lt;&#x2F;code&gt;. Back edges are bit more interesting. For back edges we insert the instruction sequence:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;r += backedge_inc
path_table[r++]
r = backedge_reset
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;backedge_inc&lt;&#x2F;code&gt; is set to &lt;code&gt;Inc((entry,backedge.dest))&lt;&#x2F;code&gt;, which is the &lt;code&gt;Inc&lt;&#x2F;code&gt; value of the edge from the entry block to the head of the loop containing the back edge. &lt;code&gt;backedge_reset&lt;&#x2F;code&gt; is set to &lt;code&gt;Inc((backedge.source, exit))&lt;&#x2F;code&gt;, which is the &lt;code&gt;Inc&lt;&#x2F;code&gt; value of the edge from the bottom of the loop to the exit block. This was somewhat underspecified in the paper but I believe it makes sense based on their descriptions of what the two inserted edges for each back edge represent. This is the instrumented CFG of the example we have been looking at:&lt;&#x2F;p&gt;
&lt;img src=&quot;simple_inst.png&quot; style=&quot;width: 50%&quot;&gt;
&lt;p&gt;We can verify that our path mappings are preserved by figuring out the value of &lt;code&gt;r&lt;&#x2F;code&gt; at the end of every path before we do &lt;code&gt;path_table[r]++&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;Path&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;r&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;A,B,D&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;A,B,C&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;B,C&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;3&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;B,D&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Also, note that this matches the &lt;code&gt;Val&lt;&#x2F;code&gt; sum for each path computed above.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;differences-between-the-paper-and-this-implementation&quot;&gt;Differences between the paper and this implementation&lt;&#x2F;h3&gt;
&lt;p&gt;First, the paper seems to be a bit misleading when it comes to instrumenting back edges. It seems to imply that whenever you have a back edge that you should always insert the instructions &lt;code&gt;path_table[r]++; r = 0&lt;&#x2F;code&gt;. However, as the above simple example showed and the edge cases below will show, it actually needs to be a bit more complex than just increment and reset &lt;code&gt;r&lt;&#x2F;code&gt; to 0. The edge cases analyzed below will hopefully show that the following code snippet is actually what back edges should be instrumented with:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;r += backedge_inc
path_table[r++]
r = backedge_reset
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;I also diverged from the paper when constructing the maximum spanning tree. The authors used existing edge profiling information to estimate the cost of instrumenting an edge. Because instrumentation is only placed on the chords of the maximum spanning tree, those will be the edges least likely to be taken. When I construct the maximum spanning tree, I don&#x27;t assume any prior profiling information and assume that each edge is equally likely to be taken. This produces an instrumentation with minimal edges but not necessarily the edges that are taken the least. I felt like this was an acceptable compromise in order to keep this project fairly self contained and not rely on importing information from an edge profiler.&lt;&#x2F;p&gt;
&lt;p&gt;The paper also introduces a few optimizations for instrumentation placement that I felt were not necessary to implement for this project. Specifically, it discusses moving the initialization of the path number register &lt;code&gt;r&lt;&#x2F;code&gt; from the entry block to the edges that are uniquely reached from the entry block. Additionally, they move the path table count increment up to the edge which uniquely reaches the exit block on every path.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;difficulties-and-edge-cases-in-the-implementation&quot;&gt;Difficulties and edge cases in the implementation&lt;&#x2F;h3&gt;
&lt;p&gt;There were a few parts of the algorithm that were underspecified in the paper that caused some edge cases and lead to some interesting implementation details. The main difficulty was dealing with the DAG that was created after removing the back edges and inserting the 2 new edges: this new graph was a multigraph, which means that there could be multiple edges with the same source and destination block. This meant that edges could not be uniquely identified with their source and destination blocks. Furthermore, because edges in the DAG corresponded directly to edges in the CFG, I needed a way to link edges in different graphs together. This lead to a representation of a graph as a list of edge structs, where each edge struct contained a source and destination pointer as well as some additional data if the edge represented a back edge. This representation sacrifices efficiency since child and parent lookups now need to traverse all edges in the graph but I believe it makes up for that by being easy to use for the implementation of the path profiling instrumentation algorithm. Specifically, different graphs (like the CFG and the DAG) could have different lists but the elements in those lists were the same, which made it easy to match the edges in the &lt;code&gt;Inc&lt;&#x2F;code&gt; mapping to edges in the CFG (since they were the same edge structs). If, for example, I had chosen to represent the multigraph with a map from basic blocks to a list of the children of that basic block, then it would be difficult to match edges between two modified version of the same graph. In the end, this representation made the implementation quite clean so I was happy with the representation.&lt;&#x2F;p&gt;
&lt;p&gt;One other edge case that I uncovered was how to correctly deal with chords of the maximum spanning tree that had the same source and destination blocks. Consider the following CFG:&lt;&#x2F;p&gt;
&lt;img src=&quot;same_head.png&quot; style=&quot;width: 30%&quot;&gt;
&lt;p&gt;When we remove the 2 back edges from C to B and D to B, insert the necessary edges for the back edges, and compute the &lt;code&gt;Val&lt;&#x2F;code&gt; mapping for the resulting graph we get:&lt;&#x2F;p&gt;
&lt;img src=&quot;same_head_val.png&quot; style=&quot;width: 30%&quot;&gt;
&lt;p&gt;If we use this &lt;code&gt;Val&lt;&#x2F;code&gt; mapping as the &lt;code&gt;Inc&lt;&#x2F;code&gt; mapping then we encounter 2 cases not covered in the paper. The first is that we have a loop header (block B) with 2 back edges pointing to it. If we follow the algorithm that I implemented, then we get that we need to reset &lt;code&gt;r&lt;&#x2F;code&gt; to 6 if we go along the back edge from C to B and we reset &lt;code&gt;r&lt;&#x2F;code&gt; to 3 if we take the back edge from D to B. This means that the path from B to C can have 2 possible mappings, 3 or 6, depending on which back edge you took to get to the top of the loop. Initially, I thought that this might be some kind of bug in the algorithm and that the algorithm should only insert the entry to loop header edge if it didn&#x27;t already exist. Then you would only get a single reset value when taking any back edge to the loop header and then the path from B to C would have a single value. However, I realized that these 2 mappings for the same path actually do capture important path information: they describe how you entered a path. In the above example, if path 3 gets incremented then I know I took the path from block B to C and I entered this path by taking the back edge from block D to B. If path 6 gets incremented, then I know that I took the path from B to C but that I entered that path by taking a back edge from C to B. This path entry information might be useful for some kind of profile driven optimization or if you are worried about path coverage. This also shows why non-zero resets are necessary. If the back edge from C to B reset &lt;code&gt;r&lt;&#x2F;code&gt; to zero, then path B,C would have the same value in &lt;code&gt;r&lt;&#x2F;code&gt; as path A,B,C.&lt;&#x2F;p&gt;
&lt;p&gt;The second case is when there are 2 edges from a block to the exit block. This happens with block D. There are two edges, one of them a back edge from D to the exit block in the DAG. In the paper it was not specified what to do if a back edge had an &lt;code&gt;Inc&lt;&#x2F;code&gt; value. It seems to imply that you just ignore the value. But if we do this and instrument the back edge from D to B with just &lt;code&gt;path_table[r]++; r = 3&lt;&#x2F;code&gt; then path B,D and path B,D,E have the same &lt;code&gt;r&lt;&#x2F;code&gt; value: 4. This is clearly incorrect, since they are 2 different paths. This means that we need to add &lt;code&gt;r+=1&lt;&#x2F;code&gt; on the back edge from D to B before incrementing the count in &lt;code&gt;path_table&lt;&#x2F;code&gt;. These two observations are what lead me to my modified version of the back edge instrumentation that I presented above:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;r += backedge_inc
path_table[r++]
r = backedge_reset
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;I believe this is the correct instrumentation based on the examples and reasons given above.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;My evaluation consisted of 2 parts. In the first part I developed some programs that were edge cases for the algorithm and verified that the algorithm behaved as expected. The second part of the evaluation involved gathering some benchmark test cases and evaluating the profiling overhead for each of the benchmarks. For this project, I chose a few single file benchmark test cases from the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-test-suite&quot;&gt;LLVM test suite&lt;&#x2F;a&gt;. The primary reason for picking only single file benchmark test cases was that they were easy to put through my LLVM pass.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;correctness-tests&quot;&gt;Correctness Tests&lt;&#x2F;h3&gt;
&lt;p&gt;Most of the correctness tests were either simple sanity checks or specifically crafted CFGs to test the edge cases discusses above. For example, one sanity check was to create the CFG that was initially discussed in the paper and see if the algorithm did the correct thing:&lt;&#x2F;p&gt;
&lt;img src=&quot;diamond.png&quot; style=&quot;width: 30%&quot;&gt;
&lt;p&gt;For this graph, my algorithm generated the following inc values:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;Edge&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Inc&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;D -&amp;gt; F&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;C -&amp;gt; D&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;4&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;B -&amp;gt; C&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;-2&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;This gives the following path numberings:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;Path&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;value&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;A,B,D,E,F&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;A,B,D,F&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;A,B,C,D,E,F&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;A,B,C,D,F&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;3&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;A,C,D,E,F&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;4&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;A,C,D,F&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;5&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Which is similar to what the paper generates.&lt;&#x2F;p&gt;
&lt;p&gt;A more interesting edge case is one where there are 2 loops with the same loop head. This required manually writing some LLVM code since I was not sure how to generate such a CFG using only C:&lt;&#x2F;p&gt;
&lt;img src=&quot;samehead.png&quot; style=&quot;width: 30%&quot;&gt;
&lt;p&gt;The algorithm identifies 9 distinct paths through the graph and assigns the following inc values:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;Edge&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Inc&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;bb -&amp;gt; bb8&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;6&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;bb4 -&amp;gt; bb28&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;bb18 -&amp;gt; bb4&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;bb -&amp;gt; bb8&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;3&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;One interesting this about this table is that the edge bb -&amp;gt; bb8 is assigned 2 inc values. This is because one of them corresponds to restarting the loop by taking edge bb12 -&amp;gt; bb8 and the other corresponds to restarting the loop by taking the edge from bb4 -&amp;gt; bb8. To better illustrate this, these are the values assigned to each of the paths:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;Path&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;value&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;bb,bb8,bb12&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;bb,bb8,bb18,bb4&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;bb,bb8,bb18,bb4,bb28&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;(from bb4) bb8,bb12&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;3&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;(from bb4) bb8,bb18,bb4&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;4&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;(from bb4) bb8,bb18,bb4,bb28&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;5&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;(from bb12) bb8,bb12&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;6&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;(from bb12) bb8,bb18,bb4&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;7&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;(from bb12) bb8,bb18,bb4,bb28&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;8&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Even though path bb8,bb12 appears twice in this table with different values, the algorithm is inserting instrumentation to differentiate the 2 ways that the path was started. If we take the back edge bb4 -&amp;gt; bb8, then we start with the path register equal to 3. If we take the back edge bb12 -&amp;gt; bb8, then we start with the path register equal to 6. These were the 2 inc values computed for path bb -&amp;gt; bb8 in the table above. This kind of example was not explicitly mentioned in the paper but I think this is the correct thing to do.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;performance-tests&quot;&gt;Performance Tests&lt;&#x2F;h3&gt;
&lt;p&gt;For the performance tests, I collected 11 programs from the SingleSource directory of the LLVM test suite. Each of these programs was compiled to LLVM, then sent through my profiling instrumentation pass, and then compiled to an executable. Additionally, a version of the executable was generated from the original LLVM (with no instrumentation). No optimizations were enabled for any of the compilation steps. Then, both of these programs were run and their run times were compared with time. The table below describes the results:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;Program name&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Normal execution time (secs)&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Instrumented execution time (secs)&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Instrumented &#x2F; Normal&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-test-suite&#x2F;blob&#x2F;master&#x2F;SingleSource&#x2F;Benchmarks&#x2F;CoyoteBench&#x2F;almabench.c&quot;&gt;almabench&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;8.50&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;10.04&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1.18&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-test-suite&#x2F;blob&#x2F;master&#x2F;SingleSource&#x2F;Benchmarks&#x2F;McGill&#x2F;chomp.c&quot;&gt;chomp&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2.37&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;5.18&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2.19&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-test-suite&#x2F;blob&#x2F;master&#x2F;SingleSource&#x2F;Benchmarks&#x2F;Stanford&#x2F;FloatMM.c&quot;&gt;FloatMM&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1.48&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;3.61&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2.44&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-test-suite&#x2F;blob&#x2F;master&#x2F;SingleSource&#x2F;Benchmarks&#x2F;CoyoteBench&#x2F;huffbench.c&quot;&gt;huffbench&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;55.67&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;102.73&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1.84&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-test-suite&#x2F;blob&#x2F;master&#x2F;SingleSource&#x2F;Benchmarks&#x2F;Linpack&#x2F;linpack-pc.c&quot;&gt;linpack-pc&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;15.80&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;37.49&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2.37&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-test-suite&#x2F;blob&#x2F;master&#x2F;SingleSource&#x2F;Benchmarks&#x2F;CoyoteBench&#x2F;lpbench.c&quot;&gt;lpbench&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;10.94&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;26.75&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2.45&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-test-suite&#x2F;blob&#x2F;master&#x2F;SingleSource&#x2F;Benchmarks&#x2F;McGill&#x2F;misr.c&quot;&gt;misr&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.32&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.41&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1.28&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-test-suite&#x2F;blob&#x2F;master&#x2F;SingleSource&#x2F;Benchmarks&#x2F;Misc&#x2F;oourafft.c&quot;&gt;oourafft&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;13.45&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;18.24&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1.36&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-test-suite&#x2F;blob&#x2F;master&#x2F;SingleSource&#x2F;Benchmarks&#x2F;McGill&#x2F;queens.c&quot;&gt;queens&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;3.53&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;6.69&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1.90&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-test-suite&#x2F;blob&#x2F;master&#x2F;SingleSource&#x2F;Benchmarks&#x2F;Misc&#x2F;ReedSolomon.c&quot;&gt;ReedSolomon&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;11.28&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;19.35&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1.72&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-test-suite&#x2F;blob&#x2F;master&#x2F;SingleSource&#x2F;Benchmarks&#x2F;Stanford&#x2F;Treesort.c&quot;&gt;Treesort&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.18&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.32&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1.78&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;These results were acquired in one run of each program. There was no effort made to isolate the running process from other process on the system (a standard laptop) or efforts to minimize the effects of memory alignment. This was because the performance differences were large enough that doing any of the above would have likely been overkill to get meaningful results.&lt;&#x2F;p&gt;
&lt;p&gt;The results show worse overhead than the paper reported, but this may be the result of multiple factors. The first is that the algorithm that I implemented did not include any of the instrumentation placement optimizations that the paper described. In my implementation, the path register is actually stack allocated so every instrumented edge needs to perform at least a memory read and write. Instead, the path register can actually be pinned to a register throughout a function. However, this would cause one less register to be available for register allocation so it would be interesting to see what the tradeoffs would be. Next, the operation to increment the path counter table is actually a function call. This is because I thought I would add locking to support multithreaded programs. However, none of the benchmarks are multithreaded, so this function call is unnecessary and the increment could be inlined. Finally, there is no strength reduction to store the address of path counter in the path register instead of the index. The paper mentions this as a potential optimization, however no optimization passes were run on the programs after instrumentation.&lt;&#x2F;p&gt;
&lt;p&gt;In summary, the instrumentation generally causes about a 2x increase in program execution time, which is not that great. It would be interesting to see what performance improvements any or all of the above optimizations could achieve.&lt;&#x2F;p&gt;
&lt;p&gt;All code for this project can be found &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Dan12&#x2F;llvm-pass-skeleton&#x2F;tree&#x2F;mutate&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Making LLVM Address Calculation Safe(r)</title>
                <pubDate>Wed, 13 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/gep/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/gep/</guid>
                <description>&lt;h1 id=&quot;memory-safety-in-llvm&quot;&gt;Memory Safety in LLVM&lt;&#x2F;h1&gt;
&lt;p&gt;LLVM IR code is not generally memory safe.
While certain &lt;em&gt;obviously bad&lt;&#x2F;em&gt; behaviors are
disallowed, it is not hard to write code that
may execute out-of-bounds memory accesses at run time.&lt;&#x2F;p&gt;
&lt;p&gt;For instance, the size of an
array may be statically known,
but the access index may be unknown at compile time:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;foo&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tmp[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;some code to set values in tmp
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tmp[x];
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In this project we seek to improve the memory
safety of LLVM programs by inserting dynamic bounds
checks at run time that cause the program to stop
executing rather than violate memory safety.
After running our compilation pass the aforementioned
code would have the following run-time behavior:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;foo&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tmp[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;some code to set values in tmp
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
       &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tmp[x];
    } &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
       &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;exit&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;llvm-address-calculation&quot;&gt;LLVM Address Calculation&lt;&#x2F;h3&gt;
&lt;p&gt;When compiling high level array and struct access to LLVM code, compilers
generally use the &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;GetElementPtr.html&quot;&gt;getelementptr&lt;&#x2F;a&gt;
(or GEP) instruction to calculate offsets into these memory allocations.
GEP instructions have the nice property that they are type aware; offsets are phrased
in terms of &amp;quot;number of elements&amp;quot; rather than &amp;quot;number of bytes.&amp;quot;
For example, the code in this stub dereferences some memory in
the middle of a struct (specifically the last element of the &lt;code&gt;b&lt;&#x2F;code&gt; field).&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;struct &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;EX {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a;
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;c;
}

&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;struct &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;X;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...
return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; X-&amp;gt;b[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In LLVM, we could write a single GEP instruction to calculate
the correct offset into the struct and then execute a &lt;code&gt;load&lt;&#x2F;code&gt; instruction
to actually dereference the pointer.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; getelementptr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%struct&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.EX,  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%struct&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.EX&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* %&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;X, i64 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, i32 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, i64 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; load i8, i8&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* %&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
ret &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This functionality makes GEP an ideal point for analyzing out-of-bounds accesses.
Before a program might make an out-of-bounds access it has to
acquire an out-of-bounds pointer. Usually, this means it
executes a GEP whose result will then later be the argument of a
load or store operation.&lt;&#x2F;p&gt;
&lt;p&gt;Our approach in this implementation is to prevent the execution of
any run-time GEP instructions that might lead to
illegal memory accesses.
&lt;em&gt;If a program can never acquire an out-of-bounds pointer,
it can&#x27;t violate memory safety.&lt;&#x2F;em&gt; As we discuss &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;gep&#x2F;#soundness-and-completeness&quot;&gt;later&lt;&#x2F;a&gt;, this is not really a sufficient condition for memory safety in LLVM IR,
but it does cover a large class of problems.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;making-gep-safe-r&quot;&gt;Making GEP Safe(r)&lt;&#x2F;h3&gt;
&lt;p&gt;Let&#x27;s go back to our first example of an out of bounds array access:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tmp[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...
return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tmp[x];
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The return statement roughly translates to:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;addr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; getelementptr [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x i32], [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x i32]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* %&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;tmp, i64 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, i64 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;val &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; load i32, i32&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* %&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;addr;
ret &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;addr;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In order to insert a dynamic check for memory safety, there
are two things we need to know:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;What is the &lt;em&gt;actual&lt;&#x2F;em&gt; access index value?&lt;&#x2F;li&gt;
&lt;li&gt;What are &lt;strong&gt;legal&lt;&#x2F;strong&gt; access index values?&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Happily, when considering GEP instructions, the first question is easy
to answer; each operand represents an access index
value. We can dynamically insert instructions into the program that compare
those operands to other values.&lt;&#x2F;p&gt;
&lt;p&gt;The second question is a much more difficult problem,
whose subtleties we&#x27;ll address in the next section.
For the most part though, we leverage LLVM&#x27;s type
information. Based on its type annotation, we know
that &lt;code&gt;%tmp&lt;&#x2F;code&gt; points to an array with 10 32-bit integers (notated as &lt;code&gt;[10 x i32]&lt;&#x2F;code&gt;).
Therefore, we can conclude that the only valid values for &lt;code&gt;%x&lt;&#x2F;code&gt; are
between 0 (inclusive) and 10 (exclusive).&lt;&#x2F;p&gt;
&lt;p&gt;To execute this check completely, we need to check that &lt;em&gt;all&lt;&#x2F;em&gt; index operands
are legal. You may have noticed that our prior example actually has &lt;strong&gt;two&lt;&#x2F;strong&gt;
index operands; we have to check both that &lt;code&gt;tmp&lt;&#x2F;code&gt; points to at least one
integer array of size 10, &lt;em&gt;and&lt;&#x2F;em&gt; that &lt;code&gt;x&lt;&#x2F;code&gt; is a valid index for such an array.&lt;&#x2F;p&gt;
&lt;p&gt;In general our algorithm for modifying LLVM code is this:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Initialize the current type to be the type of the first operand.&lt;&#x2F;li&gt;
&lt;li&gt;Initialize current operand to the first index operand.&lt;&#x2F;li&gt;
&lt;li&gt;If possible, insert instructions to check if current operand is in bounds based on current type.&lt;&#x2F;li&gt;
&lt;li&gt;Set current type to the next element type (e.g. if current type is *int[] the next type is int[]).&lt;&#x2F;li&gt;
&lt;li&gt;If there are no more index operands, exit. 
Else, set the current operand to the next in the operand list and goto (3).&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;gep-checking-a-walkthrough&quot;&gt;GEP Checking: A Walkthrough&lt;&#x2F;h3&gt;
&lt;p&gt;In this section we&#x27;ll walk through the above example in excruciating detail.
Feel free to skip ahead to &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;gep&#x2F;#pointer-sizes-and-tracking-allocations&quot;&gt;the next section&lt;&#x2F;a&gt;
if you&#x27;re an expert in how &lt;code&gt;getelementptr&lt;&#x2F;code&gt; works and&#x2F;or the above algorithm makes intuitive sense.&lt;&#x2F;p&gt;
&lt;p&gt;The main complicating part of the above algorithm is how to compute the next element type.
Based on the possible types that GEP expects there are only a few cases to handle.
Intuitively, each type represents a container in some way and &amp;quot;indexing&amp;quot; into it
should get us the type contained by the outer one.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Type&lt;&#x2F;th&gt;&lt;th&gt;Next Element Type&lt;&#x2F;th&gt;&lt;th&gt;Notes&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;t*&lt;&#x2F;td&gt;&lt;td&gt;t&lt;&#x2F;td&gt;&lt;td&gt;For pointers, the next type is the type being pointed to&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;[ size x t ]&lt;&#x2F;td&gt;&lt;td&gt;t&lt;&#x2F;td&gt;&lt;td&gt;For arrays, the next type is the array element type&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&amp;lt; size x t &amp;gt;&lt;&#x2F;td&gt;&lt;td&gt;t&lt;&#x2F;td&gt;&lt;td&gt;Vectors, like arrays, have an element type&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;struct { f1, f2,...,fn }&lt;&#x2F;td&gt;&lt;td&gt;fi&lt;&#x2F;td&gt;&lt;td&gt;i is the index value; LLVM requires this is a compile time constant&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Checking the instruction: 
&lt;code&gt;%addr = getelementptr [10 x i32], [10 x i32]* %tmp, i64 0, i64 %x;&lt;&#x2F;code&gt;&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;PointerSource &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= %&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;tmp;
CurrentType &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x i32]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
CurrentOp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i64 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;Get the max offset for accessing %tmp. Since %tmp was generated with
&#x2F;&#x2F; %tmp = alloca [10 x i32]
&#x2F;&#x2F;We know that %tmp points to only 1 integer array
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;NumElements &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
InsertCheck(CurrentOp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; CurrentOp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; NumElements); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; (0 &amp;gt;= 0 &amp;amp;&amp;amp; 0 &amp;lt; 1)
&#x2F;&#x2F;Our implementation will actually automatically omit this
&#x2F;&#x2F;check since it can be easily statically determined to be `true`
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
CurrentType &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;NextElementType(CurrentType); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;[10 x i32]
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;CurrentOp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i32 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x;

NumElements &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10 &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; retrieved from the type [10 x i32]
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;InsertCheck(CurrentOp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; CurrentOp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; NumElements); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; (x &amp;gt;= 0 &amp;amp;&amp;amp; x &amp;lt; 10)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
CurrentOp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;none&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;Done!
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h2 id=&quot;pointer-sizes-and-tracking-allocations&quot;&gt;Pointer Sizes and Tracking Allocations&lt;&#x2F;h2&gt;
&lt;p&gt;In the above examples, we could always tell how big our memory allocations
were since they were allocated with static sizes. &lt;code&gt;int tmp[10]&lt;&#x2F;code&gt; comprises two static allocations: 1) a single pointer-sized memory cell (to contain the local variable &lt;code&gt;tmp&lt;&#x2F;code&gt;); 2) a memory cell
containing 10 integers (the memory pointed to by &lt;code&gt;tmp&lt;&#x2F;code&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;In many cases, the sizes of arrays may be difficult or impossible to determine at compile
time. Consider the following snippet:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;foo&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x, y) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tmp[x];
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;... &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;init values in tmp
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tmp[y];
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In the corresponding LLVM code, the type of &lt;code&gt;tmp&lt;&#x2F;code&gt; is no longer a sized type;
it is just &lt;code&gt;i32*&lt;&#x2F;code&gt;. We can no longer use types to help us determine what
are and are not legal offsets. In this case, however, there is something
that we can do. LLVM uses &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;LangRef.html#alloca-instruction&quot;&gt;&lt;code&gt;alloca&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
instructions to allocate local variables. &lt;code&gt;alloca&lt;&#x2F;code&gt; takes an argument to
determine how many elements must be allocated. If we keep track of the
sizes of local allocations we can infer that the above code is safe if and only if:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#0086b3;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; y &amp;lt; x &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;we&amp;#39;ll assume x &amp;gt; 0 here
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In our implementation, we simply keep around a map from allocations to their sizes.
Additionally, we track heap allocations by scanning for function calls to &lt;code&gt;malloc&lt;&#x2F;code&gt;.
This allows us to calculate maximum pointer index values for GEP instructions
where the types are unsized. Unfortunately, this is rather imprecise since
it tracks &lt;em&gt;exact&lt;&#x2F;em&gt; value dependencies and doesn&#x27;t keep track of other ways
a pointer may be passed to a GEP. For instance, spilling a value to memory
and then re-loading it will cause our analysis to lose track of the original
allocation. &lt;&#x2F;p&gt;
&lt;p&gt;Additionally, we run our transformation as a function pass so
it doesn&#x27;t track interprocedural allocations. The main
reason for this limitation is that, even if we knew
the original allocation size for all callers, we would
have to modify the function signature to communicate
legal index values from the caller.
This seemed both out of scope for our current project
and a potentially questionable design decision. Should
a compiler pass be modifying the signatures of
potentially every function?&lt;&#x2F;p&gt;
&lt;p&gt;Alternatively, one could implement a much more heavyweight
dynamic checker in the style of &lt;a href=&quot;http:&#x2F;&#x2F;valgrind.org&#x2F;&quot;&gt;Valgrind&lt;&#x2F;a&gt; or &lt;a href=&quot;..&#x2F;mempass&quot;&gt;this other CS 6120 project&lt;&#x2F;a&gt;. These checkers keep track of all allocated pointers and aliases in a large run-time datastructure to ensure that no dereference is illegal. This is a different approach, focused less on static analysis and more on total safety but comes with much larger run-time overheads in both space and time.&lt;&#x2F;p&gt;
&lt;p&gt;Consequently, our pass is unable to improve the
memory safety of this function:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;foo&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int* &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;y) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x[y];
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We had hoped to use LLVM&#x27;s alias analysis or copy propagation tools
to increase the precision of allocation tracking. However, we couldn&#x27;t get
these to work; they were difficult to integrate and didn&#x27;t seem to track
pointer value propagation as we expected them to. LLVM&#x27;s relatively new
&lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;MemorySSA.html&quot;&gt;MemorySSA&lt;&#x2F;a&gt; analysis seemed very
promising, since their example code finds domination relationships between
memory uses and definitions. This would allow us to, at least for some cases,
track allocation size information transitively through pointer reads and writes.
However, the implementation is less precise than the documentation lets on
and does not find accurate enough relationships to identify the root allocation
for any pointers in practice.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;bitcasting&quot;&gt;Bitcasting&lt;&#x2F;h3&gt;
&lt;p&gt;Additionally, &lt;code&gt;bitcast&lt;&#x2F;code&gt; instructions complicate this process
even more, since they cause the &amp;quot;sizes&amp;quot; of memory allocations to
be interpreted differently.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; alloca i32, i64 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; bitcast i32&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* %&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; to i8&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Since 4 &lt;code&gt;i8&lt;&#x2F;code&gt; values fit into one &lt;code&gt;i32&lt;&#x2F;code&gt;, the allocation of &lt;code&gt;%2&lt;&#x2F;code&gt; represents a totally
different number of elements than &lt;code&gt;%1&lt;&#x2F;code&gt; even though they represent the result of the
same allocation operation.
In the above code, a GEP that uses &lt;code&gt;%1&lt;&#x2F;code&gt; can safely index into elements 0 to 9.
However, a GEP that uses &lt;code&gt;%2&lt;&#x2F;code&gt; as the base can safely index into elements 0 to 31.
For any bitcast instruction that casts an allocation of known size, we convert
and track the size of the new value, using integer multiplication and division
to soundly approximate the maximum safe index. For typical bitcasts (e.g., &lt;code&gt;char&lt;&#x2F;code&gt; to &lt;code&gt;int&lt;&#x2F;code&gt;)
this will not lose precision; however LLVM does have arbitrary precision integers,
which could cause this estimate of allocation size to be an underestimate.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;soundness-and-completeness&quot;&gt;Soundness and Completeness&lt;&#x2F;h2&gt;
&lt;p&gt;Often, when trying to ensure a safety
property you&#x27;d like to show that your results are either &lt;em&gt;sound&lt;&#x2F;em&gt; or &lt;em&gt;complete&lt;&#x2F;em&gt;.
In our case, soundness would imply that any LLVM program which uses no
&amp;quot;type unsafe&amp;quot; features and is compiled with our pass 
executes &lt;em&gt;no GEP instructions which would generate out-of-bounds pointers&lt;&#x2F;em&gt;.
Completeness, on the other hand, would imply that we can compile
and execute &lt;em&gt;all programs that never execute out-of-bounds accesses&lt;&#x2F;em&gt;.
In other words, completeness ensures that all safe programs should still be executable after our instrumentation.&lt;&#x2F;p&gt;
&lt;p&gt;Typically, you cannot achieve both soundness and completeness simultaneously;
although, the use of code transformation to insert run-time checks does make
this tractable for some problems. Our solution achieves only completeness and
not soundness; we allow all programs to execute by skipping checks where we
cannot determine the legal index bounds.
A simple &lt;em&gt;sound alternative&lt;&#x2F;em&gt; would be to simply reject any programs that fit the above criteria;
by placing an unconditional exit in front of any such GEP instruction
we could ensure safety but would prevent some safe programs from executing.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;a-note-on-soundness&quot;&gt;A Note on Soundness&lt;&#x2F;h3&gt;
&lt;p&gt;Soundness for our problem isn&#x27;t really achievable without some
assumptions about the behavior of LLVM programs oustide of the GEP instructions
(&lt;em&gt;or without a much more complex interprocedural analysis&lt;&#x2F;em&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;To highlight one of the reasons for this, consider the following LLVM program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; alloca [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x i32]
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; bitcast [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x i32]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* %&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; to [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;11&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x i32]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*
%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; getelementptr [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;11&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x i32], [&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;11&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x i32]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* %&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, i64 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, i64 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The above LLVM code is totally legal and will compile using standard
LLVM tools. However, this example invalidates the assumption that our
pass uses to ensure GEP safety.&lt;&#x2F;p&gt;
&lt;p&gt;To be clear, &lt;code&gt;%1&lt;&#x2F;code&gt; is a pointer to an array of 10 32-bit integers, 
but the next instruction copies that same pointer value into &lt;code&gt;%2&lt;&#x2F;code&gt;
while treating it as a pointer to &lt;em&gt;11&lt;&#x2F;em&gt; 32-bit integers.&lt;&#x2F;p&gt;
&lt;p&gt;When our pass analyses &lt;code&gt;%3&lt;&#x2F;code&gt; it will insert the following bounds
check for &lt;code&gt;%x&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#0086b3;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;= %&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x &amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;11
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Executions where &lt;code&gt;%x == 10&lt;&#x2F;code&gt; will cause memory safety violations.
We consider such behaviors outside the scope of this project
and assume that the LLVM types for the arguments to the GEP
instruction reflect accurate allocations of memory. This assumption
is what we mean by not using &amp;quot;type unsafe&amp;quot; features.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;To evaluate the utility of our pass,
we took a selection of &lt;a href=&quot;https:&#x2F;&#x2F;parsec.cs.princeton.edu&#x2F;&quot;&gt;PARSEC&lt;&#x2F;a&gt;
benchmarks and considered both: 1) How often we failed to determine the
legal bounds for a pointer; and 2) How much run-time overhead we inccured
with our dynamic checks. Furthermore, we ran a number of microbenchmarks
to ensure that our pass was properly instrumenting code in the absence
of the soundness problems that we mentioned above.&lt;&#x2F;p&gt;
&lt;p&gt;We chose benchmarks primarily by the ones that we could get
to compile most easily, so they may not be reflective of as
wide a range of behaviors as possible. We maintained the
same compiler flags used in the original suite, specifically
using the &lt;code&gt;-O3&lt;&#x2F;code&gt; optimization flag. For an apples to apples comparison
the &amp;quot;Baseline&amp;quot; uses Clang to compile but does not run our transformation
pass. The &amp;quot;Instrumented&amp;quot; code is generated by running our pass after &lt;code&gt;-O3&lt;&#x2F;code&gt; optimization but
does not run any further optimizations. Therefore, any overheads we find
should be considered upper bounds as optimization may remove some of
them or improve how they are calculated.&lt;&#x2F;p&gt;
&lt;p&gt;We ran the benchmarks a total of 10 times each and
calculate both the average and standard deviation of execution time.
These were executed on a Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz,
with 32GB of RAM and only one thread allocated per execution.
The reported execution times are based off of the built-in PARSEC
&lt;em&gt;region of interest&lt;&#x2F;em&gt; measures which only report execution of the
hot loop and omit initialization and clean up times.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;evaluation-precision&quot;&gt;Evaluation: Precision&lt;&#x2F;h3&gt;
&lt;p&gt;In all of the PARSEC programs we benchmarked, we failed to
instrument &lt;em&gt;most&lt;&#x2F;em&gt; of the unsafe memory acceses. As you can see by
the graph below, in the best case (Fluidanimate) we managed to instrument 30%
of GEP instructions, while in the worst we added &lt;em&gt;no run-time checks&lt;&#x2F;em&gt; (Blackscholes).&lt;&#x2F;p&gt;
&lt;img src=&quot;precision.png&quot;&#x2F;&gt;
&lt;p&gt;Based on our manual observation of the compiled code and testing
our instrumentation with microbenchmarks, we believe this lack of
precision stems from two main sources:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Pointers allocated outside function scope (arguments and global variables)&lt;&#x2F;li&gt;
&lt;li&gt;Allocation information loss due to operations on pointer variables&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;As mentioned previously, the former is difficult to deal with
as an IR compiler pass since a general solution would require
modifying function signatures to pass allocation size information.&lt;&#x2F;p&gt;
&lt;p&gt;The latter problem stems primarily from three operations: &lt;code&gt;load&lt;&#x2F;code&gt;, &lt;code&gt;store&lt;&#x2F;code&gt; and &lt;code&gt;getelementptr&lt;&#x2F;code&gt;.
While complications arising from GEPs are straightforward to solve (since that&#x27;s already an instruction
we&#x27;re instrumenting already), without precise memory dependency and alias analyses,
we cannot track allocation sizes of pointers derived from other pointers.
The most common case we noticed, anecdotally,
were global pointers to data that were &amp;quot;malloced&amp;quot; at the beginning of the &lt;code&gt;main&lt;&#x2F;code&gt; function,
but accessed throughout the program.&lt;&#x2F;p&gt;
&lt;p&gt;For instance, take the &lt;strong&gt;Blackscholes&lt;&#x2F;strong&gt; benchmark, for which we instrumented &lt;em&gt;no&lt;&#x2F;em&gt; GEP instructions.
It has a global pointer to an array of &lt;code&gt;floats&lt;&#x2F;code&gt; called &lt;code&gt;prices&lt;&#x2F;code&gt;, whose size is determined during
the beginning of execution:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;fptype &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;prices;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;main() {
   &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
   prices &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(fptype&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;malloc&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(numOptions&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*sizeof&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(fptype));
   &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The above allocation translates to the following LLVM IR:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;46 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tail call noalias i8&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; @&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;malloc&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(i64 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;45&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
store i8&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* %&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;46&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, i8&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;** &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;bitcast (&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;float**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; @prices to i8&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;), align &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;8
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Our analysis determined that the run-time size of the memory pointed to
by &lt;code&gt;%46&lt;&#x2F;code&gt; was given by the value &lt;code&gt;%45&lt;&#x2F;code&gt;.
However, &lt;code&gt;%46&lt;&#x2F;code&gt; is not used as the argument to any GEP instruction, instead
later operations use a &lt;code&gt;load&lt;&#x2F;code&gt; to retrieve the array pointer from &lt;code&gt;prices&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;179 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; load &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;float*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;float**&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; @prices, align &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;8
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;180 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; getelementptr inbounds &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;float&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;float* %&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;179&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, i64 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;159
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Since we are not running a memory dependency analysis we could
not determine that the size of the allocation pointed to by &lt;code&gt;%179&lt;&#x2F;code&gt; was &lt;code&gt;%45&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Often, these two problems combined, since &lt;code&gt;prices&lt;&#x2F;code&gt; may be accessed
outside the scope at which &lt;code&gt;%45&lt;&#x2F;code&gt; is available and we therefore would need
to modify the program to communicate this allocation information (potentially
via an extra global variable).&lt;&#x2F;p&gt;
&lt;h3 id=&quot;evaluation-overhead&quot;&gt;Evaluation: Overhead&lt;&#x2F;h3&gt;
&lt;p&gt;We measured run-time overhead in terms of wall clock time purely because it
was the simplest thing to instrument and probably the most relevant bottom-line metric when inserting dynamic checks. In the following graph we report the
average slowdown caused by our instrumentation (lower is better). At the end
of this section is a graph reporting our base results (rather than the ratio)
which reports the mean execution time for both configurations. Error bars on that
graph represent one standard deviation.&lt;&#x2F;p&gt;
&lt;img src=&quot;overhead.png&quot;&#x2F;&gt;
&lt;p&gt;In the case of the Ferret benchmark, our instrumentation caused the implementation
to exit prematurely. Since the benchmark is quite large we did not have time to investigate
why this was; it is possible that Ferret intentionally executes &amp;quot;unsafe&amp;quot; GEP instructions.
Since our instrumentation did not cause bugs in any of the other implementations we find
this to be a likely cause, but it does warrant further examination.&lt;&#x2F;p&gt;
&lt;p&gt;Interestingly, the instrumented Canneal and Streamcluster benchmarks run ever so slightly
faster; however, this result is within the standard deviation and could
also be influenced by effects covered in the &lt;a href=&quot;..&#x2F;measurement&quot;&gt;first blog for this course&lt;&#x2F;a&gt;.
Without running any real statistics, it seems like the instrumentation only had a meaningful
impact on the Fluidanimate benchmark. Somewhat unsurprisingly, this is also the benchmark
for which we managed to instrument the most GEP instructions.&lt;&#x2F;p&gt;
&lt;p&gt;Intuitively, our instrumentation &lt;em&gt;should&lt;&#x2F;em&gt; add run-time overhead which scales with
the number of GEPs and the number of times each of those GEPs are executed. It would
have been interesting to determine how &amp;quot;hot&amp;quot; each GEP instruction was and drill
down into where the overhead was coming from. That would have involved much
more invasive profiling which we did not implement.&lt;&#x2F;p&gt;
&lt;img src=&quot;runtime.png&quot;&#x2F;&gt;</description>
            </item>
        
            <item>
                <title>LLVM Loop Autovectorization</title>
                <pubDate>Wed, 13 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/llvm-autovec/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/llvm-autovec/</guid>
                <description>&lt;p&gt;We can use SIMD support in modern processors to optimize programs.
Specifically, a compiler can automatically detect when loops can be
&lt;em&gt;autovectorized&lt;&#x2F;em&gt; so that multiple iterations of the loop can be performed
at once.
We implement an autovectorizer pass for LLVM that performs this optimization
for LLVM IR. 
LLVM natively supports vector instructions already, so the pass was implemented
with relatively few lines of code by taking advantage of LLVM&#x27;s existing
infrastructure.&lt;&#x2F;p&gt;
&lt;p&gt;The repository can be found &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;rolph-recto&#x2F;cs6120-autovec&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design-overview&quot;&gt;Design Overview&lt;&#x2F;h2&gt;
&lt;p&gt;We implemented our vectorizer as a transformation pass in LLVM.
We overrode the &lt;code&gt;LoopPass&lt;&#x2F;code&gt; base class, which provides a &lt;code&gt;runOnLoop&lt;&#x2F;code&gt; method
that child classes can override, running on all the natural loops found
for a particular method.&lt;&#x2F;p&gt;
&lt;p&gt;Our vectorizer pass comes in two stages.
First, it checks whether a loop can be vectorized at all.
Then, it vectorizes instructions in the loop and updates the loop stride to
match the vector size.&lt;&#x2F;p&gt;
&lt;p&gt;We deem a loop vectorizable if it satisfies the following criteria:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The loop is in canonical form: it has an inductive variable (read: a unique
variable that enters the loop at 0 and increments by 1 per iteration)
and it has a single block from which the loop exits.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;The loop bound is a constant and is divisible by the vector size.
We determine the loop bound by checking the condition on which the unique
exiting block in the loop exits.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;The loop has no cross-iteration dependencies.
Vectorization assumes that adjacent loop iterations can be parallelized,
which is not true if the iterations have dependencies on each other.
To check that cross-iteration dependencies do not exist, 
our vectorizer checks that all array accesses are indexed either by the
inductive variable or loop-invariant data.
It also checks that operands for all operations in the loop either are
(1) vectorizable, (2) loop-invariant, or (3) the induction variable.
Branches in the loop are checked to see if their condition, if they have one,
is loop-invariant.
This ensures that the loop does not vary in control flow between adjacent
iterations.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Once we have deemed a loop vectorizable, we vectorize instructions by changing
the types of instructions into their vectorized counterparts.
This is possible because vector types in LLVM are first-class and are treated
like other types like &lt;code&gt;int&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;For &lt;code&gt;store&lt;&#x2F;code&gt;s, &lt;code&gt;load&lt;&#x2F;code&gt;s, and operation instructions (e.g., &lt;code&gt;add&lt;&#x2F;code&gt;, &lt;code&gt;mul&lt;&#x2F;code&gt;, &lt;code&gt;icmp&lt;&#x2F;code&gt;),
we replace the operands with their vectorized counterparts.
For constant operands, these are vectorized in place; otherwise, we create
a vectorized version of the operand immediately after its definition site
using &lt;code&gt;insertelement&lt;&#x2F;code&gt; instructions, and then replace uses of the operand
with its new vectorized definition.
There are two possible cases for operands:
(1) the operand is loop-invariant, in which case the vector contains &lt;code&gt;n&lt;&#x2F;code&gt; copies
of the original operand given where &lt;code&gt;n&lt;&#x2F;code&gt; is the vector size;
or (2) the operand is the inductive variable, in which case
(assuming the inductive variable is &lt;code&gt;i&lt;&#x2F;code&gt;) the vector is of the form
&lt;code&gt;&amp;lt;i, i+1, ..., i+(n-1)&amp;gt;&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;For &lt;code&gt;getelementptr&lt;&#x2F;code&gt; instructions, there are only two possible cases given
our vectorization conditions above:
(1) the base pointer is indexed by loop-invariant data, in which case
the vectorizer does nothing;
or (2) the base pointer is indexed in at least one position by the
inductive variable, in which case we immediately &lt;code&gt;bitcast&lt;&#x2F;code&gt; the result of the GEP
into a vector type.
For example, if the GEP returns &lt;code&gt;uint8_t*&lt;&#x2F;code&gt;, we bitcast the resulting pointer
into vector type &lt;code&gt;&amp;lt;n x uint8_t&amp;gt;*&lt;&#x2F;code&gt;, where &lt;code&gt;n&lt;&#x2F;code&gt; is the vector size.
We then replace all the uses of the GEP result with its &lt;code&gt;bitcast&lt;&#x2F;code&gt;ed counterpart.
Thus &lt;code&gt;load&lt;&#x2F;code&gt;s from the pointer load not just data from the array index, but
its adjacent indices as well, thus loading a vector from memory locations
and not just a single datum.
&lt;code&gt;store&lt;&#x2F;code&gt;s using GEP results can also write vectors to memory in this way.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, once all instructions are vectorized, we find the unique instruction
that increments the inductive variable and change its stride to match the
vector size.
(Given our vectorization check, we can assume the inductive variable is a
&lt;code&gt;PHINode&lt;&#x2F;code&gt; with two incoming definitions: a constant &lt;code&gt;0&lt;&#x2F;code&gt; and an addition
instruction that increments the inductive variable.
This is how we find the inductive variable&#x27;s unique update instruction.)&lt;&#x2F;p&gt;
&lt;h2 id=&quot;example&quot;&gt;Example&lt;&#x2F;h2&gt;
&lt;p&gt;To show our vectorizer in action, consider the following example:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;int64_t&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;100&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;int64_t&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;100&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
    a[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i;
  }
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This C code compiled to LLVM IR&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#mem2reg&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; looks like:&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;mem2reg&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;The IR generated has been transformed into SSA form using the
&lt;code&gt;mem2reg&lt;&#x2F;code&gt; pass.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;entry:
  %a = alloca [100 x i64], align 16
  br label %for.cond

for.cond:                                         ; preds = %for.inc, %entry
  %i.0 = phi i64 [ 0, %entry ], [ %inc, %for.inc ]
  %cmp = icmp slt i64 %i.0, 100
  br i1 %cmp, label %for.body, label %for.end

for.body:                                         ; preds = %for.cond
  %arrayidx = getelementptr [i64]* %a, i64 0, i64 %i.0
  store i64 %i.0, i64* %arrayidx, align 8
  br label %for.inc

for.inc:                                          ; preds = %for.body
  %inc = add nsw i64 %i.0, 1
  br label %for.cond

for.end:                                          ; preds = %for.cond
  ret i32 0
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Running our vectorization pass, we get the following:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;entry:
  %a = alloca [100 x i64], align 16
  br label %for.cond

for.cond:                                         ; preds = %for.inc, %entry
  %i.0 = phi i64 [ 0, %entry ], [ %inc, %for.inc ]
  %0 = insertelement &amp;lt;4 x i64&amp;gt; zeroinitializer, i64 %i.0, i64 0
  %1 = insertelement &amp;lt;4 x i64&amp;gt; %0, i64 %i.0, i64 1
  %2 = insertelement &amp;lt;4 x i64&amp;gt; %1, i64 %i.0, i64 2
  %3 = insertelement &amp;lt;4 x i64&amp;gt; %2, i64 %i.0, i64 3
  %4 = add &amp;lt;4 x i64&amp;gt; %3, &amp;lt;i64 0, i64 1, i64 2, i64 3&amp;gt;
  %cmp = icmp slt i64 %i.0, 100
  br i1 %cmp, label %for.body, label %for.end

for.body:                                         ; preds = %for.cond
  %arrayidx = getelementptr [i64]* %a, i64 0, i64 %i.0
  %5 = bitcast i64* %arrayidx to &amp;lt;4 x i64&amp;gt;*
  store &amp;lt;4 x i64&amp;gt; %4, &amp;lt;4 x i64&amp;gt;* %5, align 8
  br label %for.inc

for.inc:                                          ; preds = %for.body
  %inc = add nsw i64 %i.0, 4
  br label %for.cond

for.end:                                          ; preds = %for.cond
  ret i32 0
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The vectorizer changes the loop so that each new iteration performs 4 iterations
of the original loop.
The GEP in the &lt;code&gt;for.body&lt;&#x2F;code&gt; block computes the base address for a 4-vector;
the subsequent &lt;code&gt;store&lt;&#x2F;code&gt; into &lt;code&gt;%arrayidx&lt;&#x2F;code&gt; then uses the vectorized version 
of the inductive variable &lt;code&gt;%i.0&lt;&#x2F;code&gt; computed in &lt;code&gt;for.cond&lt;&#x2F;code&gt; (&lt;code&gt;%4&lt;&#x2F;code&gt;)
to write into memory the next 4 values of &lt;code&gt;%i.0&lt;&#x2F;code&gt; at once.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;We evaluated our autovectorizer across a range of small benchmarks containing
loops.
We ran experiments on a ThinkPad T410 with an Intel Core i7-620M processor and
8GB of RAM.
We constructed our benchmarks so that all loops can be vectorized according
to the rather stringent vectorization conditions, which disallows indexing
arrays with anything other than the inductive variable or loop-invariant data.&lt;&#x2F;p&gt;
&lt;p&gt;The results below are measured across three runs.
Note that DEF is the configuration without any optimizations (&lt;code&gt;-O0&lt;&#x2F;code&gt;),
while OPT is the configuration with only our autovectorizer enabled.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Benchmark&lt;&#x2F;th&gt;&lt;th&gt;DEF runtime (ms)&lt;&#x2F;th&gt;&lt;th&gt;OPT runtime (ms)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;test1&lt;&#x2F;td&gt;&lt;td&gt;26.00&lt;&#x2F;td&gt;&lt;td&gt;13.00&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;test2&lt;&#x2F;td&gt;&lt;td&gt;25.00&lt;&#x2F;td&gt;&lt;td&gt;22.67&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;test3&lt;&#x2F;td&gt;&lt;td&gt;19.00&lt;&#x2F;td&gt;&lt;td&gt;15.33&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;test4&lt;&#x2F;td&gt;&lt;td&gt;25.33&lt;&#x2F;td&gt;&lt;td&gt;15.00&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;test5&lt;&#x2F;td&gt;&lt;td&gt;10.00&lt;&#x2F;td&gt;&lt;td&gt;11.00&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Across the benchmarks, either OPT outperforms DEF handily
(as in &lt;code&gt;test1&lt;&#x2F;code&gt; and &lt;code&gt;test4&lt;&#x2F;code&gt;), or it runs about the same time
as its DEF counterpart.
This is the behavior we expected.&lt;&#x2F;p&gt;
&lt;p&gt;It is not immediately clear what makes &lt;code&gt;test1&lt;&#x2F;code&gt; and &lt;code&gt;test4&lt;&#x2F;code&gt; different from the
other testcases.
They are the only testcases with both of the following features:
division expressions and a single, non-nested loop.
The rest of the testcases either have multiple loops, nested loops, or no
division expressions.
More investigation is needed to determine the exact reason why the other
vectorized testcases do not run noticeable faster than their
non-vectorized counterparts.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>LLVM Function Inlining Pass</title>
                <pubDate>Wed, 13 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/llvm-function-inlining/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/llvm-function-inlining/</guid>
                <description>&lt;p&gt;It is a common convention in computer programming to abstract out repeated
functionality in a program into procedures, so that it can be referenced
multiple times without having to repeat the code. Separating out code in this
way can increase the legibility and modularity of the program, but comes at a
cost. Namely, function calls in programs can introduce a major runtime
performance overhead due to the costs of entering and exiting functions.
Typically, when a function is called, a new stack frame will need to be created
and the instruction pointer will need to be moved to the beginning of that
function—which may or may not be in the instruction cache—among other
function-initialization tasks. Also, some of the state of the calling function
(e.g., caller-saved registers) will need to be stored in preparation for the
call, such that it will not be overwritten by the called function. Separating
out functions also gives the compiler less information for optimizing the
program, because when looking at a function, the compiler cannot be sure of the
context that it will be running in. As such, it is often beneficial for
compilers to be able to get rid of function calls.&lt;&#x2F;p&gt;
&lt;p&gt;Function inlining is a classic compiler optimization that does exactly this; it
replaces calls in code with the body of the called function in order to remove
the overhead of making the call at runtime. In this project we implemented and
evaluated the effects of a function inlining pass which runs on programs in the
LLVM intermediate representation (IR).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design&quot;&gt;Design&lt;&#x2F;h2&gt;
&lt;p&gt;To properly inline a function call without breaking the program, there are
several considerations that must be made. As a first approximation, the call
instruction is erased, and all of the instructions in the function that was
called are put in its place. Any arguments or return values in the call must
then be handled. Uses of the arguments in the inlined function now need to refer
to the actual values that were provided in the call instruction. Due to the
possibility of multiple return instructions in the function, handling the data
flow and control flow for exiting the inlined function is a bit more
complicated. All return instructions are replaced with unconditional branches to
the block which immediately followed the call in the parent function. A phi node
is inserted into that block to coalesce the values from all of the return
instructions into a single place, which then can be used whereever the returned
value was used in the parent function. The LLVM API provides the function
&lt;code&gt;llvm::InlineFunction&lt;&#x2F;code&gt; which handles completing these operations. In general,
you would also need to handle naming collisions between variables in the caller
and callee functions. However, because the LLVM IR is in static single
assignment (SSA) form, and data dependencies are represented explicitly through
pointers, this is not a concern.&lt;&#x2F;p&gt;
&lt;p&gt;With a method for inlining in place, the next step is to determine when we want
to inline calls. Some function calls can be immediately eliminated from
consideration because they would be impossible to inline correctly. For example,
functions not defined locally cannot be inlined, as their implementations are
unknown. Also, functions that include indirect branch instructions may directly
address locations, and if they do, they cannot be inlined, as doing so could
cause the indirect branches to go to unexpected positions in the program.&lt;&#x2F;p&gt;
&lt;p&gt;Even if a function is able to be inlined without breaking the semantics of the
program, it still may not be beneficial to do so. For one, inlining functions
with large numbers of instructions can have a significant impact on the code
size of the program. As such, the space-time tradeoff must be considered before
inlining every function call. Also, duplicating code in this way can reduce the
temporal locality and thus the performance of the instruction cache.
Consequently, it is typically most beneficial for overall performance to inline
smaller functions and leave the larger ones as calls. In our implementation, we
decided to only inline functions whose instruction counts are below a constant
threshold. In the evaluation we discuss how the exact value of this threshold
affects the program performance. In general, more sophisticated heuristics could
be used to determine when to inline a function call in order to better optimize
the space-time tradeoff, such as considering the hotness of the call instruction
from profiling data. A call to a large function which is executed frequently
could be more beneficial to inline than a call to a small function that is
rarely reached.&lt;&#x2F;p&gt;
&lt;p&gt;Our function inlining pass runs through each of the function calls in the
original program and decides whether or not to inline them. As such, it never
gets caught in a loop trying to inline a recursive function. However, there are
cases when it is beneficial to inline deeply (i.e., inline calls that were
created from the inlining of another call). As such, our implementation provides
the capability to iterate through the program multiple times. In the evaluation,
we consider the effect that multiple iterations have on program performance.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;We ran our optimization on a subset of programs from the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-test-suite&quot;&gt;LLVM test suite&lt;&#x2F;a&gt;.
Specifically, we focused on single-source programs in the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-test-suite&#x2F;tree&#x2F;master&#x2F;SingleSource&#x2F;Benchmarks&#x2F;Misc&quot;&gt;Misc folder&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;We benchmarked programs by measuring the total running time of each program. We
ran each benchmark test with varying function size thresholds and recursive
inlining levels to determine their effects on program performance. All benchmark
programs were run on a 2.90GHz Intel(R) Core(TM) i9-8950HK CPU with 32 GB 2400
MHz DDR4 RAM, and they were run five times. Average running times along with
standard deviations are shown below. &amp;quot;R&amp;quot; refers to the recursive inlining level
(e.g., &amp;quot;R=2&amp;quot; means that the inlining pass was run twice), and &amp;quot;T&amp;quot; refers to the
function size threshold in terms of LLVM instructions (e.g., &amp;quot;T=100&amp;quot; means that
all functions that translate to at most 100 LLVM instructions are inlined).&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;llvm-function-inlining&#x2F;benchmarks.png&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;As seen above, function inlining usually improves the performance of the
benchmark programs, but there is no clear overall trend. For some benchmarks,
increasing the function size threshold (for a constant recursive inlining level)
improves performance. This is seen in &amp;quot;evalloop,&amp;quot; &amp;quot;fbench,&amp;quot; &amp;quot;mandel-2,&amp;quot; and a
few other benchmarks. However, in other cases, increasing the function size
threshold decreased performance. This is seen in &amp;quot;ReedSolomon,&amp;quot; &amp;quot;fp-convert,&amp;quot;
and several others. This variability in performance is likely due to the
unpredictable effects of function inlining. In some cases, function inlining may
improve performance due to the reduced number of stack allocations for function
calls. However, function inlining might cause page thrashing, in which chunks of
code are constantly swapped in and out of virtual memory. From our benchmarks,
it seems that the average speedup is optimal at R = 1 and T = 100, with a 7.59%
performance improvement. That being said, each individual benchmark&#x27;s
performance distribution varies quite a bit, and it would be naive to draw
conclusions from the average metrics.&lt;&#x2F;p&gt;
&lt;p&gt;Another important factor to consider with this optimization is code size. On
average, function inlining increased code size by about 20% (in terms of LLVM
instructions), and in some cases, bloated program size by almost 150%. There is
not a clear correlation between the inlining parameters and program size, since
the changes in code size are highly dependent on the programs themselves.
Overall, the space-time tradeoff seems reasonable, and if program performance is
crucial, the ~5-10% speedup may be worth a 100% increase in code size.&lt;&#x2F;p&gt;
&lt;p&gt;Ultimately, from the benchmarks, it seems that there are no universal optimal
function inlining parameters, and that each program requires manual
experimentation to determine optimal function inlining parameters. For future
work, it would be interesting to investigate a more complex cost model than
function size. For example, taking into account the number of function
parameters and whether the function is recursive or not could be interesting. It
would also be interesting to take a profile-guided approach to this optimization
by inlining &amp;quot;hot&amp;quot; functions and seeing the effect on program performance.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Runtime Execution Profiling using LLVM</title>
                <pubDate>Wed, 13 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/llvm-profiling/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/llvm-profiling/</guid>
                <description>&lt;h2 id=&quot;goal&quot;&gt;Goal&lt;&#x2F;h2&gt;
&lt;p&gt;The goal of the project is to collect run-time information by adding an LLVM pass that is accurate even in multi-threading program. We are interested in three kinds of information: &lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Basic execution information including the number of LLVM instructions, basic blocks etc.&lt;&#x2F;li&gt;
&lt;li&gt;Expensive instruction information like the number of memory operations, multiplications and branches since they are most likely to affect execution time.&lt;&#x2F;li&gt;
&lt;li&gt;Function call information: the number of times each function is called.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h2 id=&quot;design-overview&quot;&gt;Design Overview&lt;&#x2F;h2&gt;
&lt;p&gt;The LLVM architecture looks like this:&lt;&#x2F;p&gt;
&lt;img src=&quot;llvm.jpg&quot; style=&quot;width: 90%&quot;&gt;
&lt;p&gt;&amp;quot;IR&amp;quot; stands for &lt;em&gt;intermediate language&lt;&#x2F;em&gt;. We only need to add one pass that changes the existing IR program to another IR program (program which includes the desired properties like profiling information), following instructions to add an LLVM pass in our instructor, &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;%7Easampson&#x2F;&quot;&gt;Adrian&lt;&#x2F;a&gt;&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;%7Easampson&#x2F;blog&#x2F;llvm.html&quot;&gt;blog post&lt;&#x2F;a&gt;. We also took help from the Github repo &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;pranith&#x2F;AtomicCounter&#x2F;blob&#x2F;master&#x2F;AtomicCountPass&#x2F;AtomicCount.cpp&quot;&gt;atomicCounter&lt;&#x2F;a&gt; where the increments to the atomic global counter variables are atomic updates. In our project, we don&#x27;t profile atomic operations but use atomic updates to global variables.&lt;&#x2F;p&gt;
&lt;p&gt;We start from &lt;a href=&quot;http:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;WritingAnLLVMPass.html#the-modulepass-class&quot;&gt;ModulePass&lt;&#x2F;a&gt; and then access functions inside a Module, BasicBlocks per function and instructions per BasicBlock to obtain all information. At the beginning and ending of each module, we also call a custom method &lt;code&gt;initialize&lt;&#x2F;code&gt; to create global variables in IR  and &lt;code&gt;finalize&lt;&#x2F;code&gt; to print global variables. The structure of our profiling pass looks like this:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;struct &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;SkeletonPass : &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;ModulePass &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;static char&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ID;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;SkeletonPass&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() : ModulePass(ID) {}
      
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;virtual bool &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;runOnModule&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(Module &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;M); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;when there is a Module
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;virtual bool &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;runOnFunction&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(Function &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;F, Module &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;M); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;called by runOnModule
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;virtual bool &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;runOnBasicBlock&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(BasicBlock &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;BB, Module &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;M); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; called by runOnFunction
      
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;bool &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;initialize&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(Module &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;M); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;create global variable
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;bool &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;finialize&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(Module &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;M); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;print global variable
      
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;createInstr&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(BasicBlock &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;bb, Constant &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;counter_ptr, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;num);
      
    vector&amp;lt;string&amp;gt; atomicCounter; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;keep global variable names for profiling. e.g. instr counter
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;};
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;basic-execution-information&quot;&gt;Basic Execution Information&lt;&#x2F;h3&gt;
&lt;p&gt;In the &lt;code&gt;initialize&lt;&#x2F;code&gt; method, we create global variable &lt;code&gt;llvmInstrAtomicCounter&lt;&#x2F;code&gt; and &lt;code&gt;basicBlockAtomicCounter&lt;&#x2F;code&gt; with &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;doxygen&#x2F;classllvm_1_1GlobalVariable.html#a3ef813d6bda7e49e31cb6bf239c4e264&quot;&gt;GlobalVariable&lt;&#x2F;a&gt; constructor.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;GlobalVariable(M, I64Ty, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;false&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, GlobalValue::CommonLinkage, ConstantInt::get(I64Ty, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;), atomicCounter[i]);
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then in &lt;code&gt;runOnBasicBlock&lt;&#x2F;code&gt;, we obtain a pointer to the global variable names with &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;doxygen&#x2F;classllvm_1_1Module.html#abd8f7242df6ecb10f429c4d39403c334&quot;&gt;&lt;code&gt;getOrInsertGlobal&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; method. After getting the instruction number in each block, we create atomic addition with &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;doxygen&#x2F;classllvm_1_1AtomicRMWInst.html#abf7e0649c7f272cc49165e579be010a5&quot;&gt;&lt;code&gt;AtomicRMWInst&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; constructor. &lt;&#x2F;p&gt;
&lt;p&gt;Finally, in the &lt;code&gt;finalize&lt;&#x2F;code&gt; method, we print the profiling results with global variable names and the corresponding values at the end of &lt;code&gt;main&lt;&#x2F;code&gt; block before &lt;code&gt;return&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;We create &lt;code&gt;printf&lt;&#x2F;code&gt; FunctionCallee with &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;doxygen&#x2F;classllvm_1_1Module.html#a5310b7bb84192372c55cbc66cd975c59&quot;&gt;&lt;code&gt;getOrInsertFunction&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;) method. &lt;&#x2F;li&gt;
&lt;li&gt;We insert &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;doxygen&#x2F;classllvm_1_1IRBuilder.html#aa87594a9d1f908486410d8fa9bea9c1f&quot;&gt;&lt;code&gt;CreateGlobalStringPtr&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; method to create pointer pointing to string we would like print.&lt;&#x2F;li&gt;
&lt;li&gt;Then we obtained the value of corresponding strings with the &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;doxygen&#x2F;classllvm_1_1LoadInst.html&quot;&gt;&lt;code&gt;loadInst&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; method. &lt;&#x2F;li&gt;
&lt;li&gt;The last step is to create the function call with &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;doxygen&#x2F;classllvm_1_1CallInst.html#a850d8262cd900958b3153c4aa080b2bb&quot;&gt;&lt;code&gt;Create&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;. &lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The complete code is post in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;neiladit&#x2F;llvm_profiling&#x2F;blob&#x2F;master&#x2F;skeleton&#x2F;Skeleton.cpp&quot;&gt;Neil&#x27;s Github repo&lt;&#x2F;a&gt;. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;expensive-operations-information&quot;&gt;Expensive Operations Information&lt;&#x2F;h3&gt;
&lt;p&gt;This part follows the same flow as basic execution information except we need to distinguish the instruction type and increment corresponding counter on a block basis. Therefore in each &lt;code&gt;runOnBasicBlock&lt;&#x2F;code&gt; method, we need the following lines:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;auto&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; it &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; bb.begin(); it &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; bb.end(); it&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
	&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;switch &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(it-&amp;gt;getOpcode()) {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;case&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; Instruction::Mul:&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; multiplication
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;            mul_instr&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;continue&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;case&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; Instruction::Br:&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; branch
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;            br_instr&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;continue&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;case&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; Instruction::Store:&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; store
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;case&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; Instruction::Load:&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; load
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;            mem_instr&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;continue&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;default&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;:
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;break&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;function-calls&quot;&gt;Function calls&lt;&#x2F;h3&gt;
&lt;p&gt;We profile the number of times each function was called in the program. This is done by first initializing global variables corresponding to all the functions in the program to zero. This is done statically by iterating over a function list in the module given by &lt;code&gt;getFunctionList()&lt;&#x2F;code&gt;. This returns all the functions called in the program including the ones that were included from C libraries like &lt;code&gt;printf&lt;&#x2F;code&gt; or &lt;code&gt;scanf&lt;&#x2F;code&gt;. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;auto &amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;functionList &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; M.getFunctionList(); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; gets the list of functions
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;auto &amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;function &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; functionList) { &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;iterates over the list
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;    Value &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;atomic_counter &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; new GlobalVariable(M, I64Ty, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;false&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, GlobalValue:&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;CommonLinkage, ConstantInt:&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;get(I64Ty, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;),  function.getName()&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;.glob&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; create a global variable, name it based on the function name
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;} 

&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Next we want to insert atomic counters at the start of each function definition. This ensures that irrespective of multiple return points in the function we can always increment the counter for it at the beginning. We start with the entry basic block in the function which is given by the iterator &lt;code&gt;F.begin()&lt;&#x2F;code&gt;.  To insert it at the top of the basic block we use &lt;code&gt;getFirstNonPHI()&lt;&#x2F;code&gt; which returns the first instruction that is not a PHI node. We insert an atomic add instruction similar to other profiling instructions. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;hardest-parts&quot;&gt;Hardest Parts&lt;&#x2F;h2&gt;
&lt;ol&gt;
&lt;li&gt;For people who are new to LLVM, instructions are hard to find and follow. Searching on Google can help if you know what you&#x27;re looking for. It&#x27;s difficult to get helpful information unless your search is confined to existing functions. Even though LLVM documentation is pretty exhaustive, it has too many functions to go through and the lack of examples can be off putting. Tutorials and existing backbone codes on Github can be really handy in these scenarios, which we took advantage of. It not only helped us implement specific functions like &lt;code&gt;printf&lt;&#x2F;code&gt; but also establish a structure to our IR pass. &lt;&#x2F;li&gt;
&lt;li&gt;String manipulations: I am not sure if this is the right term to use, but LLVM seems to have 2 string types - Twine and StringRef. &lt;code&gt;getName&lt;&#x2F;code&gt; on a function returns a StringRef. In order to make a custom name I perform &lt;code&gt;F.getName()+&amp;quot;name&amp;quot;&lt;&#x2F;code&gt; which returns a Twine. But the function &lt;code&gt;getGlobalVariable&lt;&#x2F;code&gt; only accepts StringRef. Twine has a function &lt;code&gt;str&lt;&#x2F;code&gt; which can be used to convert it into string. Even though this is a straightforward solution, it ended up taking time to figure out the problem and looking into the documentation of these classes.&lt;&#x2F;li&gt;
&lt;li&gt;Setting up and running benchmarks: &amp;quot;It&#x27;s all fun and games until you run your IR pass on real programs&amp;quot; - anonymous. We faced issues setting up benchmarks like PARSEC on Mac. Embench had multiple source files to compile which ran into trouble partly due to our IR pass not being thoughtfully written. We were defining global variables in all the files irrespective of it being a function&#x2F;utility file or the main source file. We ended up using &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kozyraki&#x2F;phoenix&quot;&gt;Phoenix&lt;&#x2F;a&gt; which worked well on Linux but was not meant for Mac. Hence for doing our LLVM pass, we had to install and update libraries on Linux machines.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h2 id=&quot;evaluation-and-results&quot;&gt;Evaluation and results&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;gcov-profiling-tool&quot;&gt;Gcov profiling tool&lt;&#x2F;h3&gt;
&lt;p&gt;To validate and benchmark our profiling results, we use &lt;code&gt;gcov&lt;&#x2F;code&gt; testing tool which can be used as a profiler to give performance statistics. We first compile the code using gcc flags required for &lt;code&gt;gcov&lt;&#x2F;code&gt; : &lt;code&gt;gcc -fprofile-arcs -ftest-coverage foo.c&lt;&#x2F;code&gt; . Now we run &lt;code&gt;gcov&lt;&#x2F;code&gt; with relevant flags to give us statistics to compare with our profiler:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;gcov -b -c -f foo.c
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We ran the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kozyraki&#x2F;phoenix&quot;&gt;Phoenix&lt;&#x2F;a&gt; benchmark suite and used &lt;code&gt;gcov&lt;&#x2F;code&gt; to profile statistics of function calls. The makefile initially had optimization flag -O3 but this might lead to incorrect numbers of function calls due to optimizations, so we compile without any flags. We picked &lt;code&gt;Kmeans&lt;&#x2F;code&gt; arbitrarily to demonstrate a detailed example of our profiling tool below:&lt;&#x2F;p&gt;
&lt;h4 id=&quot;kmeans&quot;&gt;Kmeans&lt;&#x2F;h4&gt;
&lt;p&gt;Sequential execution without optimization flags yields the following output for function calls by &lt;code&gt;gcov&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main: 1
dump_matrix: 1
calc_means: 23
find_clusters: 23
add_to_sum: 23000
get_sq_dist: 230000
generate_points: 2
parse_args: 1
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We only list the function calls from the profiling tool &lt;code&gt;gcov&lt;&#x2F;code&gt; for sanity check. Our profiling pass also outputs other instruction statistics. The output generated by our profiling pass is listed down below:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;llvmInstrAtomicCounter: 40091156
basicBlockAtomicCounter: 5019598
mulAtomicCounter: 691267
memOpAtomicCounter: 20250888
branchAtomicCounter: 4766547

parse_args: 1
generate_points: 2
add_to_sum: 23000
find_clusters: 23
get_sq_dist: 230000
calc_means: 23
dump_matrix: 1
main: 1
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The numbers match with &lt;code&gt;gcov&lt;&#x2F;code&gt;. We compile the results for all benchmarks below.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;results&quot;&gt;Results&lt;&#x2F;h3&gt;
&lt;p&gt;The following results are for sequential execution of the benchmarks. We have reported the instruction counts in the following table:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Benchmark&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;LLVM instruction count&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;basic block count&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;multiplication count&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;memory operation count&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;branch operation count&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Histogram&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1707337893&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;104532503&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;871090234&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;104532501&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Kmeans&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;40091156&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;5019598&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;691267&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;20250888&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;4766547&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Linear regression&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2244735757&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;86335992&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;86335983&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1093589204&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;86335991&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;PCA&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;35388&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;3461&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;573&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;17975&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;3454&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;String match&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;3652012341&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;454936402&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1575281754&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;443891533&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Word Count&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1148215642&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;213577613&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;46026&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;512924102&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;199431461&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Function counts for the benchmarks are listed below (they have been matched with &lt;code&gt;gcov&lt;&#x2F;code&gt;s output as well):&lt;&#x2F;p&gt;
&lt;p&gt;Histogram:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;test_endianess: 1
main: 1
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Kmeans:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;parse_args: 1
generate_points: 2
add_to_sum: 23000
find_clusters: 23
get_sq_dist: 230000
calc_means: 23
dump_matrix: 1
main: 1
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Linear Regression:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main: 1
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;PCA:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;parse_args: 1
dump_points: 2
generate_points: 1
calc_mean: 8
calc_cov: 8
main: 1
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;String match:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;getnextline: 5522432
compute_hashes: 5522435
string_match: 1
main: 1
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Word Count:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;wordcount_cmp: 579521
wordcount_splitter: 1
wordcount_getword: 1
wordcount_addword: 1513425
dobsearch: 1513425
main: 1
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;pthread-execution&quot;&gt;Pthread execution&lt;&#x2F;h3&gt;
&lt;p&gt;We also ran &lt;code&gt;kmeans&lt;&#x2F;code&gt; using pthreads on &lt;code&gt;8&lt;&#x2F;code&gt; threads. We can see that some of the function calls in the output are scaled by 8 in both &lt;code&gt;gcov&lt;&#x2F;code&gt;  and our profiling pass. Matching results also show that atomic counters were successfully implemented.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;gcov&lt;&#x2F;code&gt; output:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main: 1
calc_means: 184
find_clusters: 184
add_to_sum: 23000
get_sq_dist: 230000
generate_points: 2
parse_args: 1
dump_points: 1
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Our profiling output:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;llvmInstrAtomicCounter: 40125466
basicBlockAtomicCounter: 5023832
mulAtomicCounter: 691429
memOpAtomicCounter: 20266988
branchAtomicCounter: 4770459

dump_points: 1
parse_args: 1
generate_points: 2
add_to_sum: 23000
find_clusters: 184
get_sq_dist: 230000
calc_means: 184
main: 1
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;</description>
            </item>
        
            <item>
                <title>Loop Perforation</title>
                <pubDate>Wed, 13 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/loop-perforation/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/loop-perforation/</guid>
                <description>&lt;p&gt;Have you ever been frustrated that your code takes too long to run?
Do you have a sneaking suspicion that most of the time is spent in loops?
Have you ever considered just &lt;em&gt;running fewer loops&lt;&#x2F;em&gt; by having your compiler mangle your code to skip arbitrary loop iterations?&lt;&#x2F;p&gt;
&lt;p&gt;Welcome to &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Loop_perforation&quot;&gt;loop perforation&lt;&#x2F;a&gt;, an idea that sounds so ludicrous that we can barely believe it actually works at all!&lt;&#x2F;p&gt;
&lt;p&gt;The basic premise is common across the field of &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Approximate_computing&quot;&gt;approximate computing&lt;&#x2F;a&gt;: many applications spend a lot of time and energy getting results that are &lt;em&gt;exactly&lt;&#x2F;em&gt; right, when they could happily get away with results that are &lt;em&gt;mostly&lt;&#x2F;em&gt; right.
If a programmer is able to define what exactly &amp;quot;mostly right&amp;quot; means for their particular application, then approximate computing techniques allow them to explore trading off cost and correctness.
The original &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=2025133&quot;&gt;loop perforation paper&lt;&#x2F;a&gt;, &amp;quot;Managing Performance vs. Accuracy Trade-offs with Loop Perforation&amp;quot;, from ESEC&#x2F;FSE’11, takes this idea to a beautifully flippant extreme: look at some loops, and replace something like &lt;code&gt;i++&lt;&#x2F;code&gt; with &lt;code&gt;i += 2&lt;&#x2F;code&gt; or even &lt;code&gt;i += 21&lt;&#x2F;code&gt;.
Skipping loops like this almost definitely makes your code both faster and more energy efficient (assuming it still runs!).&lt;&#x2F;p&gt;
&lt;p&gt;For this project, we set out to implement simple loop perforation as a &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&quot;&gt;LLVM&lt;&#x2F;a&gt; pass, with the goal of a richer exploration into what it means to actually accept worse accuracy from our programs in this domain.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-we-did&quot;&gt;What We Did&lt;&#x2F;h2&gt;
&lt;p&gt;LLVM is an industrial-strength compiler that structures optimizations as a series of passes that both act on and produce a human-readable intermediate representation.&lt;&#x2F;p&gt;
&lt;p&gt;We implemented two new LLVM passes:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;LoopCountPass&lt;&#x2F;code&gt;, a function pass that identifies and saves which program loops are candidates for perforation.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;LoopPerforationPass&lt;&#x2F;code&gt;, a loop pass that perforates loops at a specified rate.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Both passes work in conjunction with additional infrastructure:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;An external driver program, written in Python.&lt;&#x2F;li&gt;
&lt;li&gt;User-provided representative inputs.&lt;&#x2F;li&gt;
&lt;li&gt;User-defined accuracy&#x2F;error metrics.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;You can find our implementation &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;avanhatt&#x2F;llvm-loop-perforation&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-meandering-tour-of-loop-perforation&quot;&gt;A Meandering Tour of Loop Perforation&lt;&#x2F;h2&gt;
&lt;p&gt;To understand the interplay between our LLVM pass, the user-defined error metrics, and the python driver, let&#x27;s consider a toy example.&lt;&#x2F;p&gt;
&lt;p&gt;Say we want to write a silly function that sums the integers from 0 to some number &lt;code&gt;n&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;int sum_to_n(int n) {
    int sum = 0;
    for (int i = 0; i &amp;lt; n; i++) {
       sum += i;
    }
    return sum;
}

int main(int argc, char const *argv[]) {
    printf(&amp;quot;%d\n&amp;quot;, sum_to_n(5));
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You can imagine that perforating the loop here is a &lt;em&gt;pretty&lt;&#x2F;em&gt; bad idea: this implementation doesn&#x27;t have a ton of wiggle room in getting a totally correct result.
However, if we suspend disbelief for a moment and imagine that some poor soul only cares about the &lt;em&gt;order of magnitude&lt;&#x2F;em&gt; of the resulting sum, then perforating this loop becomes an interesting task.&lt;&#x2F;p&gt;
&lt;p&gt;Conceptually, the driver just needs to take in the &lt;code&gt;sum_to_n&lt;&#x2F;code&gt; implementation and some oracle that can answer the question: &amp;quot;is this perforated implementation &lt;em&gt;good enough&lt;&#x2F;em&gt;?&amp;quot;
(For more complicated applications, the driver also needs a representative input, but for this example the executable takes no arguments.)
So, let&#x27;s also tell the driver how wrong the ultimate answer can be.&lt;&#x2F;p&gt;
&lt;p&gt;To do this, we require a python implementation of an &lt;code&gt;errors&lt;&#x2F;code&gt; module.
At a high level, &lt;code&gt;errors&lt;&#x2F;code&gt; should tell the driver 1) what error metrics we care about for this application, and 2) a float value between 0 and 1 for each metric (0 is perfect, 1 is unacceptable).
For &lt;code&gt;sum_to_n&lt;&#x2F;code&gt;, let&#x27;s define a single error metric that&#x27;s the ratio between our new sum answer and the correct answer:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;# Provide the name of each metric we care about
error_names = [&amp;quot;error_ratio&amp;quot;]

# The arguments `standard_fn` and `perforated_fn` are filenames of output files
def error(standard_fn, perforated_fn):
    standard = int(get_contents(standard_fn))
    perforated = int(get_contents(perforated_fn))

    delta = abs(standard - perforated)
    ratio = delta &#x2F; standard

    return {&amp;quot;error_ratio&amp;quot; : ratio}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now, we can hand off this little application to the driver, to determine which loops it can successfully perforate. The driver takes an argument for what level of error is acceptable; we said we cared about the order of magnitude, so let&#x27;s say the error can be 50% and we&#x27;d still be happy:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;$ python3 driver.py tests&#x2F;sum_to_n -e 0.5
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Let&#x27;s walk through what happens now. First, the driver needs a basic sense of what the correct behavior for this application should be, so it builds and executes the application (on the representative input, if provided).
For &lt;code&gt;sum_to_n&lt;&#x2F;code&gt;, our basic understanding of arithmetic holds up, and we get that the sum of the numbers from 0 to 4 is indeed:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;10
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This output is saved to disk for later comparisons.&lt;&#x2F;p&gt;
&lt;p&gt;Executing the application a single time assumes that the application&#x27;s output is deterministic, which is potentially a huge blindspot in loop perforation.
In particular, our implementation does nothing to detect non-determinism in the intact program, which seems consistent with much published work on this topic (though some do address this by collecting multiple runs e.g., &lt;a href=&quot;https:&#x2F;&#x2F;dada.cs.washington.edu&#x2F;research&#x2F;tr&#x2F;2015&#x2F;01&#x2F;UW-CSE-15-01-01.pdf&quot;&gt;ACCEPT&lt;&#x2F;a&gt;). 
However, we do run the perforated variants multiple times, as we will see below.&lt;&#x2F;p&gt;
&lt;p&gt;After the driver executes the standard variant of the application, it needs to determine what loop structures the program has to exploit.
To accomplish this, the driver runs our first pass, &lt;code&gt;LoopCountPass&lt;&#x2F;code&gt;.
Because we are in LLVM-land, this pass gets to rely on existing LLVM infrastructure for most of the heavy lifting.
The pass invokes two dependent passes, &lt;code&gt;llvm::LoopInfo&lt;&#x2F;code&gt; and &lt;code&gt;llvm::LoopSimplify&lt;&#x2F;code&gt;, which return statistics and simplify loops to a canonical form where possible, respectively.
Our pass then examines which of these loops have both been successfully been converted to a simple form and have a canonical induction variable.
We write these loops, which we consider to be perforation candidates, out to disk as a JSON file.
For &lt;code&gt;sum_to_n&lt;&#x2F;code&gt;, the implementation has example one loop in a simple form, so the resulting JSON looks something like this:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;{
    &amp;quot;sum_to_n-phis.ll&amp;quot;: {
        &amp;quot;sum_to_n&amp;quot;: [
            &amp;quot;%2&amp;lt;header&amp;gt;&amp;lt;exiting&amp;gt;,%4,%6&amp;lt;latch&amp;gt;&amp;quot;
        ]
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We use functionality from &lt;code&gt;llvm::Loop::Print()&lt;&#x2F;code&gt; to get the name of each loop, which includes which basic blocks are included in the loop (here, &lt;code&gt;%2&lt;&#x2F;code&gt;, &lt;code&gt;%4&lt;&#x2F;code&gt;, and &lt;code&gt;%6&lt;&#x2F;code&gt;) as well as their role within the loop.
Here, we enter the loop at block &lt;code&gt;%2&lt;&#x2F;code&gt;, then either exit the block or branch to blocks &lt;code&gt;%4&lt;&#x2F;code&gt;, &lt;code&gt;%6&lt;&#x2F;code&gt;, and back to &lt;code&gt;%2&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Next, the driver needs to explore how far it can mangle each loop before the results become unacceptable (remember, here that means with error under 50%).
The driver iteratively perforates each candidate loop with a set of possible perforation rates—for this example, 2, 3, 5.
More concretely, the driver invokes the second LLVM pass, &lt;code&gt;LoopPerforationPass&lt;&#x2F;code&gt;, that finds canonical induction variables and replaces them with constants multiplied by the desired rate.
For our toy example, conceptually this means changing the loop increment expression from:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;for (int i = 0; i &amp;lt; n; i++) {
    ...
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;To:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;for (int i = 0; i &amp;lt; n; &#x2F;* Perforated rate here -&amp;gt; *&#x2F; i += 2 ) {
    ...
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;At the LLVM intermediate representation level, this changes this blocks&#x27; implementation from:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;; &amp;lt;label&amp;gt;:2:
  %.01 = phi i32 [ 0, %1 ], [ %5, %6 ]
  %.0 = phi i32 [ 0, %1 ], [ %7, %6 ]
  %3 = icmp slt i32 %.0, %0            ;; Check if i &amp;lt; n
  br i1 %3, label %4, label %8         ;; Exit or loop again

; &amp;lt;label&amp;gt;:4:
  %5 = add nsw i32 %.01, %.0           ;; sum += i
  br label %6

; &amp;lt;label&amp;gt;:6:
  %7 = add nsw i32 %.0, 1              ;; i += 1
  br label %2
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;To:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;; &amp;lt;label&amp;gt;:2:
  %.01 = phi i32 [ 0, %1 ], [ %5, %6 ]
  %.0 = phi i32 [ 0, %1 ], [ %7, %6 ]
  %3 = icmp slt i32 %.0, %0.
  br i1 %3, label %4, label %8         ;; &amp;lt;- Branching condition doesn&amp;#39;t change

; &amp;lt;label&amp;gt;:4:
  %5 = add nsw i32 %.01, %.0
  br label %6

; &amp;lt;label&amp;gt;:6:
  %7 = add nsw i32 %.0, 2              ;; &amp;lt;- Perforated rate here
  br label %2
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;After the driver perforates the desired loop, it executes that modified executable on the representative input, this time multiple times, as these could be non-deterministic even if the original program was not.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The driver checks the return code of the execution and treats crashing programs as unacceptable.
It also measures the execution of each program in wall-clock time.
The output is saved to disk then compared with the standard, non-perforated output.
The comparison uses the application-specific &lt;code&gt;errors&lt;&#x2F;code&gt; module, calculating the error between the perforated and expected result for a number of user-defined metrics.
This driver repeats this for every potential perforation rate, and then write the results out as another JSON file.&lt;&#x2F;p&gt;
&lt;p&gt;For &lt;code&gt;sum_to_n&lt;&#x2F;code&gt;, this looks like the following (with some light post-processing for brevity):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;{
    &amp;quot;{\{\&amp;quot;sum_to_n\&amp;quot;: {\&amp;quot;%2,%4,%6\&amp;quot;: 1}}}&amp;quot;: {
        &amp;quot;return_code&amp;quot;: 0,
        &amp;quot;time&amp;quot;: 0.0195,
        &amp;quot;errors&amp;quot;: {
            &amp;quot;error_ratio&amp;quot;: 0.0
        }
    },
    &amp;quot;{\{\&amp;quot;sum_to_n\&amp;quot;: {\&amp;quot;%2,%4,%6\&amp;quot;: 2}}}&amp;quot;: {
        &amp;quot;return_code&amp;quot;: 0,
        &amp;quot;time&amp;quot;: 0.0194,
        &amp;quot;errors&amp;quot;: {
            &amp;quot;error_ratio&amp;quot;: 0.4
        }
    },
    &amp;quot;{\{\&amp;quot;sum_to_n\&amp;quot;: {\&amp;quot;%2,%4,%6\&amp;quot;: 3}}}&amp;quot;: {
        &amp;quot;return_code&amp;quot;: 0,
        &amp;quot;time&amp;quot;: 0.0197,
        &amp;quot;errors&amp;quot;: {
            &amp;quot;error_ratio&amp;quot;: 0.7
        }
    },
    &amp;quot;{\{\&amp;quot;sum_to_n\&amp;quot;: {\&amp;quot;%2,%4,%6\&amp;quot;: 5}}}&amp;quot;: {
        &amp;quot;return_code&amp;quot;: 0,
        &amp;quot;time&amp;quot;: 0.0196,
        &amp;quot;errors&amp;quot;: {
            &amp;quot;error_ratio&amp;quot;: 1.0
        }
    }
    &#x2F;&#x2F; ...
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here, our wall-clock execution time is essentially a wash, because the loop is so small that the difference is negligible relative to noise.
However, we can see that the error ratio does drastically change as we increase the perforation rate, from perfect for the standard run to completely unacceptable at a perforation rate of 5, as we might expect (skipping 4 in every 5 iterations is skipping every integer except 0!). Note that if our application had multiple loops, we would see a results equal to the number of loops times the number of perforation rates (one run for each).&lt;&#x2F;p&gt;
&lt;p&gt;The final task of the driver is to combine what it learned about how much it can mangle each loop into one final answer.
That is, where we were previously perforating each loop in turn, we now want to perforate some subset of the loops that we determined don&#x27;t hurt the accuracy too much.
Here, we follow the lead of the original loop perforation paper in making a greedy assumption—that we can choose an final loop perforation strategy based on joining the maximum perforation rate for each loop that that was below the error tolerance.
This strategy makes the very optimistic assumption that errors do not interfere in destructive ways to cause crashes; in the published paper their system backtracks in the case that the combined executable is unacceptable.
For the purpose of this project, we simply ungracefully fail in that case.&lt;&#x2F;p&gt;
&lt;p&gt;In our &lt;code&gt;sum_to_n&lt;&#x2F;code&gt; toy example, this joined result sees that only a loop perforation rate of 2 produces an executable that is below our error threshold of 50%. We thus see the following final joined summary in the results:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;    &amp;quot;joined_{\{\&amp;quot;sum_to_n\&amp;quot;: {\&amp;quot;%2,%4,%6\&amp;quot;: 2}}}&amp;quot;: {
        &amp;quot;return_code&amp;quot;: 0,
        &amp;quot;time&amp;quot;: 0.020437002182006836,
        &amp;quot;errors&amp;quot;: {
            &amp;quot;error_ratio&amp;quot;: 0.4
        }
    }
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;To summarize, our pass took our little toy example of summing the integers up to &lt;code&gt;n&lt;&#x2F;code&gt; and made it much, much stupider.
We defined a simple error metric that the resulting sum must be within 50% of the real sum, which led the driver to choose a loop perforation rate of two.
The ultimate joined perforation pass thus changed the loop increment from &lt;code&gt;+= 1&lt;&#x2F;code&gt; to &lt;code&gt;+= 2&lt;&#x2F;code&gt;.
This means the sum skips the loop iterations for &lt;code&gt;i = 1&lt;&#x2F;code&gt; and &lt;code&gt;i = 3&lt;&#x2F;code&gt;, resulting in a total sum &lt;code&gt;0 + 2 + 4 = 6&lt;&#x2F;code&gt;, which is 40% off from the correct answer of 10.
Close enough!!!!!!1! ¯\&lt;em&gt;(ツ)&lt;&#x2F;em&gt;&#x2F;¯&lt;&#x2F;p&gt;
&lt;p&gt;Now, an enterprising reader might have picked up on the fact that while we claim that loop perforation will make your code run &lt;em&gt;faster&lt;&#x2F;em&gt;, we actually executed your entire executable &lt;em&gt;way more times&lt;&#x2F;em&gt; in order to find perforation rates that won&#x27;t crash or completely destroy your output accuracy.
That reader would be right!
However, the promise of loop perforation (along with many other optimizations that rely on dynamic analysis) is that we can run the expensive analysis on a small, representative input, and have the performance improvements scale to much larger examples.
For this silly toy, imagine we wanted to sum a list of millions of numbers—we would not need to rerun analysis, but could simply use the same executable.&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;Perforated programs could be non-deterministic even if the original program was not. For instance, the perforated &lt;code&gt;alloc-loop&lt;&#x2F;code&gt; reads allocated but uninitialized values, which are vestigal, meaningless bytes from a bygone age.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;h3 id=&quot;design-decisions&quot;&gt;Design Decisions&lt;&#x2F;h3&gt;
&lt;p&gt;We made the following design decisions in our implementation:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Our driver is less clever about &amp;quot;criticality testing&amp;quot; (that is, determining which loops are safe to perforate) than the original paper.
In particular, in addition to not implementing backtracking when combining loop perforation rates, we do not use Valgrind or anything similar to detect memory errors in perforated runs, and instead rely only on process return code and the user-defined accuracy metrics.&lt;&#x2F;li&gt;
&lt;li&gt;In some implementations of loop perforation, rather than modifying the induction variable directly the pass instruments an additional counter to each loop.
For example, this is the approach taken in [ACCEPT&#x27;s loop perforation pass][].
This allows the compiler to do more clever variants of loop perforation, such as copying the value from a previous iteration instead of skipping an iteration altogether.
We decided to modify the induction variable directly to be able to spend more effort on the driver and evaluation.&lt;&#x2F;li&gt;
&lt;li&gt;Unlike the original paper, we allow users to define any number of error metrics instead of just one.
This allows users to conduct a richer exploration of the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Pareto_efficiency#Pareto_frontier&quot;&gt;Pareto frontier&lt;&#x2F;a&gt; for their given application.
We discuss this further in the Error Metrics section.&lt;&#x2F;li&gt;
&lt;li&gt;Our passes write and read JSON files rather than keeping all data in memory.
We made this decision to make it easier to assess progress and debug as we used the driver and passes.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;error-metrics&quot;&gt;Error Metrics&lt;&#x2F;h3&gt;
&lt;p&gt;The &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=2025133&quot;&gt;original loop perforation paper&lt;&#x2F;a&gt; uses the following accuracy metric:&lt;&#x2F;p&gt;
&lt;p&gt;$$ \text{acc} = \frac{1}{m} \sum_{i=1}^m w_i \left|\frac{o_i - \hat o_i}{o_i}\right| $$&lt;&#x2F;p&gt;
&lt;p&gt;That is to say, it comes with a pre-selected division of the accuracy into pre-selected &amp;quot;components&amp;quot; $o_i$.
Though these components are sold as a modular feature of the approach, this exact equation is extremely restrictive.
For instance, this means that matrix and vector accuracy calculations &lt;em&gt;must be&lt;&#x2F;em&gt; weighted sums of their dimensions.
Moreover, overwhelmingly there is no good choice for one component to be weighted over another: the representation is forced by the restriction to real valued outputs of programs, and so anything encoded across multiple components cannot be re-weighted.
More generally, if a programmer has an error metric in mind that require a set of components to all be operating well (arguably very important for measuring functionality), they cannot easily encode this in this methodology.&lt;&#x2F;p&gt;
&lt;p&gt;One consequence of this restriction is that for the common distance metrics used for matrices, images, etc., only a &amp;quot;normalized&amp;quot; $\ell_1$ distance can be encoded.
It also assumes that zero is a privileged value for all computation metrics: relative error is given as distance away from zero.
Relative error approaches infinity, independent of the tolerance of the system to errors, as the standard error $o$ goes to zero.
For many applications, this might be far off from what an actual end-programmer would specify as their intended acceptable error.&lt;&#x2F;p&gt;
&lt;p&gt;In an attempt to mitigate these problems, and investigate the dependence on the particular choice of error metric, we take a very different approach: we collect statistics, and compute &lt;em&gt;many&lt;&#x2F;em&gt; different error metrics.
For instance, we implement other common metrics for matrix errors that include: distance based on $\ell_2$, $\ell_1$, or Frobenius operator norms.&lt;&#x2F;p&gt;
&lt;p&gt;Each of these conceptually give us distances between matrices, but what distances are good?
The answer depends largely on the variance of the data.&lt;&#x2F;p&gt;
&lt;p&gt;We still want a &amp;quot;normalized error&amp;quot; in order to run the loop-perforation algorithm, as we need to set some threshold for what it means to be &amp;quot;good enough&amp;quot; to survive the culling pass.
Rather than using the absolute scale of the correct answer as our measuring stick, we effectively provide the variance as an input.
Assuming that the final distance measure is normally distributed (which is more and more appropriate the more sums and multiplications of random variables are involved) with variance $\sigma^2&#x2F;2$, then the probability that the observed variable lies within a distance $d$ of the correct one is exactly the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Error_function&quot;&gt;&lt;em&gt;error function&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;p&gt;$$ \text{erf}(x) := \frac{1}{\sqrt{\pi}} \int_{-x}^x e^{-t^2} \mathrm{d}t $$&lt;&#x2F;p&gt;
&lt;p&gt;This can be scaled to a specific variance $\sigma^2&#x2F;2$ by dividing $t$ by $\sigma^2$.
Effectively, this gives us a way of turning scalar distances into variance-parameterized error metrics.
For these reasons, we enter a total accuracy score for each error metric that we collect, as well as for each of many variances.
We then use all of these metrics to compute the Pareto frontier.&lt;&#x2F;p&gt;
&lt;p&gt;These error metrics are more finicky than they might appear for real programs.
Consider a Sobel image filter for edge detection.
We use an implementation of this filter from the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;uwsampa&#x2F;accept-apps&#x2F;blob&#x2F;master&#x2F;sobel&#x2F;sobel.c&quot;&gt;ACCEPT benchmarks&lt;&#x2F;a&gt; for approximate computing.&lt;&#x2F;p&gt;
&lt;p&gt;Running the standard implementation on this image of cute buffalo:&lt;&#x2F;p&gt;
&lt;img src=&quot;sobel-input.png&quot; width=500&#x2F;&gt;
&lt;p&gt;Gives the following correct output:&lt;&#x2F;p&gt;
&lt;img src=&quot;sobel-output.png&quot; width=500&#x2F;&gt;
&lt;p&gt;Running our driver with a default configuration perforates the loops in this application at the following rates:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;{
    \&amp;quot;load_image_data\&amp;quot;: {\&amp;quot;%81,%84,%85,%97,%98,%88,%95\&amp;quot;: 5, \&amp;quot;%85,%88,%95\&amp;quot;: 5},
    \&amp;quot;save_image_data\&amp;quot;: {\&amp;quot;%11,%14,%15,%28,%29,%18,%26\&amp;quot;: 1, \&amp;quot;%15,%18,%26\&amp;quot;: 1},
    \&amp;quot;sobel_filtering\&amp;quot;: {\&amp;quot;%74,%77,%78,%88,%89,%81,%86\&amp;quot;: 5, \&amp;quot;%78,%81,%86\&amp;quot;: 5}
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The driver chooses this configuration because one of the error metrics, &lt;code&gt;&amp;quot;l2_100000&amp;quot;&lt;&#x2F;code&gt;, only has an error of 0.185.
However the resulting image looks... very bad.&lt;&#x2F;p&gt;
&lt;img src=&quot;sobel-perforated-default.png&quot; width=500&#x2F;&gt;
&lt;p&gt;With a little design space exploration, we can try and find other metrics that produce less psychedelic results.
The following result only considers the &lt;code&gt;l1_10000&lt;&#x2F;code&gt; metric and allows an error of 0.6.&lt;&#x2F;p&gt;
&lt;img src=&quot;sobel-perforated-l1-10000-0.6.png&quot; width=500&#x2F;&gt;
&lt;p&gt;This image is &lt;em&gt;kind of&lt;&#x2F;em&gt; on the right track, in that we can clearly see some of the edges in the original image (just shifted and stretched to be almost useless).
Remarkably, the above image was the closest to correct we could find with any loop perforation at all via a casual design space exploration.
However, this is clearly &lt;strong&gt;not&lt;&#x2F;strong&gt; an acceptable result for edge detection!&lt;&#x2F;p&gt;
&lt;p&gt;Ultimately, this example illustrates that finding the right error metric might be both difficult and counterintuitive—to get a better image that gained any speedup, we had to select a higher error rate, just for a different metric.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;tests-and-benchmarks&quot;&gt;Tests and Benchmarks&lt;&#x2F;h3&gt;
&lt;p&gt;We implemented three small test programs:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;sum-to-n&lt;&#x2F;code&gt;: Sums all numbers between 1 and n.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;alloc-loop&lt;&#x2F;code&gt;: A program that performs arithmetic on an array of integer pointers.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;matrix_multiplication&lt;&#x2F;code&gt;: Matrix multiplication of two random 100-by-100 matrices.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;We additionally ran loop perforation on three larger benchmark programs:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;sobel&lt;&#x2F;code&gt;: A Sobel filter from the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;uwsampa&#x2F;accept-apps&#x2F;blob&#x2F;master&#x2F;sobel&#x2F;sobel.c&quot;&gt;ACCEPT benchmarks&lt;&#x2F;a&gt; for approximate computing.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;img-blur&lt;&#x2F;code&gt;: A Gaussian blur that operates on the same inputs as &lt;code&gt;sobel&lt;&#x2F;code&gt;; we implemented this ourselves using Sobel as a model. They share PGM processing code.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;blackscholes&lt;&#x2F;code&gt;: From the &lt;a href=&quot;https:&#x2F;&#x2F;parsec.cs.princeton.edu&quot;&gt;PARSEC&lt;&#x2F;a&gt; benchmark suite. We chose this benchmark for ease and simplicity of compilation, as well as having a straightforward potential for error metrics for the output data.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;It is worth noting that the &lt;code&gt;blackscholes&lt;&#x2F;code&gt; benchmark is meant to be a stress test and hence wrapped in one outer loop which duplicates the computation in order to get more robust estimates. Skipping iterations of the outer loop, which does not degrade performance, clearly undermines the intent of the benchmark, once again suggesting that loop perforation gains may be overfitted to the evaluation metrics they&#x27;re trained on.&lt;&#x2F;p&gt;
&lt;p&gt;We had hoped to run our pass on additional PARSEC benchmarks, but had trouble either 1) compiling them with LLVM on our machines, or 2) determining reasonable error metrics.&lt;&#x2F;p&gt;
&lt;p&gt;The following plot shows run times for original programs and the final joined perforated programs.
Perforation rates were allowed to be 2, 3, 5, 8, 13, and 21.
Each program was run ten times on a 2017 MacBook Pro (2.3 GHz Intel Core i5, 8 GB RAM), and error bars represent 95% confidence intervals.
The perforated versions of &lt;code&gt;img-blur&lt;&#x2F;code&gt;, &lt;code&gt;sobel&lt;&#x2F;code&gt;, and &lt;code&gt;matrix_multiply&lt;&#x2F;code&gt; are faster than their corresponding originals because they have so many fewer instructions.
The other three programs are so small their runtimes are likely dominated by overhead.&lt;&#x2F;p&gt;
&lt;img src=&quot;all-runtimes.png&quot; width=&quot;80%&quot;&#x2F;&gt;
&lt;p&gt;For every test and benchmark, we plotted the error and runtime of every perforated and non- version.
Each resulting graph shows a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Pareto_efficiency#Pareto_frontier&quot;&gt;Pareto-optimal&lt;&#x2F;a&gt; frontier trading off error and runtime, as in the &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=2025133&quot;&gt;original paper&lt;&#x2F;a&gt;.
A point is not on the frontier if it is eclipsed by another point that has both lower error by every metric and lower runtime.
The frontier can be thought of as the projection of the convex hull of all runs with all error metrics onto the particular space chosen for visualization.
Each program was run ten times, and all runs are included in the graphs. Runtimes were calculated as above.
The following graph is for &lt;code&gt;matrix_multiply&lt;&#x2F;code&gt; with an $\ell_2$-error and an error function variance of 10,000:&lt;&#x2F;p&gt;
&lt;img src=&quot;matrix_multiply-frontier.png&quot; width=&quot;80%&quot;&#x2F;&gt;
&lt;p&gt;The ten runs of the original program (in magenta) are on the bottom with zero error.
The black points at the top are from runs that did not complete successfully and were assigned the maximum possible error.
In this case, perforating some loops reduced the size of matrices, making them incomparable to the correct program output.
The frontier shows that perforated programs gain some speed by rapidly incurring error.
The joined perforations aren&#x27;t always on the frontier—even for a simple program like this, multiple perforated loops can perform worse than their individually perforated components.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Register Allocation for Bril</title>
                <pubDate>Wed, 13 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/multiplication-strength-reduction/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/multiplication-strength-reduction/</guid>
                <description>&lt;h1 id=&quot;strength-reduction-for-multiplication&quot;&gt;Strength Reduction for Multiplication&lt;&#x2F;h1&gt;
&lt;h2 id=&quot;motivation&quot;&gt;Motivation&lt;&#x2F;h2&gt;
&lt;p&gt;The strength reduction method is to replace expensive operations with cheaper but equivalent ones, so we can obtain a faster program. For this project, we will focus on the weak form of strength reduction. Specifically, we focus on how to replace constant multiplications with cheaper operations.&lt;&#x2F;p&gt;
&lt;p&gt;Most modern processors have different latencies and throughputs for different kinds of instructions. It is sometimes possible to find instructions that are mathematically equivalent but faster in practice. For example, on most processors, multiplication may run slower than a bitwise shift. Therefore, it is possible to replace a multiplication &lt;code&gt;x * 2&lt;&#x2F;code&gt; with bitwise left a shift operation &lt;code&gt;x &amp;lt;&amp;lt; 1&lt;&#x2F;code&gt; for better performance.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;alternatives-for-multiplication-with-constants&quot;&gt;Alternatives for Multiplication with Constants&lt;&#x2F;h2&gt;
&lt;p&gt;The strength reduction for multiplication with a constant of powers of $2$ is obvious. However, even when the constant is not a power of $2$, reducing multiplications to bitwise shifts is still possible.&lt;&#x2F;p&gt;
&lt;p&gt;Since constants can be represented as sum of powers of $2$, we can use sum of bitwise shifts to replace multiplication operations. For example, &lt;code&gt;x * 7&lt;&#x2F;code&gt; can be represented as &lt;code&gt;(x &amp;lt;&amp;lt; 2) + (x &amp;lt;&amp;lt; 1) + x&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;$7$ can be also represented as $8-1$, a power of $2$ subtracts a constant. In fact, $7$ is &amp;quot;closer&amp;quot; to the next number. If we reduce &lt;code&gt;x * 7&lt;&#x2F;code&gt; to &lt;code&gt;x * (8 - 1) = x * 8 - x * 1 = (x &amp;lt;&amp;lt; 3) - x&lt;&#x2F;code&gt;, it requires fewer bitwise shifts and add&#x2F;subtract operations.&lt;&#x2F;p&gt;
&lt;p&gt;Therefore, we have three choices for multiplying a constant:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;multiply the constant directly&lt;&#x2F;li&gt;
&lt;li&gt;binary decompose the constant, and sum up the results of bitwise shifts&lt;&#x2F;li&gt;
&lt;li&gt;represent the constant as $2^k-c$, left shift $x$ by $k$ bits, and binary decompose $c$ then subtracts those bitwise shifts&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;In order to determine with methods we want to use for a multiplication reduction, we need a cost function to compare the cost of those instructions. &lt;&#x2F;p&gt;
&lt;p&gt;Based on a different architecture, we can assign a cost to each of &lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;bitwise shift operation&lt;&#x2F;li&gt;
&lt;li&gt;add&#x2F;subtract operation&lt;&#x2F;li&gt;
&lt;li&gt;multiplication&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;The function calculates the total costs of these three approaches and determines which one has the lowest cost.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;In order to get an estimation of how much cost can be saved by multiplication strength reduction, we have run the cost analysis that reports the average cost reduction of multiplying each integer constant less than a given upper bound $n$.&lt;&#x2F;p&gt;
&lt;p&gt;In this analysis, we scaled the cost of add&#x2F;subtract operations and bitwise operations to $1$ unit. According to &lt;a href=&quot;https:&#x2F;&#x2F;www.agner.org&#x2F;optimize&#x2F;instruction_tables.pdf&quot;&gt;Agner Fog&#x27;s benchmarks on different AMD processors&lt;&#x2F;a&gt;, multiplication operations could cost $2-16$ times more clock cycles compared to add&#x2F;subtract operations. In the following table, we show the expected proportion of clock cycles it can save by multiplication strength reduction on the different cost of multiplication operations and range of the constant factor.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;&#x2F;th&gt;&lt;th&gt;n&amp;lt;=128&lt;&#x2F;th&gt;&lt;th&gt;n&amp;lt;=1024&lt;&#x2F;th&gt;&lt;th&gt;n&amp;lt;=8192&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;cost(MUL)=2&lt;&#x2F;td&gt;&lt;td&gt;0.034884&lt;&#x2F;td&gt;&lt;td&gt;0.005854&lt;&#x2F;td&gt;&lt;td&gt;0.000915&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;cost(MUL)=4&lt;&#x2F;td&gt;&lt;td&gt;0.143411&lt;&#x2F;td&gt;&lt;td&gt;0.032683&lt;&#x2F;td&gt;&lt;td&gt;0.006469&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;cost(MUL)=8&lt;&#x2F;td&gt;&lt;td&gt;0.478682&lt;&#x2F;td&gt;&lt;td&gt;0.207073&lt;&#x2F;td&gt;&lt;td&gt;0.065605&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;cost(MUL)=16&lt;&#x2F;td&gt;&lt;td&gt;0.739341&lt;&#x2F;td&gt;&lt;td&gt;0.588293&lt;&#x2F;td&gt;&lt;td&gt;0.430108&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;From the table, we can find that the larger the performance gap between &lt;code&gt;ADD&#x2F;SUB&lt;&#x2F;code&gt; and &lt;code&gt;MUL&lt;&#x2F;code&gt; is, and the smaller the constant is, then more clock cycles we can save on multiplication strength reduction.&lt;&#x2F;p&gt;
&lt;p&gt;In order to show how much cost it can reduce on a set of benchmark programs, PARSEC, we implement an LLVM pass to insert two instruction counters: one is to count the cost of the original instruction and the other is to count the cost of optimized instruction. For each multiplication instruction, we instrument instructions to update the costs.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;parsec&quot;&gt;PARSEC&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;parsec.cs.princeton.edu&#x2F;index.htm&quot;&gt;PARSEC&lt;&#x2F;a&gt; is the Princeton Application Repository for Shared-Memory Computers, a benchmark suite consisted of 13 real-world applications. It is widely used in many literatures. However, it only supports gcc and icc and  porting to Clang is not trivial. Furthermore, it seems to have &lt;a href=&quot;https:&#x2F;&#x2F;yulistic.gitlab.io&#x2F;2016&#x2F;05&#x2F;parsec-3.0-installation-issues&#x2F;&quot;&gt;problems&lt;&#x2F;a&gt; on its own, and it is also not well-maintained anymore. For example, on the official website, it provides two &lt;a href=&quot;https:&#x2F;&#x2F;parsec.cs.princeton.edu&#x2F;help.htm#MailingLists&quot;&gt;mailing list&lt;&#x2F;a&gt;. The &lt;a href=&quot;https:&#x2F;&#x2F;lists.cs.princeton.edu&#x2F;mailman&#x2F;listinfo&#x2F;parsec-announce&quot;&gt;first one&lt;&#x2F;a&gt; does not exist anymore, and the questions in the &lt;a href=&quot;https:&#x2F;&#x2F;lists.cs.princeton.edu&#x2F;mailman&#x2F;listinfo&#x2F;parsec-users&quot;&gt;second one&lt;&#x2F;a&gt; are also unmoderated by origin teams. As a result, we use an unofficial &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cirosantilli&#x2F;parsec-benchmark&quot;&gt;repository&lt;&#x2F;a&gt;, which several problems have been fixed, as a starting point. We manage to port several programs in PARSEC to use Clang by manually fixing &lt;strong&gt;a lot of compile errors&lt;&#x2F;strong&gt; arising during Clang compilation.&lt;&#x2F;p&gt;
&lt;p&gt;The following table shows our evaluation results. For each pair &lt;code&gt;(a,b)&lt;&#x2F;code&gt;, a represents the cost for unoptimized multiplication cost and &lt;code&gt;b&lt;&#x2F;code&gt; is the multiplication cost after strength reduction. We can see that for application with a lot of integer multiplication, the difference is significant.
Our LLVM pass code can be found &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;xu3kev&#x2F;llvm-pass-skeleton&quot;&gt;here&lt;&#x2F;a&gt; at &lt;code&gt;noauto&lt;&#x2F;code&gt; branch.&lt;&#x2F;p&gt;
&lt;p&gt;The cost is calculated by instrumenting the counter increment instructions to the place where multiplications occur. Our custom LLVM pass will first insert code at the beginning of the application to initialize an global counter and then instrument the multiplications code. As a result, we can count the number of multiplications dynamically. Finally, we also insert an &lt;code&gt;printf&lt;&#x2F;code&gt; function at the end of the main function to report back the value store in the counter. (Interestingly, modifying our C compiler to insert a &lt;code&gt;printf&lt;&#x2F;code&gt; to print something to stdout every time will also break the compiling process for one of the application in the PARSEC, because that one uses C program to print out the code and then compile that code. Thus, we need to be careful.)&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;&#x2F;th&gt;&lt;th&gt;blackscholes&lt;&#x2F;th&gt;&lt;th&gt;bodytrack&lt;&#x2F;th&gt;&lt;th&gt;facesim&lt;&#x2F;th&gt;&lt;th&gt;ferret&lt;&#x2F;th&gt;&lt;th&gt;fluidanimate&lt;&#x2F;th&gt;&lt;th&gt;streamcluster&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;cost(MUL)=2&lt;&#x2F;td&gt;&lt;td&gt;(14,10)&lt;&#x2F;td&gt;&lt;td&gt;(8046 , 4003)&lt;&#x2F;td&gt;&lt;td&gt;(389600456 , 195013270)&lt;&#x2F;td&gt;&lt;td&gt;(512 , 256)&lt;&#x2F;td&gt;&lt;td&gt;(12 , 8)&lt;&#x2F;td&gt;&lt;td&gt;(14400 , 5419)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;cost(MUL)=4&lt;&#x2F;td&gt;&lt;td&gt;(28,12)&lt;&#x2F;td&gt;&lt;td&gt;(16092, 4005)&lt;&#x2F;td&gt;&lt;td&gt;(779200912 , 195013270)&lt;&#x2F;td&gt;&lt;td&gt;(1024 , 256)&lt;&#x2F;td&gt;&lt;td&gt;(24 , 10)&lt;&#x2F;td&gt;&lt;td&gt;(28800 , 5419)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;cost(MUL)=8&lt;&#x2F;td&gt;&lt;td&gt;(56,12)&lt;&#x2F;td&gt;&lt;td&gt;(32184, 4006)&lt;&#x2F;td&gt;&lt;td&gt;(1558401824 , 195013270)&lt;&#x2F;td&gt;&lt;td&gt;(2048 , 256)&lt;&#x2F;td&gt;&lt;td&gt;(48 , 10)&lt;&#x2F;td&gt;&lt;td&gt;(57600 , 5419)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;cost(MUL)=16&lt;&#x2F;td&gt;&lt;td&gt;(112,12)&lt;&#x2F;td&gt;&lt;td&gt;(64368, 4006)&lt;&#x2F;td&gt;&lt;td&gt;(3116803648 , 195013270)&lt;&#x2F;td&gt;&lt;td&gt;(4096 , 256)&lt;&#x2F;td&gt;&lt;td&gt;(96 , 10)&lt;&#x2F;td&gt;&lt;td&gt;(115200 , 5419)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h2 id=&quot;our-thoughts&quot;&gt;Our Thoughts&lt;&#x2F;h2&gt;
&lt;p&gt;On most of the modern processors, the performance difference between add&#x2F;subtract&#x2F;shift operations and multiplication operations are not that huge. In a non-scientific computation workload, the optimization strength reduction for multiplication is negligible.&lt;&#x2F;p&gt;
&lt;p&gt;We also found that the strength reduction could be very useful on hardware design languages, such as Verilog for FPGA design. Since the multiplication circuit has a longer path than add&#x2F;subtract&#x2F;shift operations, strength reduction may be used to reduce the length of critical path and the total number of gates of a circuit.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Dynamic Null Pointer Checks Using LLVM</title>
                <pubDate>Wed, 13 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/null-pointer-guards/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/null-pointer-guards/</guid>
                <description>&lt;h2 id=&quot;preface&quot;&gt;Preface&lt;&#x2F;h2&gt;
&lt;p&gt;This blog post is meant for those with limited knowledge of how to use LLVM
to write non-trivial compiler passes. Hopefully this can be a useful resource
to those new to LLVM since, in my experience, information on how to use LLVM
is relatively sparse. I am using LLVM 9. All of the code can be found at
https:&#x2F;&#x2F;github.com&#x2F;chrisroman&#x2F;llvm-pass-skeleton.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;overview&quot;&gt;Overview&lt;&#x2F;h2&gt;
&lt;p&gt;The goal of this project is to add null pointer checks before pointer dereferences
to avoid segfaults. However, it would be wasteful to add a null pointer check to
a pointer dereference when we know for sure the pointer is non-null. For example:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;int *p = new int(4120);
&#x2F;&#x2F; No null check necessary here
*p = 6120;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Thus, we will do a null pointer analysis that determines, at each instruction,
which pointers may be null. Using this, we can limit the number of extraneous
checks.&lt;&#x2F;p&gt;
&lt;p&gt;To accomplish this, we will create a new compiler pass using LLVM.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design&quot;&gt;Design&lt;&#x2F;h2&gt;
&lt;p&gt;Adding the null pointer checks themselves were pretty straightforward. To keep
things simple, I decided to just print to tell the user that there was an attempt
to dereference a null pointer and then &lt;code&gt;exit(1)&lt;&#x2F;code&gt;. Fancier implementations may be
able to print useful information so the user doesn&#x27;t have to &lt;code&gt;gdb&lt;&#x2F;code&gt; everything to
see what went wrong.&lt;&#x2F;p&gt;
&lt;p&gt;For the null pointer analysis (NPA henceforth), we aim for soundness rather than
completeness. That is, we only mark a pointer as &lt;em&gt;DefinitelyNonNull&lt;&#x2F;em&gt; only if we are
guaranteed that the pointer is not null. This way, we are guaranteed that a null
check is removed only if it is really safe to do so.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;adding-the-null-checks&quot;&gt;Adding the null checks&lt;&#x2F;h3&gt;
&lt;p&gt;For null pointer checks, we first created a compiler pass called &lt;code&gt;AddNullCheckFuncPass&lt;&#x2F;code&gt;
which adds a new function to a module, which is equivalent to:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;nullcheck&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p) {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;nullptr&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;printf&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;Found a null pointer. Exiting...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;exit&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
  }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then we created a compiler pass called &lt;code&gt;AddNullCheckPass&lt;&#x2F;code&gt;. This is a &lt;code&gt;FunctionPass&lt;&#x2F;code&gt;
which looks for a &lt;code&gt;LoadInst&lt;&#x2F;code&gt; or &lt;code&gt;StoreInst&lt;&#x2F;code&gt;, as these are the instructions which
actually dereference a pointer. Let&#x27;s look at how a dereference gets translated to
LLVM IR:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= ...
*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6120
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;is represented in the IR as:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; alloca i32&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, align &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;8
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...
%&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; load i32&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, i32&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;** %&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p, align &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
store i32 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6120&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, i32&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* %&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, align &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We can see that we&#x27;re storing &lt;code&gt;6120&lt;&#x2F;code&gt; into &lt;code&gt;%1&lt;&#x2F;code&gt;, so &lt;code&gt;%1&lt;&#x2F;code&gt; is what we need to
perform the null check on. To do so, we can simply call the &lt;code&gt;nullcheck&lt;&#x2F;code&gt; function
that was created in the previous &lt;code&gt;AddNullCheckFuncPass&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;implementing-the-null-pointer-analysis&quot;&gt;Implementing the Null Pointer Analysis&lt;&#x2F;h3&gt;
&lt;p&gt;Now, we want to elide null checks provided we can determine that the pointer
is non-null at the time of dereferencing. Initially, I figured we could use
the existing alias analyses and check if the pointers alias &lt;code&gt;nullptr&lt;&#x2F;code&gt;. However,
when I tried this, it would always show as being &lt;code&gt;NoAlias&lt;&#x2F;code&gt;. There is the
&lt;code&gt;AAResults::pointsToConstantMemory&lt;&#x2F;code&gt; function, but according to the
&lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;AliasAnalysis.html#the-pointstoconstantmemory-method&quot;&gt;documentation&lt;&#x2F;a&gt;,&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;The pointsToConstantMemory method returns true if and only if the analysis can prove that
the pointer only points to unchanging memory locations (functions, constant global
variables, and the null pointer).
This is an issue for two reasons. First, we can&#x27;t differentiate between pointing
to functions, global variables, or the null pointer. Secondly, this only returns
true when the pointer &lt;em&gt;only&lt;&#x2F;em&gt; points to one of these constant locations. This means
that if a pointer may or may not point to the null pointer, the function would
return false. This would be too restrictive for our use case, as we are really
looking for pointers that are &lt;em&gt;definitely not null&lt;&#x2F;em&gt; rather than those that are
&lt;em&gt;definitely null&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Therfore, to perform this null pointer analysis, we do a dataflow analysis
that is similar to CCP (Conditional Constant Propagation). Our lattice elements
will be a mapping from pointers in the program to &lt;code&gt;PossiblyNull&lt;&#x2F;code&gt; or &lt;code&gt;DefinitelyNonNull&lt;&#x2F;code&gt;.
(We assume &lt;code&gt;PossiblyNull&lt;&#x2F;code&gt; and &lt;code&gt;DefinitelyNonNull&lt;&#x2F;code&gt; also form a lattice with the former as &lt;em&gt;Top&lt;&#x2F;em&gt;
and the latter as &lt;em&gt;Bottom&lt;&#x2F;em&gt;.)
Our &lt;em&gt;Top&lt;&#x2F;em&gt; value will be a map where everything maps to &lt;code&gt;PossiblyNull&lt;&#x2F;code&gt;, and &lt;em&gt;Bottom&lt;&#x2F;em&gt;
is a map where everything maps to &lt;code&gt;DefinitelyNonNull&lt;&#x2F;code&gt;. The analysis is a forward
analysis. The &lt;em&gt;Meet&lt;&#x2F;em&gt; operator is simply an elementwise &lt;em&gt;meet&lt;&#x2F;em&gt; on elements in the map.
For example, if we have the two lattice elements &lt;code&gt;X = {p: DefinitelyNonNull, q: PossiblyNull}&lt;&#x2F;code&gt;
and &lt;code&gt;Y = {p: DefinitelyNonNull, q: DefinitelyNonNull}&lt;&#x2F;code&gt;, then &lt;code&gt;meet(X, Y) = {p: DefinitelyNonNull, q: PossiblyNull}&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Our transfer function is tricky to get right. Consider:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;store i32* %p, i32** %q, align 8
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Let&#x27;s call the the memory that &lt;code&gt;%q&lt;&#x2F;code&gt; points to &lt;code&gt;%deref_q&lt;&#x2F;code&gt;. Naively, we would just
say that if &lt;code&gt;%p&lt;&#x2F;code&gt; was &lt;code&gt;PossiblyNull&lt;&#x2F;code&gt;, then &lt;code&gt;%deref_q&lt;&#x2F;code&gt; would also be &lt;code&gt;PossiblyNull&lt;&#x2F;code&gt;.
While this is necessary, it is not sufficient. Consider the following program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= new int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6120&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int **&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;a;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;nullptr&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;100&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Observe that the line &lt;code&gt;*a = 100&lt;&#x2F;code&gt; is a null pointer dereference, because
&lt;code&gt;p&lt;&#x2F;code&gt; aliases &lt;code&gt;&amp;amp;a&lt;&#x2F;code&gt;. That is, the memory location that &lt;code&gt;p&lt;&#x2F;code&gt; point to and &lt;code&gt;&amp;amp;a&lt;&#x2F;code&gt;
point to are the same. Thus, when we do &lt;code&gt;*p = nullptr&lt;&#x2F;code&gt;, we are also setting
&lt;code&gt;a&lt;&#x2F;code&gt; to &lt;code&gt;nullptr&lt;&#x2F;code&gt;. This means that whenever we a store a value &lt;code&gt;%p&lt;&#x2F;code&gt; to &lt;code&gt;%deref_q&lt;&#x2F;code&gt;
(the location pointed to by &lt;code&gt;%q&lt;&#x2F;code&gt;), we must change everything that aliases &lt;code&gt;%deref_q&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;One strange thing about this implementation is how to represent &lt;code&gt;%deref_p&lt;&#x2F;code&gt;.
Given &lt;code&gt;store i32* null, i32** %p, align 8&lt;&#x2F;code&gt;, we need to make sure that subsequent
loads from &lt;code&gt;%p&lt;&#x2F;code&gt; are &lt;code&gt;PossiblyNull&lt;&#x2F;code&gt;. So, we keep a map &lt;code&gt;deref_map&lt;&#x2F;code&gt; from pointers like &lt;code&gt;%p&lt;&#x2F;code&gt; to
an arbitrary new pointer. Therefore, when we see &lt;code&gt;%val = load i32*, i32** %p, align 8&lt;&#x2F;code&gt;,
we set &lt;code&gt;lattice[val] = lattice[deref_map[p]]&lt;&#x2F;code&gt;, where &lt;code&gt;lattice[x]&lt;&#x2F;code&gt; tells us if
&lt;code&gt;x&lt;&#x2F;code&gt; is &lt;code&gt;PossiblyNull&lt;&#x2F;code&gt; or &lt;code&gt;DefinitelyNonNull&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;For correctness, I wrote some tests by hand to account for some of the
programs shown above. For the most part, the correctness of adding the nullchecks
was easy to check. However, the correctness of the NPA was trickier to determine,
as shown by the programs that require an alias analysis.&lt;&#x2F;p&gt;
&lt;p&gt;We also want to see how much of an impact these null checks have on the performance
of programs. Unfortunately I wasn&#x27;t able to get the PARSEC benchmark tests running.
Instead, I found benchmarks from &lt;a href=&quot;https:&#x2F;&#x2F;benchmarksgame-team.pages.debian.net&#x2F;benchmarksgame&#x2F;&quot;&gt;online&lt;&#x2F;a&gt;
which has various benchmarks for different languages. I chose a small portion of
these tests to run with and without my compiler passes.&lt;&#x2F;p&gt;
&lt;p&gt;I chose to run:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;benchmarksgame-team.pages.debian.net&#x2F;benchmarksgame&#x2F;program&#x2F;binarytrees-gpp-2.html&quot;&gt;binary-trees&lt;&#x2F;a&gt;
&lt;ul&gt;
&lt;li&gt;Ran with: &lt;code&gt;.&#x2F;binary-trees 13&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;benchmarksgame-team.pages.debian.net&#x2F;benchmarksgame&#x2F;program&#x2F;fannkuchredux-gpp-5.html&quot;&gt;fannkuch-redux&lt;&#x2F;a&gt;
&lt;ul&gt;
&lt;li&gt;Ran with: &lt;code&gt;.&#x2F;fannkucuh-redux 10&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;benchmarksgame-team.pages.debian.net&#x2F;benchmarksgame&#x2F;program&#x2F;fasta-gpp-1.html&quot;&gt;fasta&lt;&#x2F;a&gt;
&lt;ul&gt;
&lt;li&gt;Ran with: &lt;code&gt;.&#x2F;fasta 250000&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;benchmarksgame-team.pages.debian.net&#x2F;benchmarksgame&#x2F;program&#x2F;mandelbrot-gpp-5.html&quot;&gt;mandelbrot&lt;&#x2F;a&gt;
&lt;ul&gt;
&lt;li&gt;Ran with: &lt;code&gt;.&#x2F;mandelbrot 2000&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;benchmarksgame-team.pages.debian.net&#x2F;benchmarksgame&#x2F;program&#x2F;nbody-gpp-1.html&quot;&gt;n-body&lt;&#x2F;a&gt;
&lt;ul&gt;
&lt;li&gt;Ran with: &lt;code&gt;.&#x2F;nbody 5000000&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Here we simply show the mean time taken to run these programs and the standard
deviation. There are three columns, representing the three different ways we
compiled the program.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Baseline&lt;&#x2F;em&gt; is just compiling using Clang without our null check passes.
Compiled with &lt;code&gt;clang++ -O2 &amp;lt;benchmark&amp;gt;.cpp -o &amp;lt;benchmark&amp;gt;&lt;&#x2F;code&gt;, e.g.,
&lt;code&gt;clang++ -O2 binary-trees.cpp -o binary-trees&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;No NPA&lt;&#x2F;em&gt; is compiling using the null check passes, but without taking into
consideration which pointers are &lt;code&gt;PossiblyNull&lt;&#x2F;code&gt; or &lt;code&gt;DefinitelyNonNull&lt;&#x2F;code&gt;. That is,
we do a null check on every pointer dereference. To get this to work, I had 
to go into source code and remove the check for if pointers are &lt;code&gt;PossiblyNull&lt;&#x2F;code&gt;
Compiled with &lt;code&gt;..&#x2F;tests&#x2F;compile-with-opt.sh nbody.cpp -O2&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;With NPA&lt;&#x2F;em&gt; is compiling using the null check passes &lt;em&gt;and&lt;&#x2F;em&gt; eliding null checks
if the pointer is &lt;code&gt;DefinitelyNonNull&lt;&#x2F;code&gt;.
Compiled with same instruction as above.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;strong&gt;Runtime Means and Standard Deviations&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Baseline&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;No NPA&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;With NPA&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;binary-trees&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;208 ms ± 4.78 ms&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;217 ms ± 12.8 ms&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;213 ms ± 6.78 ms&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;fannkuch-redux&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;222 ms ± 5.02 ms&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;536 ms ± 6.34 ms&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;563 ms ± 54.9 ms&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;fasta&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;200 ms ± 6.01 ms&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;365 ms ± 7.77 ms&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;375 ms ± 28.8 ms&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;mandelbrot&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;319 ms ± 5.79 ms&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;294 ms ± 2.52 ms&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;298 ms ± 9.76 ms&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;n-body&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;327 ms ± 16.7 ms&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;733 ms ± 12.4 ms&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;734 ms ± 17.8 ms&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Here we can see that in majority of cases, the baseline performs significantly
better than the programs with null checks. To me this is a little surprising
because I figured the branch predictor would always know that the null checks
don&#x27;t do anything except return from the null checking function.&lt;&#x2F;p&gt;
&lt;p&gt;It is also interesting to note that the NPA didn&#x27;t cause any improvement! This
may be because not that many null pointer checks were removed, perhaps because
the analysis is too conservative; this would require further investigation.
This is a little disappointing because I spent a lot of time implementing the
NPA to reduce the overhead of null checks.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;hardest-parts-to-get-right&quot;&gt;Hardest Parts to Get Right&lt;&#x2F;h2&gt;
&lt;p&gt;One of the hardest things of this project was the learning curve of LLVM. The
documentation is fairly good, but there&#x27;s just not much information overall on
how to do specific things. For example, I spent a lot of time just figuring out
how to make a call to &lt;code&gt;printf&lt;&#x2F;code&gt; in the IR. For some reason, it wouldn&#x27;t work
if &lt;code&gt;printf&lt;&#x2F;code&gt; wasn&#x27;t already used in the file being compiled.&lt;&#x2F;p&gt;
&lt;p&gt;The other hardest thing was doing the null pointer analysis. It was frustrating
to know that I couldn&#x27;t check if a pointer aliased nullptr. I wasted a lot of time
on an incorrect solution that looks as follows:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Create a global pointer that is
always &lt;code&gt;nullptr&lt;&#x2F;code&gt;, and replace all instances of &lt;code&gt;nullptr&lt;&#x2F;code&gt; with this global pointer.
Then we can use the alias analysis to check what aliases this pointer to see what
is potentially null at an instruction. However, by doing this, now all writes to
any pointer ends up writing to the location pointed to by the global variable,
which is incorrect.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;This project gives me a newfound appreciation for compilers writers.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;extras&quot;&gt;Extras&lt;&#x2F;h2&gt;
&lt;p&gt;When trying to debug certain programs, I found that certain variables were being
optimized away, which was quite annoying. In searching for a way to prevent this,
I found &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=nXaxk27zwlk&quot;&gt;this great talk&lt;&#x2F;a&gt; by Chandler Carruth
who discusses microbenchmarking of C++ code. He showed two special functions that
can force side effects on variables without actually emitting assembly. See
&lt;a href=&quot;https:&#x2F;&#x2F;youtu.be&#x2F;nXaxk27zwlk?t=2438&quot;&gt;this&lt;&#x2F;a&gt; part of the talk to learn more about it.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Partial Redundancy Elimination using Lazy Code Motion for LLVM</title>
                <pubDate>Wed, 13 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/pre/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/pre/</guid>
                <description>&lt;h3 id=&quot;problem&quot;&gt;Problem&lt;&#x2F;h3&gt;
&lt;p&gt;This project aims to write a pass for LLVM that implements &lt;em&gt;partial redundancy elimination&lt;&#x2F;em&gt; (PRE) using &lt;em&gt;&lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=143136&quot;&gt;lazy code motion&lt;&#x2F;a&gt;&lt;&#x2F;em&gt; (lcm).&lt;&#x2F;p&gt;
&lt;h3 id=&quot;algorithm-design&quot;&gt;Algorithm Design&lt;&#x2F;h3&gt;
&lt;p&gt;The algorithm is the same as the one I used in my project 2 &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;brilpre&#x2F;&quot;&gt;BrilPRE&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h3&gt;
&lt;p&gt;This project is implemented as an LLVM pass, and it is based on &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;llvm-pass-skeleton&quot;&gt;llvm-pass-skeleton&lt;&#x2F;a&gt;. 
The full implementation repository is &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Neroysq&#x2F;llvm-pre&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Although the algorithm is the same, LLVM IR is way more complex than Bril, and its framework also takes more time to learn than building from scratch when I dealt with Bril. Here I&#x27;d like to talk about the biggest challenge and also the most significant difference from BrilPRE: LLVM IR&#x27;s SSA form.&lt;&#x2F;p&gt;
&lt;p&gt;In the LLVM IR framework, 
There are no names for local variables&#x2F;registers (at least in compiler-generated code). 
There are only pointers representing the results of instructions, 
which means I cannot create a new variable with a specific name. &lt;&#x2F;p&gt;
&lt;p&gt;I can use built-in functions to create an instruction and get the result pointer, 
but the real trouble is that there is no way to do this: 
create the same instruction in two branches, 
and refer them using one pointer after branches merge. 
What I can do is to create a &lt;code&gt;phi&lt;&#x2F;code&gt; function to get a merged result, 
but this breaks a beautiful property of the original algorithm: 
for each expression, 
this algorithm only creates one new variable,
then for any evaluation of this expression, 
either it is the first evaluation in this path so that we evaluate it and assign the result to this variable, 
or we just replace it with this variable. 
I think this is beautiful because in the implementation I only need to maintain a map storing the corresponding variables of each expression, 
but in this project, 
I need to maintain this map on the fly since all pointers are different, and &lt;code&gt;phi&lt;&#x2F;code&gt; functions also introduce changes to this map.&lt;&#x2F;p&gt;
&lt;p&gt;Also, since the code is in SSA form, 
the values of each variable (register) never change,
which means expressions also never change.
Therefore, 
some analysis in lazy code motion becomes useless.
For example, &lt;code&gt;kill(b)&lt;&#x2F;code&gt; means the set of expressions whose values are changed in block &lt;code&gt;b&lt;&#x2F;code&gt;.
For LLVM IR, it is always empty.
This shows that SSA form is not the most ideal situation to apply this algorithm, and there are PRE algorithms designed for SSA form, such as &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=258940&quot;&gt;&amp;quot;A new algorithm for partial redundancy elimination based on SSA form&amp;quot;&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h3&gt;
&lt;p&gt;To first test the correctness I manually test some tiny programs and make sure the results of the standard compiler and my pass match. 
Then, 
I use a tool &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;guilhermeleobas&#x2F;tf&quot;&gt;tf&lt;&#x2F;a&gt;, 
which can help run LLVM&#x27;s test-suite benchmarks, 
to run large-scale test sets.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;setting&quot;&gt;Setting&lt;&#x2F;h4&gt;
&lt;p&gt;The benchmarks I run are: &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;BenchmarkGame, CoyoteBench, Dhrystone, Linpack, McGill, Misc, PolyBench, Shootout, Stanford
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;They all come from &lt;code&gt;llvm-test-suite&#x2F;SingleSource&#x2F;Benchmarks&lt;&#x2F;code&gt; and contain 98 programs in total.&lt;&#x2F;p&gt;
&lt;p&gt;I compare the performance of four settings:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;The baseline: compile with only one pass &lt;code&gt;mem2reg&lt;&#x2F;code&gt;, which translates C code to SSA form LLVM IR.&lt;&#x2F;li&gt;
&lt;li&gt;LLVM&#x27;s built-in PRE: compile with passes &lt;code&gt;mem2reg, gvn, simplifycfg&lt;&#x2F;code&gt;, which claims to perform redundant load elimination in addition to PRE, &lt;&#x2F;li&gt;
&lt;li&gt;My implementation &lt;code&gt;lcm&lt;&#x2F;code&gt; (a pass called &lt;code&gt;PRE&lt;&#x2F;code&gt;): compile with passes &lt;code&gt;mem2reg, PRE, simplifycfg&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Current state-of-the-art overall optimization: compile using &lt;code&gt;clang&lt;&#x2F;code&gt; with the option &lt;code&gt;-O2&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h4 id=&quot;result&quot;&gt;Result&lt;&#x2F;h4&gt;
&lt;p&gt;&lt;img src=&quot;.&#x2F;chart2.png&quot; alt=&quot;chart2&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Above is a chart showing relative runtime running my select benchmarks of the latter three settings over the baseline. 
The result shows that:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Overall, both &lt;code&gt;gvn&lt;&#x2F;code&gt; and &lt;code&gt;lcm&lt;&#x2F;code&gt; occasionally optimize the program significantly (ratio &amp;lt; 0.9): 15&#x2F;98 for &lt;code&gt;gvn&lt;&#x2F;code&gt;, and 7&#x2F;98 for &lt;code&gt;lcm&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Sometimes, PRE makes performance significantly worse (ratio &amp;gt; 1.1): 6&#x2F;98 for &lt;code&gt;gvn&lt;&#x2F;code&gt;, and 4&#x2F;98 for &lt;code&gt;lcm&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;gvn&lt;&#x2F;code&gt; performs better in more cases than &lt;code&gt;lcm&lt;&#x2F;code&gt;: 55 out of 98.&lt;&#x2F;li&gt;
&lt;li&gt;There are 15 out of 98 cases where &lt;code&gt;gvn&lt;&#x2F;code&gt; performs significantly better (ratio &amp;lt; 0.9) than &lt;code&gt;lcm&lt;&#x2F;code&gt;;  there are 7 out of 98 cases where &lt;code&gt;gvn&lt;&#x2F;code&gt; performs significantly better (ratio &amp;lt; 0.9) than &lt;code&gt;lcm&lt;&#x2F;code&gt;; &lt;&#x2F;li&gt;
&lt;li&gt;There are 43 out of 98 cases where &lt;code&gt;-O2&lt;&#x2F;code&gt; is significantly better (ratio &amp;lt; 0.9) than both &lt;code&gt;gvn&lt;&#x2F;code&gt; and &lt;code&gt;lcm&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h3&gt;
&lt;p&gt;In conclusion, I successfully implemented a partial redundancy elimination pass for LLVM and tested its correctness and performance. I found it very exciting to see my implementation can do a significantly better job in some of the test cases. I guess the reason is that &lt;code&gt;lcm&lt;&#x2F;code&gt; does a better job to reduce register pressure. And since &lt;code&gt;-O2&lt;&#x2F;code&gt; performs equally well in these cases, I think that  other passes in &lt;code&gt;-O2&lt;&#x2F;code&gt; cover register pressure optimization.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Space Efficient Conservative Garbage Collection</title>
                <pubDate>Wed, 13 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/space-efficient-gc/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/space-efficient-gc/</guid>
                <description>&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=155109&quot;&gt;&amp;quot;Space efficient conservative garbage collection&amp;quot;&lt;&#x2F;a&gt; describes some inexpensive but useful techniques that can make &lt;strong&gt;conservative garbage collectors&lt;&#x2F;strong&gt; more space efficient.
Their techniques can reduce pointer misidentification to retain less memory during run time, and also prevent some of the excess retention due to less information about variable liveness than conventional collectors. Their methods can be easily incorporated into any garbage collecting allocator transparently to client programs.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;pointer-misidentification-and-solutions-of-existing-conservative-garbage-collectors&quot;&gt;Pointer Misidentification and Solutions of Existing Conservative Garbage Collectors&lt;&#x2F;h1&gt;
&lt;p&gt;The goal of a garbage collector is to retain as little memory as it can, subject to the constraint that all memory that will be accessed in the future must be retained. A garbage collector is called &lt;strong&gt;conservative&lt;&#x2F;strong&gt; if it can operate with minimal information about the layout of the client program&#x27;s data. &lt;&#x2F;p&gt;
&lt;p&gt;For conservative collectors, the most apparent potential source of excess memory retention is &lt;strong&gt;pointer misidentification&lt;&#x2F;strong&gt; (e.g., misidentifying integers as pointers). Here is a typical example:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;An integer variable contains the address of a valid but inaccessible object&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The probability of such misidentification can increased if more of the address space is occupied by the heap.&lt;&#x2F;p&gt;
&lt;p&gt;Some &lt;em&gt;current ad hoc techniques&lt;&#x2F;em&gt; can help decrease pointer misidentification probablity. One method is to design an allocator to avoid allocating objects at address that are likely to collide with other data (properly position the heap in the address space), and to align pointers properly. Otherwise, all possible alignments must be considered by the collector and could result in more false pointers. For example, in the figure shown below, these two small integers (&lt;em&gt;0000 0009&lt;&#x2F;em&gt; and &lt;em&gt;0000 000a&lt;&#x2F;em&gt;) can also be viewed as a valid heap address (&lt;em&gt;0009 0000&lt;&#x2F;em&gt;) with the concatenation of the low order half word of the first integer and the high order half of the next.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;.&#x2F;001.png&quot; alt=&quot;001&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Currently, most compilers always guarantee adequate alignment. However, it is hard to determine the proper position of the heap. Thus, this paper introduces a much less ad hoc and more flexible technique to avoid pointer misidentification.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;their-techniques&quot;&gt;Their Techniques&lt;&#x2F;h1&gt;
&lt;p&gt;This paper&#x27;s technique is composed of two steps: &lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Normal garbage collection at regular intervals;&lt;&#x2F;li&gt;
&lt;li&gt;Invalid pointer recording.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;The first step ensures that normal garbage collection takes place at regular intervals, with at least a &lt;em&gt;fast and initial one happening right after system startup&lt;&#x2F;em&gt; before any allocation begins. The second step trys to record the invalid pointers found during a garbage collection so that these addresses could be used to hold valid objects for later allocation. Below is a detailed explanation of the second step.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;invalid-pointer-recording&quot;&gt;Invalid Pointer Recording&lt;&#x2F;h3&gt;
&lt;p&gt;They use a &lt;strong&gt;blacklist&lt;&#x2F;strong&gt; to keep a record of invalid pointers found during a garbage collection. Addresses in the blacklist, which could be valid object addresses afterwards, can not be allocated with new objects. The following algorithm is used to construct the blacklist.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;.&#x2F;002.png&quot; alt=&quot;002.png&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;A naive marking algorithm is modified to support blacklisting. The only modification is in bold face. Whenever an invalid object address is found, the address would be added to the blacklist if it is in the vincinity of the heap. &lt;&#x2F;p&gt;
&lt;p&gt;This scheme would mostly blacklist addresses that correspond to long-lived data values before these values become false references as they are the data that could possibly cause garbage to be retained indefinitely. One other thing to notice is that the scheme can eliminate the false references originating from statically allocated constant data scanned for roots by the collector, which is the most &lt;em&gt;troublesome&lt;&#x2F;em&gt;. Meanwhile, small and pointer-free objects can still be allocated at blacklisted address due to their little impact on erroneous retention.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h3&gt;
&lt;p&gt;They implemented variants of this approach in some versions of &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=74862&quot;&gt;PCR&lt;&#x2F;a&gt; and other &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=52202&quot;&gt;garbage collectors&lt;&#x2F;a&gt;. They both conservatively scan the stacks, registers, static data and the heap.&lt;&#x2F;p&gt;
&lt;p&gt;Entire pages rather than individual addresses are blacklisted. In that way, the blacklist can be implemented as a bit array, indexed by page numbers. Hash table can be utilized for discontiguous heaps.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h3&gt;
&lt;p&gt;They evaluate their techniques by running &lt;strong&gt;a&lt;&#x2F;strong&gt; program on different machines using both statically and dynamically linked versions of C library. The program allocates 200 circular linked lists containing 100 Kbytes each. And the collector would retain the entire list if any data points to any of the 100,000 addresses corresponding to objects in the list. The results of &lt;strong&gt;storage retention&lt;&#x2F;strong&gt; with and without blacklisting are shown in the following table.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;.&#x2F;003.png&quot; alt=&quot;003.png&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Several observations can be made based on the table:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Blacklisting is effective in nearly eliminating all accidental retention caused by garbage collector conservativism.&lt;&#x2F;li&gt;
&lt;li&gt;The numbers in the table are approximate as the results are not completely reproducible. This is due to the fact that the scanned part of the address space is polluted with UNIX environment variables. So they are specified as ranges.&lt;&#x2F;li&gt;
&lt;li&gt;If all interior pointers are considered valid, it would be difficult to allocate individual objects larger than about 100 Kbytes without violating the blacklist constraint, or requesting memory from the OS at a garbage collector specified location.&lt;&#x2F;li&gt;
&lt;li&gt;According to the paper, blacklist can be easily incorpoated into a garbage collecting allocator at &lt;em&gt;almost no performance cost&lt;&#x2F;em&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;other-sources-of-excess-retention&quot;&gt;Other Sources of Excess Retention&lt;&#x2F;h3&gt;
&lt;p&gt;Usually, this is the end of a 2019 paper, or it may continue with some related work, future work and conclusion. However, it&#x27;s published in &lt;strong&gt;1993&lt;&#x2F;strong&gt; and here is more stuff!&lt;&#x2F;p&gt;
&lt;p&gt;Another source of excess memory retention is due to the fact that conservative collectors usually have less information about &lt;strong&gt;variable liveness&lt;&#x2F;strong&gt; than conventional collectors. For example, a global variable may contain a valid pointer which is no longer used in the program. Due to the lack of knowledge of this information, such variable will remain in the stack even after garbage collection. 
The stack might be under unrealistically heavy use due to this problem, causing performance degradation in garbage collection.&lt;&#x2F;p&gt;
&lt;p&gt;So here are some useful techniques to help address this problem:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Have the allocator and collector carefully clean up after themselves, clearing local variables before function exit.&lt;&#x2F;li&gt;
&lt;li&gt;The allocator can try to clear areas sometimes in the stack beyond the most recently activated frame.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The first technique is to eliminate the possible impact of irregularly-triggered out-of-line allocation code and garbage collector, which is relatively rare. As the program may have a very regular execution, ensuring that the same stack location are always overwritten. So it pays to use this means (like writing 0s over stack frames after popping them) to maintain the regular execution.&lt;&#x2F;p&gt;
&lt;p&gt;The second technique sounds really confusing to me. If the stack frames is cleared individually as being popped, it seems that nothing needs to be done to the rest of the stack. So why struggling with the part of the stack beyond the most recently activated frame?&lt;&#x2F;p&gt;
&lt;h3 id=&quot;minor-consequences-of-misidentification&quot;&gt;Minor: Consequences of Misidentification&lt;&#x2F;h3&gt;
&lt;p&gt;Actually, the involved data structure can greatly influence the individual false reference. &lt;&#x2F;p&gt;
&lt;p&gt;For example, given a balanced binary tree, the expected number of vertices retained in a false reference is about the height of the tree. The height of a tree with &lt;em&gt;n&lt;&#x2F;em&gt; nodes lies &lt;em&gt;[log2(n+1) - 1, clog2(n+2) + b)&lt;&#x2F;em&gt;, which is usually tolerable. Queues and lazy lists could exhibit much worse behavior as they grow without bound and the whole data structure needs to be retained.&lt;&#x2F;p&gt;
&lt;p&gt;A more common problem is the construction of large strongly connected data structures, which could result in an unbounded memory leak if the structures are large enough. Take the data structures shown in the below graph as an example.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;.&#x2F;004.png&quot; alt=&quot;004.png&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Figure 3 and 4 depict two different data structures of a rectangular array, in which the vertices are linked both horizontally and vertically. The structure can be accessed by traversing a row&#x2F;column. The left shows an embedded link representation. A false reference in this structure would result in the retention of a large fraction of the whole structure. The right shows an embedded link representation of the same structure, with a separate link representation (represented by ovals in the figure). Thus, at most a single row&#x2F;column is affected.&lt;&#x2F;p&gt;
&lt;p&gt;Therefore, the embedded link version of some data structures would greatly help reduce storage retention and should be encouraged.&lt;&#x2F;p&gt;
&lt;p&gt;There is much more on this topic in another paper by Hans, &lt;a href=&quot;https:&#x2F;&#x2F;www.hpl.hp.com&#x2F;techreports&#x2F;2001&#x2F;HPL-2001-251.pdf&quot;&gt;Bounding Space Usage of Conservative Garbage Collectors&lt;&#x2F;a&gt; if you are interested in it.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;conclusion-and-thoughts&quot;&gt;Conclusion and Thoughts&lt;&#x2F;h1&gt;
&lt;p&gt;This paper introduces some simple but effective techniques for reducing storage retention in conservative garbage collectors. &lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;I think this paper gives a thorough explanation in the pointer misidentification problem in conservative garbage collectors, as well as the current and their proposed solutions. The blacklisting solution is simple and effective, and can be easily incorporated into current collectors. &lt;&#x2F;li&gt;
&lt;li&gt;However, it doesn&#x27;t give a thorough evaluation of their techniques. Only a handwritten program is tested for the effectiveness and no experiment is conducted on the performance cost when incorporating it into current garbage collectors. Also, they don&#x27;t consider the possible memory fragmentation problem in the evaluation.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h1 id=&quot;questions&quot;&gt;Questions&lt;&#x2F;h1&gt;
&lt;ul&gt;
&lt;li&gt;Why is it important to run a initial garbage collection right after system startup? Why false references originating from statically allocated constant data is the most troublesome?&lt;&#x2F;li&gt;
&lt;li&gt;How much could fragmentation due to blacklisting influence the performance?&lt;&#x2F;li&gt;
&lt;li&gt;Still ad hoc? Any more insightful ideas recently?&lt;&#x2F;li&gt;
&lt;li&gt;How do you like the organization&#x2F;design of the evaluation? (I thought it quite inadequate but remember it&#x27;s from &lt;strong&gt;1993&lt;&#x2F;strong&gt;:)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</description>
            </item>
        
            <item>
                <title>Strength Reduction Pass in LLVM</title>
                <pubDate>Wed, 13 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/strength-reduction-pass-in-llvm/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/strength-reduction-pass-in-llvm/</guid>
                <description>&lt;p&gt;Strength reduction is an optimization technique which substitutes expensive operations with computationally cheaper ones. For example, a very weak strength reduction algorithm can substitute the instruction &lt;code&gt;b = a * 4&lt;&#x2F;code&gt; with &lt;code&gt;b = a &amp;lt;&amp;lt; 2&lt;&#x2F;code&gt;. For this course project, we implemented the loop strength reduction algorithm described in &lt;a href=&quot;http:&#x2F;&#x2F;www.cs.utexas.edu&#x2F;%7Epingali&#x2F;CS380C&#x2F;2019&#x2F;lectures&#x2F;strengthReduction.pdf&quot;&gt;Prof. Pingali&#x27;s lecture slides&lt;&#x2F;a&gt; as an LLVM pass. Our pass identifies induction variables and reduces the expensive computation inside the loop to improve the execution time of the compiled program. The source code can be found &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Hecmay&#x2F;llvm-pass-skeleton&quot;&gt;here&lt;&#x2F;a&gt;. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;methodology&quot;&gt;Methodology&lt;&#x2F;h3&gt;
&lt;h4 id=&quot;preprocessing&quot;&gt;Preprocessing&lt;&#x2F;h4&gt;
&lt;p&gt;While it is possible to implement a strength reduction pass without any preprocessing to the code generated by the compiler front-end, it is much easier to perform effective loop strength reduction on properly optimized code. Aside from gathering sufficient information from the CDFG, memory-to-register promotion, constant propagation, copy propagation, and loop canonicalization greatly reduces the complexity of loop strength reduction. Because we are using LLVM to implement our pass, we leverage existing LLVM passes to perform these optimizations for us. The LLVM loop simplification pass canonicalizes the loop to have a preheader block, a header block, one single exit block and only one backedge. After executing this pass, all loops in the program have the same regular structure, and we leverage this structure to simplify our implementation. &lt;&#x2F;p&gt;
&lt;h4 id=&quot;loop-strength-reduction&quot;&gt;Loop Strength Reduction&lt;&#x2F;h4&gt;
&lt;p&gt;We illustrate our loop strength reduction algorithm using a simple example. The following code shows an unoptimized loop:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;while&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;( i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; j &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  a[j] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a[j] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;- &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We can see that the integer &lt;code&gt;j&lt;&#x2F;code&gt; is repeatedly computed in every iteration. Each computataion of &lt;code&gt;j&lt;&#x2F;code&gt; uses one multiplication and one addition. However, since &lt;code&gt;j&lt;&#x2F;code&gt; is actually linearly dependent on &lt;code&gt;i&lt;&#x2F;code&gt; and &lt;code&gt;i&lt;&#x2F;code&gt; increments by two in every iteration, it is possible to compute a proper initial value of &lt;code&gt;j&lt;&#x2F;code&gt; outside the loop, and increment &lt;code&gt;j&lt;&#x2F;code&gt; by a pre-computed stride in every iteration of the loop. With this optimization, the update of &lt;code&gt;j&lt;&#x2F;code&gt; only uses one addition per iteration as shown in the following code snippet. The optimized code is computationally much cheaper than the unoptimized version. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; j &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; j = 3 * 0 + 2
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;while&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;( i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  a[j] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a[j] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;- &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  j &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; j &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; j = j + 3 * 2
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h5 id=&quot;finding-induction-variables&quot;&gt;Finding Induction Variables&lt;&#x2F;h5&gt;
&lt;p&gt;The first step of performing loop strength reduction is to locate all loop induction variables. We use a map to store the names of the induction variables and their corresponding coefficients. Our map has the following structure:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;Map { Value =&amp;gt; (basic induction variable, multiplicative factor, additive factor) }
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For example, the basic induction variable &lt;code&gt;i&lt;&#x2F;code&gt; in the example code will have an entry &lt;code&gt;i =&amp;gt; (i, 1, 0)&lt;&#x2F;code&gt; in the map, while the variable &lt;code&gt;j&lt;&#x2F;code&gt; will have an entry &lt;code&gt;j =&amp;gt; (i, 3, 2)&lt;&#x2F;code&gt; after we finalize all the induction variables and their corresponding coefficients. We identify the basic induction variables by checking the Phi nodes in the header block of the loop, and then  repeatedly scan the loop body to find all other induction variables using the following two rules: &lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;If we find an assignment of form &lt;code&gt;k = b * j&lt;&#x2F;code&gt; where &lt;code&gt;j&lt;&#x2F;code&gt; is an induction variable with triple &lt;code&gt;(i, c, d)&lt;&#x2F;code&gt;, then add &lt;code&gt;k =&amp;gt; (i, b * c, d)&lt;&#x2F;code&gt; to the map;&lt;&#x2F;li&gt;
&lt;li&gt;If we find an assignment of form &lt;code&gt;k = j + b&lt;&#x2F;code&gt; where &lt;code&gt;j&lt;&#x2F;code&gt; is an induction variable with triple &lt;code&gt;(i, c, d)&lt;&#x2F;code&gt;, then add &lt;code&gt;k =&amp;gt; (i, c, d + b)&lt;&#x2F;code&gt; to the map. &lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The algorithm runs until convergence, i.e., it will stop when the size of the induction variable set does not increase any more. For our simple example, in the first iteration it will add the basic induction variable &lt;code&gt;i =&amp;gt; (i, 1, 0)&lt;&#x2F;code&gt; into the map, while in the second iteration it will add &lt;code&gt;j =&amp;gt; (i, 3, 2)&lt;&#x2F;code&gt; into the map. The algorithm will terminate in the third iteration. &lt;&#x2F;p&gt;
&lt;h5 id=&quot;update-the-program&quot;&gt;Update the Program&lt;&#x2F;h5&gt;
&lt;p&gt;Computing the stride and initial value of each induction variable from each entry in our map is straightforward. Notice that for the initial value, we don&#x27;t directly compute an integer value for each variable, but relate each induction variable with its basic induction variable. Then we can update the program to perform loop strength reduction with the following three steps:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Initialize a non-basic induction variable with the computed initial value in the loop preheader;&lt;&#x2F;li&gt;
&lt;li&gt;Add one Phi node for this non-basic induction variable into the loop header, and set the incoming value of the Phi node to the initial value we just computed;&lt;&#x2F;li&gt;
&lt;li&gt;At the end of the loop body, insert an update instruction for this non-basic induction variable, and set it as the other incoming value of the Phi node we just inserted;&lt;&#x2F;li&gt;
&lt;li&gt;Replace all uses of the original induction variable with the Phi node we inserted. &lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The procedure is repeated for each non-basic induction variable we have identified. After the optimization, dead code elimination is performed to clean up the instructions used for computing the original induction variables. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;implementation-details&quot;&gt;Implementation Details&lt;&#x2F;h3&gt;
&lt;h4 id=&quot;preprocessing-1&quot;&gt;Preprocessing&lt;&#x2F;h4&gt;
&lt;p&gt;For the loop preprocessing, we create a function pass manger to include all necessary passes we want to apply. The pass manager is instantiated with the module encapsulating this LLVM function. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;legacy::FunctionPassManager &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;FPM&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(module);
FPM.add(createConstantPropagationPass());
FPM.add(createIndVarSimplifyPass());
FPM.add(createDeadCodeEliminationPass());
FPM.add(createLoopSimplifyPass());
FPM.doInitialization();
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;bool&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; changed &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; FPM.run(F);
FPM.doFinalization();
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We also run &lt;code&gt;LoopInfoWrapperPass&lt;&#x2F;code&gt; to retrieve all the loop information. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;getAnalysisUsage&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(AnalysisUsage &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;AU) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
  AU.setPreservesCFG();
  AU.addRequired&amp;lt;LoopInfoWrapperPass&amp;gt;();
  AU.addRequired&amp;lt;TargetLibraryInfoWrapperPass&amp;gt;();
} 
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h4 id=&quot;locate-basic-induction-variables&quot;&gt;Locate Basic Induction Variables&lt;&#x2F;h4&gt;
&lt;p&gt;For each loop in the program, we create a map to record all induction variables as described earlier. For each loop, we find basic induction variables by simply collecting all the Phi nodes in the loop header:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;map&amp;lt;Value&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, tuple&amp;lt;Value&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;gt; &amp;gt; IndVarMap;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; collect all basic indvars by visiting all phi nodes
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;auto &amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;I &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: *&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;b_header) {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(PHINode &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;PN &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;dyn_cast&amp;lt;PHINode&amp;gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;I)) {
    IndVarMap[&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;I] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;make_tuple(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;I, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
  }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Notice that this step might collect more variables than the basic induction variables we need. However, notice that here we initialize all the collected variables as &amp;quot;basic induction variables&amp;quot;. The variables not recognized by our induction variable analysis will remain &amp;quot;basic&amp;quot; and will not be touched when we perform the optimization. &lt;&#x2F;p&gt;
&lt;h4 id=&quot;induction-variable-analysis&quot;&gt;Induction Variable Analysis&lt;&#x2F;h4&gt;
&lt;p&gt;We collect all induction variables by repeatedly iterating through a loop, adding new induction variables to the map if the following conditions are satisfied: &lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;If the value is computed with an Add &#x2F; Sub instruction on a constant and a induction variable that is already in the map; &lt;&#x2F;li&gt;
&lt;li&gt;If the value is computed with an Mul instruction on a constant and a induction variable that is already in the map. The algorithm will iterate the loops until the map size does not increase any more. &lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h4 id=&quot;strength-reduction&quot;&gt;Strength Reduction&lt;&#x2F;h4&gt;
&lt;p&gt;We iterate through all Phi nodes in the loop header, and find all induction variables corresponding to that Phi node. For each induction variable, we will then insert instructions (one Mul and one Add) in the loop preheader to compute the initial value. A new Phi node will be created to replace the original Phi node. The mapping of the new Phi node and the induction variable is stored in a map. Notice that when computing the initial value of the induction variable, we need to retrieve the initial value of the basic induction variable (the one defined by the Phi node). Since we have performed loop simplification, here we can safely assume that all Phi nodes in the loop header have only two incoming values: one from the loop body and the other one from outside the loop. We always retrieve the latter as the initial value. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;for (auto &amp;amp;I : *b_header) {
  &#x2F;&#x2F; we insert at the first phi node
  if (PHINode *PN = dyn_cast&amp;lt;PHINode&amp;gt;(&amp;amp;I)) {
    int num_income = PN-&amp;gt;getNumIncomingValues();
    assert(num_income == 2);
    &#x2F;&#x2F; find the preheader value of the phi node
    for (int i = 0; i &amp;lt; num_income; i++) {
      if (PN-&amp;gt;getIncomingBlock(i) == b_preheader) {
        preheader_val = PN-&amp;gt;getIncomingValue(i);
      } else {
        b_body = PN-&amp;gt;getIncomingBlock(i);
      }
    }
    ...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We then insert update instructions into the loop body, and set the output as an incoming value for the new Phi node we just inserted. After creating all the phi-nodes for induction variable substitution, we will update all the original values with the newly created PHINodes to complete the optimization. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;experiment-results&quot;&gt;Experiment Results&lt;&#x2F;h3&gt;
&lt;p&gt;We evaluate our pass on the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;embench&#x2F;embench-iot&quot;&gt;embench-iot&lt;&#x2F;a&gt; benchmark suite, which is a benchmark suite designed to test the performance of deeply embedded systems. The strength reduction pass is performed on each program to evaluate its correctness and efficiency. Experiments are performed on a server with an 2.20GHz Intel Xeon processor and 128GB memory. All programs are single-thread.&lt;&#x2F;p&gt;
&lt;p&gt;To run the optimized program, we first use clang to emit LLVM IR of the original program with all optimization passes disabled. Then the IR is passed into LLVM opt, optimized with our pass and compiled into bitcode and objects. Finally we compile the object files into binary and run on physical machines. For each benchmark, there is one macro &lt;code&gt;CPU_MHZ&lt;&#x2F;code&gt; to control the number of times the top-level benchmark function executes. We set this number to 1000 to ensure that each benchmark is executed for a sufficient number of times. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;# generate LLVM IR for original program 
clang -c -emit-llvm -O0 -Xclang -disable-O0-optnone benchmark.c $EMBENCH_DIR&#x2F;support&#x2F;*.c -I&#x2F;path&#x2F;to&#x2F;benchmark \
-I$EMBENCH_DIR&#x2F;support -DCPU_MHZ=1000

for f in *.bc
do
  llvm-dis ${f} 
done

for f in *.ll
do
  # optimze with llvm opt
  opt -S -load build&#x2F;skeleton&#x2F;libSkeletonPass.so -mem2reg -sr -dce ${f} -o opt_${f}
  opt -S -load build&#x2F;skeleton&#x2F;libSkeletonPass.so -mem2reg -sr -dce ${f} -o opt_${f}.bc
  llc -filetype=obj opt_${f}.bc; 
done

gcc *.o -lm; 
time .&#x2F;a.out
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The LLVM pass is first compiled into a shared library, which is loaded into LLVM opt as a compiler pass (with the &lt;code&gt;-sr&lt;&#x2F;code&gt; argument, which is the custom name of our pass). The experiment results are shown in the following table. To make a fair comparison, we also commented out all the content except the preprocessing and postprocessing part in the code of our strength reduction pass, compile it into a &amp;quot;dummy pass&amp;quot;, and use the same commands to compile the benchmarks. The results of executing this version of the benchmarks is recorded as our baseline in the &amp;quot;Original&amp;quot; column of the table. &lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;Benchmark&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Original (s)&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Optimized (s)&lt;&#x2F;th&gt;&lt;th&gt;Speedup&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;aha-mont64&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.261&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.244&lt;&#x2F;td&gt;&lt;td&gt;1.070&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;crc32&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.598&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.598&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;cubic&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.018&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.018&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;edn&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.439&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.389&lt;&#x2F;td&gt;&lt;td&gt;1.129&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;huffbench&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.564&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.027&lt;&#x2F;td&gt;&lt;td&gt;20.889&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;matmult-int&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.628&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.579&lt;&#x2F;td&gt;&lt;td&gt;1.085&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;minver&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.067&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.078&lt;&#x2F;td&gt;&lt;td&gt;0.859&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;nbody&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.01&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.015&lt;&#x2F;td&gt;&lt;td&gt;0.667&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;nettle-aes&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.316&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;wrong&lt;&#x2F;td&gt;&lt;td&gt;N&#x2F;A&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;nettle-sha256&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.348&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.364&lt;&#x2F;td&gt;&lt;td&gt;0.956&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;nsichneu&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.312&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.314&lt;&#x2F;td&gt;&lt;td&gt;0.994&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;picojpeg&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.545&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;stuck&lt;&#x2F;td&gt;&lt;td&gt;N&#x2F;A&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;qrduino&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1.183&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1.136&lt;&#x2F;td&gt;&lt;td&gt;1.041&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;sglib-combined&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.612&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;wrong&lt;&#x2F;td&gt;&lt;td&gt;N&#x2F;A&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;slre&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.729&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;wrong&lt;&#x2F;td&gt;&lt;td&gt;N&#x2F;A&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;st&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.064&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.049&lt;&#x2F;td&gt;&lt;td&gt;1.306&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;statemate&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.372&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.401&lt;&#x2F;td&gt;&lt;td&gt;0.928&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;ud&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.55&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.538&lt;&#x2F;td&gt;&lt;td&gt;1.022&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;wikisort&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.215&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0.276&lt;&#x2F;td&gt;&lt;td&gt;0.779&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;Geomean&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;&lt;&#x2F;td&gt;&lt;td&gt;1.211&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Except for a few programs, the average speedup is 1.211x over the original baseline program. Unfortunately, our pass does not optimize all programs correctly: &lt;code&gt;nettle-aes&lt;&#x2F;code&gt;, &lt;code&gt;sglib-combined&lt;&#x2F;code&gt;, and &lt;code&gt;slre&lt;&#x2F;code&gt; do not produce correct results after our optimization, while &lt;code&gt;picojpeg&lt;&#x2F;code&gt; does not terminate after applying our pass. We have not figured out the reason of these errors yet. Possible reasons are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;We failed to distinguish all the non-induction variables collected at the beginning of the pass in our analysis and optimization process, causing the program to behave incorrectly;&lt;&#x2F;li&gt;
&lt;li&gt;When inserting the update instructions, we probably inserted at incorrect positions or inserted multiple times. This may also explain the 20x speedup for huffbench program (i.e., the loop exits before finishing all the work, but the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;embecosm&#x2F;embench-beebs&#x2F;blob&#x2F;master&#x2F;src&#x2F;huffbench&#x2F;libhuffbench.c#L456&quot;&gt;verification function&lt;&#x2F;a&gt; still returns the correct result).&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</description>
            </item>
        
            <item>
                <title>Generating traces from LLVM</title>
                <pubDate>Wed, 13 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/trace-gen-llvm/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/trace-gen-llvm/</guid>
                <description>&lt;h2 id=&quot;let-s-generate-traces-from-llvm&quot;&gt;Let&#x27;s generate traces from LLVM!&lt;&#x2F;h2&gt;
&lt;p&gt;The code described in this blog post is hosted &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;pbb59&#x2F;ir-trace&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;overview&quot;&gt;Overview&lt;&#x2F;h2&gt;
&lt;p&gt;In this project we use LLVM to generate traces and execute them. We use LLVM passes to log all the traces in a program and use another pass to create an executable that only executes a selected trace. This could be useful to implement efficient trace selection, or to implement trace-based LLVM passes in general.&lt;&#x2F;p&gt;
&lt;p&gt;In a real system, an interpreter would internally implement both of these passes (profiling and trace generation). However, we did not use an interpreter due to technological limitations in LLVM.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;background&quot;&gt;Background&lt;&#x2F;h2&gt;
&lt;p&gt;Trace-based compilers have found many use cases in computer systems. In tracing compilers, a loop body is compiled to only execute a single path (a trace) from the beginning of the loop body to the end. A trace, by definition, is a single basic block with the original branches removed. The reduction in branches can have a significant performance impact in many architectures.&lt;&#x2F;p&gt;
&lt;img src=&quot;tracing-jit.png&quot; alt=&quot;Tracing&quot; width=&quot;20%&quot;&gt;
&lt;p&gt;Guards must be inserted to check whether this assumption of path taken by the program is correct. If it&#x27;s not correct, then another path must be executed at high performance overhead. These systems speculate on a certain control flow of the program in a somewhat high-risk high-reward environment: the benefits of repeating the same trace is large (fewer instructions and more optimizations between basic blocks), but mispredicting the trace is worse than not running the trace at all.&lt;&#x2F;p&gt;
&lt;p&gt;Tracing systems found early success in VLIW architectures. These architectures try to schedule many instructions at once, but run into trouble scheduling across basic blocks, which may or may not be run. A single basic block can remedy this by only considering instructions along a single path through the program.&lt;&#x2F;p&gt;
&lt;p&gt;More recently, tracing has found success in Just-in-Time compilers (JITs) for dynamic languages. The most popular Python JIT is a tracing JIT that speculates on both types and control flow.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;trace-detection&quot;&gt;Trace Detection&lt;&#x2F;h2&gt;
&lt;p&gt;To determine all the traces in a given program, we first use LLVM functions to detect branch instructions. We isolate branch instructions that diverge control flow and use an LLVM pass to insert a function call to a run-time library to log the condition upon divergence. At run time, when the conditions to the diverging branch instructions are resolved, the sequence of branch selections during execution is logged in a CSV file of boolean values. This sequence of branch selection forms a trace through the entire program taken during the profiled execution.&lt;&#x2F;p&gt;
&lt;!-- This LLVM pass was created broadly following the &quot;Linking With a Runtime Library&quot; section from [LLVM for Grad Students](https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;~asampson&#x2F;blog&#x2F;llvm.html) blog post. However, a `Value` was created from `FunctionCallee` object before passing into `CreateCall` function from [`IRBuilder`](https:&#x2F;&#x2F;llvm.org&#x2F;doxygen&#x2F;classllvm_1_1IRBuilder.html) to avoid a casting error. This pass is registered as `-skull` in `Skull.cpp`. --&gt;
&lt;p&gt;In order to trace loop bodies, we first need to identify where the loops are in the program. We attempted to this using an LLVM loop pass and trace within each individual loop. Unfortunately, we were not able to get this pass working. Instead, we manually extracted loops from programs and traced them in isolation from the rest of the original program.&lt;&#x2F;p&gt;
&lt;!-- The next step in trace detection was detecting back-edges and clipping off isolated traces from the full execution trace. LLVM provides great tools to achieve this through &quot;interactions between passes&quot; detailed in [Writing an LLVM Pass](http:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;WritingAnLLVMPass.html#specifying-interactions-between-passes). Using [LoopPass](https:&#x2F;&#x2F;llvm.org&#x2F;doxygen&#x2F;classllvm_1_1Loop.html) analysis to gather loop information, LLVM provides a nice way to detect loops and whether each basic block is within a loop or not. However, accessing entry point to the loop and exit point turned out to be hard. The pass is expected to mark the start and end points of each loop in the trace log, thereby enabling trace selection and optimizing. Work in progress on this pass can be found using `-pelvis` in `Pelvis.cpp`. --&gt;
&lt;h2 id=&quot;trace-generation&quot;&gt;Trace Generation&lt;&#x2F;h2&gt;
&lt;p&gt;Traces through the program are generated using a function pass in LLVM. The pass takes as input the profiling data generated in the trace detection pass. The pass modifies a function by merging the entry block with other basic blocks along the function&#x27;s path. At the end of the pass, the entry block will be the only block left in the entire function. We trace the program at the level of the function because it was simplest method. The applicability of this to real benchmarks will be discussed in the Evaluation section.&lt;&#x2F;p&gt;
&lt;p&gt;Starting from the top of the entry block, we traverse instructions in program order. We modify the entry basic block when we encounter one of four instructions enumerated below. Once this modification occurs, we restart the traversal from the beginning of the modified entry block.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Case&lt;&#x2F;th&gt;&lt;th&gt;Action to Entry Block&lt;&#x2F;th&gt;&lt;th&gt;Instruction Action&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Conditional Branch&lt;&#x2F;td&gt;&lt;td&gt;Merge appropriate block (1 of 2) based on profile&lt;&#x2F;td&gt;&lt;td&gt;Remove Branch&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Unconditional Branch&lt;&#x2F;td&gt;&lt;td&gt;Merge block jumped to.&lt;&#x2F;td&gt;&lt;td&gt;Remove Branch&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Function Call&lt;&#x2F;td&gt;&lt;td&gt;Inline Function&lt;&#x2F;td&gt;&lt;td&gt;Remove Call&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Phi Node (only in -O1+)&lt;&#x2F;td&gt;&lt;td&gt;Update program dependencies to last defined variable&lt;&#x2F;td&gt;&lt;td&gt;Remove Phi Node&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Block merging transfers all of the instructions from a block into the entry block and removes the branch linking the blocks. Function inlining uses the built-in LLVM inline tool. Our algorithm is effectively recursive so control can be traced through branches nested in branches, functions in functions, branches in functions, etc.&lt;&#x2F;p&gt;
&lt;p&gt;We illustrate our pass with a simple test function shown below.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;example&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;a, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;b) {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b;
  }
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b;
  } 
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The following multi-block LLVM IR is created at &lt;code&gt;-O0&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;define dso_local i32 @example(i32, i32) #0 {
  %3 = alloca i32, align 4
  %4 = alloca i32, align 4
  %5 = alloca i32, align 4
  store i32 %0, i32* %4, align 4
  store i32 %1, i32* %5, align 4
  %6 = load i32, i32* %4, align 4
  %7 = icmp sgt i32 %6, 3
  br i1 %7, label %8, label %12

8:                                                ; preds = %2
  %9 = load i32, i32* %4, align 4
  %10 = load i32, i32* %5, align 4
  %11 = add nsw i32 %9, %10
  store i32 %11, i32* %3, align 4
  br label %16

12:                                               ; preds = %2
  %13 = load i32, i32* %4, align 4
  %14 = load i32, i32* %5, align 4
  %15 = sub nsw i32 %13, %14
  store i32 %15, i32* %3, align 4
  br label %16

16:                                               ; preds = %12, %8
  %17 = load i32, i32* %3, align 4
  ret i32 %17
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;After applying the trace generation pass (speculating block &lt;code&gt;%8&lt;&#x2F;code&gt; is taken), we produce the following LLVM IR.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;define dso_local i32 @example(i32, i32) #0 {
  %3 = alloca i32, align 4
  %4 = alloca i32, align 4
  %5 = alloca i32, align 4
  store i32 %0, i32* %4, align 4
  store i32 %1, i32* %5, align 4
  %6 = load i32, i32* %4, align 4
  %7 = icmp sgt i32 %6, 3
  %8 = load i32, i32* %4, align 4
  %9 = load i32, i32* %5, align 4
  %10 = add nsw i32 %8, %9
  store i32 %10, i32* %3, align 4
  %11 = load i32, i32* %3, align 4
  ret i32 %11
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Phi Nodes should also be removed because there is no control flow to merge in the trace after block merging has occurred. When a Phi Node is removed, the single operand that exists should be forwarded to the uses of that Phi Node.&lt;&#x2F;p&gt;
&lt;p&gt;We do not remove the basic blocks during the previously described traversal. It&#x27;s possible a block that was not taken by a conditional branch may be jumped to later, so we can&#x27;t delete it. Once block merging and inlining is complete, we delete all dead basic blocks by checking if any predecessors exist.&lt;&#x2F;p&gt;
&lt;p&gt;The traces generated take the place of the given function. They do not contain guards that check if the path speculation was correct. Thus they must be run in isolation as a single loop iteration.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;We evaluate on a subset of benchmarks from &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;embench&#x2F;embench-iot&quot;&gt;Embench&lt;&#x2F;a&gt;. These are single-threaded benchmarks targeted at embedded systems.&lt;&#x2F;p&gt;
&lt;p&gt;We do not have an interpreter or language run time so we could not inject trace code into the program, and importantly handle failed traces. Additionally, there are no guards to hook into this language run time even if it did exist. For these reasons, we do not evaluate on entire programs, but rather small sections of the programs. Our methodology is a shown below.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;&lt;&#x2F;th&gt;&lt;th&gt;Step&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;Manually extract the body of a &amp;quot;hot&amp;quot; loop into a function&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;Compile the function and run with the profiling pass&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;Compile the function with the trace generation pass&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;td&gt;Run the generated trace&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;In certain scenarios, we removed some loops (set the iterators to their initial values) and unrolled fixed size loops when possible.&lt;&#x2F;p&gt;
&lt;p&gt;We evaluate on small code sections from four Embench benchmarks and one synthetic benchmark. The features of each code section are described below.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Benchmark&lt;&#x2F;th&gt;&lt;th&gt;Conditional Branches&lt;&#x2F;th&gt;&lt;th&gt;Function Calls&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;minver&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;aha-mont64&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;nbody&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;1 (built-in, don&#x27;t inline)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;ud&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;synthetic&lt;&#x2F;td&gt;&lt;td&gt;2 (nested)&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Our performance metric is the number of machine instructions generated by the optimized program vs. the un-optimized program in the observed function. We use &lt;code&gt;objdump -dC &amp;lt;binary&amp;gt;&lt;&#x2F;code&gt; and count the number of  instructions in the function. We compiled both at &lt;code&gt;-O0&lt;&#x2F;code&gt;, but included a dead code elimination (&lt;code&gt;--dce&lt;&#x2F;code&gt;) pass after both to eliminate no longer useful instructions from the trace (condition check and initializing un-used variables).&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Benchmark&lt;&#x2F;th&gt;&lt;th&gt;Un-opt Machine Inst.&lt;&#x2F;th&gt;&lt;th&gt;Opt Machine Inst.&lt;&#x2F;th&gt;&lt;th&gt;Inst. Reduction (%)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;minver&lt;&#x2F;td&gt;&lt;td&gt;64&lt;&#x2F;td&gt;&lt;td&gt;33&lt;&#x2F;td&gt;&lt;td&gt;48&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;aha-mont64&lt;&#x2F;td&gt;&lt;td&gt;45&lt;&#x2F;td&gt;&lt;td&gt;34&lt;&#x2F;td&gt;&lt;td&gt;24&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;nbody&lt;&#x2F;td&gt;&lt;td&gt;130&lt;&#x2F;td&gt;&lt;td&gt;130&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;ud&lt;&#x2F;td&gt;&lt;td&gt;56&lt;&#x2F;td&gt;&lt;td&gt;50&lt;&#x2F;td&gt;&lt;td&gt;11&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;synthetic&lt;&#x2F;td&gt;&lt;td&gt;39&lt;&#x2F;td&gt;&lt;td&gt;12&lt;&#x2F;td&gt;&lt;td&gt;69&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;We see a reduction in the number of instructions generated for all programs with branches and inline-able functions.&lt;&#x2F;p&gt;
&lt;p&gt;We also executed the traces as standalone functions to check correctness. For each benchmark, we compared the output of a single loop iteration of the original loop and compared it the output of the trace for the same input. In all case, the result was equivalent between the original loop and trace.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Immix: a Mark-Region Garbage Collector with Space Efficiency, Fast Collection, and Mutator Performance</title>
                <pubDate>Mon, 11 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/immix/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/immix/</guid>
                <description>&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=1375586&quot;&gt;&amp;quot;Immix: a mark-region garbage collector with space efficiency, fast collection, and mutator performance&amp;quot;&lt;&#x2F;a&gt; describes a new garbage collector &lt;em&gt;immix&lt;&#x2F;em&gt;, 
which combines two techniques: 
&lt;em&gt;mark-region&lt;&#x2F;em&gt; and &lt;em&gt;opportunistic&lt;&#x2F;em&gt; defragmentation. 
And by full-scale evaluations, 
this paper shows that immix can match the state of the art on three critical aspects: 
&lt;em&gt;space efficiency&lt;&#x2F;em&gt;, &lt;em&gt;fast collection&lt;&#x2F;em&gt;, and &lt;em&gt;mutator performance&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;the-imperfection-of-existing-tracing-collectors&quot;&gt;The Imperfection of Existing Tracing Collectors&lt;&#x2F;h1&gt;
&lt;p&gt;This paper focus on three important demands on garbage collectors:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Space efficiency&lt;&#x2F;em&gt;: the less space overhead, the better.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Fast collection&lt;&#x2F;em&gt;: the faster the collector runs, the better.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Mutator performance&lt;&#x2F;em&gt;: the faster the mutator runs, the better.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;This paper first investigates three existing canonical tracing collectors: 
&lt;em&gt;semi-space&lt;&#x2F;em&gt; (SS), &lt;em&gt;mark-sweep&lt;&#x2F;em&gt; (MS), and &lt;em&gt;mark-compact&lt;&#x2F;em&gt; (MC). &lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;.&#x2F;001.png&quot; alt=&quot;001&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The above figure shows the performance tradeoffs for canonical collectors. 
It plots the geometric mean of total time, collection time, mutator time, 
and mutator cache misses as a function of the heap size, 
normalized to the best, for 20 DaCapo, SPECjvm98 and SPECjbb2000 benchmarks, 
and shows 99% confidence intervals. 
Figure (a) demonstrates that semi-space performs best when the heap size is large enough, 
but sacrifices space efficiency due to the 2x space overhead; 
Figure (b) shows that mark-compact does a bad job on achieving fast collection; 
Figure (c) and (d) indicate that mark-sweep cannot compete with the other two in terms of mutator performance.&lt;&#x2F;p&gt;
&lt;p&gt;Therefore, the primary goal of the new collector proposed by this paper is to guarantee all three demands.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;immix&quot;&gt;Immix&lt;&#x2F;h1&gt;
&lt;p&gt;Immix contains two components: 
a &lt;em&gt;mark-region&lt;&#x2F;em&gt; collector and an &lt;em&gt;opportunistic&lt;&#x2F;em&gt; defragmentation mechanism.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;mark-region&quot;&gt;Mark-region&lt;&#x2F;h3&gt;
&lt;p&gt;A mark-region collector is similar to a mark-sweep collector, 
with 2 differences: 
(1) In mark-region, memory is divided into fixed-sized &lt;em&gt;regions&lt;&#x2F;em&gt; (pretty similar to the &amp;quot;regions&amp;quot; we discussed in &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=582421&quot;&gt;Reconsidering Custom Memory Allocation&lt;&#x2F;a&gt;), 
and objects may not span regions;
(2) Instead of maintaining a free list, 
which a mark-sweep collector allocates from and sweeps to, 
a mark-region collector bumps allocates into free regions, 
which are regions with no living objects, 
and reclaims free regions during collection.&lt;&#x2F;p&gt;
&lt;p&gt;A mark-region collector performs better on mutator performance than a mark-sweep collector because of its contiguous allocation nature. 
However, it suffers from space-inefficiency as one single living object can hold a region unavailable. 
Immix addresses this problem by 
(1) refining this introduced mark-region collector to improve utilization of a region;
(2) introducing a defragmentation mechanism, 
which is presented in the next section.&lt;&#x2F;p&gt;
&lt;p&gt;Immix refines mark-region by operating at two levels, 
coarse-grained &lt;em&gt;blocks&lt;&#x2F;em&gt;, and fine-grained &lt;em&gt;lines&lt;&#x2F;em&gt;. 
Comparing to the original mark-region collector, 
the differences are: 
(1) Objects may span lines; 
(2) during reclamation,
immix also identifies free lines in partially free blocks (formerly unavailable blocks); 
(3) during allocation, 
immix first tries to find contiguous free lines that are large enough for the allocations. 
If it cannot find suitable lines, 
then it allocates into free blocks.&lt;&#x2F;p&gt;
&lt;p&gt;This paper elaborates more on the allocation process of immix: 
The allocator is divided into two parts: 
a global allocator that maintains a pool of free blocks,
and a thread-local allocator that maintains free lines. &lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;.&#x2F;002.png&quot; alt=&quot;002.png&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The figure above illustrates the basic immix heap organization during allocation. 
The allocator maintains two pointers (Bump Pointer Cursor and Bump Pointer Limit) that indicate the left and right bound of the current hole (contiguous free lines),
the right bound is either right before next living line or the end of this block, since an object cannot span blocks. &lt;&#x2F;p&gt;
&lt;h4 id=&quot;implementation-details&quot;&gt;Implementation Details&lt;&#x2F;h4&gt;
&lt;p&gt;This paper discusses many details of its implementation; 
here are some remarkable ones about mark-region.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Parallelism.&lt;&#x2F;strong&gt; Immix implementations are parallel. The thread-local allocators are mostly unsynchronized while the global allocator is synchronized. Therefore, they follow a design pattern to maximize fast, unsynchronized thread-local activities and minimizes synchronized global activities. As a result, immix&#x27;s mark-region collector requires hardly any synchronization.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Demand Driven Overflow Allocation.&lt;&#x2F;strong&gt; Each immix allocator is paired with an &lt;em&gt;overflow allocator&lt;&#x2F;em&gt; that is also a contiguous allocator, but it always uses empty blocks. If immix cannot allocate a &lt;em&gt;medium&lt;&#x2F;em&gt; object (an object with size greater than a line) into a block that contains free lines, immix allocates it with the overflow allocator.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Block and Line Size.&lt;&#x2F;strong&gt; The block and line size are key parameters for immix. The authors roughly size blocks to match the page size of the operating system, and size lines to match the architecture&#x27;s cache line. Although this selection policy is supposed to benefit the spatial locality, and therefore leads to a better mutator performance, in the evaluation, the performance is showed insensitive to the selection of block and line size.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;opportunistic-evacuation&quot;&gt;Opportunistic Evacuation&lt;&#x2F;h3&gt;
&lt;p&gt;The mark-region collector, 
like a mark-sweep one, 
is non-moving and thus subject to fragmentation. 
Immix uses a defragmentation mechanism called &lt;em&gt;opportunistic evacuation&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Actually, before the evacuation process, 
methods are needed to determine when to trigger defragmentation and how to select candidate blocks. 
This part is left to be discussed in implementation details, 
let&#x27;s suppose we know the set of candidate blocks for now. &lt;&#x2F;p&gt;
&lt;p&gt;This evacuation process is: 
when the collector encounters a live object in a candidate block, 
it &lt;em&gt;opportunistically&lt;&#x2F;em&gt; evacuates the object. 
That is, it only evacuates an object if it can successfully perform the following: 
the collector allocates this object in the same way it allocates an object created by the mutator,
except it doesn&#x27;t allocate into defragmentation candidate blocks. 
Besides, the collector leaves a forwarding pointer which records the address of the new location.
And if the collector encounters references to a forwarded object, 
it replaces them with the new location.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;implementation-details-1&quot;&gt;Implementation Details&lt;&#x2F;h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Defragmentation Trigger.&lt;&#x2F;strong&gt; If there are one or more partially free blocks that the allocator &amp;quot;did not use&amp;quot; or the previous collection didn&#x27;t free enough space, immix triggers defragmentation at the beginning of the current collection.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Candidate Selection.&lt;&#x2F;strong&gt; Immix selects candidate blocks with more holes, as many as possible. It maintains an array called &lt;em&gt;available histogram&lt;&#x2F;em&gt; AH, where AH[i] reflects space occupied in all blocks with i holes. During selection, immix starts from biggest i, select all blocks with i holes as candidates if available space is not exhausted, and decrease i, repeat this process until available space is exhausted or i becomes 0.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Headroom.&lt;&#x2F;strong&gt; Headroom is a small number of free blocks that immix never returns to the global allocator and &lt;em&gt;only&lt;&#x2F;em&gt; uses for evacuating.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Pinning.&lt;&#x2F;strong&gt; In some situations, an application may request that an object not be moved: When programmers want to manually optimize locality; when programmers want to interact with code without GC management (e.g. some external C code). Immix supports this pinning feature by leveraging its opportunism. If the application pins an object, defragmentation never moves this object.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h1 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h1&gt;
&lt;p&gt;This paper has put a lot of effort into evaluations.&lt;&#x2F;p&gt;
&lt;p&gt;The benchmarks contain SPECjvm98, the DaCapo suites, and pseudojbb2000 (a fixed workload version of SPEC jbb2000). 
They use three different hardware platforms, 
including one machine with a dual-core CPU:
Intel Core 2 Duo.
Most analyses focus on results from the Core 2 Duo using two processors.&lt;&#x2F;p&gt;
&lt;p&gt;Immix is implemented in the memory management toolkit (MMTk) in Jikes RVM.
And other algorithms evaluated in this paper are also implemented in MMTk.&lt;&#x2F;p&gt;
&lt;p&gt;The first set of evaluations compare immix to mark-sweep, semi-space, and mark-compact.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;.&#x2F;003.png&quot; alt=&quot;003.png&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The figure above shows performance as the geometric mean for all benchmarks. 
(a) - (c) show that immix outperforms the others consistently on all three architectures. 
(d) and (e) demonstrate mutator time and L2 cache misses, 
where immix matches the performance of semi-space and mark-compact. 
Finally, (f) shows that immix performs as well as mark-sweep in terms of collection time. 
In addition, the below chart illustrates the minimum heap sizes needed to run each algorithm. 
Immix is 14% more space-efficient than mark-sweep on average and is close to mark-compact.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;.&#x2F;004.png&quot; alt=&quot;004.png&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The second set of evaluations examine the performance of immix in a composite generational collector. 
(G|A-B) represents a generational collector that performs A in nursery space and B in mature space.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;.&#x2F;005.png&quot; alt=&quot;005.png&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The figure above shows the total performance of mark-sweep and the three generational semi-space composites. 
Immix&#x27;s performance is consistently among the best. 
And you can also see there is a fairly large gap between traditional tracing algorithms and their generational composites.&lt;&#x2F;p&gt;
&lt;p&gt;The authors also evaluated the performance sensitivity of immix. &lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;.&#x2F;006.png&quot; alt=&quot;006.png&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The table above shows the performance of immix when certain features are disabled, 
or parameters are changed. 
Here I want to focus on the algorithm features.
The four columns under algorithm features represent performance for mark-region collectors that provide only block marking (Block);
block and line marking with overflow allocation, 
but without defragmentation (Line); 
everything but headroom (No HR); 
everything but overflow allocation. 
It is interesting to me that disabling headroom affects the performance a lot while disabling overflow allocation doesn&#x27;t matter too much.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;conclusion-and-thoughts&quot;&gt;Conclusion and Thoughts&lt;&#x2F;h1&gt;
&lt;p&gt;This paper introduces mark-region, a new family of garbage collectors, 
and opportunistic evacuation, 
a lightweight defragmentation mechanism. 
It combines them in immix and shows that it matches or beats existing canonical garbage collectors by presenting a full-scale evaluation and detailed analysis.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;I think this paper puts a lot of effort into details of immix&#x27;s implementation, which is good. However, I see a lot of optimization applied to the original design: for example, adding &lt;em&gt;overflow allocator&lt;&#x2F;em&gt; that always uses free blocks, adding &lt;em&gt;headroom&lt;&#x2F;em&gt; that is reserved for evacuation only. Since some of them matter a lot to the performance (by looking at the last table I showed), it seems questionable to me that this algorithm would still perform better if we strip these optimizations off. Or maybe all production collectors tweak a lot?&lt;&#x2F;li&gt;
&lt;li&gt;This paper doesn&#x27;t talk about pause times. Maybe the reason is that all the algorithms here stop-the-world, and the techniques for improving pause times such as incremental collection or concurrent collection are orthogonal to the topic.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h1 id=&quot;influence&quot;&gt;Influence&lt;&#x2F;h1&gt;
&lt;p&gt;I wanted to find out any existing industrial instance of this algorithm, 
so I did a little survey on several popular languages&#x27; compilers to get a sense of the state of modern garbage collection.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Python&lt;&#x2F;em&gt; (CPython) and &lt;a href=&quot;https:&#x2F;&#x2F;www.php.net&#x2F;manual&#x2F;en&#x2F;features.gc.php&quot;&gt;&lt;em&gt;PHP&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;. They are using reference counting plus some cycle detection mechanism.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.swift.org&#x2F;swift-book&#x2F;LanguageGuide&#x2F;AutomaticReferenceCounting.html&quot;&gt;&lt;em&gt;Swift&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;. It also uses reference counting, but no cycle detection. Programmers are supposed to leverage weak references correctly to get rid of cycles by themselves. Wow.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.microsoft.com&#x2F;en-us&#x2F;dotnet&#x2F;standard&#x2F;garbage-collection&#x2F;fundamentals?redirectedfrom=MSDN&quot;&gt;&lt;em&gt;C#&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;. Concurrent generational mark-compact.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;JavaScript&lt;&#x2F;em&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;v8.dev&#x2F;blog&#x2F;trash-talk&quot;&gt;V8&lt;&#x2F;a&gt;). Parallel concurrent incremental generational semi-space+mark-compact, with idle time GC. &lt;a href=&quot;https:&#x2F;&#x2F;webkit.org&#x2F;blog&#x2F;7122&#x2F;introducing-riptide-webkits-retreating-wavefront-concurrent-garbage-collector&#x2F;&quot;&gt;Webkit&#x27;s GC&lt;&#x2F;a&gt; seems not related to regions, either.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Haskell&lt;&#x2F;em&gt; (&lt;a href=&quot;https:&#x2F;&#x2F;wiki.haskell.org&#x2F;GHC&#x2F;Memory_Management&quot;&gt;GHC&lt;&#x2F;a&gt;). Parallel generational GC, from a paper published by MSR in 2008.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Java&lt;&#x2F;em&gt;. OpenJDK has four GC options. And the newest one (&lt;a href=&quot;https:&#x2F;&#x2F;www.oracle.com&#x2F;technetwork&#x2F;tutorials&#x2F;tutorials-1876574.html&quot;&gt;G1&lt;&#x2F;a&gt;) is a concurrent parallel mark-region (!) collector with evacuate defragmentation (!). However, it doesn&#x27;t maintain finer-grained regions.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;So there is a state-of-the-art compiler which is applying a similar idea. If it is inspired by this paper, that&#x27;s a great impact!&lt;&#x2F;p&gt;
&lt;p&gt;In addition, more industrial information related to immix (thanks to Adrian!):&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;An in-progress immix implementation &lt;a href=&quot;https:&#x2F;&#x2F;gitlab.haskell.org&#x2F;ghc&#x2F;ghc&#x2F;wikis&#x2F;commentary&#x2F;rts&#x2F;storage&#x2F;gc&#x2F;immix&quot;&gt;for GHC&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;A Haxe compiler with &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;HaxeFoundation&#x2F;hxcpp&#x2F;blob&#x2F;master&#x2F;src&#x2F;hx&#x2F;gc&#x2F;Immix.cpp&quot;&gt;an immix implementation&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Scala-Native apparently &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;scala-native&#x2F;immix&quot;&gt;uses immix&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;A &lt;em&gt;proposal&lt;&#x2F;em&gt; to &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;crystal-lang&#x2F;crystal&#x2F;issues&#x2F;5271&quot;&gt;use immix in Crystal&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h1 id=&quot;questions&quot;&gt;Questions&lt;&#x2F;h1&gt;
&lt;ul&gt;
&lt;li&gt;How would you predict the performance when immix keeps everything but lines? &lt;&#x2F;li&gt;
&lt;li&gt;How do you like the organization&#x2F;design of the evaluation? (I actually found it quite overwhelming.)&lt;&#x2F;li&gt;
&lt;li&gt;This paper leaves concurrency to future work. Do you think it is easy to turn immix concurrent?&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</description>
            </item>
        
            <item>
                <title>A Unified Theory of Garbage Collection</title>
                <pubDate>Fri, 08 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/unified-theory-gc/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/unified-theory-gc/</guid>
                <description>&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;&#x2F;h2&gt;
&lt;p&gt;Tracing and reference counting are normally viewed as the two main, completely different approaches to garbage collection. However, in &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=1028982&quot;&gt;A Unified Theory of Garbage Collection&lt;&#x2F;a&gt;, Bacon et al. showed tracing and reference counting to be duals of one another, and
that all garbage collectors are various types of hybrids of tracing and reference counting. Intuitively, tracing is tracking the live objects while reference counting is tracking dead objects.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;background&quot;&gt;Background&lt;&#x2F;h2&gt;
&lt;p&gt;Broadly speaking, garbage collection (GC) is a form of automatic memory management. The garbage collector attempts to free the memory blocks occupied by objects that are no longer in use by the program. It relieves programmers from the burden of explicitly freeing allocated memory. Moreover, it also serves as part of the security strategy of languages like Java: in the Java virtual machine programmers are unable to accidentally (or purposely) crash the machine by incorrectly freeing memory. The opposite is manual memory management, which is available in C&#x2F;C++. This gives the maximum freedom for programmers and avoids the potential overhead that affects program performance.&lt;&#x2F;p&gt;
&lt;p&gt;The task garbage collection needs to solve is identifying the objects not accessible by the program in the reference graph. It then frees the unreachable objects and rearranges the memory sometimes to reduce heap fragmentation. &lt;&#x2F;p&gt;
&lt;p&gt;The most traditional approaches are tracing and reference counting:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Tracing: Recursively mark reachability by starting from a set of root memory
blocks that are in use (e.g., pointed to by global variables or local
variables currently in stack frames).&lt;&#x2F;li&gt;
&lt;li&gt;Reference Counting: Count the number of pointers pointing to one particular
object by bookkeeping it every time a pointer is created or modified. It frees
the object when the counter decreases to zero.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;These two approaches have a lot differences:&lt;&#x2F;p&gt;
&lt;img src=&quot;diff.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;Although tracing naturally solves the reachability problem accurately, it requires traversing over a static graph and therefore suspending the whole program. On the other hand, reference counting is done incrementally along with each pointer assignment. However, it brings unnecessary overhead when the pointers are changed often and it does not collect cycles of garbage. Thus people proposed more complicated algorithms based on different hypotheses, such as deferred reference counting, generational garbage collection, etc.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;tracing-reference-counting-are-duals&quot;&gt;Tracing &amp;amp; Reference Counting are Duals&lt;&#x2F;h2&gt;
&lt;p&gt;On the high level, tracing is tracking &amp;quot;matter&amp;quot; -- all reachable objects, while reference counting is tracking &amp;quot;anti-matter&amp;quot; -- all unreachable objects. Their connection is further revealed when we align them by removing certain &amp;quot;optimizations&amp;quot;. We can consider a version of tracing that computes the number of incoming edges from roots or live objects instead of a single bit; and a version of reference counting that postpones the decrements to be processed in batches. If the graph contains no cycles, both methods would converge to tagging the same value for each object. Tracing achieves this by setting this value to zero and increases it recursively, while reference counting starts from an upper bound and decrements it recursively. &lt;&#x2F;p&gt;
&lt;p&gt;To formalize this connection, we define the value they converge to mathematically then align their algorithmic structures.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;mathematical-model&quot;&gt;Mathematical Model&lt;&#x2F;h3&gt;
&lt;p&gt;In order to analyze and compare different garbage collection strategies, the
paper presents a mathematical model of the memory management problem that
garbage collection is trying to solve. Objects in memory and the pointers that
they contain are modeled as the nodes and edges of a graph, respectively. The
set of objects is denoted as $V$, and the multiset of pointers between objects
is $E$. An object should not be freed if it will be used in the future. A
conservative approximation of this, without any program analysis, is that an
object might be used in the future if there exists a path of pointers to the
object which originates from the stack or from a register. We call the starting
points of such paths (i.e., all objects to which there is a direct pointer on
the stack or in a register) the roots of the graph, which make up the multiset
$R$.&lt;&#x2F;p&gt;
&lt;p&gt;Using these definitions, we can formulate the reference counts of objects
(denoted $\rho(v)$ for $v \in V$) as a fixed point of the following equation:&lt;&#x2F;p&gt;
&lt;p&gt;$$ \rho(v) = \big|[v : v \in R]\big| +
\big|[(w, v) : (w, v) \in E \land \rho(w) &amp;gt; 0]\big| $$&lt;&#x2F;p&gt;
&lt;p&gt;Here we recursively define the reference count of an object $v$ to be the number
of root pointers to $v$ plus the number of pointers to $v$ from objects which
themselves have non-zero reference counts. Any object whose reference count is
zero according a fixed point of $\rho$ can be freed, as there is no way for the
program to reference it in the future.&lt;&#x2F;p&gt;
&lt;p&gt;However, it is important to note that there could be multiple fixed points to
this equation, namely in the presence of cyclic garbage. If object $a$ points to
object $b$, and vice versa, but neither is a root and neither is pointed to from
elsewhere, then either $\rho(a) = \rho(b) = 1$ or $\rho(a) = \rho(b) = 0$ is a
valid solution to the fixed-point equation. Ideally, a garbage collection
algorithm will find the least fixed point of $\rho$, meaning that it will
consider all cyclic garbage as able to be freed. A tracing collector does this,
whereas a reference counting collector does not detect any cyclic garbage, and
thus finds the greatest fixed point.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;alignments-of-algorithmic-structures&quot;&gt;Alignments of Algorithmic Structures&lt;&#x2F;h3&gt;
&lt;p&gt;As modeled in the paper, a tracing garbage collector fundamentally works by
performing a traversal on the memory graph, starting from the root objects in
$R$. Every time it reaches an object, it increments its reference count. And,
when it reaches an object for the first time, it traverses the pointers from
that object. In order to complete the traversal, the algorithm maintains a work
list, $W$, of objects that it needs to visit. This traversal recomputes all
reference counts completely each time it runs. When it terminates, every live
object will have a positive reference count, and every dead object will have a
reference count of zero.&lt;&#x2F;p&gt;
&lt;p&gt;This formulation of the algorithm differs in a few ways from a standard tracing
collector. For example, there is no need in a real tracing collector to keep
track of the full reference count of each object; it is sufficient to keep just
a single mark bit which determines whether the reference count is non-zero. A
tracing collector might also employ other optimizations, such as &lt;a href=&quot;http:&#x2F;&#x2F;www-inst.eecs.berkeley.edu&#x2F;%7Ecs61bl&#x2F;r&#x2F;&#x2F;cur&#x2F;graphs&#x2F;garbage-collection-no-fringe.html&quot;&gt;pointer
reversal&lt;&#x2F;a&gt;, to improve its run time or memory usage. However, these
optimizations do not significantly change the core ideas of the algorithm.&lt;&#x2F;p&gt;
&lt;p&gt;Reference counting collection differs from tracing collection in that the
reference counts persist over time and are computed iteratively. The paper
models reference counting collectors to keep a work list, just as tracing
collectors do. Every time a pointer is stored somewhere, its reference count is
immediately incremented. Every time a pointer is overwritten or erased, it is
added to the work list. Then, when the collection algorithm runs, the work list
is iterated through, decrementing the reference count of each object in the
list, and, if this causes the reference count to become zero, adding any objects
that are pointed to by that object to the work list. When this algorithm
terminates, every live object will still have a positive reference count, and
the reference counts of dead objects could be zero (depending on if they are
part of a garbage cycle).&lt;&#x2F;p&gt;
&lt;p&gt;Real implementations of reference counting collection will typically not have a
work list. Instead, they will immediately decrement the reference count whenever
a pointer is overwritten, and, if the reference count becomes zero, recursively
decrement the reference count of anything that is pointed to by that object. The
algorithm is presented in this work-list format in order to more clearly
represent its relationship to tracing garbage collection. Formulated in this
way, if we look at the code for the central work-list algorithms for each type
of garbage collector, we can see striking similarities between them:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;&lt;img src=&quot;tracing-wl.png&quot; style=&quot;width:100%&quot;&gt;&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;&lt;img src=&quot;rc-wl.png&quot; style=&quot;width:100%&quot;&gt;&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;Graph traversal algorithm for tracing&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;Work-list algorithm for RC&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;The only differences between the two are that tracing increments the reference
counts of objects on the worklist whereas RC decrements them, and that tracing
checks if $\rho(w) = 1$ for when to add the objects that $w$ points to whereas
RC checks if $\rho(w) = 0$. The two garbage collection strategies also differ in
how $W$ is initialized. For tracing, $W$ initially contains all of the roots of
the graph, $R$. For reference counting, $W$ starts with all of the objects that
have had a pointer to them erased since the last time the algorithm ran. By
viewing the two strategies through this formulation, we can see that they are
opposites of each other in many ways:&lt;&#x2F;p&gt;
&lt;img src=&quot;diff-v2.png&quot; style=&quot;width:100%&quot;&gt;
&lt;p&gt;As discussed in the mathematical model part, reference counting requires an extra pass to collect cycles. This is generally be done by two strategies: backup tracing, which traces (part of) the heap occasionally, and trial deletion, which attempts to decrease $\rho(w)$ by trial-and-error guided by heuristics. However, notice that there is also a counterpart for that in tracing: the sweeping phase. In addition, while tracing converges to the fixed point value starting from the lower bound of all reference counts at 0, reference counting starts from the upper bound, which contains all incoming pointers that existed at the time of the previous collection. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;hybrids&quot;&gt;Hybrids&lt;&#x2F;h2&gt;
&lt;p&gt;The authors further show that all realistic garbage collectors are in fact hybrids of tracing and reference counting. In general, we can categorize collectors to unified heap collectors, split heap collectors, and multi-heap collectors. Then different garbage collectors can be seen as performing tracing or reference counting when tracking references within each region and across regions.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;unified-heap-collectors-deferred-reference-counting-partial-tracing&quot;&gt;Unified Heap collectors: Deferred Reference Counting &amp;amp; Partial Tracing&lt;&#x2F;h3&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;&lt;img src=&quot;deferred.png&quot; alt=&quot;Snow&quot; style=&quot;width:100%&quot;&gt;&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;&lt;img src=&quot;partial.png&quot; alt=&quot;Snow&quot; style=&quot;width:100%&quot;&gt;&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;Deferred Reference Counting&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;Partial Tracing&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Rather than doing reference counting completely, Deferred Reference Counting defers updating the reference counts of objects pointed to directly by roots until batch processing. This is based on the observation that pointers from roots are likely to change very often as they are directly used in the program. Notice that we can view this as tracing from roots to their targets and reference counting for the intra-heap pointers: All the assignments that lead to intra-heap pointer changes would be tracked by reference counting as normal. When we suspend the program, we trace the roots for one level, which compensates for the delay.&lt;&#x2F;p&gt;
&lt;p&gt;Reversely, we could design Partial Tracing, which uses reference counting for edges from roots to heaps while tracing the intra-heap pointers. However, this combines the worst properties of both tracing and reference counting: it suffers from the high mutation cost from the fast-changing of root pointers while still need to spend a long time to trace the heap. This design failure demonstrates that although tracing and reference counting are duals, they are not equally easy to solve under different cases. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;split-heap-and-multi-heap-collectors&quot;&gt;Split-Heap and Multi-Heap Collectors&lt;&#x2F;h3&gt;
&lt;p&gt;The most common split-heap collector architecture is generational collectors. Generational collectors are based on the empirical observation that most objects are short lived, as shown in the left figure below. Here the Y axis shows the number of bytes allocated and the X axis shows the number of bytes allocated over time.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;&lt;img src=&quot;ObjectLifetime.gif&quot; alt=&quot;&quot; style=&quot;width:100%&quot;&gt;&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;&lt;img src=&quot;gen.png&quot; alt=&quot;&quot; style=&quot;width:100%&quot;&gt;&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.oracle.com&#x2F;webfolder&#x2F;technetwork&#x2F;tutorials&#x2F;obe&#x2F;java&#x2F;gc01&#x2F;index.html&quot;&gt;Most objects are short lived&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;Generational collectors&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;So a generational collector isolates out a nursery space from the remaining mature space. Most of the time it only collects garbage from this nursery space and moves the remaining alive objects to mature space (minor collections). Once in a while it performs a garbage collection across the whole heap to clean the mature space (major collections). An example of generational collectors is available in &lt;a href=&quot;https:&#x2F;&#x2F;blogs.msdn.microsoft.com&#x2F;abhinaba&#x2F;2009&#x2F;03&#x2F;02&#x2F;back-to-basics-generational-garbage-collection&#x2F;&quot;&gt;this blog&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;This process can also be seen as a combination of tracing and reference counting as shown by (a) in the right figure: Reference counting is performed to track edges from mature space to nursery space. Tracing is performed within nursery space during minor collections. And finally a full tracing is performed during major collections.&lt;&#x2F;p&gt;
&lt;p&gt;We can then explore different combinations of tracing and reference counting within each space. However, notice that reference counting is always used for the edges from mature to nursery space in order to avoid visiting all objects in the mature space. In fact, the authors claim that any algorithm
that collects some subset of objects independently is fundamentally making use of reference counting. &lt;&#x2F;p&gt;
&lt;p&gt;This conclusion is further used for multi-heap collectors, which split heaps into multiple regions and collect each region individually. They can be treated asymmetrically (as in generational collectors) or symmetrically. One of the basic extensions from generational collectors is the Train algorithm. The Train algorithm extended the mature space to multiple regions of different &amp;quot;generations&amp;quot;, where only one region is collected at a time. This is mainly for reducing the pause time for collecting the mature space. Specifically, it divides the Mature Object Space (MOS) to cars of fixed size, which are chained to form trains. Each time, the current youngest car will be collected, until the whole first train is collected. We assume that then this process moves on to the next train, but then it is unclear to us why we need several trains instead of one train with all cars?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;cost-analysis&quot;&gt;Cost Analysis&lt;&#x2F;h2&gt;
&lt;p&gt;The paper also presents an abstract mathematical representation for analyzing
and comparing the space and time costs of different garbage collectors. In order
to represent the memory that a garbage collector requires, they define
$\sigma(X)$, for any garbage collector $X$, to be the space overhead of $X$ in
units of the size of a single object. For several of the garbage collectors
discussed in the paper, they give equations or approximations of $\sigma$. As an
example, for a tracing collector, $\sigma(T) \simeq \frac{\mathcal{M}}{\omega} +
\mathcal{D}$, where $\mathcal{M}$ is the capacity of memory in units of the
object size, $\omega$ is the size of an object in bits, and $\mathcal{D}$ is the
space cost of the traversal stack, proportional to the traversal depth. This
model represents a version of tracing collection without pointer reversal, and
hence includes the traversal stack. If pointer reversal were considered,
$\mathcal{D}$ could be removed from the space equation, but this would also
increase the time cost of the collector due to the additional graph traversal
required.&lt;&#x2F;p&gt;
&lt;p&gt;They define $\tau(X)$ to be the time overhead of a garbage collector. For each
collector, they define $\tau$ in terms of a linear function of various
properties of the program, omitting the constant factors that could vary between
implementations. In general, they define the overhead as $\tau(X) =
\phi(X)\kappa(X) + \mu(X)$, where $\phi$ is the frequency at which collection
occurs, $\kappa$ is the run time of a single garbage collection execution, and
$\mu$ is the overhead for mutation in the program (e.g., incrementing and
decrementing reference counts). Each of these terms depends on the type of
garbage collection, as well as the program that the collection is running on,
and is defined more specifically in the paper for several collection strategies.&lt;&#x2F;p&gt;
&lt;p&gt;These equations for the time and space costs of various collectors in terms of
parameters of the workload enable the comparison of different collectors to
determine how well-suited they are for certain mutator programs and constraints.
While not giving the exact formulations for these properties, which would be
difficult to do in general, the abstracted equations give an idea of how the
space and time of a collector vary with their parameters, and thus what the
important constraints of each are.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;By better understanding the shared structure and duality of tracing and
reference counting collectors, we can consider the design of a high-performance
garbage collector as a balance between the two base strategies. And, by modeling
the time and space costs of different collectors, we can formally analyze the
trade-offs between different collection strategies for a specific workload or in
general. The paper suggests that the design of a garbage collector involves
three major decisions: how to partition memory, how to traverse memory, and what
space-time trade-offs should be used. The heap can be divided into multiple
segments, as with generational and multi-heap collectors, or can be collected
uniformly. Each component of the heap, as well as the roots of the graph, can
then be traversed and collected through either tracing or reference counting.
And within the implementation of the two strategies there are other space-time
trade-offs to consider as well, such as whether to employ pointer reversal.&lt;&#x2F;p&gt;
&lt;p&gt;In general, this paper provides a beautiful perspective that unifies garbage collectors to a spectrum between tracing and reference counting. It affects how garbage collectors are viewed afterward and the development of other unified theories in systems. &lt;&#x2F;p&gt;
&lt;p&gt;However, there are several aspects ignored in this paper too. Note that in this paper, the authors are only concerned with identifying unreachable objects correctly with high performance in terms of speed and space usage. This assumes that malloc can always do a perfect job to avoid memory fragmentation. But some garbage collectors are also designed to compact objects (i.e., copying garbage collectors).&lt;&#x2F;p&gt;
&lt;p&gt;In addition, this formulation also ignores reachable memory leaks fundamentally. Although the memory manager can recover unreachable memory, it cannot free memory that is still reachable and therefore potentially still useful. Modern memory managers therefore provide techniques for programmers to semantically mark memory with varying levels of usefulness, which correspond to varying levels of reachability.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, there are also other methods that do automatic memory management partially in&#x2F;before compile time: &lt;a href=&quot;http:&#x2F;&#x2F;drops.dagstuhl.de&#x2F;opus&#x2F;volltexte&#x2F;2019&#x2F;10802&#x2F;pdf&#x2F;LIPIcs-ECOOP-2019-10.pdf&quot;&gt;abstract garbage collection&lt;&#x2F;a&gt;, which soundly over-approximates the behaviour of a concrete
interpreter and &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=1993493&quot;&gt;self-collecting mutators&lt;&#x2F;a&gt;, which imposes extra invariants to help garbage collection.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Reconsidering Custom Memory Allocation</title>
                <pubDate>Sat, 02 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/custom-alloc/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/custom-alloc/</guid>
                <description>&lt;p&gt;In this post, I review &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=582421&quot;&gt;&amp;quot;Reconsidering Custom Memory Allocation&amp;quot;&lt;&#x2F;a&gt;, a 2002 paper which (1) argues against the then-pervasive wisdom that writing custom memory allocators gives you performance boosts, and (2) introduces a data structure called a &amp;quot;reap&amp;quot; to do hybrid region &#x2F; heap memory management. All images in this post are taken directly from the paper, or &lt;a href=&quot;https:&#x2F;&#x2F;people.cs.umass.edu&#x2F;%7Eemery&#x2F;talks&#x2F;OOPSLA-2002.ppt&quot;&gt;Emery&#x27;s slides&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;background-on-custom-allocation&quot;&gt;Background on Custom Allocation&lt;&#x2F;h1&gt;
&lt;p&gt;Memory is necessary, but creating and freeing it takes time, and the general purpose memory management offered by C may not be optimal for your purposes. You probably know much more about the memory patterns of your application than the designers of the default memory allocator, and might be able to exploit this to save resources. Moreover, there is an opportunity to save a lot of time, as up to 40% (and 17% on average) of the program time can be spent allocating and de-allocating memory&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;custom-alloc&#x2F;potential.png&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;!--img src=&quot;potential.png&quot; alt=&quot;drawing&quot; width=&quot;400&quot; style=&quot;margin-left:100px;&quot;&#x2F;--&gt;
&lt;p&gt;All of this makes writing custom memory allocators very appealing.&lt;&#x2F;p&gt;
&lt;!--In C++, this amounts to overloading the `new` and `delete` operators. --&gt;
&lt;h4 id=&quot;saving-time-with-custom-allocation&quot;&gt;Saving Time with Custom Allocation&lt;&#x2F;h4&gt;
&lt;p&gt;For instance, suppose you have a large number of objects which you know you will eventually create, and have some upper bound on how many you&#x27;ll eventually get. Suppose further that you know there are relatively few objects that are freed during the first phase of the program, which takes the most time. By the second phase, which you know executes quickly, you don&#x27;t need any of the objects anymore. Now, a custom memory allocator can:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Reduce the allocation time, by using only a single call to malloc, and getting enough space for them all at once (even though you might not know the exact number, or use them all at the same time, and you can&#x27;t create them all yet).&lt;&#x2F;li&gt;
&lt;li&gt;Save time with free calls: you can free them all at once at a the end, rather than freeing each object individually (which could still happen regularly) during phase one of the program when all of its references go away.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Allocating memory in this way would in fact be creating a &lt;em&gt;region&lt;&#x2F;em&gt; allocator: you allocate one large chunk of memory at first, and then increment a pointer into it; the entire region must then be freed all at one time.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;saving-space-with-custom-allocation&quot;&gt;Saving Space with Custom Allocation&lt;&#x2F;h4&gt;
&lt;p&gt;The example above actually uses more memory than a heap. That is more common, but they can also be used to reduce the memory footprint, primarily by preventing fragmentation. For instance, if memory fragmentation is a huge issue, but you know roughly what fraction of your memory is used by objects of different sizes, you can partition it into pieces, and deal with each size class separately, which offers much better guarantees about the worst and average case fragmentation. This is commonly done in practice; &lt;a href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;1902.04738.pdf&quot;&gt;Mesh&lt;&#x2F;a&gt;, for instance, requires that all pages in question are of the same size class. &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=582421&quot;&gt;The paper we analyze&lt;&#x2F;a&gt; refers to allocators like these as &amp;quot;per-class custom allocators&amp;quot;.&lt;&#x2F;p&gt;
&lt;p&gt;For all of these reasons, in 2002, it was common practice and widely considered a good idea to write a custom memory allocation for your program to improve performance.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-drawbacks-to-custom-allocation&quot;&gt;The Drawbacks to Custom Allocation&lt;&#x2F;h2&gt;
&lt;p&gt;Of course, even without any experiments, it is easy to see how doing this could be a mistake: the effectiveness of any of these relies on certain assumptions about the programs, and so any modifications need to be done while keeping the custom allocator in mind. Moreover, non-standard allocation schemes make it much more difficult to use existing tools to analyze your code. In more detail:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Accidentally calling the standard &lt;code&gt;free&lt;&#x2F;code&gt; on a custom-allocated object could corrupt the heap and lead to errors.&lt;&#x2F;li&gt;
&lt;li&gt;Custom memory allocation makes it impossible to use leak detection tools.&lt;&#x2F;li&gt;
&lt;li&gt;It also prevents you from using a different custom allocator to do something else that&#x27;s more valuable in the future.&lt;&#x2F;li&gt;
&lt;li&gt;It keeps you from using garbage collection.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h1 id=&quot;the-paper-reconsidering-custom-memory-management&quot;&gt;The Paper: Reconsidering Custom Memory Management&lt;&#x2F;h1&gt;
&lt;p&gt;The paper at hand, &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=582421&quot;&gt;&amp;quot;Reconsidering Custom Memory Allocation&amp;quot;&lt;&#x2F;a&gt;, is packaged and sold as the thesis that custom memory allocators in practice do not out perform general ones.&lt;&#x2F;p&gt;
&lt;p&gt;One of the key arguments of this paper is that standard baselines are not fair. Evidently the usual argument &lt;em&gt;in favor of&lt;&#x2F;em&gt; custom allocation in 2002 was a comparison against the win32 memory allocator, which was much slower than custom allocators. The first part of this paper is an evaluation against the Lea allocator (Doug Lea&#x27;s &lt;code&gt;dlmalloc&lt;&#x2F;code&gt;, now often the default allocator), which greatly reduces and in some cases eliminates the margin of victory for custom allocators not exploiting regions.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-taxonomy-of-memory-allocators&quot;&gt;The Taxonomy of Memory Allocators.&lt;&#x2F;h3&gt;
&lt;p&gt;This paper uses the following three bins to classify allocators:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;[1] Per-class allocators&lt;&#x2F;strong&gt;. You build a separate heap for every object size class, and optimize each one separately for objects of this size. This is fast, simple, and interacts with C well---but could be space inefficient if you don&#x27;t know how much memory of each class size you will use.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;[2] Regions&lt;&#x2F;strong&gt;.  An allocator that allocates large chunks of memory, puts smaller pieces within it, and then must free them all at once. They are fast and convenient, but use a lot of space and are arguably dangerous because dangling references keep things from being freed. This (nominally) requires programmers to re-structure code to keep references to the region, and free the entire region at once, resulting in a usage pattern that looks somewhat like this:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
```
createRegion(r);
x1 = regionMalloc(r,8);
x2 = regionMalloc(r,8);
x3 = regionMalloc(r,16);
x4 = regionMalloc(r,8);
```
&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;[3] Custom patterns&lt;&#x2F;strong&gt;. Anything else---for example, those that exploit stack-like patterns in memory allocation (the relevant benchmark is &lt;code&gt;197.parser&lt;&#x2F;code&gt;). The authors describe these as fast, but brittle.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;experimental-setup&quot;&gt;Experimental Setup&lt;&#x2F;h2&gt;
&lt;p&gt;Although this paper is well-regarded and remembered primarily for its evaluation, the actual experimental data involves only 8 programs, each run with a single input. The table containing the programs and their inputs can be seen in the figure below:&lt;&#x2F;p&gt;
&lt;img src=&quot;expts.png&quot; alt=&quot;drawing&quot; width=&quot;400&quot; style=&quot;margin-left:100px;&quot;&#x2F;&gt;
&lt;p&gt;All of the other graphs in this blog post will reference the 8 benchmarks on the left, and run their experiments for a single input, a single time, not reporting variance, the impact of parameters, machine loads, or any other confounders. For a paper that is purported to be an exemplar takedown of a common design practice, and appears to have had a large impact on the state of memory allocation, its actual empirical backing could be more robust.&lt;&#x2F;p&gt;
&lt;p&gt;Given that there are only 8 data points, with no reported statistics, and more than 8 graphs presented in the paper alone (we must imagine the authors conducted far more that they did not report), that dozens of features were exposed, and that the authors were willing to consider non-linear decision boundaries (the &lt;em&gt;is it a region&lt;&#x2F;em&gt; decision stump&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#2&quot;&gt;2&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;), makes things even more worrying. The authors also give very little attention to allocators that are not region-based. The hypothesis that the only design point of a custom allocator that could give it an edge over a general one is making use of regions, while very possibly true, is not particularly evident from the data: many of the region allocators also fail to outperform &lt;code&gt;dlmalloc&lt;&#x2F;code&gt;, and some non-region allocators have a slight edge.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;emulation&quot;&gt;Emulation&lt;&#x2F;h3&gt;
&lt;p&gt;In order to compare allocators which have different semantics and expose different interfaces to the programmers, Berger et al. write an emulator so that these programs can be run using malloc and free. This allows them to be linked to the analysis tools, and for various properties of the programs to be computed, such as the drag. This hampers the custom allocators and could also be a source of performance disparity.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;results-part-i-custom-vs-general-purpose-allocators&quot;&gt;Results Part I: Custom vs General-Purpose Allocators&lt;&#x2F;h2&gt;
&lt;p&gt;The primary finding of this paper is that that, while programs outperform the win32 memory allocator, they often roughly break even with the Lea allocator. The only custom allocators that seem to be able to get a reliable edge on &lt;code&gt;dlmalloc&lt;&#x2F;code&gt; are  region allocators, as seen below:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;custom-alloc&#x2F;noreap-time.png&quot; alt=&quot;Runtme Benchmarks&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;At a glance, we can see that dlmalloc roughly matches the performance of the custom allocators for all except for &lt;code&gt;lcc&lt;&#x2F;code&gt; and &lt;code&gt;muddle&lt;&#x2F;code&gt;, both region allocators. Given that the Lea allocator was already over a decade old when this paper was being written, it is surprising that people had not recognized this. Perhaps an alternate, less grand but equally appropriate framing of the discovery would be &amp;quot;the win32 memory allocator is not very good, and you have other options&amp;quot;. The failure of regions to help in the first three applications  remains an unexplored question. One also wonders about the possibility of a ninth data point in which a non-region allocator vastly out-performs dlmalloc.&lt;&#x2F;p&gt;
&lt;p&gt;Here are the results in terms of memory usage, for comparison:
&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;custom-alloc&#x2F;noreap-space.png&quot; alt=&quot;Runtme Benchmarks&quot; &#x2F;&gt;
The observation the authors make is that regions can provide performance benefits in both realms, though regions&#x27; benefits are misleading, because their peak memory consumption is higher, and they often leave a lot of memory lying around, as seen in the drag graphs below:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;custom-alloc&#x2F;drag-graphs.png&quot; alt=&quot;drag&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h1 id=&quot;regions-and-reaps&quot;&gt;Regions and Reaps&lt;&#x2F;h1&gt;
&lt;p&gt;In order to provide the performance benefits of regions to general purpose allocators, Berger et al. introduce &amp;quot;reaps&amp;quot;, a merging of regions and heaps, sold as a generalization of both.
Recall that a heap exposes &lt;code&gt;malloc&lt;&#x2F;code&gt; and &lt;code&gt;free&lt;&#x2F;code&gt;; a region gives you a &lt;code&gt;malloc&lt;&#x2F;code&gt; and &lt;code&gt;freeAll&lt;&#x2F;code&gt;. The example in the paper is the following:&lt;&#x2F;p&gt;
&lt;img src=&quot;reap-example.png&quot; alt=&quot;reap example&quot; width=&quot;400&quot; style=&quot;margin-left:100px;&quot;&#x2F;&gt;
&lt;p&gt;Unfortunately, this ends here, leaving readers to wonder why we allocated new memory and copied meta-data when really when we intended to free memory by deleting things. This is the entirety of the relevant description from the paper:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Reaps adapt to their use, behaving either like regions or like heaps. Initially, reaps behave
like regions. They allocate memory by bumping a pointer through
geometrically-increasing large chunks of memory... Reaps act in this region mode
until a call to &lt;code&gt;reapFree&lt;&#x2F;code&gt; deletes an individual object. Reaps place
freed objects onto an associated heap. Subsequent allocations from
that reap use memory from the heap until it is exhausted, at which
point it reverts to region mode.&amp;quot;&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Owing in part to their elusiveness, I have not been able to determine with certainty how edge cases work, but I will give an expanded intuition for what reaps seem to be doing and why one might expect this to give some performance benefits:&lt;&#x2F;p&gt;
&lt;h3 id=&quot;a-pair-of-heuristics&quot;&gt;A Pair of Heuristics&lt;&#x2F;h3&gt;
&lt;p&gt;We start a new heap because the assumption is that after freeing some memory, we will probably be both freeing and allocating memory frequently in the future. Thus, the very end of our memory acts like a heap for a little while, so a bout of object creation and deletion has only the impact of a single loop iteration on the consumed memory. Effectively, we are using the heuristic that freeing an individual object means we don&#x27;t want to be using regions &amp;quot;right now&amp;quot;, and in particular, for the upcoming allocations.&lt;&#x2F;p&gt;
&lt;p&gt;On the other hand, how do we know we should go back to allocating things in region mode? This should happen when we&#x27;ve allocated a lot of memory without freeing any. Thus, we go back to region mode when the heap is full. This is our second heuristic. In some ways, tying these two together to the size of an artificial heap seems like a very neat trick.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;impact-on-programmers&quot;&gt;Impact on Programmers&lt;&#x2F;h3&gt;
&lt;p&gt;The authors argue that this is purely positive, as it gives programmers the option of programming in a style that is region-like, and thereby gaining all of its benefits, but also leaving them the freedom to do other things. This seems compelling, but one might also worry that by not forcing programmers to interact with the region semantics, one will lose their attention and mistakes will not be caught. For instance, mostly using region mode, while freeing an occasional object would result in a modest overhead and in reality frees exactly zero memory; such a mistake might not be easy to see.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;results-part-ii&quot;&gt;Results Part II:&lt;&#x2F;h2&gt;
&lt;p&gt;Reaps do indeed reap the performance benefits that were expected on them (at least as far as these 8 program traces are concerned). Below, we can see the same graph, with the addition of reaps:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;custom-alloc&#x2F;runtime-benchmarks.png&quot; alt=&quot;&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The most important feature of this graph, of course, is the fact that reaps perform reasonably well compared to dlmalloc and the custom allocators for programs that don&#x27;t use regions, and don&#x27;t pay as much as dlmalloc on &lt;code&gt;lcc&lt;&#x2F;code&gt; and &lt;code&gt;muddle&lt;&#x2F;code&gt;, where regions have given a large performance bump beyond the other general-purpose allocators.&lt;&#x2F;p&gt;
&lt;p&gt;The authors draw this conclusion unreservedly, but there are some very worrying things about this graph.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Almost all of the performance benefit comes from a single data point, &lt;code&gt;lcc&lt;&#x2F;code&gt;, which dlmalloc handles no better than win32. Even for &lt;code&gt;muddle&lt;&#x2F;code&gt;, dlmalloc does not perform terribly.&lt;&#x2F;li&gt;
&lt;li&gt;In addition, reaps have a small overhead on nearly everything; it is particularly pronounced for the parser and c-breeze.&lt;&#x2F;li&gt;
&lt;li&gt;Combining the first two points, it seems as though dlmalloc would out-perform reap on average if we take out &lt;code&gt;lcc&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Almost none of the trends that one can read off of this graph are supported uniformly by all of the points. And Once agin there are only 8 of them.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h1 id=&quot;from-2002-to-2019&quot;&gt;From 2002 to 2019&lt;&#x2F;h1&gt;
&lt;p&gt;The Lea allocator (Doug Lea&#x27;s malloc, now referred to as &lt;code&gt;dlmalloc&lt;&#x2F;code&gt;) is now the default implementation of Linux. The standard general purpose allocator to beat in evaluations is now &lt;a href=&quot;http:&#x2F;&#x2F;jemalloc.net&#x2F;&quot;&gt;jemalloc&lt;&#x2F;a&gt;, which seems to be considerably more efficient &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#3&quot;&gt;3&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;. The existence of even better general purpose allocators in some ways strengthens the point made by the paper: there&#x27;s even less to be gained by writing your own allocator.&lt;&#x2F;p&gt;
&lt;p&gt;On the other hand, custom memory allocation is not quite dead. &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;mtrebi&#x2F;memory-allocators&quot;&gt;Here&lt;&#x2F;a&gt;&#x27;s a tutorial on how and why custom allocators are helpful, custom allocation still is seen as a potential reason to disregard projects such as &lt;a href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;1902.04738.pdf&quot;&gt;Mesh&lt;&#x2F;a&gt;, and Emery Berger&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;plasma.cs.umass.edu&#x2F;emery&#x2F;heap-layers.html&quot;&gt;Heap Layers&lt;&#x2F;a&gt; project (used to construct reaps in the present paper), which has been much more widely employed, ironically is a framework for building custom allocators, by its own admission:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Heap Layers makes it easy to write high-quality custom and general-purpose memory allocators.&amp;quot;&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What happened to reaps?&lt;&#x2F;strong&gt;
Though modern memory allocators more effectively put off frees to make them more region-like, &amp;quot;reap&amp;quot; is not a common term for memory allocators---in fact, only a small handful of the papers that cite this one mention reaps (and all share authors).&lt;&#x2F;p&gt;
&lt;p&gt;It is interesting that the reap, which one might consider the more substantive, creative contribution to this paper, does not seem to have caught on. In his &lt;a href=&quot;https:&#x2F;&#x2F;emeryblogger.com&#x2F;2012&#x2F;10&#x2F;28&#x2F;most-influential-oopsla2012&#x2F;&quot;&gt;2012 blog post&lt;&#x2F;a&gt; in which he acknowledges a most influential paper award for this paper, Emery tells people (with extreme brevity and no explanation) to use &lt;a href=&quot;http:&#x2F;&#x2F;hoard.org&quot;&gt;his more modern allocator, Hoard&lt;&#x2F;a&gt;, instead of reaps. Even more strangely &lt;a href=&quot;https:&#x2F;&#x2F;people.cs.umass.edu&#x2F;%7Eemery&#x2F;pubs&#x2F;berger-asplos2000.pdf&quot;&gt;the technical paper&lt;&#x2F;a&gt; that Hoard is based on does not include the word &amp;quot;reap&amp;quot;, leaving us wondering what happened to them. Still, an implementation lives on in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;emeryberger&#x2F;Heap-Layers&#x2F;blob&#x2F;master&#x2F;legacy&#x2F;reap&#x2F;regionheap.cpp&quot;&gt;the legacy portion&lt;&#x2F;a&gt; of the Heap Layers project.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;conclusions&quot;&gt;Conclusions&lt;&#x2F;h1&gt;
&lt;p&gt;While this paper seems to have been largely correct in several important ways, many of its elements leave something to be desired, as discussed in the &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;custom-alloc&#x2F;#experimental-setup&quot;&gt;Empirical Setup section&lt;&#x2F;a&gt;. The evaluation seems thorough but has a limited scope and draws conclusions stronger than the data alone seem to support; the new algorithm and data structure (reaps) seem very useful and are sold well, but explained poorly, with only a partial example which doesn&#x27;t illustrate interesting cases, no theoretical backing, and no explanation of the heuristics that enable its performance.&lt;&#x2F;p&gt;
&lt;p&gt;These reaps do not seem to have been used more than once or twice since 2002, and not for lack of publicity (several claim, nebulously, to use &amp;quot;ideas from reaps&amp;quot;). It is unclear whether the reason they were not adopted was because of the presentation, because they never were quite as good as they seem (after all, there are only 8 data points, and it might be easy to overfit to them with parameter or design tuning), or because something better came along shortly thereafter. With no follow-up studies or testing on benchmarks for which they were not designed, it&#x27;s hard to know.&lt;&#x2F;p&gt;
&lt;p&gt;Even so, the general message is well-articulated, appears to have rung true in retrospect, and perhaps necessarily so. As general-purpose allocators get even better, and perhaps even begin to be tunable to custom programs, out-performing them becomes increasingly difficult. This is a general trend: while knowing something about the matrices you want to multiply may have been an invaluable place to mine in order to edge out general purpose matrix multiplication algorithms, today it is almost impossible to out-perform standard libraries for doing them---so much so, that it is often beneficial to rewrite other computations in terms of matrix multiplications to accelerate them; moreover, representing computations in this generic way often provides some more standard insights into the structure of your problem and allows you to immediately make use of relevant algorithms. For similar reasons, it is hard to imagine that custom memory allocators will do any more than dwindle.&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;Of course, there are only 8 programs, and so it is hard to really trust this analysis.
&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#2&quot;&gt;2&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;: A decision stump is a depth-1 decision tree, which is already a non-linearity, as it has a branch.
&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#3&quot;&gt;3&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;: https:&#x2F;&#x2F;suniphrase.wordpress.com&#x2F;2015&#x2F;10&#x2F;27&#x2F;jemalloc-vs-tcmalloc-vs-dlmalloc&#x2F;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</description>
            </item>
        
            <item>
                <title>One VM to Rule Them All</title>
                <pubDate>Fri, 01 Nov 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/one-vm/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/one-vm/</guid>
                <description>&lt;h1 id=&quot;a-short-story&quot;&gt;A Short Story&lt;&#x2F;h1&gt;
&lt;p&gt;You are the latest hire for the startup company &lt;em&gt;QuantBet&lt;&#x2F;em&gt; which specializes in developing computational models that are used for proprietary sports betting.  &lt;em&gt;QuantBet&lt;&#x2F;em&gt; is fairly new, and as a gifted PL researcher, you are tasked with creating a new programming language &lt;em&gt;QuantPL&lt;&#x2F;em&gt; to assist gamblers in their analysis.&lt;&#x2F;p&gt;
&lt;p&gt;Sports betting is a fast business, and you want your code to run quickly.  You begin your project by developing a parser for &lt;em&gt;QuantPL&lt;&#x2F;em&gt; and an Abstract Syntax Tree (AST) interpreter.&lt;&#x2F;p&gt;
&lt;p&gt;Your colleagues are impressed by your work with &lt;em&gt;QuantPL&lt;&#x2F;em&gt; which they claim is more intuitive when scripting models.  But soon, they start to notice that it&#x27;s a lot slower than what they&#x27;re used to.&lt;&#x2F;p&gt;
&lt;p&gt;You think about developing a real VM.  This would require spending a lot of time designing a run-time system in C without causing memory leaks.  You think about using Java instead as your backend.  You may have to even design a bytecode format and interpreter.  If they complain more, you&#x27;ll have to hire others to help you write a JIT compiler.  This is a slow, costly and painful process.&lt;&#x2F;p&gt;
&lt;p&gt;You hear that rival company &lt;em&gt;Quant2Bet&lt;&#x2F;em&gt; has developed a language &lt;em&gt;Quant2PL&lt;&#x2F;em&gt;  that is almost as fast as Java code without much effort.  They&#x27;re using something called &lt;a href=&quot;https:&#x2F;&#x2F;www.graalvm.org&#x2F;&quot;&gt;Truffle and Graal&lt;&#x2F;a&gt;, and only had to write an interpreter and a few high-level specialization strategies which automatically transformed into a JIT.  Now you&#x27;re stuck modifying your programming language from the very beginning.  You decide this isn&#x27;t worth your time, quit your job, and join &lt;em&gt;Quant2Bet&lt;&#x2F;em&gt; instead.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;a-solution&quot;&gt;A Solution&lt;&#x2F;h1&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=2509581&quot;&gt;&amp;quot;One VM to Rule Them All&amp;quot;&lt;&#x2F;a&gt; (2013) presents an architecture that allows implementing new programming languages within a common framework, allowing for language-agnostic optimizations.  Instead of designing the entire stack for a new programming language, you can just focus on creating a parser and AST interpreter.  Now, you can reuse functionality from existing virtual machines and your language is already fast and optimized.  In the following sections, we call the new language that we are trying to build the &lt;em&gt;guest language&lt;&#x2F;em&gt; and the backend for this infrastructure that acts as an interpreter the &lt;em&gt;host language&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;background&quot;&gt;Background&lt;&#x2F;h2&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;Java Virtual Machine (JVM)&lt;&#x2F;strong&gt; is a virtual machine that allows computers to run programming languages that compile to Java bytecode.&lt;&#x2F;li&gt;
&lt;li&gt;A &lt;strong&gt;Just-In-Time (JIT) compiler&lt;&#x2F;strong&gt; compiles a program at run time instead of before execution.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Dynamic Dispatch&lt;&#x2F;strong&gt; is the process of iterating through implementations of a polymorphic operation and selecting which one to call at run time.  Java implements runtime polymorphisms using dynamic dispatch.  If a method is overwritten, the type of object determines which version of the overwritten method is called.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;one-vm-to-rule-them-all&quot;&gt;One VM to Rule Them All&lt;&#x2F;h2&gt;
&lt;p&gt;This paper claims the following:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;A method of rewriting nodes in which an AST node can rewrite itself into a more specialized or general node.  Nodes capture constructs of the guest language, such as addition or division.&lt;&#x2F;li&gt;
&lt;li&gt;An optimizing compiler that takes in the structure of an interpreter and turns it into a compiler that emits bytecode.  In Graal, the compiler is written in a subset of Java.&lt;&#x2F;li&gt;
&lt;li&gt;A method called &lt;em&gt;speculative assumption&lt;&#x2F;em&gt; in which heuristics determine the probability of executing certain sections of code.  For less frequent sections, deoptimization disregards compiled code and switches to execution using interpretation.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The combination of these claims result in high performance from an interpreter without the need of implementing a language-specific compiler.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;truffle-and-graal&quot;&gt;Truffle and Graal&lt;&#x2F;h3&gt;
&lt;p&gt;The prototype of the language implementation framework is called &lt;em&gt;Truffle&lt;&#x2F;em&gt; and the compilation infrastructure is called &lt;em&gt;Graal&lt;&#x2F;em&gt;, both of which are &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;oracle&#x2F;graal&quot;&gt;open source&lt;&#x2F;a&gt; by Oracle.  At a high level, a user of this ecosystem implements an AST interpreter for the guest language.  In the interpreter, each node encapsulates information regarding a construct of the guest language.  An addition node, for instance, may be specialized for integer, double and string inputs. &lt;&#x2F;p&gt;
&lt;p&gt;Node rewriting occurs during interpretation, using profiling feedback.  When a subtree of the AST is deemed to be stable (meaning unlikely to be rewritten), the AST is partially evaluated at that subtree and the Graal compiler produces optimized machine code to run on the VM.  The partial evaluation occurs with aggressive inlining alleviates the overhead in interpreter dispatch code.&lt;&#x2F;p&gt;
&lt;p&gt;If during execution a node is not the correct specialized type, we &lt;em&gt;deoptimize&lt;&#x2F;em&gt;.  The optimized machine code is discarded and we switch to AST interpretation.  The node rewrites itself and the subtree is then recompiled.&lt;&#x2F;p&gt;
&lt;p&gt;Here, dynamic compilation is agnostic to the semantics of the guest language.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;example&quot;&gt;Example&lt;&#x2F;h4&gt;
&lt;p&gt;The below diagram describes an instance of an AST interpreter during execution.  In Figure 1, nodes are first uninitialized.  After profiling feedback, nodes are rewritten to become specialized and are then compiled.&lt;&#x2F;p&gt;
&lt;p&gt;Now suppose in Figure 3, there is integer overflow.  Continuing our example with JavaScript as the guest language, this would cause a change in value type.  Our speculation that all nodes in that subtree are of type integer is incorrect, so we have to deoptimize, rewrite a second time, and then recompile.  Note that node rewriting is not just for type transitions, but for any time of profiling feedback.&lt;&#x2F;p&gt;
&lt;img src=&quot;figure3.png&quot; style=&quot;width: 90%&quot;&gt;
&lt;h2 id=&quot;node-rewriting&quot;&gt;Node Rewriting&lt;&#x2F;h2&gt;
&lt;p&gt;It is the duty of the developer of the guest language to implement node rewriting.  As a guideline, the authors of the paper suggest that developers fulfill the following:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Completeness&lt;&#x2F;strong&gt; - Each node must provide rewrites for &lt;em&gt;all&lt;&#x2F;em&gt; cases that it does not handle itself.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Finiteness&lt;&#x2F;strong&gt; - The sequence of node replacements must eventually terminate to either a specialized node or a generic implementation that encompasses all possible cases.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Locality&lt;&#x2F;strong&gt; - Rewrites occur locally and only modify the subtree of the AST.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Examples of optimizations that use profiling feedback are type specializations, polymorphic inline caches and resolving operations.  As program is successively executed, the profiler yields more optimized compiled code.&lt;&#x2F;p&gt;
&lt;p&gt;Consider the following example, where the guest language is JavaScript:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
```
function sum(n) {
   var sum = 0;
   for (var i = 1; i &amp;lt; n; i++) {
   sum += i;
   }
   return sum;
}
```
&lt;&#x2F;pre&gt;
&lt;p&gt;The following is an example of an AST after immediate parsing.  Note that only constants are typed.&lt;&#x2F;p&gt;
&lt;img src=&quot;figure5.png&quot; style=&quot;width: 80%&quot;&gt;
&lt;p&gt;In code, we can write the integer addition node as follows.  Here, Java is the host language for the JavaScript interpreter.&lt;&#x2F;p&gt;
&lt;img src=&quot;figure6.png&quot; style=&quot;width: 50%&quot;&gt;
&lt;p&gt;After execution, nodes replace themselves as specialized nodes for type &lt;em&gt;integer&lt;&#x2F;em&gt; called IAdd to be used for subsequent executions.  Note that IAdd nodes only operate on integer values.  If it does not receive an integer value, it will throw an exception, and the node will be rewritten.&lt;&#x2F;p&gt;
&lt;p&gt;The below picture shows an instance where nodes are specialized for the integer type (depicted by prefix &amp;quot;I&amp;quot;).&lt;&#x2F;p&gt;
&lt;img src=&quot;figure7.png&quot; style=&quot;width: 60%&quot;&gt;
&lt;p&gt;If &lt;code&gt;sum&lt;&#x2F;code&gt; overflows, we need to rewrite certain nodes to specialize for the double data type (prefix &amp;quot;D&amp;quot;).  The following image shows this case:&lt;&#x2F;p&gt;
&lt;img src=&quot;figure8.png&quot; style=&quot;width: 60%&quot;&gt;
&lt;p&gt;We can actually write more compact code using Java annotations if the host language is Java.  For a given node, in this case &lt;code&gt;add&lt;&#x2F;code&gt;, we can use the annotation &lt;code&gt;@Specialization&lt;&#x2F;code&gt; to denote a specialized implementation and use &lt;code&gt;@Generic&lt;&#x2F;code&gt; to denote the default, unspecialized implementation.  Now the Java compiler will call the Java Annotation Processor which iterates over all Node Specifications, marked by the annotations.  It is essentially the same as before though now we can use developer tools and IDEs more seamlessly.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;&#x2F;h2&gt;
&lt;p&gt;The main overhead is dynamic dispatch between nodes when rewriting occurs.  To do this, we count the number of times a tree is called, and when it exceeds a certain threshold, the tree is assumed to be stable and is then compiled.  Deoptimization points invalidate the compiled code, allowing for the node to be rewritten.  The counter is reset, and after the threshold number of executions it will be deemed stable and compiled again. &lt;&#x2F;p&gt;
&lt;p&gt;This architecture allows for additional optimizations, such as:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Injecting Static Information&lt;&#x2F;strong&gt;: where a node adds a guard that leads to a compiled code block or a deoptimizing point.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Local Variables&lt;&#x2F;strong&gt;: where read and write operations on local variables are delegated to a Frame object that holds values.  &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=1064996&quot;&gt;Escape Analysis&lt;&#x2F;a&gt; is enforced allowing static single assignment (SSA) form.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Branch Probabilities&lt;&#x2F;strong&gt;: where probabilities of branch targets are incorporated to optimize code layout which decrease branch and instruction cache misses.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h1 id=&quot;current-implementation-and-deployment&quot;&gt;Current Implementation and Deployment&lt;&#x2F;h1&gt;
&lt;p&gt;Currently, this infrastructure is prototyped using a subset of Java.&lt;&#x2F;p&gt;
&lt;img src=&quot;figure17.png&quot; style=&quot;width: 80%&quot;&gt;
&lt;p&gt;The &lt;strong&gt;Truffle API&lt;&#x2F;strong&gt; is the main interface to implement the AST.  The &lt;strong&gt;Truffle Optimizer&lt;&#x2F;strong&gt; involves partial evaluation.  The &lt;strong&gt;VM Runtime Services&lt;&#x2F;strong&gt; provides basic VM services which includes Graal.&lt;&#x2F;p&gt;
&lt;p&gt;The authors suggest two main deployment scenarios:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Java VM&lt;&#x2F;strong&gt;:  The current implementation is in Java so it can technically run on any Java VM.  This is especially useful for debugging and low cost.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Graal VM&lt;&#x2F;strong&gt;:  This provides API access to the compiler, so the guest language runs with partial evaluation.  This uses Graal as its dynamic compiler and is extremely fast.  It is useful for integrating the guest language in a current Java environment.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;h1 id=&quot;a-survey-on-languages&quot;&gt;A Survey on Languages&lt;&#x2F;h1&gt;
&lt;p&gt;The following languages have been tested by the authors: JavaScript, Ruby, Python, J, and R.  We will expand upon the first two as they pose interesting points.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;JavaScript&lt;&#x2F;strong&gt; has many implementations, several which are of high performance.  Node rewriting is used for type specialization and inline caching.  Type specialization has been described before in the int&#x2F;double example for addition.  Inline caching seeks to optimize operations that are performed on specific-type objects.  The implementation of this is similar to pattern matching in functional languages such as &lt;code&gt;Haskell&lt;&#x2F;code&gt; or &lt;code&gt;OCaml&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Ruby&lt;&#x2F;strong&gt; is even more interesting.  Ruby allows to redefine any method, so naturally, deoptimization can be applied to invalidate compiled nodes.  Most notably, &lt;a href=&quot;https:&#x2F;&#x2F;chrisseaton.com&#x2F;truffleruby&#x2F;ecoop14&#x2F;ecoop14-ruby-truffle.pdf&quot;&gt;later papers&lt;&#x2F;a&gt; indicate that Ruby implemented with Truffle is 45x faster than MRI (C backend).  Truffle JVM is also approximately 2x faster than &lt;a href=&quot;http:&#x2F;&#x2F;docs.topazruby.com&#x2F;en&#x2F;latest&#x2F;&quot;&gt;Topaz&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;One question I&#x27;d like to pose is do these results change if the backend compiler were written in C instead of Java?&lt;&#x2F;p&gt;
&lt;h1 id=&quot;merits-and-shortcomings&quot;&gt;Merits and Shortcomings:&lt;&#x2F;h1&gt;
&lt;h3 id=&quot;merits&quot;&gt;Merits&lt;&#x2F;h3&gt;
&lt;p&gt;This is an interesting idea that definitely is worth exploring when implementing the backend for a new programming language.  The main merit of Truffle is the ease of implementing a new programming language.  The developer only has to implement an AST interpreter, and Truffle handles the rest of the heavy lifting.  The main merit of Graal is its ease of pairing with Truffle.  Graal itself is an impressive compiler which can run both JIT or AOT, has many optimizations as previously discussed and can convert Truffle ASTs into native code.  Graal is also easily extendable for new optimizations.&lt;&#x2F;p&gt;
&lt;p&gt;Most of Truffle&#x2F;Graal is available open-source!&lt;&#x2F;p&gt;
&lt;h3 id=&quot;shortcomings&quot;&gt;Shortcomings&lt;&#x2F;h3&gt;
&lt;p&gt;One shortcoming I noticed was the amount of memory needed for this ecosystem during runtime.  At deoptimization points, metadata is stored, the partially evaluated AST is reloaded and the original AST is also kept in memory.  This is a lot of information, and the different speculative optimization add even more memory usage.  Running this compiler will consume a lot of RAM which can be costly and may even outweigh performance.  This is not to mention the memory cost of Truffle and Graal themselves at runtime.  Currently, &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;oracle&#x2F;graal&#x2F;tree&#x2F;master&#x2F;substratevm&quot;&gt;SubstrateVM&lt;&#x2F;a&gt; is a suggested solution to this shortcoming which aims to reduce the memory footprint of applications by compiling Java programs into executable containers.  But SubstrateVM itself has many shortcomings, including dynamic class loading&#x2F;unloading, finalizers, &lt;a href=&quot;https:&#x2F;&#x2F;githubcom&#x2F;oracle&#x2F;graal&#x2F;blob&#x2F;master&#x2F;substratevm&#x2F;LIMITATIONS.md&quot;&gt;etc&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Another shortcoming involves the author&#x27;s solution to the dynamic dispatching problem.  The authors describe keeping a counter, and only after the counter reaches a threshold, the AST is compiled for each specialization within a node.  This means that peak performance only occurs after many iterations of the program, when node rewriting is unlikely.  Traditional JITs have a similar approach where code is compiled once it is &amp;quot;hot&amp;quot;.  However since in this paper, only a specific portion of a node is compiled, it may take even longer for the execution count to reach the threshold, since all executions would have to follow the same specialized path.  Considering the example of the addition node, suppose that the snippet of code involving adding two integers is compiled.  In a subsequent execution, if the node is rewritten to double addition, the execution counter resets, whereas in a traditional JIT compiler, the entire node would remain compiled.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;in-industry&quot;&gt;In Industry&lt;&#x2F;h1&gt;
&lt;p&gt;Both &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=pR5NDkIZBOA&quot;&gt;Twitter&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;www.ivonet.nl&#x2F;2018&#x2F;10&#x2F;23&#x2F;codeone&#x2F;Tuesday&#x2F;DEV6082__one-vm-to-rule-them-all-lessons-learned-with-truffle-and-graal&#x2F;&quot;&gt;Goldman Sachs&lt;&#x2F;a&gt; have prototyped integrating Truffle&#x2F;Graal into their stack.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;related-work&quot;&gt;Related Work&lt;&#x2F;h1&gt;
&lt;p&gt;&lt;strong&gt;PyPy&lt;&#x2F;strong&gt; is runs much quicker than Python.  Python is normally interpreted in CPython.  PyPy runs faster than CPython since it is a JIT compiler.  PyPy implements a Python VM written in RPython.  PyPy is also a meta-JIT that people use to write JITs for languages.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Metacircular VMs&lt;&#x2F;strong&gt; are written in the guest language instead of the host language.  This allows sharing between the host and guest systems.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Self Optimizing Interpreters&lt;&#x2F;strong&gt; rely on improving interpretation performance during execution.  This paper claims that since compilers analyze larger portions of the code, it can perform global optimizations that an interpreter is not able to do.&lt;&#x2F;p&gt;
&lt;p&gt;It would be interesting to quantify this statement and develop interpreter optimizations to combine with Graal.  I curious to see if interpreter optimizations in general make a significant difference. &lt;&#x2F;p&gt;
&lt;h1 id=&quot;final-thoughts&quot;&gt;Final Thoughts&lt;&#x2F;h1&gt;
&lt;p&gt;The paper is interesting in its claims, though it does not provide benchmark tests.  The main benchmark I was surprised that the paper did not include is how guest language performance using the Truffle&#x2F;Graal ecosystem compares with the LLVM compiler for a variety of languages.&lt;&#x2F;p&gt;
&lt;p&gt;It actually took a lot of searching, and eventually I found a &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=8AYESZIaacg&quot;&gt;talk by Thomas Wuerthinger&lt;&#x2F;a&gt; at Devoxx which shows the following performance measures.  Maybe these weren&#x27;t measured at the time of publishing this paper, or the authors were wary of displaying results that are marginally the same as their counterparts.&lt;&#x2F;p&gt;
&lt;img src=&quot;chart.png&quot; style=&quot;width: 80%&quot;&gt;
&lt;p&gt;This paper also markets Truffle&#x2F;Graal as the first of its kind, but in its references, it acknowledges that it is actually not.  I mainly attribute this to the fact that the paper was published out of Oracle Labs and there may have been conflicts of interest between academics and industry-minded folks.  In fact, on a &lt;a href=&quot;%5Bhttps:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=6232240%5D&quot;&gt;YCombinator&lt;&#x2F;a&gt; message board thread, developers openly refuse to even consider using a product by Oracle.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;related-reading&quot;&gt;Related Reading&lt;&#x2F;h1&gt;
&lt;p&gt;&lt;a href=&quot;http:&#x2F;&#x2F;openjdk.java.net&#x2F;projects&#x2F;metropolis&#x2F;&quot;&gt;Metropolis&lt;&#x2F;a&gt; which tries to implement Java using Java.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;chrisseaton.com&#x2F;truffleruby&#x2F;tenthings&#x2F;&quot;&gt;Top 10&lt;&#x2F;a&gt; things to do with GraalVM.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.graalvm.org&#x2F;docs&#x2F;getting-started&#x2F;&quot;&gt;Getting Started&lt;&#x2F;a&gt; with GraalVM.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;devmtg&#x2F;2016-01&#x2F;slides&#x2F;Sulong.pdf&quot;&gt;Building a JIT for LLVM IR using Graal&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Loop Invariant Code Motion and Loop Reduction for Bril</title>
                <pubDate>Wed, 30 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/loop-reduction/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/loop-reduction/</guid>
                <description>&lt;h2 id=&quot;goal&quot;&gt;Goal&lt;&#x2F;h2&gt;
&lt;p&gt;The goal of the project is to replace expensive operations like multiplications with equivalent cheaper operations like additions. This is done with the help of loop invariant code motion and induction variable analysis. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;design-overview&quot;&gt;Design Overview&lt;&#x2F;h2&gt;
&lt;p&gt;We divided the project into 3 parts:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Preprocessing to LICM: Finding back edges and associated loops along with reaching definitions of variables.&lt;&#x2F;li&gt;
&lt;li&gt;LICM algorithm: Finding loop invariant code and moving it into a pre-header outside the loop.&lt;&#x2F;li&gt;
&lt;li&gt;Strength reduction: Using the tools like LICM to simplify constant expressions and analyzing induction variables that can be computed using cheaper operations.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;finding-loops&quot;&gt;Finding loops&lt;&#x2F;h3&gt;
&lt;p&gt;All loops are built around back edges. We started by finding all pairs of &lt;code&gt;[tail, head]&lt;&#x2F;code&gt; such that there All loops are built around back edges. We started by finding all pairs of &lt;code&gt;[tail, head]&lt;&#x2F;code&gt; such that there exists an  edge &lt;code&gt;tail-&amp;gt;head&lt;&#x2F;code&gt; where &lt;code&gt;head&lt;&#x2F;code&gt; dominates &lt;code&gt;tail&lt;&#x2F;code&gt;. This is demonstrated by the pseudo code below:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;node &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;cfg:
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;dominator &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;dom[node]: &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# dom[node] is an array of dominators of the node
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;dominator &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;successor[node]: &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# successor[node] is an array of successors of the node
      &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;back_edges.append([node,dominator]) &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;#storing the backedge pair of [tail,head] into the list of backedges
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Assuming this is a reducible CFG, all back edges would be associated with a natural loop. This assumption makes it easier to find loops associated with each back edge. Hence, we&#x27;d like to associate a list of nodes (basic blocks) which form a loop corresponding to a back edge pair from the previous step.&lt;&#x2F;p&gt;
&lt;p&gt;For a back edge &lt;code&gt;A-&amp;gt;B&lt;&#x2F;code&gt; we know that &lt;code&gt;B&lt;&#x2F;code&gt; dominates &lt;code&gt;A&lt;&#x2F;code&gt;, hence all paths to &lt;code&gt;A&lt;&#x2F;code&gt; are through &lt;code&gt;B&lt;&#x2F;code&gt;. So we can find predecessors of &lt;code&gt;A&lt;&#x2F;code&gt; recursively until we hit &lt;code&gt;B&lt;&#x2F;code&gt; and include them in the loop. This is the smallest set of points which include &lt;code&gt;A&lt;&#x2F;code&gt; and &lt;code&gt;B&lt;&#x2F;code&gt; such that for all nodes, &lt;code&gt;n&lt;&#x2F;code&gt; in the list &lt;code&gt;L&lt;&#x2F;code&gt;:  &lt;code&gt;preds(n)&lt;&#x2F;code&gt; $\in$ &lt;code&gt;L&lt;&#x2F;code&gt; or &lt;code&gt;n = B&lt;&#x2F;code&gt; . &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;all_loops &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[]
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;A,B &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;back_edges:
  natural_loop&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[B]
  explored&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[B] &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# won&amp;#39;t find predecessors of explored nodes
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;find_pred(A,natural_loop,explored) &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# add predecessors and current node to natural_lop list
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;all_loops.append(natural_loop)
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;all_loops
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;reaching-definition&quot;&gt;Reaching Definition&lt;&#x2F;h3&gt;
&lt;p&gt;For finding loop invariant code, we look into reaching definitions of the argument and check wether they are outside the loop. The reaching definition problem was discussed in the lecture of CS 6120. For each block we can generate a list of reaching inputs and outputs. The data structure that we chose to represent this is a dictionary of dictionaries. We decided to store the block number associated with the variable in each block. Hence reaching definitions of each block has a dictionary of variables (as keys) and block number they were defined in (as values). This helps in keeping track of reaching definitions outside the loop to identify loop invariant code.  The reaching definitions look something like this:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;reaching_defs = {blocks:{variable:[list of block numbers]}}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Instead of the usual:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;reaching_defs = {blocks:[list of variables]}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The reaching definition problem can be solved using the worklist algorithm after defining merge and the transfer function:&lt;&#x2F;p&gt;
&lt;img src=&quot;reaching_defs.png&quot; alt=&quot;reaching definition lecture&quot; style=&quot;zoom:60%;&quot; &#x2F;&gt;
&lt;p&gt;The union function for our data structure is more nuanced than a simple union of sets of variables. In case of multiple definitions of a variable from predecessor blocks we union the list of block numbers associated with the variable. So if block 1 and block 2 are the inputs for the current block we merge (take union) the lists corresponding to each variable in these blocks. This way we keep track of the block numbers of definitions of each variable.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;out &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{}
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;s &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;dicts:
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;k,v &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;s.items():
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;k &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;not in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;out:
                out.update({k:v}) &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# add a reaching definition of a variable &amp;#39;k&amp;#39; if not already present
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;else&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;:
                out[k] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;list&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;set&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(out[k])&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;|&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;set&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(v)) &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# take union of lists of block numbers for a  variable &amp;#39;k&amp;#39;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;out
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;finding-loop-invariant-code&quot;&gt;Finding Loop Invariant Code&lt;&#x2F;h3&gt;
&lt;p&gt;A loop invariant instruction is one that does not change execution result during the loop execution. We need to go to individual loops, individual blocks in the loop and finally individual instructions inside the loop to check if the instruction is loop invariant. A instruction is loop invariant when any of its argument is either constant or variable &lt;code&gt;var&lt;&#x2F;code&gt; satisfying:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;All reaching definitions of &lt;code&gt;var&lt;&#x2F;code&gt; are outside the loop.&lt;&#x2F;li&gt;
&lt;li&gt;There is exactly one reaching definition of &lt;code&gt;var&lt;&#x2F;code&gt; and the definition is loop-invariant.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;These two conditions are realized by the following code.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;rd &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;reach_def[block][var] &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;#var is defined at rd block
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;c1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;all&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;([x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;not in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;loops[loop] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;rd ]) &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;#all rd blocks outside the loop
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;li &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;loop_invariants[loop].get(rd[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]) &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;#None or LIs in rd block
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;li &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;li &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;is &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;None &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;li
c2 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;len&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(rd)&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;==&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;and &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;any&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;([var &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;dest&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;li]) &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;#one reaching definition and var is defined as LI (matches one of dest in LIs in rd block).
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;create-pre-headers-of-loop-headers&quot;&gt;Create Pre-Headers of Loop Headers&lt;&#x2F;h3&gt;
&lt;p&gt;Before we actually move code, we need to create pre-headers for loop headers. These pre-headers are empty blocks that should be placed before loop header blocks. Notice this assumes Bril code loops have only one entry. Using these empty blocks, we can easily move loop invariants out of the loop when the requirements are satisfied. &lt;&#x2F;p&gt;
&lt;p&gt;Also notice that usually pre-headers are created so that even in loop with multiple entries, loop header can have only a single predecessor, so compiler can put pre-loop code. In our implementation, pre-header is not necessary rather than a design choice.&lt;&#x2F;p&gt;
&lt;p&gt;For each block, we first copy old block content and then check if the next block is a loop header. If so, we create an empty block. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;edge &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;loops:&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;#we use back edge as key to denote loop
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;b_names[i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;edge[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]: &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;#edge[1] is the pre-header block name
        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;name &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;fresh(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;b&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, new_blocks) &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# generate block name that is never used before
        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;new_blocks[name] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[]
        pre_header &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{x:name &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;loops[edge]}
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;break
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;img src=&quot;move_LI.png&quot; alt=&quot;reaching definition lecture&quot; style=&quot;zoom:60%;&quot; &#x2F;&gt;
&lt;h3 id=&quot;move-loop-invariant-code-to-pre-headers&quot;&gt;Move Loop Invariant Code to Pre-Headers&lt;&#x2F;h3&gt;
&lt;p&gt;Not all pre-headers are allowed to be moved to the pre-headers. If the destination of an LI is &lt;code&gt;d&lt;&#x2F;code&gt;, it needs to satisfy the following condition:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;There is only one definition of  &lt;code&gt;d&lt;&#x2F;code&gt; in the loop.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;d&lt;&#x2F;code&gt; dominates all its uses, or equivalently, &lt;code&gt;d&lt;&#x2F;code&gt; is not live-out of its pre-header.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;d&lt;&#x2F;code&gt;&#x27;s block dominates all loop exits where $d$ is live-out.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;To learn the first condition, we need to know all definitions inside the loop and check if  &lt;code&gt;d&lt;&#x2F;code&gt; is unique in the list &lt;code&gt;defs&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;defs &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[ins.get(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;dest&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;b &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;loops[back_edge] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ins &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;blocks[b] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ins.get(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;dest&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)]
defs.count(instr[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;dest&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;==  &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# if true, first check passed
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For the second condition, we can check the predecessor block of the pre-header we just added, by simply reading the index of pre-header and subtracting 1. Then check if &lt;code&gt;d&lt;&#x2F;code&gt; is live-out of block.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;ind &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;b_names.index(pre_header[b_name]) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;- &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;#b_name is the name of block where d is LI.
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;instr[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;dest&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;not in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;live_var[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;][b_names[ind]] &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;#if true, second check passed
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For the third condition, we need to know which blocks are exit blocks. The exits are blocks that have successors not in the loop. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;exits &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{}
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;k &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;loops:
    exits[k] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[]
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;l &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;loops[k]: 
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;any&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;( s &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;not in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;loops[k] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;s &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;succ[l]):
            exits[k].append(l)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;After that, we just need to find all exit blocks where &lt;code&gt;d&lt;&#x2F;code&gt; is live-out and check if &lt;code&gt;d&lt;&#x2F;code&gt;&#x27;s block dominates all loop exits.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;edest &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[e &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;e &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;exits[back_edge] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;instr[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;dest&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;live_var[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;][e]]
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;all&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;([b_name &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;dom[e] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;e &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;edest]) &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# if true, third check passed.
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;block-to-dictionaries&quot;&gt;Block to Dictionaries&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;code&gt; json.load(sys.stdin)[&#x27;functions&#x27;]&lt;&#x2F;code&gt; gives us dictionaries and for each dictionary we obtain list of instructions when the key is &lt;code&gt;instrs&lt;&#x2F;code&gt;. Then we change this list of instructions into a dictionary of blocks. We would like to reverse this process to regenerate list of instructions with modified blocks. However, the original list of instructions does not have so many labels introduced when generating block names and creating pre-headers. For example, main function does not have a label, but to form a block, we generate a &lt;code&gt;b0&lt;&#x2F;code&gt; for its label. Similarly, when generating a pre-header, we might have inserted blocked labeled with &lt;code&gt;b1&lt;&#x2F;code&gt; .  Luckily, the modified blocks are still in an ordered dictionary. Blocks with no label at first is safe to be unlabled and for &lt;code&gt;pre-header&lt;&#x2F;code&gt; blocks, we assume they have only one in-edge and one out-edge.&lt;&#x2F;p&gt;
&lt;p&gt;Therefore in this step, we only create a &lt;code&gt;label&lt;&#x2F;code&gt; instruction when the original label is in &lt;code&gt;blocks&lt;&#x2F;code&gt;. The rest is just copy every instruction other than labels to the new list of instructions.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;block &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;blocks:
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;block &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;label_name:
        new_instrs.append({&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;label&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;:block})
    new_instrs &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;new_instrs&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;blocks[block]
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;strength-reduction&quot;&gt;Strength Reduction&lt;&#x2F;h3&gt;
&lt;p&gt;Strength reduction is a compiler optimization where expensive operations are replaced with equivalent but less expensive operations. A typical example of it is to convert relatively more complex multiplications inside a loop &lt;code&gt;L&lt;&#x2F;code&gt; to easier additions. Here we are mostly interested in &lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;loop invariant code: values that do not change within the body of a loop (as we have already discussed previously)&lt;&#x2F;li&gt;
&lt;li&gt;induction variables: values that are being iterated each time through the loop&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Here is the definition of induction variable:
it is either &lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;a basic induction variable &lt;code&gt;B&lt;&#x2F;code&gt;: a variable &lt;code&gt;B&lt;&#x2F;code&gt; whose only definitions within the loop are assignments of the form: &lt;code&gt;B = B + c&lt;&#x2F;code&gt; or &lt;code&gt;B = B - c&lt;&#x2F;code&gt;, where c is either a constant or a loop-invariant variable, or&lt;&#x2F;li&gt;
&lt;li&gt;a variable defined once within the loop, whose value is a linear function of some basic induction variable at the time of the definition &lt;code&gt;A = c1 * B + c2&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The procedure of performing strength reduction is as follows:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Create new variable: &lt;code&gt;A&#x27;&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Initialize in preheader: &lt;code&gt;A’ = c1 * B + c2&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Track value of B: add after &lt;code&gt;B=B+x&lt;&#x2F;code&gt;: &lt;code&gt;A’=A’+x*c1&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Replace assignment to A: replace lone &lt;code&gt;A=...&lt;&#x2F;code&gt; with &lt;code&gt;A=A’&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Thus, the key idea here is to first find out each induction variable &lt;code&gt;A&lt;&#x2F;code&gt; and then replace definitions of A when executed.&lt;&#x2F;p&gt;
&lt;p&gt;To find out each induction variable, we scan through the code to&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;find out all the basic induction variables &lt;code&gt;B&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;find out all induction variables &lt;code&gt;A&lt;&#x2F;code&gt; in family of &lt;code&gt;B&lt;&#x2F;code&gt;, where &lt;code&gt;A&lt;&#x2F;code&gt; refers to the &lt;code&gt;B&lt;&#x2F;code&gt; at the time of definition&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;The &lt;code&gt;A&lt;&#x2F;code&gt; here should be in one of the following conditions:
i. &lt;code&gt;A&lt;&#x2F;code&gt; has a single assignment in the loop &lt;code&gt;L&lt;&#x2F;code&gt; in the form of:
​    &lt;code&gt;A = B * c&lt;&#x2F;code&gt; | &lt;code&gt;A = c * B&lt;&#x2F;code&gt; | &lt;code&gt;A = B &#x2F; c&lt;&#x2F;code&gt; | &lt;code&gt;A = B + c&lt;&#x2F;code&gt; | &lt;code&gt;A = c + B&lt;&#x2F;code&gt; | &lt;code&gt;A = B - c&lt;&#x2F;code&gt; | &lt;code&gt;A = c - B&lt;&#x2F;code&gt;
ii. &lt;code&gt;A&lt;&#x2F;code&gt; has a single assignment in the loop &lt;code&gt;L&lt;&#x2F;code&gt; in the form of (&lt;code&gt;D&lt;&#x2F;code&gt; is an induction variable in the family of &lt;code&gt;B&lt;&#x2F;code&gt;) 
​    &lt;code&gt;A = D * c&lt;&#x2F;code&gt; | &lt;code&gt;A = c * D&lt;&#x2F;code&gt; | &lt;code&gt;A = D &#x2F; c&lt;&#x2F;code&gt; | &lt;code&gt;A = D + c&lt;&#x2F;code&gt; | &lt;code&gt;A = c + D&lt;&#x2F;code&gt; | &lt;code&gt;A = D - c&lt;&#x2F;code&gt; | &lt;code&gt;A = c - D&lt;&#x2F;code&gt;
Also, no definitions of &lt;code&gt;D&lt;&#x2F;code&gt; outside &lt;code&gt;L&lt;&#x2F;code&gt; reaches the assignment of &lt;code&gt;A&lt;&#x2F;code&gt;, and every path between the point of assignment to &lt;code&gt;D&lt;&#x2F;code&gt; in &lt;code&gt;L&lt;&#x2F;code&gt; and the assignment to &lt;code&gt;A&lt;&#x2F;code&gt; has the same sequence (possibly empty) of definitions of &lt;code&gt;B&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;After all induction variables are found, strength reduction is performed to add new initialization to the variable and reduce multiplications to additions following the procedure we have described above.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;challenges&quot;&gt;Challenges&lt;&#x2F;h1&gt;
&lt;ol&gt;
&lt;li&gt;There are more properties we need than we originally expected. At first, we only generated loops and reaching variables. Then we for loop invariant code motion, we needed exits to the loop, dominance relations, and live variables. &lt;&#x2F;li&gt;
&lt;li&gt;The representation of different variables are randomly decided at first and need implementation after we finalized the representation. For example, the loop invariant code at first was stored as a list of instructions. Later, we found it necessary to change the storage form to a dictionary whose key is the block name. Otherwise, we would need to search and match the instruction to block, e.g, in the &lt;code&gt;move_LI&lt;&#x2F;code&gt; function.&lt;&#x2F;li&gt;
&lt;li&gt;There is more similarity between basic induction variables and their families. It is sometimes tricky to differentiate them. Thus, the definition flow of each induction variable is maintained to tell them apart.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;We would like to evaluate our optimizer in two aspects: theoretical improvement and actual wall-clock speedup.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;benchmarks&quot;&gt;Benchmarks&lt;&#x2F;h3&gt;
&lt;p&gt;Since strength reduction and loop invariant code motion are independent optimizations, we tested them individually using &lt;code&gt;codemotion*.ts&lt;&#x2F;code&gt; and &lt;code&gt;strengthreduction*.ts&lt;&#x2F;code&gt;  and together using &lt;code&gt;both*.ts&lt;&#x2F;code&gt;. We used TypeScript to generate Bril code since it is a higher level language, and helps with generating more complex codes with say &lt;code&gt;for&lt;&#x2F;code&gt; loops.&lt;&#x2F;p&gt;
&lt;p&gt;However, default Bril scripts generated by TypeScript always introduces redundancy during translation, and code motion optimization by default will be performed. For example, &lt;code&gt;v10: int = const 5; v11: int = v11 + v10;&lt;&#x2F;code&gt; can be found in the body of a loop if there is a line &lt;code&gt;x = x + 5&lt;&#x2F;code&gt; in TypeScript. Because we don&#x27;t perform other optimizations, we wonder how much speedup is contributed by the redundancy. We therefore write test cases with no expected optimizations (but will have some when translated to Bril) as baseline to compare actual wall-clock speedup.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Filename&lt;&#x2F;th&gt;&lt;th&gt;Description&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;normal.ts&lt;&#x2F;code&gt;, &lt;code&gt;nested.ts&lt;&#x2F;code&gt;, &lt;code&gt;nestedif.ts&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;No optimization expected.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;codemotion*.ts&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Expecting code motion.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;strengthreduction*.ts&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Expecting strength reduction optimizations.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;both*.ts&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Expecting both kinds of optimizations.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h3 id=&quot;implicit-redundancy&quot;&gt;Implicit Redundancy&lt;&#x2F;h3&gt;
&lt;p&gt;Bril scripts generated by TypeScript introduce some redundant operations: usually an &lt;code&gt;id&lt;&#x2F;code&gt; or a &lt;code&gt;const&lt;&#x2F;code&gt;. Inside a loop, these would usually be loop invariant code which gets removed by our optimizer. However, we were not sure if this would contribute towards our speedup. To make sure that these are actually negligible we benchmarked similar codes without explicit loop invariant codes in the loop to see speedups (if any).&lt;&#x2F;p&gt;
&lt;p&gt;We ran &lt;code&gt;normal.ts&lt;&#x2F;code&gt; , &lt;code&gt;nested.ts&lt;&#x2F;code&gt; and &lt;code&gt;nestedif.ts&lt;&#x2F;code&gt; to this end and got the following results:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Filename&lt;&#x2F;th&gt;&lt;th&gt;Unoptimized runtime&lt;&#x2F;th&gt;&lt;th&gt;Optimized runtime&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;normal.ts&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;193.4 ms&lt;&#x2F;td&gt;&lt;td&gt;193.3 ms&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;nested.ts&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;327.7 ms&lt;&#x2F;td&gt;&lt;td&gt;330 ms&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;nestedif.ts&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;280.4 ms&lt;&#x2F;td&gt;&lt;td&gt;276.8 ms&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;We observe that these do not change by a large margins and can be ignored.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;loop-invariant-code-motion-examples&quot;&gt;Loop Invariant Code Motion Examples&lt;&#x2F;h3&gt;
&lt;p&gt;The program &lt;code&gt;codemotion1.ts&lt;&#x2F;code&gt; has the following loop code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;let a = 8;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;let x = 0;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;let y = 8;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;let z = 1;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;let n = 100000;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;for (let i = n; i &amp;gt; 0; i = i - 1) {
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;    x = y + z;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;    a = a + x * x;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Since both &lt;code&gt;y&lt;&#x2F;code&gt; and &lt;code&gt;z&lt;&#x2F;code&gt; are constants in the loop, computation of &lt;code&gt;x&lt;&#x2F;code&gt; in each iteration is redundant and this operation can be moved outside the loop. The optimized Bril code does perform this function outside the loop as:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;v10: int = id y;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;v11: int = id z;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;v12: int = add v10 v11;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x: int = id v12; x= y+z
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and the loop body is reduced to:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;for.body.5:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;v13: int = id a;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;v14: int = id x;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;v15: int = id x;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;v16: int = mul v14 v15; #using the updated x for x*x
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;v17: int = add v13 v16; #a=a+x*x
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;a: int = id v17;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Assuming the cost of &lt;code&gt;x=y+z&lt;&#x2F;code&gt; was &lt;code&gt;c&lt;&#x2F;code&gt; we reduced the computation in the program by &lt;code&gt;(n-1)*c&lt;&#x2F;code&gt;. To benchmark this, we timed our optimized version using &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sharkdp&#x2F;hyperfine&quot;&gt;hyperfine&lt;&#x2F;a&gt; which helps set warmup times (for warming up the cache) and number of execution runs to perform (10 in our case):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;hyperfine --warmup 1 &amp;#39;brili &amp;lt; opt_codemotion&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Time (mean ± σ):     163.8 ms ±   3.2 ms
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The unoptimized version of the code yields:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;hyperfine --warmup 1 &amp;#39;brili &amp;lt; orig_codemotion&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Time (mean ± σ):     216.8 ms ±   4.5 ms
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Similarly &lt;code&gt;codemotion2.ts&lt;&#x2F;code&gt; has nested loops where invariant code can be moved from inner loops to outer ones. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;hyperfine --warmup 1 &amp;#39;brili &amp;lt; orig_codemotion2&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Time (mean ± σ):     908.6 ms ±  14.8 ms
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;hyperfine --warmup 1 &amp;#39;brili &amp;lt; opt_codemotion2&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Time (mean ± σ):     665.4 ms ±  11.9 ms
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;strength-reduction-examples&quot;&gt;Strength Reduction Examples&lt;&#x2F;h3&gt;
&lt;p&gt;The program &lt;code&gt;strengthreduction1.ts&lt;&#x2F;code&gt; has the following loop code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;let a = 8;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;let n = 10;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;for (let i = n; i &amp;gt; 0; i = i - 1) {
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;    a = a*i;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Since both &lt;code&gt;a&lt;&#x2F;code&gt; and &lt;code&gt;i&lt;&#x2F;code&gt; are induction variables and &lt;code&gt;i&lt;&#x2F;code&gt; is the basic induction variable, the multiplication of &lt;code&gt;a&lt;&#x2F;code&gt; can be reduced to addition by defining a &lt;code&gt;a&#x27;&lt;&#x2F;code&gt; variable outside the loop. The result code will act like &lt;code&gt;a = a + a - 8&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Assume the cost of &lt;code&gt;x = y*z&lt;&#x2F;code&gt; is &lt;code&gt;c1&lt;&#x2F;code&gt; and that of &lt;code&gt;x = y+z&lt;&#x2F;code&gt; is &lt;code&gt;c2&lt;&#x2F;code&gt;. We reduce the computation in the program by &lt;code&gt;n*(c1-c2) - c1&lt;&#x2F;code&gt;. To benchmark this, we timed our optimized version as in Loop Invariant Code Motion. The results are&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;hyperfine --warmup 1 &amp;#39;brili &amp;lt; orig_strengthreduction1&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Time (mean ± σ):     192.1 ms ±   7.2 ms
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;hyperfine --warmup 1 &amp;#39;brili &amp;lt; opt_strengthreduction1&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Time (mean ± σ):     173.2 ms ±   10.3 ms
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Similarly, &lt;code&gt;strengthreduction3.ts&lt;&#x2F;code&gt; has division to be reduced.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;hyperfine --warmup 1 &amp;#39;brili &amp;lt; orig_strengthreduction2&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Time (mean ± σ):     61.1 ms ±   2.4 ms
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;hyperfine --warmup 1 &amp;#39;brili &amp;lt; opt_strengthreduction2&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Time (mean ± σ):     56.3 ms ±   5.9 ms
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;both-optimizations-examples&quot;&gt;Both Optimizations Examples&lt;&#x2F;h3&gt;
&lt;p&gt;To combine both loop optimizations, namely loop invariant code motion and strength reduction, we run tests &lt;code&gt;both1&lt;&#x2F;code&gt; and &lt;code&gt;both2&lt;&#x2F;code&gt; following the same evaluation method. The results are as follows.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;hyperfine --warmup 1 &amp;#39;brili &amp;lt; orig_both1&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Time (mean ± σ):     655.0 ms ±   8.6 ms
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;hyperfine --warmup 1 &amp;#39;brili &amp;lt; opt_both1&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Time (mean ± σ):     484.1 ms ±   13.7 ms
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;hyperfine --warmup 1 &amp;#39;brili &amp;lt; orig_both2&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Time (mean ± σ):     66.2 ms ±   3.3 ms
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;hyperfine --warmup 1 &amp;#39;brili &amp;lt; opt_both2&amp;#39;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Time (mean ± σ):     49.8 ms ±   5.9 ms
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;final-results&quot;&gt;Final Results&lt;&#x2F;h3&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Filename&lt;&#x2F;th&gt;&lt;th&gt;Unoptimized runtime&lt;&#x2F;th&gt;&lt;th&gt;Optimized runtime&lt;&#x2F;th&gt;&lt;th&gt;Speedup&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;codemotion1.ts&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;216.8 ms&lt;&#x2F;td&gt;&lt;td&gt;163.8 ms&lt;&#x2F;td&gt;&lt;td&gt;1.32x&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;codemotion2.ts&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;908.6 ms&lt;&#x2F;td&gt;&lt;td&gt;665.4 ms&lt;&#x2F;td&gt;&lt;td&gt;1.36x&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;strengthreduction1.ts&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;192.1 ms&lt;&#x2F;td&gt;&lt;td&gt;173.2 ms&lt;&#x2F;td&gt;&lt;td&gt;1.11x&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;strengthreduction2.ts&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;61.1 ms&lt;&#x2F;td&gt;&lt;td&gt;56.3 ms&lt;&#x2F;td&gt;&lt;td&gt;1.09x&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;both1.ts&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;655.0 ms&lt;&#x2F;td&gt;&lt;td&gt;484.1 ms&lt;&#x2F;td&gt;&lt;td&gt;1.35x&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;both2.ts&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;66.2 ms&lt;&#x2F;td&gt;&lt;td&gt;49.8 ms&lt;&#x2F;td&gt;&lt;td&gt;1.33x&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Based on the evaluation results, we find out that both loop optimization techniques can provide some speedup. Loop invariant code motion can provide up to 1.36x speedup, while strength reduction can only have about 10% performance improvement. That is due to the fact that multiplication&#x2F;division does not take significantly more cycles than addition&#x2F;substraction on modern machines and there are many redundant &lt;code&gt;id&lt;&#x2F;code&gt; operations when generating &lt;code&gt;bril&lt;&#x2F;code&gt; programs from &lt;code&gt;ts&lt;&#x2F;code&gt;. Thus, only limited performance improvement can be obtained We expect that more speedup can be obtained if exponential operation is optimized. Also, note that there are nested loops in &lt;code&gt;codemotion2.ts&lt;&#x2F;code&gt; and &lt;code&gt;both1.ts&lt;&#x2F;code&gt;, thus their runtime are much longer than the rest.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>An Efficient Implementation of Self</title>
                <pubDate>Mon, 28 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/self/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/self/</guid>
                <description>&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=74884&quot;&gt;An Efficient Implementation of Self, a Dynamically-Typed Object-Oriented
Language Based on Prototypes&lt;&#x2F;a&gt; presents techniques for runtime
compilation, now more commonly referred to as just-in-time (JIT) compilation,
for the &lt;a href=&quot;http:&#x2F;&#x2F;www.selflanguage.org&#x2F;&quot;&gt;Self&lt;&#x2F;a&gt; programming language.&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;We have developed and implemented techniques that double the performance of
the dynamically-typed object oriented languages.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Due to the lack of a compiler name, I refer to the compiler presented in this
paper as &amp;quot;the Self compiler&amp;quot;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-challenge&quot;&gt;The Challenge&lt;&#x2F;h2&gt;
&lt;p&gt;The term &amp;quot;dynamic language&amp;quot; is mostly commonly associated with modern scripting
languages like Python and JavaScript. Self, a much older language developed
at the famed Xerox PARC labs, takes the philosophy of dynamism to a logical
extreme -- &lt;em&gt;everything&lt;&#x2F;em&gt; in self is a message to an object. This includes
Java-like method calls on object &lt;em&gt;as well as&lt;&#x2F;em&gt; control structures like loops
and conditionals.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;&#x2F;strong&gt;: Self&#x27;s multi-argument methods have the form:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;( |
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;      IfTrue: trueBlock IfFalse: falseBlock = ( trueBlock value ).
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;| )
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;allowing for multiple named arguments. In Self parlance, the method name
is read as &lt;code&gt;IfTrue:IfFalse:&lt;&#x2F;code&gt; and is invoked by placing the arguments after
each : separated name. See the &lt;a href=&quot;http:&#x2F;&#x2F;handbook.selflanguage.org&#x2F;4.5&#x2F;langref.html#slot-descriptors&quot;&gt;Self documentation&lt;&#x2F;a&gt; for more information.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;For example, the Self condition &lt;code&gt;IfTrue:IfFalse:&lt;&#x2F;code&gt; is a method invocation on
the boolean object &lt;code&gt;true&lt;&#x2F;code&gt; and &lt;code&gt;false&lt;&#x2F;code&gt;. For an C-style conditional like the
following:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;if (x) 1 else 2
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;the self invocation will look like:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;x IfTrue: [ 1 ] IfFalse: [ 2 ]
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The run-time behavior of this program is invoking the method &lt;code&gt;IfTrue:IfFalse:&lt;&#x2F;code&gt;
on the object &lt;code&gt;x&lt;&#x2F;code&gt;, which can be a &lt;code&gt;true&lt;&#x2F;code&gt; or &lt;code&gt;false&lt;&#x2F;code&gt; but is not required to be
those, and execute the &amp;quot;thunk&amp;quot; (a function with no argument) corresponding to
the true or the false branch.  Note that &lt;code&gt;x&lt;&#x2F;code&gt; is not required to be &lt;code&gt;true&lt;&#x2F;code&gt; or
&lt;code&gt;false&lt;&#x2F;code&gt;.  Any Self object can define the &lt;code&gt;IfTrue:IfFalse:&lt;&#x2F;code&gt; method and specify its
conditional behavior.&lt;&#x2F;p&gt;
&lt;p&gt;Considering this dynamism, a Self compiler must:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Respect the semantics of method calls&lt;&#x2F;strong&gt;. Restricting or specializing conditionals and primitives don&#x27;t follow the spirit of the language.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Provide interactive speed&lt;&#x2F;strong&gt;. Recompiling after a small change is not an option because Self is meant for rapid exploration in a programming environment.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Preserve stack traces&lt;&#x2F;strong&gt;. Self supports extreme reflection and introspection. A programming environment must be debuggable.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;strong&gt;Generate fast code&lt;&#x2F;strong&gt;. Interpreting the whole language is probably too slow.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h2 id=&quot;the-solution&quot;&gt;The Solution&lt;&#x2F;h2&gt;
&lt;p&gt;While the paper goes into the nitty-gritty of object layouts and method
invocations, the essence of the paper can be summarized as:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;When type information for an object is available, generate specialized code
and let inlining and compiler optimizations work their magic.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The ideas in the paper have been widely adopted in modern JITs for dynamic
languages like JavaScript (&lt;a href=&quot;https:&#x2F;&#x2F;hacks.mozilla.org&#x2F;2017&#x2F;02&#x2F;a-crash-course-in-just-in-time-jit-compilers&#x2F;&quot;&gt;IonMonkey&lt;&#x2F;a&gt;) as well as seemingly static languages
like Java (&lt;a href=&quot;https:&#x2F;&#x2F;www.oracle.com&#x2F;technetwork&#x2F;java&#x2F;javase&#x2F;tech&#x2F;index-jsp-136373.html&quot;&gt;HotSpot&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;h3 id=&quot;object-layout&quot;&gt;Object Layout&lt;&#x2F;h3&gt;
&lt;p&gt;Self programs use object prototypes to describe inheritance relationships.
Unlike classes, which have &lt;em&gt;constructors&lt;&#x2F;em&gt; used to build &lt;em&gt;instances&lt;&#x2F;em&gt;, Self uses
prototypes, which act as &lt;em&gt;exemplars&lt;&#x2F;em&gt; for other objects. Creating a new object
from a prototype is as simple as &lt;em&gt;cloning&lt;&#x2F;em&gt; it and setting its parent pointer to
the prototype. Changing the common behavior of all clones is as simple as
changing the behavior of a prototype. Since method and field lookups traverse
the parent hierarchy, clones can also override methods and fields of their
prototypes.&lt;&#x2F;p&gt;
&lt;p&gt;A naive layout scheme for objects would copy all fields from a prototype and
end up wasting a lot of space describing potentially shared behaviors. The Self
compiler minimizes space usage of &lt;em&gt;clones&lt;&#x2F;em&gt; derived from the same prototype by
using &lt;em&gt;clone families&lt;&#x2F;em&gt;. Each cloned has an &amp;quot;allocation map&amp;quot; which stores its
modifiable fields and has a pointer to its parent. The allocation map hierarchy
mirrors the inheritance hierarchy in a program.
If the instance ever
overrides one of its parents&#x27; methods, the compiler creates a new clone family to preserve lookup
semantics and propagate behavior changes to all clones of a prototype.&lt;&#x2F;p&gt;
&lt;img src=&quot;without-maps.png&quot; alt=&quot;drawing&quot; width=&quot;300&quot;&#x2F;&gt;
&lt;img src=&quot;with-maps.png&quot; alt=&quot;drawing&quot; width=&quot;300&quot;&#x2F;&gt;
&lt;p&gt;An allocation scheme without maps (left) and with maps (right). The scheme on
right saves more memory.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;bytecode&quot;&gt;Bytecode&lt;&#x2F;h3&gt;
&lt;p&gt;The Self compiler and runtime come with a bytecode format for executing compiled
programs. Instead of executing a structured AST, the compiler defines a set of
core &amp;quot;instructions&amp;quot; for the runtime (more commonly known as a &amp;quot;Virtual Machine&amp;quot;
from Java parlance). Compared to a traditional ISA, the bytecode supports
high-level constructs like message &lt;code&gt;SEND&lt;&#x2F;code&gt; (invoke a method on a receiver).
It is worth noting that the description of the Self bytecode predates the Java
Virtual Machine (JVM) and directly inspired it.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;segregation-tagging&quot;&gt;Segregation &amp;amp; Tagging&lt;&#x2F;h3&gt;
&lt;p&gt;Finding object references is one of the most common operations a runtime for an
object oriented (OO) language has to perform. It is used for everything from
garbage collection to reflection in modern OOP. The central challenge in
finding objects is disambiguating pointers from integers since they look almost
identical. A naive approach would be adding header information to pointers
and parsing it at runtime. This incurs a runtime overhead of parsing headers
and dramatically slowing down the runtime.&lt;&#x2F;p&gt;
&lt;p&gt;The Self compiler &lt;em&gt;segregates&lt;&#x2F;em&gt; the layout of byte arrays (which can confused as
object references) and object references. Byte arrays are allocated from the
top of the heap space while object references are allocated from bottom. The
runtime can then completely skip objects beyond the object reference space
boundary.&lt;&#x2F;p&gt;
&lt;p&gt;The Self compiler also &lt;em&gt;tags&lt;&#x2F;em&gt; integers and floating point numbers to disambiguate
them from references. By default, two MSB bits are used to denote whether a
machine word is an integer, a floating point number, a reference, or a header
for an object. Integers and floating numbers are directly encoded into the
remaining 30 bits, allowing them to be directly used after performing a single
logical shift.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;customized-compilation&quot;&gt;Customized Compilation&lt;&#x2F;h3&gt;
&lt;p&gt;The rampant dynamism in Self programs forces compilers to conservatively
generate slow code. For example &lt;code&gt;x &amp;lt; y&lt;&#x2F;code&gt; is a method call on the object &lt;code&gt;x&lt;&#x2F;code&gt; and
requires knowing the precise type of &lt;code&gt;x&lt;&#x2F;code&gt; to generate optimized code for &lt;code&gt;&amp;lt;&lt;&#x2F;code&gt;.
The paper mentions that contemporary Smalltalk-80 compilers restricted the
customization of primitive methods and control structures to allow for
specialized code generation. Instead of imposing these restrictions, the Self
compiler will generate specialized code for every &lt;em&gt;receiver&lt;&#x2F;em&gt; object at runtime.
This means that at the first invocation of &lt;code&gt;&amp;lt;&lt;&#x2F;code&gt; on an integer, the compiler will
specialize &lt;code&gt;&amp;lt;&lt;&#x2F;code&gt; with check to see if the type matches to an &lt;code&gt;integer&lt;&#x2F;code&gt; and
generate a single compare instruction when this is case. When the type of
receiver has not been encountered before, Self will default to a method send
to the object.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;message-inlining-and-splitting&quot;&gt;Message Inlining and Splitting&lt;&#x2F;h3&gt;
&lt;p&gt;Inlining remains one of the most crucial optimization in modern compilers,
enabling other optimizations to become more effective. However, inlining
becomes impossible in presence of dynamic method calls if the type of the
receiver object is unknown. This is an even bigger problem for Self since
most of program is a sequence of dynamic calls.&lt;&#x2F;p&gt;
&lt;p&gt;The Self compiler performs two optimizations, message &lt;em&gt;inlining&lt;&#x2F;em&gt; and
&lt;em&gt;splitting&lt;&#x2F;em&gt; to enable efficient execution. Message &lt;em&gt;inlining&lt;&#x2F;em&gt; works in a similar
fashion to customized compilation -- if the type of a receiver is known, either
at initial compilation through a dataflow analysis or at runtime, inline the
method body at the call location. When type information for an object is lost
due to control flow splits, &lt;em&gt;message splitting&lt;&#x2F;em&gt; generates specialized code for
possible receiver types and guards them using type tests. If the type of
receiver matches, fast code can be executed.&lt;&#x2F;p&gt;
&lt;p&gt;For example, the following Self program&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;x &amp;gt; y
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;corresponds to the method invocation:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;x[&amp;quot;lessThan&amp;quot;](y)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;which can be specialized as:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;if (typeof(x) == &amp;#39;integer&amp;#39; &amp;amp;&amp;amp; typeof(y) == &amp;#39;integer&amp;#39;) {
  Integer.lessThan(x, y)  &#x2F;&#x2F; Integer is the root object for all integers
}
else {
  x[&amp;quot;lessThan&amp;quot;](y)
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and further inline the call to the root &lt;code&gt;Integer&lt;&#x2F;code&gt; object.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;if (typeof(x) == &amp;#39;integer&amp;#39; &amp;amp;&amp;amp; typeof(y) == &amp;#39;integer&amp;#39;) {
  cmp(x, y) &#x2F;&#x2F; cmp is a single instruction
}
else {
  x[&amp;quot;lessThan&amp;quot;](y)
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If the Self compiler can guarantee that &lt;code&gt;x&lt;&#x2F;code&gt; has the type &lt;code&gt;integer&lt;&#x2F;code&gt;, it can
&lt;em&gt;inline&lt;&#x2F;em&gt; the comparison and remove the guard.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;programming-environment-support&quot;&gt;Programming Environment Support&lt;&#x2F;h3&gt;
&lt;p&gt;Unlike most modern languages, the Self language was designed to support tight
integration and exploratory programming in and IDE. Programmers were expected
and encouraged to inspect programs at runtime and dynamically modify the IDE
itself &lt;em&gt;while the program was executing&lt;&#x2F;em&gt; to allow for better exploration.&lt;&#x2F;p&gt;
&lt;p&gt;This features require fast recompilation and debugging capabilities from any
Self runtime and compiler. The compiler in the paper supports both &lt;em&gt;incremental
recompilation&lt;&#x2F;em&gt; and &lt;em&gt;source-level debugging&lt;&#x2F;em&gt; by keeping track of the provenances
of various method specializations in a map.&lt;&#x2F;p&gt;
&lt;p&gt;Incremental Recompilation occurs when the compiler observes a change in the
programming environment and invalidates compiled methods associated with the
affected data. At compile time, the compiler creates a dependency list encoding
the information required for specialization. For example, if a method was
inlined from a clone&#x27;s parent, updating the parent pointer or the method body
in the parent should invalidate the specialized code. Since the compiler
selectively invalidates code, methods that weren&#x27;t modified can still execute
with specialized code.&lt;&#x2F;p&gt;
&lt;p&gt;Source-level Debugging requires language support and can often cause compiler
writers to forgo obvious optimizations. For example, the Chrome V8 team decided
to &lt;a href=&quot;https:&#x2F;&#x2F;bugs.chromium.org&#x2F;p&#x2F;v8&#x2F;issues&#x2F;detail?id=4698&quot;&gt;forgo&lt;&#x2F;a&gt; tail call optimization due to due concerns about being
unable to reconstruct stack frames. The programming environment needs to be
able to step through specialized code as if it was going through the normal
method call chain. Self appends debugging information to each compiled method,
allowing the environment to reconstruct the state of the stack.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-selfish-evaluation&quot;&gt;A SELFish Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;The Self compiler was written in 33,000 lines of C++ code and ran on the Sun-3.
The authors wrote 9,000 lines of Self code to implement the object hierarchy
and prototype graphical user interface.&lt;&#x2F;p&gt;
&lt;p&gt;The quantitative evaluation compares Self running times to that of the fastest
Smalltalk-80 compiler at the time by translating the Stanford Integer
Benchmarks. The authors both did a straightforward transliteration of Self programs
(marked &lt;code&gt;Self&lt;&#x2F;code&gt;) and also rewrote the benchmarks in idiomatic Self (marked &lt;code&gt;Self&#x27;&lt;&#x2F;code&gt;).&lt;&#x2F;p&gt;
&lt;img src=&quot;eval.png&quot; alt=&quot;drawing&quot; width=&quot;400&quot;&#x2F;&gt;
&lt;p&gt;On average, Self was 2-3x faster than the Smalltalk implementation and 4x slower
than the original C programs.&lt;&#x2F;p&gt;
&lt;p&gt;The authors also describes a new metric to measure the performance of an object
oriented programming language -- &lt;em&gt;Millions of Messages per Second (MiMS)&lt;&#x2F;em&gt; which
analogous to MIPS. To compute the MiMS of a specific virtual machine, divide
the number of messages the benchmark sends by its total running time. The
first generation Self compiler ran at 3.3 MiMS or a message executing every 300ns.&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;If this sounds suspiciously similar to JavaScript, it is because the design
of JavaScript is directly inspired by Self.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</description>
            </item>
        
            <item>
                <title>Trace-based Just-in-Time Type Specialization for Dynamic Languages</title>
                <pubDate>Sun, 27 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/tbjit-type-specialization/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/tbjit-type-specialization/</guid>
                <description>&lt;h2 id=&quot;dynamic-languages&quot;&gt;Dynamic Languages&lt;&#x2F;h2&gt;
&lt;p&gt;Statically-compiled languages force the programmer to abide by the restrictions of the ISA. For example, languages like C, Java, and Perl require type information to be specified by the programmer. The hardware only knows how to do arithmetic between certain type combinations, so type knowledge is a requirement when compiling for that hardware. In the RISC-V ISA, an integer addition compiles to an &lt;code&gt;add&lt;&#x2F;code&gt; instruction while a single-precision floating point addition compiles to a &lt;code&gt;fadd.s&lt;&#x2F;code&gt; instruction.&lt;&#x2F;p&gt;
&lt;p&gt;Dynamic languages allow the programmer to free themselves from the strict requirements of the ISA. In languages like JavaScript and Python, a programmer does not need to specify whether a type is an &lt;code&gt;int&lt;&#x2F;code&gt;, &lt;code&gt;float&lt;&#x2F;code&gt;, or any other primitive type. However, these languages cannot be directly compiled to machine code because there is no instruction for untyped operations in most ISAs. Most high-performance interpreters will compile the source code to &lt;em&gt;bytecode&lt;&#x2F;em&gt; instead. A bytecode instruction is neither machine instructions nor lines of source code, but rather something in between. A bytecode instruction consists of behavior much like a machine instruction opcode, but also includes other information from the source language that would not be encoded in a machine instruction. Each bytecode instruction is evaluated by the interpreter, by (1) checking type information and opcode and (2) jumping to a function that evaluates that operation.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;problem-dynamic-languages-are-slow&quot;&gt;Problem - Dynamic Languages are Slow&lt;&#x2F;h2&gt;
&lt;p&gt;Consider a single iteration of vector-vector add (vvadd) as a motivating example. In each iteration, we load two values from memory, add them, and store the result back to memory.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; N; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  c[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b[i];
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A C compiler would statically compile this iteration to the following RISC-V instructions assuming the types of the arrays were &lt;code&gt;int&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;img src=&quot;static-vvadd.png&quot; alt=&quot;Interpreter Performance&quot; width=&quot;25%&quot; &gt;
&lt;p&gt;If instead we ran the program using an interpreter, we would execute the following machine instructions representing the interpreter rather than the user program.&lt;&#x2F;p&gt;
&lt;img src=&quot;interp-vvadd.png&quot; alt=&quot;Interpreter Performance&quot; width=&quot;60%&quot;&gt;
&lt;p&gt;Notice the same &lt;code&gt;lw&lt;&#x2F;code&gt;, &lt;code&gt;add&lt;&#x2F;code&gt;, and &lt;code&gt;sw&lt;&#x2F;code&gt; are present in the code, but we have to jump to the appropriate function to execute them. The interpreter overhead is, thus, everything that isn&#x27;t the instructions required by vvadd. For every instruction in vvadd, it requires seven additional interpreter instructions and incurs a penalty of a few cycles due to the additional branches. In total, we can estimate that it takes at least 10 cycles to execute an equivalent machine instruction on the interpreter. Many real-world interpreters perform additional operations that will increase the overhead even more. For these reasons, interpreters are generally an order of magnitude slower than statically compiled code.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;solution-just-in-time-compiler&quot;&gt;Solution - Just-in-Time Compiler&lt;&#x2F;h2&gt;
&lt;p&gt;Just-in-Time Compilers (JITs) provide speedups to dynamic languages. Although previously proposed, this paper optimizes and popularizes tracing JITs for accelerating dynamically typed languages. The authors demonstrate the effectiveness a tracing JIT in a real-world environment, namely the Mozilla Firefox web browser.&lt;&#x2F;p&gt;
&lt;p&gt;The core idea exploited in tracing JITs is the following: &lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;A loop tends to have similar type information across multiple iterations.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;In the case of vvadd, if on every iteration the types are &lt;code&gt;int&lt;&#x2F;code&gt; then we don&#x27;t actually need the flexibility of the interpreter. Instead, we can compile the bytecode during run-time to machine instructions where the type of each instruction is &lt;code&gt;int&lt;&#x2F;code&gt;. The run-time compilation procedure will greatly resemble the ahead-of-time compilation procedure of non-dynamic languages. Generally we only want to spend time compiling code that is run multiple times (i.e., in a loop).&lt;&#x2F;p&gt;
&lt;p&gt;Unlike an ahead-of-time compiler, a JIT makes assumptions about the type information of the bytecode and speculatively emits machine instructions. If our assumptions were wrong, we need to fall back to the interpreter. The JIT compiler then must also insert &lt;em&gt;guards&lt;&#x2F;em&gt; that detect when type information is wrong and will jump back to the interpreter. The state machine below describes the high-level process.&lt;&#x2F;p&gt;
&lt;img src=&quot;state-machine.png&quot; alt=&quot;Interpreter Performance&quot; width=&quot;100%&quot;&gt;
&lt;p&gt;Machine code emitted by a JIT might look something like the following. Notice that there are fewer &amp;quot;overhead&amp;quot; instructions than in the interpreter version (just two instead of seven).&lt;&#x2F;p&gt;
&lt;img src=&quot;guard-vvadd.png&quot; alt=&quot;Interpreter Performance&quot; width=&quot;40%&quot;&gt;
&lt;h2 id=&quot;tracemonkey&quot;&gt;TraceMonkey&lt;&#x2F;h2&gt;
&lt;p&gt;The authors propose &lt;em&gt;TraceMonkey&lt;&#x2F;em&gt;, a JIT which roughly follows the high-level ideas described above. TraceMonkey is a &lt;em&gt;tracing&lt;&#x2F;em&gt; JIT as opposed to a &lt;em&gt;method&lt;&#x2F;em&gt; JIT, which were predominant at the the time of this publication. Method JITs compile single functions at a time whereas a tracing JIT compiles single paths through the whole program. &lt;&#x2F;p&gt;
&lt;p&gt;The machine code generated by each JIT differ significantly. The machine code generated by a method JIT machine code will resemble the original source program. By comparison, the machine code generated by a tracing JIT will have most of its control flow removed (i.e., conditional control flow and function calls). This code will be superior because more basic blocks have been stitched together and can be optimized together.&lt;&#x2F;p&gt;
&lt;p&gt;The overall flow of TraceMonkey is to run the interpreter for a while, observe &amp;quot;hot&amp;quot; bytecode, compile the bytecode to machine code, run the machine whenever possible instead of the interpreter. Guards are inserted into the machine code to fallback to the interpreter when are assumptions failed. The main steps of TraceMonkey are &lt;em&gt;interpreting&lt;&#x2F;em&gt;, &lt;em&gt;recording&lt;&#x2F;em&gt;, &lt;em&gt;compilation&lt;&#x2F;em&gt;, &lt;em&gt;native execution&lt;&#x2F;em&gt;, and &lt;em&gt;aborting&lt;&#x2F;em&gt;. These steps are described in some detail in the next section.&lt;&#x2F;p&gt;
&lt;p&gt;To avoid confusion, keep in mind that there are four types of code in TraceMonkey: 1) Source code, 2) Bytecode, 3) Low-level intermediate representation (LIR), and 4) Machine code. Only bytecode and machine code are executed, while source code and LIR are only meant to compiled down to the subsequent code level.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;interpreting&quot;&gt;Interpreting&lt;&#x2F;h3&gt;
&lt;p&gt;The default state of TraceMonkey is to execute bytecode via an interpreter. This yields correct but slow execution of a users program.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;recording&quot;&gt;Recording&lt;&#x2F;h3&gt;
&lt;p&gt;When TraceMonkey detects a loop (simply a back-edge in the control-flow graph), it begins to record a trace. For each bytecode instruction, one or more LIR instructions are generated along with type guards. LIR instructions directly map to machine instructions, but are ISA agnostic.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Bytecode (not typed)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;c &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add a b;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; LIR Trace (typed)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;guard &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;typeof&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(a) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
guard &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;typeof&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(b) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; 
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; c &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add_int a b;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Traces inherently can only follow a single path within a loop iteration. No type information is known about the paths that we not taken, so we can&#x27;t generate machine code for them. Therefore guards must also check branch conditions. In the following example, two possible traces can be generated from the code.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Source code
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  c&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
  c&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;--&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; LIR Trace 1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;guard a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; c &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add_int c &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; LIR Trace 2
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;guard a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;false&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; c &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add_int c &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Each individual trace can be much shorter than the original program and can forgo any control flow in favor of specialized guards.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;compilation&quot;&gt;Compilation&lt;&#x2F;h3&gt;
&lt;p&gt;The LIR traces must first be compiled to machine code to execute natively on the processor. This compilation needs to be much faster than static compilation because it occurs during runtime. The authors propose limiting the number of code optimizations performed to keep the compilation runtime reasonable. For example, register allocation uses a greedy algorithm. Greedy algorithms generally give non-optimal results, but may be the only type of algorithm appropriate for a small time budget.&lt;&#x2F;p&gt;
&lt;p&gt;The compiled traces are stored in a trace buffer for later use by the interpreter.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; LIR code
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;guard a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; c &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add_int c &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Machine code
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;addi t0 x0 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; put &amp;#39;true&amp;#39; into a register
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;bne  t1 t0 abort;
addi t2 t2 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Note that more assumptions allowed in the traces, the more specialized and higher performance the machine code will be. The trade-off is that there are more guards that can fail, and the generated machine code may not be useful for most iterations of a loop.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;native-execution&quot;&gt;Native Execution&lt;&#x2F;h3&gt;
&lt;p&gt;The interpreter can execute traces when certain conditions are met. Effectively, the interpreter cedes program control to the generated native machine instructions. The performance of this code should approach that of static code, but is somewhat held back by the low-effort optimizations and additional guard instructions. However, the performance is much better than running in the interpreter.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;aborting&quot;&gt;Aborting&lt;&#x2F;h3&gt;
&lt;p&gt;Whenever a guard fails, we must abort from the current trace because our assumptions were wrong. For example if we thought the type of a value was &lt;code&gt;int&lt;&#x2F;code&gt;, but the value turned out to be a &lt;code&gt;float&lt;&#x2F;code&gt; future instructions will have incorrect behavior. A simple example is shown below.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;lw  t0 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(s0); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; unexpected float!
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;add t1 t0 t1; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; actually need a fadd instruction!
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The un-optimized version of this mechanism always jumps backs to the interpreter to decide how to proceed. The interpreter can then record a new trace and start executing machine code from that in future iterations. Effectively, the enumerated steps will repeat in the same order. The optimized version of this process is described in the Linked Traces section below.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;optimizations&quot;&gt;Optimizations&lt;&#x2F;h2&gt;
&lt;p&gt;The authors lower-level implementation of the ideas described above are the main contributions of this paper. They develop multiple optimizations to make traces less likely to abort. Aborts incur a high performance penalty, so the fewer aborts the faster the user program will run. They also develop techniques to reduce the amount of storage required for the compiled traces. &lt;&#x2F;p&gt;
&lt;h4 id=&quot;typed-traces&quot;&gt;Typed Traces&lt;&#x2F;h4&gt;
&lt;p&gt;Each trace is a basic block that has one entry node and no inner control flow. The interpreter will only enter this basic block if the types of the input variables to the block type check. This is more efficient than entering the trace and immediately aborting because the incoming types were incorrect. In the case of multiple traces, the interpreter has the ability to decide which trace to run based on the input variable types and the trace signature (i.e., the type of each variable as would be given in a C function call).&lt;&#x2F;p&gt;
&lt;h4 id=&quot;linked-traces&quot;&gt;Linked Traces&lt;&#x2F;h4&gt;
&lt;p&gt;A trace is a single forward path. A naive approach would jump back to the interpreter at the end of the trace and have the interpreter re-execute the same compiled trace. A trace can be expanded to include its jump back path if the loop is deemed &lt;em&gt;type-stable&lt;&#x2F;em&gt;, i.e., the type information does not change over consecutive iterations.&lt;&#x2F;p&gt;
&lt;p&gt;A trace can also jump to another similar trace that has different that uses different input types. This can occur is there is a particular pattern detected between different traces, i.e., if input types go from &lt;code&gt;int&lt;&#x2F;code&gt; to &lt;code&gt;float&lt;&#x2F;code&gt; to &lt;code&gt;string&lt;&#x2F;code&gt; consistently, we would want to link the three traces together.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;trace-branches&quot;&gt;Trace Branches&lt;&#x2F;h4&gt;
&lt;p&gt;As previously mentioned a trace can only contain information about a single path through the loop. If machine code encounters a different conditional branch path, it needs to abort. However, it doesn&#x27;t necessarily need to abort back to the interpreter. If there is another trace that starts from the side path, we could jump directly to this other trace. &lt;&#x2F;p&gt;
&lt;p&gt;The diagram below presents two traces. The vertical trace (the root trace) is called directly from the interpreter, while the slanted trace is called from the root trace when a certain branch condition is met. These arrangements form tree-like structures called &lt;em&gt;trace trees&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;img src=&quot;tree-with-branch.png&quot; alt=&quot;Interpreter Performance&quot; width=&quot;50%&quot;&gt;
&lt;p&gt;Jumping to another trace instead of aborting back to the interpreter is much more efficient.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;nested-traces&quot;&gt;Nested Traces&lt;&#x2F;h4&gt;
&lt;p&gt;Traces always consist of a single forward path and end on a backwards path. In the case of a loop nest, instructions will be recorded from a single path through both loops. If there are any conditionals that are post-dominated by the outer loop, then the outer loop instructions can be compiled multiple times (one for each full path). This increases the amount of storage required for reach trace.&lt;&#x2F;p&gt;
&lt;p&gt;The authors propose to effectively perform function outlining on nested loops. One trace can effectively call another trace as the interpreter would.&lt;&#x2F;p&gt;
&lt;img src=&quot;nest-trees.png&quot; alt=&quot;Interpreter Performance&quot; width=&quot;50%&quot;&gt;
&lt;h4 id=&quot;blacklisting&quot;&gt;Blacklisting&lt;&#x2F;h4&gt;
&lt;p&gt;Specific traces are not worth generating and are prevented from being recorded or run.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;The authors evaluate on a MacBook Pro! JavaScript is somewhat of a consumer language rather than a high-performance language, so a consumer-grade MacBook is somewhat relevant. Keeping in the spirit of consumerism, the authors evaluate on a consumer benchmark suite &lt;em&gt;SunSpider&lt;&#x2F;em&gt;. These benchmarks are all extremely small (&amp;lt; 250ms), but webpages generally load within this time.&lt;&#x2F;p&gt;
&lt;p&gt;The authors estimate that a bytecode instruction is 4x faster when compiled to machine code (could be multiple machine instructions, which is why this isn&#x27;t higher). Most benchmarks spend their time natively executing machine code rather than interpreter bytecode. Thus, every compatible benchmark achieved at least some speedup in TraceMonkey over &lt;em&gt;SpiderMonkey&lt;&#x2F;em&gt;, the interpreter-only version of TraceMonkey. TraceMonkey lacks support for certain JavaScript primitives and thus could not gain performance on benchmarks containing those primitives.&lt;&#x2F;p&gt;
&lt;p&gt;I would have liked to have seen the performance impact of the many optimizations they described in their paper, particularly the benefit of creating trace-trees and nested trees.&lt;&#x2F;p&gt;
&lt;p&gt;However, the authors report significant overhead in some benchmarks due to the JIT state machine mainly the recording and compilation procedures. They estimate that a native trace must be executed 270 times to justify the overhead. This may seem small, but generally the total run-time of a JavaScript program is also quite small.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;discussion-questions&quot;&gt;Discussion Questions&lt;&#x2F;h2&gt;
&lt;p&gt;Is the short compilation time potentially limiting the performance that could be achieved by traces? Bytecode only 4x when executed natively? What could be done?&lt;&#x2F;p&gt;
&lt;p&gt;Could an ISA be designed that does not require machine instructions to include type information?&lt;&#x2F;p&gt;
&lt;p&gt;This paper was written at the beginning of the multi-core processor era. Could multiprocessing improve the JIT performance?&lt;&#x2F;p&gt;
&lt;p&gt;Is it wasteful for billions of computers to &amp;quot;learn&amp;quot; the same traces when they load a webpage? Could anything be done to remedy this?&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Lazy Code Motion</title>
                <pubDate>Sat, 26 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/lazy-code-motion/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/lazy-code-motion/</guid>
                <description>&lt;p&gt;Code motion optimizations shift computations around a control-flow graph
(CFG) in order to avoid redundant computation, reduce code size, or save
resources. Loop invariant code motion, for example, identifies expressions
computed within loops which have the same value from iteration to iteration and
hoists them out of the loop so that they are computed only once. Common
subexpression elimination can also be viewed as a code motion
optimization: rather than computing a subexpression &lt;code&gt;e&lt;&#x2F;code&gt; twice in expressions
&lt;code&gt;f(e)&lt;&#x2F;code&gt; and &lt;code&gt;g(e)&lt;&#x2F;code&gt;, compute it once and store it in a termporary register.&lt;&#x2F;p&gt;
&lt;p&gt;Lazy code motion unifies and improves upon several forms of code motion
including loop invariant code motion and common subexpression elimination. It is
termed lazy to contrast with earlier approaches to unified code motion which
placed compuatations as early as possible. Perhaps this will make more sense if
we take a look at an example. I adapted the following from
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Partial_redundancy_elimination&quot;&gt;Wikipedia&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
  five: int = const 5;
  br some_condition compute do_nothing;
compute:
  y: int = add x five;
  jmp done;
do_nothing:
  jmp done;
done:
  z: int = add x five;
  ret;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If &lt;code&gt;some_condition&lt;&#x2F;code&gt; holds then &lt;code&gt;x + 5&lt;&#x2F;code&gt; whill be computed twice, but if it is
false then the computation only happens once. This is a &amp;quot;partial redundancy,&amp;quot;
and traditional common subexpression elimination will not fix it. A good global
code motion analysis will fix it, emitting code that looks something like this.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
  five: int = const 5;
  br some_condition compute do_nothing;
compute:
  tmp: int = add x five;
  y: int = id tmp;
  jmp done;
do_nothing:
  tmp: int = add x five;
  jmp done;
done:
  z: int = tmp;
  ret;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now all paths through the program include only one computation of &lt;code&gt;x+5&lt;&#x2F;code&gt;. This is
optimal code, at least along that dimension, and a realistic result to expect from
partial redundancy elimination or lazy code motion. So what makes lazy code
motion different from eager alternatives?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;register-pressure&quot;&gt;Register pressure&lt;&#x2F;h2&gt;
&lt;p&gt;When lowering IR code to assembly, compilers have to allocate storage for
a finite but unbounded number of variables in an architecture with a finite and
fixed number of registers. If there are more variables than registers, some of
them are going to end up on the stack. &amp;quot;Spilling&amp;quot; to the stack is costly--memory
is slower than registers.&lt;&#x2F;p&gt;
&lt;p&gt;Earlier passes in the compiler should try to minimize the number of spills
introduced during register allocation. The precise number of spills introduced
into a program depends on the register allocation algorithm in use, which makes
designing optimizations against this metric something of a fool&#x27;s errand. Lazy
code motion looks to a more understandable proxy metric known as &lt;em&gt;register
pressure&lt;&#x2F;em&gt;. Register pressure, for each program point, is the number of live
variables at that point as determined by a standard liveness analysis.&lt;&#x2F;p&gt;
&lt;p&gt;Eager code motion moves variable definitions (computations) further away from
their uses, which lengthens their live ranges. The resulting register pressure
can easily claw back any performance gains due to code motion. Rather than put
computations as early as possible, lazy code motion moves them down to the
a later program point that still avoids redundant computation. In fact, one
paper proves that lazy code motion places computations &amp;quot;as late as possible&amp;quot;
([1], Theorem 3), but this wording is misleading outside its original context.
The algorithm actually uses a static analysis to identify possible safe moves
and then selects the latest options. This is not always &amp;quot;as late as possible&amp;quot;
because in general the static analysis will underapproximate the set of possible
safe moves.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h1&gt;
&lt;p&gt;I implemented lazy code motion in OCaml. Since the existing Bril infrastructure
is written in Python and TypeScript, I had to implement Bril JSON parsing and
design data structures for representing control flow graphs. The open source
&lt;a href=&quot;http:&#x2F;&#x2F;ocamlgraph.lri.fr&#x2F;index.en.html&quot;&gt;OCamlgraph&lt;&#x2F;a&gt; library helped with this.
It includes excellent generic graph data structures and a framework for
implementing dataflow analyses over graphs.&lt;&#x2F;p&gt;
&lt;p&gt;There were some mismatches between the dataflow analyses presented by Drechsler
et al. [1] and the OCamlgraph dataflow framework. The biggest issue was that
the OCamlgraph framework made it hard to refer to results of other dataflow
passes in later ones. I worked around this by tagging every basic block in the
CFG with a map of &amp;quot;attributes&amp;quot; and writing OCaml functors like &lt;code&gt;MakeDataflow&lt;&#x2F;code&gt;
that require their input module to include an &lt;code&gt;attr_name&lt;&#x2F;code&gt; field. When you run
a dataflow analysis created with &lt;code&gt;MakeDataflow&lt;&#x2F;code&gt;, it saves the results of the
analysis at each basic block into the attributes map under the &lt;code&gt;attr_name&lt;&#x2F;code&gt; key.&lt;&#x2F;p&gt;
&lt;p&gt;The analyses were easy to write down once there was structure in place for
defining them. For example, here is the definition of an available expressions
analysis from the paper:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;let avail_out (_, block_attrs) avail_in =
  Bitv.bw_or
    (Bitv.bw_and
       (Cfg.Attrs.get block_attrs &amp;quot;transparent&amp;quot;)
       avail_in)
    (Cfg.Attrs.get block_attrs &amp;quot;computes&amp;quot;)

module AvailabilityIn =
  MakeDataflow
    (struct
      let attr_name = &amp;quot;availability_in&amp;quot;
      let direction = Graph.Fixpoint.Forward
      let init _ =
        EXPRS.build ~f:(fun _ -&amp;gt; false)
      let analyze (src, _, _) src_avail_in =
        avail_out src src_avail_in
    end)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;I implemented a CFG printer using the OCamlgraph Graphviz backend so that
I could debug the optimization&#x27;s changes to CFGs directly, rather than staring
at pretty printed output. Here&#x27;s the result of running the optimization on
&lt;code&gt;tests&#x2F;lcm&#x2F;simple&lt;&#x2F;code&gt;, a nested loop benchmark. There&#x27;s a way to include the
instructions from each block inside the nodes, but for this example that makes
it hard to follow. There is a loop invariant computation in &lt;code&gt;inner_body&lt;&#x2F;code&gt; that
lazy code motion hoists to &lt;code&gt;lcm_inserted_block0&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;img src=&quot;simple-before.png&quot; &#x2F;&gt;
&lt;img src=&quot;simple-after.png&quot; &#x2F;&gt;
&lt;h1 id=&quot;limitations&quot;&gt;Limitations&lt;&#x2F;h1&gt;
&lt;p&gt;I implemented the full lazy code motion optimization, but there were some
issues. The optimization expects lexically equal expressions to always be stored
into the same pseudoregister. There may be adaptations of the dataflow analyses
in later work to weaken this assumption, but I introduced a conservative code
rewriting pass that made sure expressions went into a uniquely named temporary.
This increases register pressure and introduces extraneous move instructions to
retrieve computed values from temporaries. A more clever rewriting pass might be
able to ameliorate these costs.&lt;&#x2F;p&gt;
&lt;p&gt;On the other end of the optimization, the computation placement algorithm is
inefficient. Lazy code motion places computations on edges of the CFG, which
means stitching new basic blocks into edges. While they are neccessary in
general, on many edges an inserted basic block could be safely merged with its
predecessor or successor block. This could have performance benefits because it
would reduce the number of jumps. Similarly the pretty-printer for CFGs does not
omit jumps where fall-through would work--this seems like a minor detail but
could have performance or code size implications. Both of these issues could be
addressed with a simplification pass to be run after lazy code motion.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h1&gt;
&lt;p&gt;I instrumented the interpreter to count operations and computations.
A computation is any instruction that computes the value of an expression. An
operation is any instruction at all, including control flow and return
instructions. Many of the programs in the &lt;code&gt;examples&lt;&#x2F;code&gt; and &lt;code&gt;test&lt;&#x2F;code&gt; directories are
either a single basic block or include no computations, so I excluded them from
this evaluation because the optimization doesn&#x27;t do anything to them.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;left&quot;&gt;program&lt;&#x2F;th&gt;&lt;th align=&quot;right&quot;&gt;ops (before lcm)&lt;&#x2F;th&gt;&lt;th align=&quot;right&quot;&gt;computations (before lcm)&lt;&#x2F;th&gt;&lt;th align=&quot;right&quot;&gt;ops (after lcm)&lt;&#x2F;th&gt;&lt;th align=&quot;right&quot;&gt;computations (after lcm)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;test&#x2F;lcm&#x2F;dont-hoist-thru-loop&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;8&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;1&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;11&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;test&#x2F;lcm&#x2F;two-blocks&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;5&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;1&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;8&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;test&#x2F;lcm&#x2F;hoist-thru-loop&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;412&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;303&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;617&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;203&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;test&#x2F;lcm&#x2F;simple&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;1217&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;906&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;1925&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;605&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;examples&#x2F;lvn_test&#x2F;nonlocal&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;7&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;3&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;12&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;3&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;examples&#x2F;dom_test&#x2F;loopcond&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;117&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;46&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;165&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;46&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;examples&#x2F;df_test&#x2F;cond&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;9&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;1&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;12&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;examples&#x2F;df_test&#x2F;fact&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;62&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;25&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;89&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;25&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Since lazy code motion is designed to avoid redundant computations of
expressions, the number of computations never increases after optimization. It
looks like the overhead of the conservative temporary allocation strategy and
basic block insertion impacts the total number of computations negatively by
adding moves and jumps respectively. Loopy benchmarks (simple, hoist-thru-loop)
exhibit significant speedups due to computations being hoisted out of their
loops.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;bibliography&quot;&gt;Bibliography&lt;&#x2F;h1&gt;
&lt;ol&gt;
&lt;li&gt;Drechsler, K.-H., Stadel, M. P. A variation of Knoop, Rüthing, and Steffen&#x27;s Lazy Code Motion. SIGPLAN Notices 28, 5, (1993), 29-38.&lt;&#x2F;li&gt;
&lt;li&gt;Knoop, J., Rüthing, O., Steffen, B. Lazy code motion. SIGPLAN Notices 27, 7, (1992), 224-234.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
</description>
            </item>
        
            <item>
                <title>Efficient Path Profiling </title>
                <pubDate>Fri, 25 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/efficient-path-profiling/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/efficient-path-profiling/</guid>
                <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;&#x2F;h2&gt;
&lt;p&gt;Identifying path bottlenecks on program execution is critical when evaluating and optimizing program performance.&lt;br &#x2F;&gt;
Due to the exponential number of paths in large programs, however, counting the most frequented path in a given execution directly is intractable for any realistic program.
Solutions to this problem long focused on counting other execution properties dynamically, such as the number of times each control flow graph (CFG) edge is taken.&lt;&#x2F;p&gt;
&lt;p&gt;Such an edge-focused algorithm, however, does not allow unique construction of paths, and cannot accurately predict the path profile of a given execution.
The usual definition for the most taken path is defined greedily, where it is constructed as the series of maximally taken paths for each node.
Consider, for instance, the edge profiles, which &lt;em&gt;both&lt;&#x2F;em&gt; compose to produce the path profile shown:&lt;&#x2F;p&gt;
&lt;img src=&quot;bad_paths.png&quot; alt=&quot;drawing&quot; width=&quot;400&quot;&#x2F;&gt;
&lt;p&gt;Neither measurements, however, predict the path &lt;em&gt;ABCDEF&lt;&#x2F;em&gt; as the most taken, despite it being defined as such by the usual heuristic.
Indeed, it was long thought that this loss of accuracy was necessary when defining a sufficiently fast algorithm.&lt;&#x2F;p&gt;
&lt;p&gt;We will explore how Thomas Ball&#x27;s and James R. Larus&#x27; paper &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=243857&quot;&gt;&lt;em&gt;Efficient Path Profiling&lt;&#x2F;em&gt;&lt;&#x2F;a&gt; define an efficient algorithm for dynamically computing information about path use.
This path profiling algorithm computes information about paths directly, and so does not suffer from the same loss of accuracy as approximations derived from edge use measurements.
As promised by the title, this algorithm is both memory efficient and fast enough to use when profiling realistic programs.
Experimentally, path profiling is shown to only have a 31% profiling overhead compared with edge profiling having a 16% overhead on a standard benchmark suite.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-algorithm-for-dags&quot;&gt;The Algorithm for DAGs&lt;&#x2F;h2&gt;
&lt;p&gt;We start by assuming profiling over a directed acyclic graph (DAG), but will later explore extending to loops.
Any DAG will have at least one node with no incoming edges, called entry nodes, and at least one node with no outgoing edges, called exit nodes.
We extend any CFG with a single entry and single exit node, thus making each unique.
The steps of the path profiling algorithm are as follows:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Assign a minimal number of integer values to edges such that the sum of integers along any path is unique.&lt;&#x2F;li&gt;
&lt;li&gt;Using a spanning tree, create and select instrumentation for computing the increment for each edge.&lt;&#x2F;li&gt;
&lt;li&gt;Collect the dynamic runtime profile.&lt;&#x2F;li&gt;
&lt;li&gt;Derive the executed paths based on the results and selected instrumentation.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;assigning-edge-integer-values&quot;&gt;Assigning Edge Integer Values&lt;&#x2F;h3&gt;
&lt;p&gt;The first step to path profiling is to assign edge integers such that the sum of edge integers along any path in the DAG results in a unique integer value.
There are many such assignments, but the minimal positive set can be given by reasoning about the number of paths each edge can take to reach the exit node.
Intuitively, calculating a unique integer based on paths works since each split creates a new path from the unique entry node and results in a distinct integer.
Formally, the value for each edge from &lt;em&gt;v&lt;&#x2F;em&gt; to &lt;em&gt;w&lt;&#x2F;em&gt; is constructed to be the sum of the number of paths from all previously examined nodes extending from &lt;em&gt;v&lt;&#x2F;em&gt;.
This result, for example, is the integer values for our sample CFG:&lt;&#x2F;p&gt;
&lt;img src=&quot;integers.png&quot; alt=&quot;drawing&quot; width=&quot;300&quot;&#x2F;&gt;
&lt;h3 id=&quot;creating-instrumentation&quot;&gt;Creating Instrumentation&lt;&#x2F;h3&gt;
&lt;p&gt;The unique paths given by these integers is not yet efficient counting instrumentation.
Counting instrumentation is based on incrementing a specific register whenever a given edge is passed; such writes can be expensive, so we select edges to minimize the number of writes while still producing correct results.
To construct counting efficiently, we must first identify the most taken edges (based on the edge weights computed earlier) and avoid incrementing when using those edges.
Specifically, we compute the maximum spanning tree with respect to edge weights.
All updates then only reason about those edges in the set of &lt;em&gt;chords&lt;&#x2F;em&gt;, the edges not in the maximum spanning tree.&lt;&#x2F;p&gt;
&lt;p&gt;Instrumentation for the graph (the code for updating register associated with the path) is then added for each edge in the chord based on the integers computed earlier.
Since each path of integers must be unique, the register selected by a given chord edge is simply given as the minimum integer increment since the last chord edge for any path.
This algorithm gives the instrumentation for the integers calculated on our sample CFG as follows, where edges with squares are chords:&lt;&#x2F;p&gt;
&lt;img src=&quot;instrumentation.png&quot; alt=&quot;drawing&quot; width=&quot;300&quot;&#x2F;&gt;
&lt;h3 id=&quot;regenerating-paths&quot;&gt;Regenerating Paths&lt;&#x2F;h3&gt;
&lt;p&gt;Finally, after incrementing each register according to our instrumentation, we must recover the number of times each path was taken.
The integers calculated earlier give an intuition for reconstructing these values, since each integer path must have been unique.
We can then just calculate the integer value associated with each path by walking the path in the integer version of the CFG and use this to reference which register corresponds to each path.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;extending-to-cycles&quot;&gt;Extending to Cycles&lt;&#x2F;h3&gt;
&lt;p&gt;We have assumed so far that every graph is acyclic; how can we extend this algorithm to work with cycles?
It turns out that evaluating each type of cycle and how it relates to an arbitrarily inserted enter or exit node provides sufficient information to construct our integer edge values as with a DAG.&lt;&#x2F;p&gt;
&lt;p&gt;Cycles added to a DAG can be thought of as &lt;em&gt;backedges&lt;&#x2F;em&gt;, edges which visit a node previously seen in the DAG.
Through replacing these backedges with information-carrying forward edges, the algorithm reconstructs a DAG from any cyclic graph.
The details of these edges rely on technical casework, so those interested in the details should read the EPP paper directly; however, it is sufficient for our overview to state that each backedge is replaced by forward edges from the entry node to its target, and from its source to the exit node.
This process can be summarized visually with this excellent diagram:&lt;&#x2F;p&gt;
&lt;img src=&quot;cycle.png&quot; alt=&quot;drawing&quot; width=&quot;300&quot;&#x2F;&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;To evaluate how realistic this approach can be, the authors built the PP tool and compared it with an existing edge profiling tool (QPT2).
Several optimizations were applied to PP to promote a realistic comparison, such as mapping local registers through the Executable Editing Library (EEL) and implementing hash tables to handle a large number of paths.
All experiments were run on the SPEC95 benchmark, which consists of a standard set of C and Fortran programs.&lt;&#x2F;p&gt;
&lt;p&gt;Experiments were run on a Sun Ultraserver, with 167Mhz UltraSPARC processors and a whopping 2GB of memory.
PP&#x27;s overhead on this test suite averaged 30.9%, while QPT2&#x27;s overhead averaged 16.1%.
Note that cache interference caused by profiling was not recorded; however, in general, programs with little hashing (few paths) had comparable PP and QPT2 overhead while programs with substantial hashing had a larger PP overhead.
The PP tool had perfect accuracy definition; in comparison, the QPT2 tool only averaged 37.9% accuracy, demonstrating the main strength of the PP approach.&lt;&#x2F;p&gt;
&lt;p&gt;These experiments show the relative power of this new algorithmic approach.
While the path profiling algorithm does cause some additional overhead, this was shown to be minimal compared to edge profiling algorithms; the gains in accuracy also speak for themselves.
This test suite is rather small, however, only consisting of 18 test programs, which makes generalizing these results somewhat difficult.
While the author&#x27;s analysis of the results shows some thought, it almost seems they accepted that the relatively low overhead shown here indicates that the accuracy is simply worth the cost.
Note also that while these experiments only compared PP to one other tool; there are so few other profiling algorithms that this seems appropriate.
The QPT2 tool is noted to be the lowest overhead edge profiling tool; with the accuracy guarantees provided by PP, testing against other tools feels almost meaningless.&lt;&#x2F;p&gt;
&lt;p&gt;The complete timing results are provided below.
The authors also included a summary of accuracy results, which have been omitted for simplicity.&lt;&#x2F;p&gt;
&lt;img src=&quot;results.png&quot; alt=&quot;drawing&quot; width=&quot;300&quot;&#x2F;&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;The efficient path profiling paper introduced an algorithm showing that accurate path profiling &lt;em&gt;can&lt;&#x2F;em&gt; be done quickly and efficiently.
This insight resulted in an algorithm that is, as best I can tell, still in use today.
Through proving the minimal properties of this path profiling approach, this paper seems to have resolved the direction of profiling and standardized the algorithm presented here.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Superblock Scheduling for Bril</title>
                <pubDate>Thu, 24 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/superblock-scheduling/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/superblock-scheduling/</guid>
                <description>&lt;h2 id=&quot;a-very-long-instruction-word&quot;&gt;A Very Long Instruction Word&lt;&#x2F;h2&gt;
&lt;p&gt;Very Long Instruction Word (VLIW) architectures allow execution of multiple
instructions in the same cycle. Instead of executing a single instruction,
the architecture executes a &amp;quot;bundle&amp;quot; of independent instructions, allowing it
to harness Instruction Level Parallelism (ILP). The key trade-off VLIW
architectures make is pushing the instruction scheduling choices to the compiler.
While a modern out-of-order processor will discover and dynamically change
the execution order of a sequence of instructions, VLIW processors require
the compiler to explicitly schedule parallel operations into bundles.&lt;&#x2F;p&gt;
&lt;p&gt;The central challenge in designing VLIW compilers is minimizing the number of
&lt;code&gt;nop&lt;&#x2F;code&gt;s in bundles. A bundle has a &lt;code&gt;nop&lt;&#x2F;code&gt; when not enough parallel instructions
can be found to fit the bundle&#x27;s size (say 3 instructions per bundle).
Read-write conflicts, control dependencies, and structural hazards (not having
sufficient number of hardware units to support a bundle&#x27;s parallel execution)
force compilers to move instructions into separate bundles to preserve
sequential semantics.&lt;&#x2F;p&gt;
&lt;p&gt;Superblock scheduling is an optimization for VLIW compilers that allows them to
generate denser bundles while increasing the program size. For our project, we
extended the Bril interpreter to emulate a VLIW machine and implemented the
superblock scheduling optimization to generate bundles of instructions from
source Bril programs. Finally, we evaluated the effectiveness of our
optimization by measuring (1) program cost (measured by the number of bundles
executed) and (2) measuring how often we fall out of a superblock (a sequence
of dense bundles).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;an-imaginary-vliw-machine&quot;&gt;An Imaginary VLIW machine&lt;&#x2F;h2&gt;
&lt;p&gt;We extended Bril&#x27;s interpreter to simulate a VLIW machine. In the Bril VLIW
machine, there are either bundles that contain up to 4 instructions or a single
instruction (equivalent to a bundle with that instruction and three &lt;code&gt;nop&lt;&#x2F;code&gt;s).&lt;&#x2F;p&gt;
&lt;p&gt;A bundle is defined as sequence of conditions, a sequence
of four instructions, and a label.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;[ c1, c2, c3 ...; i1, i2, i3, i4; label ]
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The semantics of the bundle is if &lt;code&gt;c1 &amp;amp;&amp;amp; c2 &amp;amp;&amp;amp; ...&lt;&#x2F;code&gt; is true, execute the
instructions, otherwise jump to the label. In hardware, this can be implemented
by guarding the write-back stage with the value of the conditional.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;superblock-scheduling&quot;&gt;Superblock Scheduling&lt;&#x2F;h2&gt;
&lt;p&gt;A straightforward implementation of VLIW compilation at the basic block level can
simply try to perform code motion and bundling of instructions within a single basic
block. Since there are no jumps or branches, the code can be easily compacted.
However, basic blocks tend to be small which leads to the compiler missing
out on optimization opportunities. This approach is refered to as &amp;quot;local compaction&amp;quot;.&lt;&#x2F;p&gt;
&lt;p&gt;In the following code block, the computations for &lt;code&gt;v1&lt;&#x2F;code&gt; and &lt;code&gt;v4&lt;&#x2F;code&gt; can be
performed in parallel (inside a bundle). However, because of the branch instruction
between the two blocks, the compiler cannot move the two instructions into
a bundle.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;b1:
  v1: int = add v2 v3
  br v7 b2 b3
b2:
  v4: int = add v2 v0 &#x2F;&#x2F; not live out of b2
  ...
b3:
  ...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The core idea with superblock scheduling is finding &amp;quot;traces&amp;quot; of frequently
code by predicting which branch a program is going to take and building
a fast program path with dense bundles. In case the branch prediction is
wrong, the compiler adds abort labels to exit the trace.&lt;&#x2F;p&gt;
&lt;p&gt;With superblock scheduling, the code example above can be turned into:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;b1:
  [ v7 : v1: int = add v2 v3, v4: int = add v2 v0 : slow ]
  br v7 b2 b3
slow:
  v1: int = add v2 v3
  br v7 b2 b3
b2:
  v4: int = add v2 v0 &#x2F;&#x2F; not live out of b2
  ...
b3:
  ...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In the fast case (when &lt;code&gt;v7&lt;&#x2F;code&gt; is true), the program will compute both &lt;code&gt;v1&lt;&#x2F;code&gt; and
&lt;code&gt;v4&lt;&#x2F;code&gt; in parallel. If &lt;code&gt;v7&lt;&#x2F;code&gt; is false, the program switches back to normal
execution, computing the &lt;code&gt;v1&lt;&#x2F;code&gt; and &lt;code&gt;v4&lt;&#x2F;code&gt; sequentially. A superblock compiler
might choose to make various part of the &amp;quot;slow&amp;quot; program paths traces
themselves, trading off program size for speed.&lt;&#x2F;p&gt;
&lt;p&gt;The core of this algorithm is inspired largely from &lt;a href=&quot;https:&#x2F;&#x2F;people.eecs.berkeley.edu&#x2F;%7Ekubitron&#x2F;courses&#x2F;cs252-S12&#x2F;handouts&#x2F;papers&#x2F;TraceScheduling.pdf&quot;&gt;Trace Scheduling: A Technique for Global Microcode Compaction&lt;&#x2F;a&gt;. We simplified the algorithm to only allow entry into the top of the superblock so that we didn&#x27;t have to worry about patching the program in subtle ways.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;list-scheduling&quot;&gt;List Scheduling&lt;&#x2F;h3&gt;
&lt;p&gt;List scheduling is a heuristic-based algorithm that performs compaction
on a sequence of instructions. It takes a directed acyclic graph (DAG) representing the
scheduling constraints (such as read-write dependencies) between instructions and returns a list of bundles.&lt;&#x2F;p&gt;
&lt;p&gt;We start by presenting a high level overview of the algorithm and then we will go into more detail
about how we build the DAG. First, a heuristic assigns each instruction in the graph a priority.
Then we initialize a queue with all the instructions
that have no predecessors in the graph. We do the following in a loop until we have scheduled all the instructions:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Make an bundle.&lt;&#x2F;li&gt;
&lt;li&gt;Take the instruction from the queue with the highest priority.&lt;&#x2F;li&gt;
&lt;li&gt;Add it to the bundle if compatible, i.e., has no conflicts with the bundle.&lt;&#x2F;li&gt;
&lt;li&gt;Continue taking instructions from the queue until either the queue is empty or the instruction
is incompatible with the current bundle.&lt;&#x2F;li&gt;
&lt;li&gt;For each element that we scheduled, check if their successors now have all their predecessors scheduled and add them to the queue.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The algorithm schedules an instruction only after all of its predecessors in the
DAG have been scheduled. This means that to maintain program correctness,
the predecessor relation must respect data dependencies.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;v1: int = id x;
x: int = const 10;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In the program above, we cannot reorder the instructions because the second
instruction writes to &lt;code&gt;x&lt;&#x2F;code&gt; while the first reads from it. We can encode this
constraint in the DAG by making the first instruction a predecessor of the second.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;generating-superblocks&quot;&gt;Generating Superblocks&lt;&#x2F;h3&gt;
&lt;p&gt;The superblock algorithm generates a sequence of instructions and a DAG that are
used by the list scheduling algorithm to generate a program trace. Because the sequence of instructions
is speculative and may contain instructions that shouldn&#x27;t be executed, the
superblock algorithm has to encode data and control dependencies in the DAG
to allow an unmodified list scheduling algorithm to work.&lt;&#x2F;p&gt;
&lt;p&gt;The core insight with the superblock algorithm is that adding the DAG constraint that
all the live out variables in a branch &amp;quot;read&amp;quot; from the conditional. This
means that a write in a conditional cannot be moved outside it and in case
a program exits from the middle of a trace, it will never observe the writes
from a compacted branch body.&lt;&#x2F;p&gt;
&lt;p&gt;For example,&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;  ...
  v6: bool = gt v4 v5;
  br v6 for.body.2 for.end.2;
for.body.2:
  v7: int = id result;
  v8: int = id i;
  v9: int = mul v7 v8;
  result: int = id v9;
  ...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;When we consider this as a trace, we ignore the label and assume that the
branch will jump to &lt;code&gt;for.body.2&lt;&#x2F;code&gt;. Without the constraint on the conditional,
the following is valid reordering according to the list scheduling algorithm.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;  ...
  v6: bool = gt v4 v5;
  v7: int = id result;
  v8: int = id i;
  v9: int = mul v7 v8;
  result: int = id v9;
  br v6 for.body.2 for.end.2;
  ...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;However, &lt;code&gt;v6&lt;&#x2F;code&gt; can be false which means we should have jumped to &lt;code&gt;for.end.2&lt;&#x2F;code&gt;
instead but since the read was moved above the branch, &lt;code&gt;for.end.2&lt;&#x2F;code&gt; will observe
the write to &lt;code&gt;result&lt;&#x2F;code&gt; which it shouldn&#x27;t in the normal case.&lt;&#x2F;p&gt;
&lt;p&gt;The other important ingredient in this algorithm is the heuristic that assigns priorities to instructions.
We follow &lt;a href=&quot;https:&#x2F;&#x2F;people.eecs.berkeley.edu&#x2F;%7Ekubitron&#x2F;courses&#x2F;cs252-S12&#x2F;handouts&#x2F;papers&#x2F;TraceScheduling.pdf&quot;&gt;Fisher&#x27;s&lt;&#x2F;a&gt;
example and use the &lt;strong&gt;highest levels first&lt;&#x2F;strong&gt; heuristic. The priority of each node is the depth of the longest chain
in the dependency DAG. He claims close to &amp;quot;optimal in practical environments&amp;quot; with this heuristic.
Given more time, it would be interesting to explore how changing this heuristic effects the results
of trace scheduling.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;For our project, we implemented four things:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Bril Extension&lt;&#x2F;strong&gt;: We extended the Bril interpreter by adding a &amp;quot;group&amp;quot;
instruction which has the semantics described above. The interpreter simply
executes the instructions in the group if the conditionals are true and
otherwise jumps to the label. We also add a &amp;quot;bundle counter&amp;quot; to track
the number of instructions executed in the interpreter. Each bundle and
instruction counts as one.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Implement control and data flow analysis&lt;&#x2F;strong&gt;: We implemented a pass to generate
the CFG of the Bril program and a straightforward live variable analysis.
Both of these analyses are used by the superblock scheduling algorithm to
correctly incorporate branches into a trace.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Superblock scheduling&lt;&#x2F;strong&gt;: We generate a trace for the input program. We don&#x27;t
do anything smart for this; we just generate the longest trace we can from the
entry block for each function. We use the live variable information to
generate the DAG for the trace as described above and then run list scheduling
to compact the trace.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Generating valid programs from traces&lt;&#x2F;strong&gt;: Once we have a program trace, we
change the programs to correctly jump into and out of traces. We also duplicate
the code in the trace to correctly work for the slow program path.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;cost-model&quot;&gt;Cost Model&lt;&#x2F;h3&gt;
&lt;p&gt;We evaluated our superblock optimization implementation using two metrics: instruction count
and number of runtime instructions implemented. For the instruction count, we ignored labels
counted bundles as a single instruction. We had a simple runtime cost model where both instruction
and bundles had a cost of one to run.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;comparison-to-local-compaction&quot;&gt;Comparison to Local Compaction&lt;&#x2F;h3&gt;
&lt;p&gt;We compared against unoptimized code as a baseline and then against local compaction. We expected
that we should be able to beat local compaction for runtime cost but not instruction count.
Our baseline cost for factorial was 272. With local compaction, the cost was 171 and with superblock
scheduling, our cost was down to 133. The baseline had 22 instructions. Superblock optimization used 17 instructions while local compaction
used 16 instructions. In general, superblock optimization increases code size because it sometimes
has to duplicate code to maintain correctness.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;Superblock scheduling is an optimization that helps VLIW compilers generate
larger program traces thereby allowing for faster execution in the common case.
The underlying philosophy behind this optimization has been widely adopted by
moder JITs that dynamically generate specialized fast code for program paths
that get executed in the common case.&lt;&#x2F;p&gt;
&lt;p&gt;Trace scheduling, a more general version of superblock scheduling that allows
traces to have multiple inputs and outputs was also studied as a part of
efforts to design VLIW machines. Specifically, the &lt;a href=&quot;https:&#x2F;&#x2F;courses.cs.washington.edu&#x2F;courses&#x2F;cse548&#x2F;16wi&#x2F;Fisher-VLIW.pdf&quot;&gt;ELI-512&lt;&#x2F;a&gt; was built with
special support to execute programs generated by its trace scheduling compiler
&lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=912347&quot;&gt;Bulldog&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Loop-Level Automatic Vectorization</title>
                <pubDate>Wed, 23 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/autovectoring/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/autovectoring/</guid>
                <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;&#x2F;h2&gt;
&lt;p&gt;Modern processors have support for &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;SIMD&quot;&gt;SIMD&lt;&#x2F;a&gt; instructions, which allow for efficient vector operations. We can leverage this feature to optimize loops that operate iteratively on arrays by changing operations that act on single array elements into vector operations that act on multiple array values in one instruction. &lt;&#x2F;p&gt;
&lt;p&gt;Consider the following loop of a vector-vector add:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;# Assume a, b, and c represent array base addresses in memory such
# that the arrays do not overlap.
...
one: int = const 1;

vvadd_loop:
  ai: int = add a i;
  bi: int = add b i;
  ci: int = add c i;

  va: int = lw ai;
  vb: int = lw bi;
  vc: int = add va vb;
  sw vc ci;

  i: int = add i one;
  done: bool = ge i size;
  br done vvadd_done vvadd_loop;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Compare this with the following loop:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;four: int = const 4;

vvadd_loop:
  ai: int = add a i;
  bi: int = add b i;
  ci: int = add c i;

  va: int = vload ai;
  vb: int = vload bi;
  vc: int = vadd va vb;
  vstore vc ci;

  i: int = add i four;
  done: bool = ge i size;
  br done vvadd_done vvadd_loop;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The second loop executes for a fourth the number of iterations of the first loop while behaving identically by using &lt;code&gt;vload&lt;&#x2F;code&gt;, &lt;code&gt;vstore&lt;&#x2F;code&gt;, and &lt;code&gt;vadd&lt;&#x2F;code&gt; that operate on four array elements at a time. This allows for &lt;em&gt;i&lt;&#x2F;em&gt; to be incremented by four each iteration instead of by one. &lt;&#x2F;p&gt;
&lt;p&gt;For this project, we designed and implemented automatic loop vectorization by converting serial operations on array elements to their vector counterparts. We build on  Philip Bedoukian&#x27;s work that brings &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;interpreter-vector-support&#x2F;&quot;&gt;vector instruction support into the Bril interpreter&lt;&#x2F;a&gt;. We use his array implementation and the vector operations that he provides which operate on vectors of length four. We did not use his C++ dynamic library, and instead wrote a new dynamic library in Rust with a wider range of function calls as we were more familiar with the Rust specification. In addition, this choice proved to be a valuable learning opportunity. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;design-overview&quot;&gt;Design Overview&lt;&#x2F;h2&gt;
&lt;p&gt;To vectorize a loop, we must first detect loops, check whether they have dependencies that prevent vectorization, vectorize array operations, and deal with extraneous serial instructions. &lt;&#x2F;p&gt;
&lt;h4 id=&quot;dominator-analysis&quot;&gt;Dominator Analysis&lt;&#x2F;h4&gt;
&lt;p&gt;We need a dominator analysis to find loops, and so we implemented this dataflow analysis as follows:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Edge&lt;&#x2F;strong&gt;: Each out edge of a block consists of all blocks that dominate it, including itself
&lt;strong&gt;Direction&lt;&#x2F;strong&gt;: Forward
&lt;strong&gt;Initial Value&lt;&#x2F;strong&gt;: Set of all blocks
&lt;strong&gt;Merge&lt;&#x2F;strong&gt;: Set intersection (or empty set if first block in program)
&lt;strong&gt;Transfer&lt;&#x2F;strong&gt;: Current block unioned with in-edge&lt;&#x2F;p&gt;
&lt;h4 id=&quot;loops&quot;&gt;Loops&lt;&#x2F;h4&gt;
&lt;p&gt;We first define a &lt;em&gt;back edge&lt;&#x2F;em&gt; as an edge from A to B where A and B are basic blocks and B dominates A. This is essentially the control flow edge that transitions from the end of the loop back to the beginning. As such, we define the loop as all blocks along the path from B to A, which we find using DFS. &lt;&#x2F;p&gt;
&lt;p&gt;Now that we are able to find a loop, it must have the following elements to be vectorizable:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;A branch statement (to exit the loop)
&lt;ul&gt;
&lt;li&gt;e.g., &lt;code&gt;br done exit loop_header;&lt;&#x2F;code&gt; &lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;A condition variable (boolean variable for the branch argument)
&lt;ul&gt;
&lt;li&gt;e.g., &lt;code&gt;done: bool = eq i size;&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;An &lt;em&gt;induction variable&lt;&#x2F;em&gt; that increases or decreases each iteration
&lt;ul&gt;
&lt;li&gt;e.g., &lt;code&gt;i: int = add i one;&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;These are the minimum requirements for creating a loop that iterates a set number of times. To actually get these values, we need to enforce that there is only one branch statement in the loop. We do not handle loops with multiple branches in this project as it is much more complex to hadle vectorization for nested or loops where some statements run more iterations than other statements. We also restrict the induction variable, condition variable, and branch statement to be the last three instructions in the loop to enforce the same number of total iterations for all statements, which would not be the case if statements were allowed after the branch since those would be executed one less time than statements before the branch.&lt;&#x2F;p&gt;
&lt;p&gt;From these, we can find more information such as the variable that specifies the bound, which can be deduced from the arguments for the condition variable as one argument must be the induction variable, leaving the other to be the bound. &lt;&#x2F;p&gt;
&lt;p&gt;We now want to verify properties to ensure that this loop is indeed vertorizable. For example, we need to know that the bound variable is a constant and not something loaded from memory, or we won&#x27;t be able to determine how many times the loop is executed. This requires us to know the latest definition of variables, which we get using a reaching definitions analysis. It is possible to find that statement walking backwards through the CFG, but we found using a dataflow analysis to be a much cleaner approach. Furthermore, we&#x27;d also like to know values of variables at different program points, such as the base pointers for array loads or stores, which we can get through a copy propagation analysis. &lt;&#x2F;p&gt;
&lt;p&gt;For both of those analyses, we found it to be very helpful to have information at the statement level instead of the block level. For example, in the block with the condition variable, it is possible that the bound is defined in that block before the condition variable statement and then again after the statement, and so if we only had the copy propagation values at the in and out edges of that block, we would not get the correct value the condition statement sees. We wanted to avoid iterating through blocks to make sure variables are not redefined, so we made each statement a block by adding trivial jumps and labels between statements, thereby getting finer-grained analyses.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;reaching-definitions&quot;&gt;Reaching Definitions&lt;&#x2F;h4&gt;
&lt;p&gt;&lt;strong&gt;Edge&lt;&#x2F;strong&gt;: Each edge is a dictionary from variable names to their latest definition, or to &lt;em&gt;None&lt;&#x2F;em&gt; if multiple definitions reach that point.
&lt;strong&gt;Direction&lt;&#x2F;strong&gt;: Forward
&lt;strong&gt;Initial Value&lt;&#x2F;strong&gt;: Empty dictionary
&lt;strong&gt;Merge&lt;&#x2F;strong&gt;: Unions the in-edge dictionaries if a variable (key) does not exist in both dictionaries, and set the varible&#x27;s definition statement (value) to &lt;em&gt;None&lt;&#x2F;em&gt; if it exists in multiple dictionaries. 
&lt;strong&gt;Transfer&lt;&#x2F;strong&gt;: Starting with the merged in-edge (a dictionary from variable to definition) we set the value for every variable defined in the current block to that definition statement.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;copy-propagation&quot;&gt;Copy Propagation&lt;&#x2F;h4&gt;
&lt;p&gt;We used the copy propagation analysis already in the Bril repo.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;validity-checks&quot;&gt;Validity Checks&lt;&#x2F;h3&gt;
&lt;p&gt;Now that we have reaching definitions and copy propagation per statement, we can check whether a loop is vectorizable.&lt;&#x2F;p&gt;
&lt;p&gt;The primary reason a loop is not vectorizable is due to &lt;em&gt;flow-dependencies&lt;&#x2F;em&gt;. A flow-dependency is when a variable uses information from the pervious iteration of a loop, such as when a value is loaded from an array position that was written to in the previous iteration of the loop. To detect this, we use the fact that indexing into an array is typically done by adding an offset to the base pointer of an array. We are able to find the variables representing array pointers by examing load and store instructions&#x27; arguments, and then we use information from our reaching definitions analysis to check whether those variables are computed each iteration as an addition of a constant and another variable. &lt;&#x2F;p&gt;
&lt;p&gt;Since our vector load&#x2F;store instructions access four consecutive values, we also need to enforce that the non-constant variable is the induction variable, and also that the induction variable must increment or decrement by exactly 1 every iteration. This ensures that arrays are always accessed sequentially. &lt;&#x2F;p&gt;
&lt;p&gt;The number of iterations can be computed from the bound variable, the condition variable, and the initial value of the induction variable, and this number allows us to find array lengths since arrays are sequentially accessed per iteration. With the base array pointers and array lengths, we can now check that arrays do not overlap, which then proves that they cannot have flow-dependencies as each array location can only be accessed once in the duration of the loop. &lt;&#x2F;p&gt;
&lt;p&gt;To be able to convert singular addition operations to vector addition, we also check that operations on loaded values must only involve loop-invariant variables (i.e., variables that do not differ per iteration) so that those operations are not flow-dependent. We then check that stores only store variables that are either constant or are results from operations on loaded values (which we previously enforced to be flow-independent).&lt;&#x2F;p&gt;
&lt;p&gt;After we are confident that the array operations in a loop are vectorizable, we now need to convert the loop structure to its equivalent vectorized form.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;strip-mining&quot;&gt;Strip Mining&lt;&#x2F;h3&gt;
&lt;p&gt;Since we are working with vector operations on four consecutive array elements, we &amp;quot;chunk&amp;quot; the sequence of loop iterations in blocks of four, which is known as &lt;em&gt;strip mining&lt;&#x2F;em&gt;. &lt;&#x2F;p&gt;
&lt;p&gt;Trivially, strip mining can be done by finding the statement that increments the induction variable and changing it from incrementing&#x2F;decrementing 1 to incrementing&#x2F;decrementing 4. All operations on array elements are changed to their vector counterparts, e.g., &lt;code&gt;add&lt;&#x2F;code&gt; would become &lt;code&gt;vadd&lt;&#x2F;code&gt;. This would work as long as the loop only included operations on arrays, but we found that to often not be the case. &lt;&#x2F;p&gt;
&lt;p&gt;For example, printing the &lt;em&gt;i&lt;&#x2F;em&gt; each iteration, where &lt;em&gt;i&lt;&#x2F;em&gt; is the induction variable, should still allow the loop to be vectorized as there are no flow-dependencies, but if we change the induction variable by four each iteration, that print statment will behave incorrectly as it would only print one out of four elements. To mitigate this, we do partial loop unrolling when strip mining.&lt;&#x2F;p&gt;
&lt;p&gt;To achieve this, we go through the loop and find non-array instructions and keep track of them. Then, we append them to the end of the loop (before the condition and branch statements) and also a copy of the induction variable increment&#x2F;decrement. We do this insertion three times total. &lt;&#x2F;p&gt;
&lt;p&gt;Example loop snippit before strip mining and partial unrolling:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;loop:
  print i;
  ci: int = add c i;
  v: int = lw ci;
  i: int = add i inc;
  done: bool = eq i size;
  br done exit loop;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;After strip mining and partial unrolling:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;loop:
  print i;
  ci: int = add c i;
  v: int = vload ci;
  i: int = add i inc;
  print i;
  i: int = add i inc;
  print i;
  i: int = add i inc;
  print i;
  i: int = add i inc;
  done: bool = eq i size;
  br done exit loop;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;With this method, array operations are allowed to be vectorized while preserving serial instructions because serial instructions are unrolled into four copies, where each copy operates with a different induction variable value. This loop also increments the induction variable by four every iteration which preserves the performance increase from reduced branch overhead. &lt;&#x2F;p&gt;
&lt;p&gt;In terms of implementation, we found it much easier to first coalesce the loop into one big block before running the strip mine algorithm because we could treat this sequence of instructions as one array instead of having to worry about jumps and label renaming from inserting new blocks. Aggregating all the blocks of this loop was possible because we previously enforced that there can be exactly one branch instruction located at the end of the loop.&lt;&#x2F;p&gt;
&lt;p&gt;Up to here, we have been operating on the assumption that array sizes are divisible by the vector size---four. To account for arrays not divisible by four, we append a copy of this loop (without any optimizations) to a block that follows the main optimized loop. This serially executes the remaining loop iteration which allows us to maintain correctness. This also means that we need to floor the loop bound of the vectorized loop to the nearest multiple of four but keep the original bound for the serial loop that follows.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;rust-dynamic-library-with-foreign-function-interface&quot;&gt;Rust Dynamic Library with Foreign Function Interface&lt;&#x2F;h3&gt;
&lt;p&gt;In order to more accurately measure the performance impact of automatic vectorization, we design a dynamically linked library in Rust with a foreign function interface. We use the Rust crate for SIMD that targets the x86 platform to utilize SIMD intrinsics to support vector-add, vector-multiply, and vector-subtract. These functions are called in the dispatch loop of the interpreter for the corresponding vectorized instructions.&lt;&#x2F;p&gt;
&lt;p&gt;Consider the Rust function for vectorized addition.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;#Requires data_a, data_b, and data_c to point to arrays
# of 32-bit integers of length 4. 
#Adds the arrays pointed to by data_a and data_b element-wise
# and store in array pointed to be data_c.

#[no_mangle] 
pub fn vadd(data_a: *const i32, data_b: *const i32, data_c: *mut i32) {
    unsafe {
        let a = _mm_load_si128(mem_a);
        let b = _mm_load_si128(mem_b);
        let c = _mm_add_epi32(a, b);
        _mm_store_si128(mem_c, c)
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The Rust crate for SIMD is currently required to be wrapped in an unsafe block. This is because it is the programmer&#x27;s responsibility to ensure that the CPU the program is running on supports the function being called. For example, it is unsafe to call an AVX2 function on a CPU that does not actually support AVX2. (AVX2 is an extension to the x86 ISA that supports SIMD operations).&lt;&#x2F;p&gt;
&lt;p&gt;In order to justify these unsafe blocks, we used the &lt;code&gt;#cfg&lt;&#x2F;code&gt; attribute when compiling. This allows you to dynamically detect this CPU feature, and provide a fallback implementation in the case the CPU does not support SIMD operations. The processor we used for evaluation supported AVX2, so we did not need a fallback implementation for evaluation.&lt;&#x2F;p&gt;
&lt;p&gt;Consider the corresponding invocation of this Rust function in the interpreter for vector add.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;case &amp;quot;vadd&amp;quot;: {
    let vecA = getVec(instr, env, 0);
    let vecB = getVec(instr, env, 1);
    let vecC = new Int32Array(fixedVecSize);
    vadd(vecA, vecB, vecC);
    env.set(instr.dest, vecC);    
    return NEXT;
  }
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These external calls add potentially significant overhead to the execution. We quantify this overhead to more accurately evaluate execution time in the interpreter. We run a vvadd benchmark, documented below, varying array sizes of (128, 1024, 2048, 4196, 8192). We run 10 iterations of each configuration and average the execution time between them. We then compare execution time with and without calls to the Rust library. We find there is a 16% overhead for making Rust calls. While we predict this will be offset for especially large arrays, we decided to add Rust calls for serialized instructions as well to isolate the vectorized instructions as a variable in our experiment from the Rust call overhead variable.&lt;&#x2F;p&gt;
&lt;p&gt;In addition, note the loads and stores that accompany the vectorized add instruction. The &lt;code&gt;vadd&lt;&#x2F;code&gt; function implements &lt;code&gt;c[i] = a[i] + b[i]&lt;&#x2F;code&gt;. However, this line translated to Bril IR would look like the code below. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;va: vector = vload ai;
vb: vector = vload bi;
vc: vector = vadd va vb;
vstore vc ci;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Ideally, these instructions would be in separate functions in our Rust library with separate invocations in the interpreter. However, TypeScript does not have a copmatible type for &lt;code&gt;__m128i&lt;&#x2F;code&gt;, which represents a 128-bit SIMD register. In order to separate the &lt;code&gt;vadd&lt;&#x2F;code&gt; from &lt;code&gt;vload&lt;&#x2F;code&gt; and &lt;code&gt;vstore&lt;&#x2F;code&gt;, we would either need to return a &lt;code&gt;__m128i&lt;&#x2F;code&gt;, or invoke additional functions in the Rust SIMD crate to unpack this value into 4 32-bit integers. Without a compatible type for &lt;code&gt;__m128i&lt;&#x2F;code&gt;, we did find a way to write a signature such that we could accept or return a &lt;code&gt;__m128i&lt;&#x2F;code&gt; from a single vector add. For the latter, adding functions to unpack integers adds potentially significant overhead every time we pass vectors between TypeScript and Rust. Therefore, we group these as a unit to more accurately mimic SIMD operations, as well as make vstore and vload effectively noops in the interpreter. This is valid as our implementation of automatic vectorization ensures each vload-vload-vstore is accompanied by some vector operation.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;correctness&quot;&gt;Correctness&lt;&#x2F;h3&gt;
&lt;p&gt;In order to verify correctness, we chose self-verifying programs such that execution of the program proves the program computed the same value, and thus is correct  in terms of inputs&#x2F;outputs. These programs include vector-vector add, vector-vector multiply, and vector-vector subtract.
In addition, we tested programs with combinations of these operations, as well as programs with dependencies to ensure we only conservatively unroll. 
Consider the following loop.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;a[i] = mul sa[i] a[i]
b[i] = mul sb[i] b[i]
c[i] = add a[i] b[i]
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In this case, while there are dependencies within the loop body, there are no dependencies between iterations of the loop. Therefore, we expect this operation will be vectorized.&lt;&#x2F;p&gt;
&lt;p&gt;We also enforced many constraints for what kinds of loops can be optimized, but we believe that it does not significantly impact expressivity. The loop format we described is most similar to a do-while loop where the loop condition is checked at the very end, but it is simple to compile regular while loops, do-while loops, and for loops from a higher level language into this Bril IR loop format using slight modifications to the loop bounds and conditions. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;performance&quot;&gt;Performance&lt;&#x2F;h3&gt;
&lt;p&gt;Determining how to measure the impact of this optimization was challenging as the primary benefit comes from special instructions in the SIMD instruction set that can operate on multiple data points in parallel.&lt;&#x2F;p&gt;
&lt;p&gt;One option is to implement 4 add operations in the interpreter&#x27;s dispatch loop for vadd, 4 sub operations for vsub, etc. While this would not affect program correctness, it does not attempt to mimic SIMD registers that perform a SIMD operation in the processor. Therefore, it is likely to mispredict realistic speedup by naively ignoring external factors. In addition, the performance benefit seen under this approach would likely only reflect the decrease in number of instructions, rather the specifically the instrinsics of the SIMD instructions.&lt;&#x2F;p&gt;
&lt;p&gt;A second option is to use SIMD instructions in x86 with SIMD intrinsics. This approach involves execution of SIMD instructions that ideally much more closely mirror behvaior of SIMD operations. We modeled these SIMD operations this way for this reason using the aforementioned Rust dynamic library.&lt;&#x2F;p&gt;
&lt;p&gt;To evaluate our implementation, we run vvadd benchmarks. We run each benchmark on arrays of varying sizes (16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192). We initially ran 5 iterations per array size per benchmark, but noticed the execution times significantly decreased after each iteration. This is likely due to the CPU warming up, caching, and better branch prediction after each iteration. Therefore, we decided to increase this number to 25, after which the difference in execution times for sequential runs was less than 0.01 ms. &lt;&#x2F;p&gt;
&lt;img src=&quot;graph.png&quot; style=&quot;max-width: 100%&quot; &gt;
&lt;p&gt;We ran these benchmarks and calculated the standard error bars by computing the standard deviation for each iteration for a given array size.&lt;&#x2F;p&gt;
&lt;p&gt;While the Rust SIMD calls do not appear to have a performance improvement as expected, there are a couple points to note. Some factors that could contribute to the performance impact could be communication costs and coupling of loads and stores with vectorized operations.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;communication-costs&quot;&gt;Communication Costs&lt;&#x2F;h4&gt;
&lt;p&gt;The cost of the Rust calls likely dimishes the visible improvement. While we tried to mimize this by writing functions for serial operations as well in Rust, it inaccurately emphasizes the number of instructions as we artificially add a scalar overhead to them.&lt;&#x2F;p&gt;
&lt;p&gt;In addition, the calls from TypeScript to the Rust FFI are opaque and it is unclear how exactly they are made. It is also unclear how the arguments are passed in to the function, and what the associated costs are. Therefore, it is possible that some of the variance in the results can be attributed to the variance of these calls.&lt;&#x2F;p&gt;
&lt;p&gt;A next step to better isolate these vectorized instructions from the rust call overhead would be to write the entire interpreter in Rust, or some other language, and execute the SIMD instructions inline without function calls.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;coupling-of-loads-and-stores&quot;&gt;Coupling of Loads and Stores&lt;&#x2F;h4&gt;
&lt;p&gt;As mentioned in the Rust FFI section above, the Rust functions that are invoked for a vector-add are coupled with vload and vstore as TypeScript does not have a corresponding type for &lt;code&gt;__m128i&lt;&#x2F;code&gt;. Therefore, it is possible that there are extra vloads and vstores performed for the vectorized programs. This would also diminish the potential performance impact. &lt;&#x2F;p&gt;
&lt;p&gt;To isolate each operation in Bril, we would need to write the interpreter in a language that has types compatible with 128-bit packed integers. This would enable the program to return the value from a vector load in the Dynamic Library to the interpreter. Similarly, we could then pass these 128-bit packed integers back to the Dynamic Library for the vectorized arithmetic operations.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;challenges&quot;&gt;Challenges&lt;&#x2F;h2&gt;
&lt;p&gt;The two biggest challenges we faced were finding flow-dependencies and linking our Rust library. Since Bril is an IR, we do not have while, do-while, or for loops which are clearly defined and have easily identifiable loop guards. In Bril, we have to use analyses to find those loops and then check for dependencies. Incorporating calls to our Rust library also difficult because we had to translate values between Rust and Typescript, and building the library itself was challenging because the SIMD crate (Rust package) was unsafe and frequently resulted in segfaults. &lt;&#x2F;p&gt;
&lt;p&gt;Working with Python also proved somewhat difficult due to the lack of types. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;We were able to correctly implement automatic vectorization in the Bril interpreter along with a dynamic library in Rust. However, we were not able to obtain a reliable execution speedup of these instructions due to the variance of FFI calls, coupling of loads and stores in our library functions, and extraneous cache factors.&lt;&#x2F;p&gt;
&lt;p&gt;In this work, automatic vectorization was implemented for Bril. A next step would be to reduce loop restrictions such as by allowing an arbitrary number of branch instructions to exist in a loop. This will be possible with a smart analysis on how many times the code between each branch instruction is executed. We can also eliniminate the overhead caused by calls to Rust by rewriting the Bril interpreter in Rust for a more accurate performance analysis. &lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>BrilPRE: A tool for Partial Redundancy Elimination on Bril</title>
                <pubDate>Wed, 23 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/brilpre/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/brilpre/</guid>
                <description>&lt;h3 id=&quot;problem&quot;&gt;Problem&lt;&#x2F;h3&gt;
&lt;p&gt;Partial redundancy elimination (PRE) is a compiler optimization that eliminates expressions that 
are redundant on some but not necessarily all paths through a program.&lt;&#x2F;p&gt;
&lt;p&gt;For example, in the following code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
    a:int = const 20;
    n:int = const 1000;
    cmp:bool = const true;
    br cmp here end;
here:
    n:int = add a a;
end:
    b:int = add a a;
    print n;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The expression &lt;code&gt;a + a&lt;&#x2F;code&gt; assigned to &lt;code&gt;b&lt;&#x2F;code&gt; is partially redundant when &lt;code&gt;cmp&lt;&#x2F;code&gt; is true.
One possible PRE optimization would be:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
    a:int = const 20;
    n:int = const 1000;
    cmp:bool = const true;
    br cmp here newbranch;
newbranch:
    tmp:int = add a a;
    jmp end;
here:
    tmp:int = add a a;
    n:int = id tmp;
end:
    b:int = id tmp;
    print n;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now no matter in which path, &lt;code&gt;a + a&lt;&#x2F;code&gt; would be only computed once. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;design&quot;&gt;Design&lt;&#x2F;h3&gt;
&lt;p&gt;This tool applies the algorithm called &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=143136&quot;&gt;lazy code motion&lt;&#x2F;a&gt;.
While preserving &lt;em&gt;computational optimality&lt;&#x2F;em&gt; (no partial redundant expression computations), 
this algorithm guarantees &lt;em&gt;lifetime optimality&lt;&#x2F;em&gt; (computations are placed as early as necessary but as late as possible). Lifetime optimality is critical to reduce register pressure and therefore improve performance.&lt;&#x2F;p&gt;
&lt;p&gt;This algorithm involves four passes over the control flow graph (CFG) of the program. 
Each of them is data flow analysis using the worklist algorithm.&lt;&#x2F;p&gt;
&lt;p&gt;Before the passes, this algorithm adds a new empty block between each block with multiple predecessors and each of its predecessors. And in the end, all empty blocks are removed. &lt;&#x2F;p&gt;
&lt;h4 id=&quot;pass-1-anticipated-expressions&quot;&gt;Pass 1: Anticipated Expressions&lt;&#x2F;h4&gt;
&lt;p&gt;An expression is &lt;em&gt;anticipated&lt;&#x2F;em&gt; at a point if it is certain to be evaluated 
along any path before this expression&#x27;s value is changed 
(any variables its evaluation depends on are reassigned).&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;direction: backward

kill(b) = set of expressions defined and not evaluated afterward
use(b) = set of expressions evaluated before killed in b
def(b) = set of expressions whose related variables are reassigned in b 

merge function = union
anticipated.out(exit) = empty set
anticipated.in(b) = (anticipated.out(b) - def(b)) union use(b)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h4 id=&quot;pass-2-available-expressions&quot;&gt;Pass 2: Available Expressions&lt;&#x2F;h4&gt;
&lt;p&gt;An expression is &lt;em&gt;available&lt;&#x2F;em&gt; at a point if it is available in the usual sense 
&lt;em&gt;assuming all anticipated expressions at this point are precomputed&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;direction: forward

merge function: intersection
available.in(entry) = empty set
available.out(b) = (available.in(b) union anticipated.in(b)) - kill(b)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then for each expression, 
we want to find the blocks where this expression is anticipated but not available at the beginning.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;earliest(b) = anticipated.in(b) - available.in(b)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;earliest(b)&lt;&#x2F;code&gt; intuitively indicates expressions that must be evaluated before this block.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;pass-3-postponable-expressions&quot;&gt;Pass 3: Postponable Expressions&lt;&#x2F;h4&gt;
&lt;p&gt;This step aims to achieve lifetime optimality, that is, 
delaying the evaluation of expressions as long as possible.&lt;&#x2F;p&gt;
&lt;p&gt;An expression is &lt;em&gt;postponable&lt;&#x2F;em&gt; at a point if for every path arrives at this point, 
this expression was in set &lt;code&gt;earliest&lt;&#x2F;code&gt; but never used.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;direction: forward

merge function: intersection
postponable.in(entry) = empty set
postponable.out(b) = (postponable.in(b) union earliest(b)) - use(b)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then now we can compute points that certain expressions must be evaluated.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;latest(b) = 
    (earliest(b) union postponable.in(b)) 
        intersect 
    (kill(b) union 
        not(intersection over (earliest(b&amp;#39;) union postponable.in(b&amp;#39;)) for b&amp;#39; in successors of b))
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;latest(b)&lt;&#x2F;code&gt; intuitively indicates expressions that can be placed in b and not ok to put in some of the successors.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;pass-4-expressions-used-in-the-future&quot;&gt;Pass 4: Expressions Used in the Future&lt;&#x2F;h4&gt;
&lt;p&gt;Although we can already place evaluations in &lt;code&gt;latest(b)&lt;&#x2F;code&gt; for each block &lt;code&gt;b&lt;&#x2F;code&gt;, 
this step tries to solve this problem:
when an expression will only be used once after this evaluation, 
there is no need to place this evaluation.&lt;&#x2F;p&gt;
&lt;p&gt;An expression is &lt;em&gt;used&lt;&#x2F;em&gt; at a point if it will be used along some path from this point.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;direction: backward

merge function: union
used.out(exit) = empty set
used.in(b) = (used.out(b) union use(b)) - latest(b)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In this pass, for each block, it finds all expressions that are used in the future, either in this block (&lt;code&gt;use(b)&lt;&#x2F;code&gt;) or in any future path (&lt;code&gt;used.out(b)&lt;&#x2F;code&gt;), but it&#x27;s first evaluation is not in this block (not in &lt;code&gt;latest(b)&lt;&#x2F;code&gt;), that means, this expression is evaluated at least once earlier.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, we insert evaluations of expressions in both &lt;code&gt;used.out(b)&lt;&#x2F;code&gt; and &lt;code&gt;latest(b)&lt;&#x2F;code&gt;, 
that means expressions that we must put here and will be evaluated again later 
(notice we are not inserting evaluations for expressions used only once), 
and replace the later usages of these expressions.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Neroysq&#x2F;BrilPRE&quot;&gt;This tool&lt;&#x2F;a&gt; is implemented in Java. 
I leveraged some code from my last project, such as the parsing of Bril JSON files.&lt;&#x2F;p&gt;
&lt;p&gt;There are some tricks applied when implementing this algorithm:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;When building the control flow graph, 
the implementation treats each instruction as one block. 
So during all the passes, we can ensure that each block can only contain one instruction.
This decision makes analyzing easier.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;When inserting new variables and labels, we must make sure the new names are unique.
The way I do it is to find the smallest &lt;code&gt;n&lt;&#x2F;code&gt; where the string &lt;code&gt;n&lt;&#x2F;code&gt;*&amp;quot;_&amp;quot; is not a prefix of any
existing variables in the code. 
Then I can create any name and put this prefix on to get rid of any naming conflict.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;When storing a value operation, this tool first normalizes the expression
to try to make arguments sorted. If the order of the arguments need to be reversed, 
this tool would also reverse the operation type (i.e., &lt;code&gt;le&lt;&#x2F;code&gt; to &lt;code&gt;ge&lt;&#x2F;code&gt;, &lt;code&gt;add&lt;&#x2F;code&gt; to &lt;code&gt;add&lt;&#x2F;code&gt;).&lt;br &#x2F;&gt;
Then we can store expressions as strings in later steps.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h3&gt;
&lt;p&gt;I manually designed a couple of test cases (in folder &lt;code&gt;pre_test&lt;&#x2F;code&gt;) and also pulled test cases in
repo &lt;code&gt;bril&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;To test the correctness, I compare the results run by &lt;code&gt;brili&lt;&#x2F;code&gt; before and after PRE optimization. 
They match perfectly.&lt;&#x2F;p&gt;
&lt;p&gt;I also want to evaluate how good PRE performs. I consider three measurements: 
lines of code, instructions executed, and computations executed.&lt;&#x2F;p&gt;
&lt;p&gt;To measure lines of code, I wrote a script in Python to count lines of instructions in source code.&lt;&#x2F;p&gt;
&lt;p&gt;To count instructions executed, 
I hacked the reference interpreter &lt;code&gt;brili&lt;&#x2F;code&gt; to count instructions and show it at the end when executing.
However, I found this number not representative: 
Even when PRE gets rid of redundant evaluations, it doesn&#x27;t decrease the number of instructions executed,
because it only replaces original evaluations with &lt;code&gt;id&lt;&#x2F;code&gt; operations instead of removing them.&lt;&#x2F;p&gt;
&lt;p&gt;Therefore, I only count those computationally significant operations (all value operations except &lt;code&gt;id&lt;&#x2F;code&gt;), 
plus expensive control-flow operations (&lt;code&gt;br&lt;&#x2F;code&gt; and &lt;code&gt;jmp&lt;&#x2F;code&gt;).&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;testcase&lt;&#x2F;th&gt;&lt;th&gt;LoC before&lt;&#x2F;th&gt;&lt;th&gt;LoC after&lt;&#x2F;th&gt;&lt;th&gt;diff&lt;&#x2F;th&gt;&lt;th&gt;#instr before&lt;&#x2F;th&gt;&lt;th&gt;#instr after&lt;&#x2F;th&gt;&lt;th&gt;diff&lt;&#x2F;th&gt;&lt;th&gt;#comp instr before&lt;&#x2F;th&gt;&lt;th&gt;#comp instr after&lt;&#x2F;th&gt;&lt;th&gt;diff&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;dom_test&#x2F;loopcond.json&lt;&#x2F;td&gt;&lt;td&gt;22&lt;&#x2F;td&gt;&lt;td&gt;22&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;117&lt;&#x2F;td&gt;&lt;td&gt;117&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;82&lt;&#x2F;td&gt;&lt;td&gt;82&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;tdce_test&#x2F;skipped.json&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;tdce_test&#x2F;double-pass.json&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;tdce_test&#x2F;reassign-dkp.json&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;na&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;tdce_test&#x2F;combo.json&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;tdce_test&#x2F;double.json&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;tdce_test&#x2F;simple.json&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;tdce_test&#x2F;diamond.json&lt;&#x2F;td&gt;&lt;td&gt;11&lt;&#x2F;td&gt;&lt;td&gt;11&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;tdce_test&#x2F;reassign.json&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;na&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;lvn_test&#x2F;redundant.json&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;7&lt;&#x2F;td&gt;&lt;td&gt;16.7%&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;7&lt;&#x2F;td&gt;&lt;td&gt;16.7%&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;-33.3%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;lvn_test&#x2F;idchain.json&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;na&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;lvn_test&#x2F;nonlocal.json&lt;&#x2F;td&gt;&lt;td&gt;8&lt;&#x2F;td&gt;&lt;td&gt;9&lt;&#x2F;td&gt;&lt;td&gt;12.5%&lt;&#x2F;td&gt;&lt;td&gt;7&lt;&#x2F;td&gt;&lt;td&gt;8&lt;&#x2F;td&gt;&lt;td&gt;14.3%&lt;&#x2F;td&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;-25.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;lvn_test&#x2F;idchain-prop.json&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;na&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;lvn_test&#x2F;idchain-nonlocal.json&lt;&#x2F;td&gt;&lt;td&gt;7&lt;&#x2F;td&gt;&lt;td&gt;7&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;lvn_test&#x2F;commute.json&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;7&lt;&#x2F;td&gt;&lt;td&gt;16.7%&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;7&lt;&#x2F;td&gt;&lt;td&gt;16.7%&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;-33.3%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;lvn_test&#x2F;clobber.json&lt;&#x2F;td&gt;&lt;td&gt;10&lt;&#x2F;td&gt;&lt;td&gt;11&lt;&#x2F;td&gt;&lt;td&gt;10.0%&lt;&#x2F;td&gt;&lt;td&gt;10&lt;&#x2F;td&gt;&lt;td&gt;11&lt;&#x2F;td&gt;&lt;td&gt;10.0%&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;-40.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;lvn_test&#x2F;redundant-dce.json&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;7&lt;&#x2F;td&gt;&lt;td&gt;16.7%&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;7&lt;&#x2F;td&gt;&lt;td&gt;16.7%&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;-33.3%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;lvn_test&#x2F;clobber-fold.json&lt;&#x2F;td&gt;&lt;td&gt;10&lt;&#x2F;td&gt;&lt;td&gt;11&lt;&#x2F;td&gt;&lt;td&gt;10.0%&lt;&#x2F;td&gt;&lt;td&gt;10&lt;&#x2F;td&gt;&lt;td&gt;11&lt;&#x2F;td&gt;&lt;td&gt;10.0%&lt;&#x2F;td&gt;&lt;td&gt;5&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;-40.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;lvn_test&#x2F;reassign.json&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;na&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;pre_test&#x2F;complex_loop.json&lt;&#x2F;td&gt;&lt;td&gt;14&lt;&#x2F;td&gt;&lt;td&gt;15&lt;&#x2F;td&gt;&lt;td&gt;7.1%&lt;&#x2F;td&gt;&lt;td&gt;4009&lt;&#x2F;td&gt;&lt;td&gt;4010&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;4004&lt;&#x2F;td&gt;&lt;td&gt;3005&lt;&#x2F;td&gt;&lt;td&gt;-25.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;pre_test&#x2F;complex_loop2_unsafe.json&lt;&#x2F;td&gt;&lt;td&gt;18&lt;&#x2F;td&gt;&lt;td&gt;18&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;7852&lt;&#x2F;td&gt;&lt;td&gt;7852&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;6867&lt;&#x2F;td&gt;&lt;td&gt;6867&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;pre_test&#x2F;complex_loop2.json&lt;&#x2F;td&gt;&lt;td&gt;19&lt;&#x2F;td&gt;&lt;td&gt;21&lt;&#x2F;td&gt;&lt;td&gt;10.5%&lt;&#x2F;td&gt;&lt;td&gt;6007&lt;&#x2F;td&gt;&lt;td&gt;6508&lt;&#x2F;td&gt;&lt;td&gt;8.3%&lt;&#x2F;td&gt;&lt;td&gt;5501&lt;&#x2F;td&gt;&lt;td&gt;5001&lt;&#x2F;td&gt;&lt;td&gt;-9.1%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;pre_test&#x2F;register_presure.json&lt;&#x2F;td&gt;&lt;td&gt;14&lt;&#x2F;td&gt;&lt;td&gt;16&lt;&#x2F;td&gt;&lt;td&gt;14.3%&lt;&#x2F;td&gt;&lt;td&gt;9&lt;&#x2F;td&gt;&lt;td&gt;10&lt;&#x2F;td&gt;&lt;td&gt;11.1%&lt;&#x2F;td&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;td&gt;4&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;pre_test&#x2F;simple_loop.json&lt;&#x2F;td&gt;&lt;td&gt;9&lt;&#x2F;td&gt;&lt;td&gt;13&lt;&#x2F;td&gt;&lt;td&gt;44.4%&lt;&#x2F;td&gt;&lt;td&gt;7&lt;&#x2F;td&gt;&lt;td&gt;8&lt;&#x2F;td&gt;&lt;td&gt;14.3%&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;-33.3%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;pre_test&#x2F;logic.json&lt;&#x2F;td&gt;&lt;td&gt;21&lt;&#x2F;td&gt;&lt;td&gt;21&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;6&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;2&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;pre_test&#x2F;print.json&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;0&lt;&#x2F;td&gt;&lt;td&gt;na&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;pre_test&#x2F;add.json&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;pre_test&#x2F;loop_invariant.json&lt;&#x2F;td&gt;&lt;td&gt;11&lt;&#x2F;td&gt;&lt;td&gt;12&lt;&#x2F;td&gt;&lt;td&gt;9.1%&lt;&#x2F;td&gt;&lt;td&gt;400005&lt;&#x2F;td&gt;&lt;td&gt;400006&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;400000&lt;&#x2F;td&gt;&lt;td&gt;300001&lt;&#x2F;td&gt;&lt;td&gt;-25.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;pre_test&#x2F;fibonacci.json&lt;&#x2F;td&gt;&lt;td&gt;17&lt;&#x2F;td&gt;&lt;td&gt;17&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;648&lt;&#x2F;td&gt;&lt;td&gt;648&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;403&lt;&#x2F;td&gt;&lt;td&gt;403&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;pre_test&#x2F;complex_loop3.json&lt;&#x2F;td&gt;&lt;td&gt;24&lt;&#x2F;td&gt;&lt;td&gt;26&lt;&#x2F;td&gt;&lt;td&gt;8.3%&lt;&#x2F;td&gt;&lt;td&gt;66004&lt;&#x2F;td&gt;&lt;td&gt;66006&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;63999&lt;&#x2F;td&gt;&lt;td&gt;53001&lt;&#x2F;td&gt;&lt;td&gt;-17.2%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;pre_test&#x2F;gcd.json&lt;&#x2F;td&gt;&lt;td&gt;16&lt;&#x2F;td&gt;&lt;td&gt;16&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;141&lt;&#x2F;td&gt;&lt;td&gt;141&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;92&lt;&#x2F;td&gt;&lt;td&gt;92&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;pre_test&#x2F;factorial.json&lt;&#x2F;td&gt;&lt;td&gt;14&lt;&#x2F;td&gt;&lt;td&gt;14&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;100008&lt;&#x2F;td&gt;&lt;td&gt;100008&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;100003&lt;&#x2F;td&gt;&lt;td&gt;100003&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;df_test&#x2F;cond.json&lt;&#x2F;td&gt;&lt;td&gt;15&lt;&#x2F;td&gt;&lt;td&gt;15&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;9&lt;&#x2F;td&gt;&lt;td&gt;9&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;.&#x2F;test&#x2F;df_test&#x2F;fact.json&lt;&#x2F;td&gt;&lt;td&gt;13&lt;&#x2F;td&gt;&lt;td&gt;13&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;62&lt;&#x2F;td&gt;&lt;td&gt;62&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;td&gt;42&lt;&#x2F;td&gt;&lt;td&gt;42&lt;&#x2F;td&gt;&lt;td&gt;0.0%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Above is a table showing all the results.
Unfortunately for most cases, the improvement is not significant.
Part of the reasons is that most programs are short and do not involve loops.
But for all programs with loops that I manually designed,
this tool can successfully detect them, generate correct new code,
and provide a significant performance improvement.&lt;&#x2F;p&gt;
&lt;p&gt;Also, PRE significantly increases the length of code. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h3&gt;
&lt;p&gt;In conclusion, I successfully implemented partial redundancy elimination and tested its correctness and performance.
I hope to investigate more of PRE, such as testing it on more practical programs, 
extending it to eliminate injured partial redundancies, and speculative PRE. &lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>SIMD Divergence Optimizations</title>
                <pubDate>Wed, 23 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/divergence-optimizations/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/divergence-optimizations/</guid>
                <description>&lt;p&gt;The code used in this blog post is hosted &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;pbb59&#x2F;bril&#x2F;tree&#x2F;proj2&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;problem-statement&quot;&gt;Problem Statement&lt;&#x2F;h2&gt;
&lt;p&gt;Parallel programming models must make tradeoffs between productivity and performance. Single-Program Multiple-Data (SPMD) and C with SIMD intrinsics (C+SIMD) are two common models to generate programs for vector machines that make different tradeoffs in this space. SPMD employs a coarse-grain parallelization model that allows for high productivity, but aggressively generates vector instructions when cheaper scalar instructions would suffice. Conversely, C+SIMD uses a fine-grain parallelization model that compiles to conservative scalar instructions by default, but requires additional programmer effort to manually insert vector instructions.&lt;&#x2F;p&gt;
&lt;p&gt;Divergence optimization seeks to provide the best-case performance of C+SIMD while maintaining the productivity of SPMD. The SPMD front-end still aggressively generates vector instructions, but a middle-end pass statically identifies unnecessary vector instructions and converts them into more efficient scalar instructions. One can do this conversion when each work-item&#x2F;lane&#x2F;thread in the vector instruction does the same computation. In the &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=3314902&quot;&gt;literature&lt;&#x2F;a&gt;, divergence analysis has been shown to improve execution time by 1.5% on average for real GPU programs.&lt;&#x2F;p&gt;
&lt;p&gt;In this blog, we describe our implementation and empirical evaluation of divergence optimizations at the middle-end of the compiler stack.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-spmd-programming-model&quot;&gt;The SPMD Programming Model&lt;&#x2F;h2&gt;
&lt;p&gt;SPMD is a highly productive parallel programming model with massive market share (ex. CUDA, OpenCL, OpenGL). In SPMD, a programmer describes parallelization at a coarse-grain level (i.e., at the level of an entire program). This contrasts a standard C+SIMD programming model which requires a fine-grained specification of when parallelization is desired (i.e., at the single-instruction level). One can think of the SPMD model as a higher level of abstraction than the SIMD model: in certain processors, SPMD will be compiled down to the SIMD model. &lt;&#x2F;p&gt;
&lt;p&gt;While the coarse-grain specification of parallelization provides productivity, it also aggressively generates vector instructions. An obvious SPMD compilation procedure would generate a vector instruction for every instruction in the SPMD program. However, not every instruction needs to be parallelized. There are often values unknown at compile time, but constant across each parallel work-item. For example, memory indices and loop indices might be constant across lanes. In these cases, scalar instructions are optimal. A scalar instruction performs one operation for a group of parallel lanes and can later broadcast the single value to future vector instructions.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;target-architecture&quot;&gt;Target Architecture&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;hardware&quot;&gt;Hardware&lt;&#x2F;h3&gt;
&lt;p&gt;We target a simplified version of the GPU architecture described in &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=3314902&quot;&gt;IGC&lt;&#x2F;a&gt;. Each core contains a single ALU with a vector length of four as well as scalar, vector, and predicate register files. A SIMD instruction will use operands from the vector register and run on all lanes of the ALU. Predicate registers can be used to mask off certain lanes of the vector instructions when there is control flow. Scalar instructions will use a scalar register and only a single lane of the ALU. A scalar instruction and an equivalent vector instruction will complete in the same amount of time, but a scalar instruction will consume less energy than the vector instruction. Less dynamic energy is required: to (1) read a scalar register than a vector one and (2) use only a single lane of the ALU rather than every lane. We will save energy any time we can replace a vector instruction with an equivalent scalar one.&lt;&#x2F;p&gt;
&lt;p&gt;Although we target a specific architecture, every SIMD (Intel Integrated GPUs) and SIMT (nVidia and AMD Discrete GPUs) architecture can benefit from divergence optimizations.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;isa&quot;&gt;ISA&lt;&#x2F;h3&gt;
&lt;p&gt;Predication was added to Bril to target the aforementioned hardware building upon the vector instructions added in our Project 1. A new instruction &lt;code&gt;vcmp&lt;&#x2F;code&gt; writes to a predicate register. A predicate register and its complement can be optionally specified before a vector instruction to mask off certain lanes of the ALU. Finally, two vector registers can be merged using the mask from a predicate register with a &lt;code&gt;vphi&lt;&#x2F;code&gt; instruction. An alternate implementation could remove the merge instruction and write to the same register directly at different indices. However, the former approach works better with the SSA format.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# initialize vectors
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;va: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  vb: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# generate the predicate
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p0: pred &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vcmp va vb;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# ignore lanes based on predicate mask
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(p0)  vc0: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vadd va vb;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# ignore lanes based on complement of predicate mask
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(!p0) vc1: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vsub va vb;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# merge lanes back together
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vc: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vphi p0 vc0 vc1;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Each vector instruction supported by the interpreter is enumerated along with a description below.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Vector Instruction&lt;&#x2F;th&gt;&lt;th&gt;Description&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;vadd&lt;&#x2F;td&gt;&lt;td&gt;c[i] = a[i] + b[i]&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;vsub&lt;&#x2F;td&gt;&lt;td&gt;c[i] = a[i] - b[i]&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;vmul&lt;&#x2F;td&gt;&lt;td&gt;c[i] = a[i] * b[i]&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;vdiv&lt;&#x2F;td&gt;&lt;td&gt;c[i] = a[i] &#x2F; b[i]&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;s2v&lt;&#x2F;td&gt;&lt;td&gt;c = (v0,v1,v2,v3)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;s2vb&lt;&#x2F;td&gt;&lt;td&gt;c = (v0,v0,v0,v0)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;v2s&lt;&#x2F;td&gt;&lt;td&gt;c = v[i]&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;gather&lt;&#x2F;td&gt;&lt;td&gt;c[0,1,2,3] = mem[a0,a1,a2,a3]&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;scatter&lt;&#x2F;td&gt;&lt;td&gt;mem[a0,a1,a2,a3] = a[0,1,2,3]&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;vload&lt;&#x2F;td&gt;&lt;td&gt;c[i] = mem[i]&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;vstore&lt;&#x2F;td&gt;&lt;td&gt;mem[i] = a[i]&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;vcmp&lt;&#x2F;td&gt;&lt;td&gt;pred = a[i] == b[i]&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;vphi&lt;&#x2F;td&gt;&lt;td&gt;c[i] = pred ? a[i] : b[i]&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h2 id=&quot;divergence-analysis&quot;&gt;Divergence Analysis&lt;&#x2F;h2&gt;
&lt;p&gt;Divergence analysis statically determines whether a vector instruction has redundant lanes of computation. In the following code, if &lt;code&gt;vec0&lt;&#x2F;code&gt; and &lt;code&gt;vec1&lt;&#x2F;code&gt; are vectors with the same value in each index, then the vector add will do the &lt;strong&gt;exact&lt;&#x2F;strong&gt; same work in each lane of the ALU. It would be much more efficient to do a single scalar (single-lane) &lt;code&gt;add&lt;&#x2F;code&gt; instruction instead.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# initialize vectors
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vec0: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(v0, v0, v0, v0);
  vec1: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(v1, v1, v1, v1);

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# add vectors
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vec2: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vadd vec0 vec1;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;An instruction is assumed to be convergent (not divergent) by default. We traverse the dataflow graph forwards and mark an instruction as divergent if the following conditions are met. Our algorithm is based on the descriptions in &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=3314902&quot;&gt;these&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;ieeexplore.ieee.org&#x2F;document&#x2F;6113840&quot;&gt;papers&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Condition&lt;&#x2F;th&gt;&lt;th&gt;Description&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Instruction is &lt;code&gt;s2v&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;A different scalar value is loaded into each index of a vector register&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Instruction is &lt;code&gt;vload&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Unknown values are loaded into each element of a vector register&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Any data dependency is divergent&lt;&#x2F;td&gt;&lt;td&gt;An incoming edge in the dataflow graph is already divergent due to one of the previous conditions&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;It&#x27;s possible that during runtime some vectors marked as divergent might turn out to be convergent. For example, a &lt;code&gt;vload&lt;&#x2F;code&gt; may load in contiguous elements with the same values. However, there is no way to optimize for this case statically.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;divergence-optimizations&quot;&gt;Divergence Optimizations&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;instruction-swapping&quot;&gt;Instruction Swapping&lt;&#x2F;h3&gt;
&lt;p&gt;Once we know which instructions are divergent and which are not, we can optimize the code on an instruction-by-instruction basis. In the previous Bril example, every vector instruction is convergent. Therefore we can swap each vector instruction with a more energy-efficient scalar instruction. The optimization is shown below. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# initialize vectors -&amp;gt; initialize scalars
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vec0_s: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;id &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;v0;
  vec1_s: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;id &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;v1;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# add vectors -&amp;gt; add scalars
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vec2_s: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;add vec0_s vec1_s;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We implement a &amp;quot;swap table&amp;quot; that matches a vector instruction with a functionally equivalent scalar instruction. An alternate design would be to annotate each original scalar instruction with a vector length and just change the vector length instead of doing a swap. Our swap table is given below along with a description of each instruction reproduced from above.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Vector Instruction&lt;&#x2F;th&gt;&lt;th&gt;Description&lt;&#x2F;th&gt;&lt;th&gt;Scalar Instruction&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;vadd&lt;&#x2F;td&gt;&lt;td&gt;c[i] = a[i] + b[i]&lt;&#x2F;td&gt;&lt;td&gt;add&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;vsub&lt;&#x2F;td&gt;&lt;td&gt;c[i] = a[i] - b[i]&lt;&#x2F;td&gt;&lt;td&gt;sub&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;vmul&lt;&#x2F;td&gt;&lt;td&gt;c[i] = a[i] * b[i]&lt;&#x2F;td&gt;&lt;td&gt;mul&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;vdiv&lt;&#x2F;td&gt;&lt;td&gt;c[i] = a[i] &#x2F; b[i]&lt;&#x2F;td&gt;&lt;td&gt;div&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;s2v&lt;&#x2F;td&gt;&lt;td&gt;c = (v0,v1,v2,v3)&lt;&#x2F;td&gt;&lt;td&gt;id&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;s2vb&lt;&#x2F;td&gt;&lt;td&gt;c = (v0,v0,v0,v0)&lt;&#x2F;td&gt;&lt;td&gt;id&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;v2s&lt;&#x2F;td&gt;&lt;td&gt;c = v[i]&lt;&#x2F;td&gt;&lt;td&gt;id&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;gather&lt;&#x2F;td&gt;&lt;td&gt;c[0,1,2,3] = mem[a0,a1,a2,a3]&lt;&#x2F;td&gt;&lt;td&gt;lw&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;scatter&lt;&#x2F;td&gt;&lt;td&gt;mem[a0,a1,a2,a3] = a[0,1,2,3]&lt;&#x2F;td&gt;&lt;td&gt;sw&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;vload&lt;&#x2F;td&gt;&lt;td&gt;c[i] = mem[i]&lt;&#x2F;td&gt;&lt;td&gt;Can&#x27;t optimize&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;vstore&lt;&#x2F;td&gt;&lt;td&gt;mem[i] = a[i]&lt;&#x2F;td&gt;&lt;td&gt;Can&#x27;t optimize&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;vcmp&lt;&#x2F;td&gt;&lt;td&gt;pred = a[i] == b[i]&lt;&#x2F;td&gt;&lt;td&gt;Can&#x27;t optimize&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;vphi&lt;&#x2F;td&gt;&lt;td&gt;c[i] = pred ? a[i] : b[i]&lt;&#x2F;td&gt;&lt;td&gt;Can&#x27;t optimize&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Notably, we can&#x27;t optimize across &lt;code&gt;vload&lt;&#x2F;code&gt; and &lt;code&gt;vstore&lt;&#x2F;code&gt; instructions because different memory addresses are always accessed. However, in certain cases a &lt;code&gt;gather&lt;&#x2F;code&gt; and &lt;code&gt;scatter&lt;&#x2F;code&gt; can access the exact same memory location if the address vector is the same for each. In this case, the access will be redundant, which will waste memory energy and potentially execution time. Even though we perform the &lt;code&gt;scatter&lt;&#x2F;code&gt;&#x2F;&lt;code&gt;gather&lt;&#x2F;code&gt; optimization in the compiler, it is likely that the hardware implementation would also detect this case and avoid the redundant accesses.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;vector-regeneration&quot;&gt;Vector Regeneration&lt;&#x2F;h3&gt;
&lt;p&gt;An instruction swap could create a register type mismatch between the result of the optimized instruction and future vector instructions that use that result. For this reason, a second pass is added to the optimization algorithm. After the scalar instructions have been created, we traverse each instruction in program order and detect when a vector argument points to a scalar register. Upon detection, an &lt;code&gt;s2vb&lt;&#x2F;code&gt; instruction is generated to effectively cast a scalar value to a vector value. The faulting instruction argument is then updated to the new vector value produced by this instruction.&lt;&#x2F;p&gt;
&lt;p&gt;The benefits of replacing a vector with a scalar outweighs the additional &lt;code&gt;s2vb&lt;&#x2F;code&gt; instruction. For example, a vector instruction with length four consumes three more ALU ops than a scalar instruction while an additional &lt;code&gt;s2vb&lt;&#x2F;code&gt; only consumes a single ALU op (we assume single-op broadcast). The overall benefit is then two ALU ops worth of energy savings.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;predication-removal&quot;&gt;Predication Removal&lt;&#x2F;h3&gt;
&lt;p&gt;Predicated vector instructions can also be simplified even in the case of a divergent predicate value. Every lane that is active in the vector instruction may still perform redundant work. Consider the Bril example below. The predicate &lt;code&gt;p0&lt;&#x2F;code&gt; is divergent because its inputs &lt;code&gt;vec2&lt;&#x2F;code&gt; and &lt;code&gt;vec3&lt;&#x2F;code&gt; are divergent. However, the predicated vector instructions &lt;code&gt;vec4&lt;&#x2F;code&gt; and &lt;code&gt;vec5&lt;&#x2F;code&gt; are convergent because their inputs are convergent.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# convergent vectors
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vec0: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
  vec1: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# divergent vectors
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vec2: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
  vec3: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# predicate p0 is divergent
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p0: pred &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vcmp vec2 vec3;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# however both predicated computations are convergent
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(p0) vec4: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vadd vec0 vec0;
  (!p0) vec5: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vadd vec1 vec1;
  vec6: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vphi p0 vec4 vec5;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Thus, the code inside the predicate can be optimized, and the predicate can be removed because there are no longer lanes to mask out. The values still need to be merged afterwards according to the predicate to produce a result vector.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# convergent vectors -&amp;gt; scalars
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vec0_s: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  vec1_s: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# divergent vectors
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vec2: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
  vec3: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# predicate p0 is divergent
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;p0: pred &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vcmp vec2 vec3;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# convergent predicated code -&amp;gt; scalar instructions
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vec4_s: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;add vec0_s vec0_s;
  vec5_s: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;add vec1_s vec1_s;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;# need to regenerate vectors to do the merge
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vec4_s_v: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;s2vb vec4_s;
  vec5_s_v: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;s2vb vec5_s;
  vec6: vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vphi p0 vec4_s_v vec5_s_v;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;correctness&quot;&gt;Correctness&lt;&#x2F;h3&gt;
&lt;p&gt;We test the correctness of the optimizations using &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cucapra&#x2F;turnt&quot;&gt;Turnt&lt;&#x2F;a&gt;. Turnt verifies both the code produced by the optimizations and the functionality (using the output of &lt;code&gt;print&lt;&#x2F;code&gt; instructions in the Bril code). We design six tests to check correctness. The tests are enumerated in the table below.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Test&lt;&#x2F;th&gt;&lt;th&gt;Description&lt;&#x2F;th&gt;&lt;th&gt;Expected Optimization&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;vvadd&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;vload&lt;&#x2F;code&gt; followed by a &lt;code&gt;vadd&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;None, due to &lt;code&gt;vload&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Unique scalars&lt;&#x2F;td&gt;&lt;td&gt;Unique values written to vector (&lt;code&gt;s2v&lt;&#x2F;code&gt;), then &lt;code&gt;vadd&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;None, due to unique values&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Redundant scalars&lt;&#x2F;td&gt;&lt;td&gt;Redundant values written to vector, then &lt;code&gt;vadd&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;All instructions should be scalar&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Partially divergent&lt;&#x2F;td&gt;&lt;td&gt;Both convergent and divergent instructions&lt;&#x2F;td&gt;&lt;td&gt;Optimize convergent instructions and add &lt;code&gt;s2v&lt;&#x2F;code&gt; as needed before divergent instructions&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Predication - Unique&lt;&#x2F;td&gt;&lt;td&gt;Divergent predicated vector instructions&lt;&#x2F;td&gt;&lt;td&gt;None, all divergent&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Predication - Redundant&lt;&#x2F;td&gt;&lt;td&gt;Convergent predicated vector instructions&lt;&#x2F;td&gt;&lt;td&gt;Optimize convergent instructions with divergent predication and remove predication when possible.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h3 id=&quot;performance&quot;&gt;Performance&lt;&#x2F;h3&gt;
&lt;h4 id=&quot;metric&quot;&gt;Metric&lt;&#x2F;h4&gt;
&lt;p&gt;Our evaluation metric on the imaginary hardware is the number of ALU ops required by the program. Generally, each vector instruction consumes four ALU ops and each scalar instruction consumes a single ALU op. In this model, a scalar instruction is exactly four times as energy efficient as a redundant vector instruction. We argue that this is a good proxy metric for energy consumption if only the dynamic energy consumption of the ALU is considered and no other parts of the processor are considered (like memory access and on-chip network).&lt;&#x2F;p&gt;
&lt;h4 id=&quot;benchmarks&quot;&gt;Benchmarks&lt;&#x2F;h4&gt;
&lt;p&gt;We evaluate the effectiveness of the divergence optimizations on synthetic benchmarks. We take benchmark inspiration from the examples in &lt;a href=&quot;https:&#x2F;&#x2F;ieeexplore.ieee.org&#x2F;document&#x2F;6494995&quot;&gt;these&lt;&#x2F;a&gt; &lt;a href=&quot;https:&#x2F;&#x2F;hal.inria.fr&#x2F;hal-00909072v2&#x2F;document&quot;&gt;papers&lt;&#x2F;a&gt;. All benchmarks have a 2D loop nest. We vectorize over each outer loop and unroll over each inner loop because we do not support most control flows. We manually unroll the inner loop twice for each benchmark as it allows us to get a sense of the dynamic ALU ops without actually running the program. The number of ALU ops for the baseline and optimized version of each benchmark is shown in the table below.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Benchmark&lt;&#x2F;th&gt;&lt;th&gt;Description&lt;&#x2F;th&gt;&lt;th&gt;Baseline Ops&lt;&#x2F;th&gt;&lt;th&gt;Optimized Ops&lt;&#x2F;th&gt;&lt;th&gt;Improvement (%)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;FIR&lt;&#x2F;td&gt;&lt;td&gt;2D FIR filter&lt;&#x2F;td&gt;&lt;td&gt;60&lt;&#x2F;td&gt;&lt;td&gt;53&lt;&#x2F;td&gt;&lt;td&gt;12&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;FIR-pred&lt;&#x2F;td&gt;&lt;td&gt;2D FIR filter with single conditional&lt;&#x2F;td&gt;&lt;td&gt;87&lt;&#x2F;td&gt;&lt;td&gt;81&lt;&#x2F;td&gt;&lt;td&gt;7&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Synthetic&lt;&#x2F;td&gt;&lt;td&gt;Sum of (a[outer] * b[inner]) &#x2F; c[inner]&lt;&#x2F;td&gt;&lt;td&gt;53&lt;&#x2F;td&gt;&lt;td&gt;39&lt;&#x2F;td&gt;&lt;td&gt;26&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;The optimization does lead to improvement in the number of ALU Ops for the listed benchmarks. It&#x27;s hard to say exactly what a fair baseline would be because we don&#x27;t know what a SPMD front-end would actually emit. For example, loads that aren&#x27;t contiguous use &lt;code&gt;gather&lt;&#x2F;code&gt; in our baseline. In the case where each address is the same (i.e., indexed by the inner loop iterator), the &lt;code&gt;gather&lt;&#x2F;code&gt; can be turned into a &lt;code&gt;lw&lt;&#x2F;code&gt;. It&#x27;s possible that this is obvious enough for a SPMD front-end to do automatically. We don&#x27;t have a SPMD to Bril compiler, so we can&#x27;t truly quantify a realistic improvement.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;shortcomings&quot;&gt;Shortcomings&lt;&#x2F;h2&gt;
&lt;p&gt;These are things that we didn&#x27;t do, but would have improved the implementation and empirical results.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;ssa&quot;&gt;SSA&lt;&#x2F;h3&gt;
&lt;p&gt;The code must be in SSA form to perform divergence analysis on programs with arbitrary control flow. A control dependence can be converted into a data dependence with a &lt;code&gt;phi&lt;&#x2F;code&gt; instruction. These data dependencies fit naturally into the dataflow algorithm used in divergence analysis.&lt;&#x2F;p&gt;
&lt;p&gt;We were not able to implement transformations to and from SSA although we did successfully implement a &lt;code&gt;phi&lt;&#x2F;code&gt; instruction. To work around this limitation, we manually wrote our tests and benchmarks in an SSA-like form.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;synthetic-benchmarks&quot;&gt;Synthetic Benchmarks&lt;&#x2F;h3&gt;
&lt;p&gt;Our benchmark selection was weak for two reasons. First, as described above, we had to manually code in SSA form which made programming in Bril meticulous. The second challenge was that we could not use a high-level language to create Bril programs. The current TypeScript front-end does not support vector instructions nor the SPMD model. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;We implemented divergence analysis and optimizations based on that analysis. We focused on swapping expensive vector instructions for cheaper scalar instructions when the vector instruction did redundant work. We quantify our optimization by comparing the number of ALU ops executed by the un-optimized baseline and optimized version. Our results show an overall reduction in ALU ops.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Global Value Numbering</title>
                <pubDate>Wed, 23 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/global-value-numbering/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/global-value-numbering/</guid>
                <description>&lt;h3 id=&quot;overview&quot;&gt;Overview&lt;&#x2F;h3&gt;
&lt;p&gt;If you follow &lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;paul_pearce&#x2F;status&#x2F;1056684865846861824&quot;&gt;PL Twitter&lt;&#x2F;a&gt;™, you may have seen this tweet go by from our very own &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;%7Easampson&#x2F;&quot;&gt;Adrian&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;img src=&quot;twitter.jpg&quot; width=&quot;500&quot;&#x2F;&gt;
&lt;p&gt;What, you may ask, is value numbering, and how is it so mind-blowing?&lt;&#x2F;p&gt;
&lt;p&gt;The basic premise of &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Value_numbering&quot;&gt;value numbering&lt;&#x2F;a&gt; is that we can make our code more efficient by making a distinction between &lt;em&gt;values&lt;&#x2F;em&gt; (that is, computations) and the &lt;em&gt;variables&lt;&#x2F;em&gt; the programmer happened to store each within.
By focusing on the values themselves during compilation, we can keep the readability of the original code while unlocking convenient optimizations opportunities like removing duplicate computations.
In particular, if we assign values to every “unique” piece of computation, we can save on work by only actually running that piece once.
At every subsequent use, we can then refer to the already-computed value.&lt;&#x2F;p&gt;
&lt;p&gt;For example, a value numbering pass (followed by dead code elimination) can change the following:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;sum1 : int = a + b;
sum2 : int = b + a;
mul : int = sum1 * sum2;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;To the more efficient:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;sum1 : int = a + b;
mul : int = sum1 * sum1;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A compiler is especially well-poised to do this type of optimization because a programmer implementing it directly can come at the cost of readability.
While a programmer may not want to explicitly memoize every intermediate value across their computations, implementing value numbering allows us to reduce redundancy without much overhead.&lt;&#x2F;p&gt;
&lt;p&gt;The mind-blowing aspect of value numbering is that while the basic idea seems straightforward, with some simple extensions it can accomplish a wide range of additional optimizations.
For example, we extend our value numbering to do both &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Copy_propagation&quot;&gt;copy propagation&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Constant_folding&quot;&gt;constant folding&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;why-global-value-numbering&quot;&gt;Why &lt;em&gt;global&lt;&#x2F;em&gt; value numbering?&lt;&#x2F;h2&gt;
&lt;p&gt;The most simple form of value numbering is &lt;em&gt;local&lt;&#x2F;em&gt; value numbering, which works within each basic block.
However, a local approach misses optimizing code like the following, where redundancy is found across basic block boundaries:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;void main(x : int, y : int) {
entry:
  z1 : int = add x y;
  jmp block;
block:
  z2 : int = add x y;
  ret;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&#x2F;blob&#x2F;master&#x2F;README.md&quot;&gt;Bril&lt;&#x2F;a&gt;&#x27;s ecosystem already had a local value numbering implementation that was, as expected, unable to optimize the code above.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;static-single-assignment-ssa-form&quot;&gt;Static Single Assignment (SSA) Form&lt;&#x2F;h2&gt;
&lt;p&gt;The primary difference between how you go about actually implementing &lt;em&gt;local&lt;&#x2F;em&gt; value numbering vs. &lt;em&gt;global&lt;&#x2F;em&gt; value numbering is the form of the input source code.
For local value numbering, the analysis can take in code in any standard imperative assignment format.
Because local value numbering only considers one basic block at a time, the compiler can easily determine the relationship between assignments to the same variable (that is, if the source writes to a variable &lt;code&gt;x&lt;&#x2F;code&gt; twice, we can just process the code in the block in order to know which assignment is relevant).
This clean assumption breaks when considering multiple basic blocks.
If &lt;code&gt;x&lt;&#x2F;code&gt; is assigned to in two different predecessors of the block we are currently processing, looking up the value number for any assignment involving &lt;code&gt;x&lt;&#x2F;code&gt; on the right hand side becomes impossible!&lt;&#x2F;p&gt;
&lt;p&gt;Global value numbering sidesteps this difficulty by requiring that the input source code first be transformed to &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Static_single_assignment_form&quot;&gt;&lt;em&gt;static single assignment (SSA)&lt;&#x2F;em&gt;&lt;&#x2F;a&gt; form.
In SSA, every variable name can only be assigned to once.
Reassignments to the same variable in the original code are translated to assignments to new variable names.
Because reassignments often take place in different control flow branches (to actually have the branches be useful!), SSA form needs a way to recombine different names back into a final variable.
SSA form relies on a special additional instruction, canonically called &lt;code&gt;phi&lt;&#x2F;code&gt; nodes (named to be evocative of &lt;code&gt;if&lt;&#x2F;code&gt;, backward) to combine variables from diverging control flow back into a single variable.
&lt;code&gt;phi&lt;&#x2F;code&gt; instructions take as arguments a list of renamed variables, and assign one of the variables into a new assignment based on which control flow block was actually taken.&lt;&#x2F;p&gt;
&lt;p&gt;For example, consider the following Bril code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;start:
  br cond left right;
left:
  x : int = const 1;
  jmp exit;
right:
  x : int = const 2;
  jmp exit;
exit:
  print x;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;To convert these basic blocks to Bril code, we rename variables and insert a &lt;code&gt;phi&lt;&#x2F;code&gt; instruction when the control flow merges again in the last block.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;start:
  br cond1 left right;
left:
  x1 : int = const 1;
  jmp exit;
right:
  x2 : int = const 2;
  jmp exit;
exit:
  x3 : int = phi x1 x2;
  print x3;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Having unique variable assignments is enormously helpful in implementing optimization---most industrial-strength compilers, including &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;tutorial&#x2F;OCamlLangImpl7.html&quot;&gt;LLVM&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;gcc.gnu.org&#x2F;onlinedocs&#x2F;gccint&#x2F;SSA.html&quot;&gt;GCC&lt;&#x2F;a&gt;, rely on it internally. We followed a standard algorithm to convert Bril programs to SSA form.&lt;&#x2F;p&gt;
&lt;p&gt;The first pass inserts &lt;code&gt;phi&lt;&#x2F;code&gt; nodes wherever control flow merges multiple variable definitions.
We determine this condition by computing the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Dominator_(graph_theory)&quot;&gt;&lt;em&gt;dominance frontier&lt;&#x2F;em&gt;&lt;&#x2F;a&gt; (essentially, boundaries in the control flow graph where control flow can recombine) for every basic block.
For every block in the dominance frontier of every variable, we insert a &lt;code&gt;phi&lt;&#x2F;code&gt; node with as many arguments as there are predecessors to the block.
One bug we hit with this part of the implementation was realizing that we only should insert &lt;code&gt;phi&lt;&#x2F;code&gt; nodes at the dominance frontier for variables that we assigned to multiple times.
For convenience of later analysis, we also augmented the base algorithm to track which basic blocks (or source) each &lt;code&gt;phi&lt;&#x2F;code&gt; argument originated from (for the above example, our generated &lt;code&gt;phi&lt;&#x2F;code&gt; would be &lt;code&gt;x3 : int = phi x1 x2 left right;&lt;&#x2F;code&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;The second pass is a recursive renaming algorithm that uses a stack to determine the correct version of a variable name for every point in the program.
The trickiest part of implementing this algorithm was traversing the basic blocks in the correct order.&lt;&#x2F;p&gt;
&lt;p&gt;The high level algorithm specifies that you should recursively traverse the children of each block in the dominator tree.
However, we found that not only did we need to traverse the children, but we needed to traverse them &lt;em&gt;in order&lt;&#x2F;em&gt; relative to the control flow graph.
In particular, Bril’s ecosystem already had a utility for getting the dominators of every block. To get the dominance tree, we calculated the transitive reduction of this relation.
Our implementation additionally extends the dominance tree such that every list of children was ordered relative to a reverse post-order traversal of the control flow graph.&lt;&#x2F;p&gt;
&lt;p&gt;In later testing (using an example, &lt;code&gt;test&#x2F;gvn&#x2F;constant-folding-tau-example.bril&lt;&#x2F;code&gt;, ported from the &lt;a href=&quot;http:&#x2F;&#x2F;www.cs.tau.ac.il&#x2F;%7Emsagiv&#x2F;courses&#x2F;pa07&#x2F;lecture2-notes-update.pdf&quot;&gt;class notes of Prof. Mooly Sagiv of Tel Aviv University)&lt;&#x2F;a&gt;, we found one minor bug with our SSA implementation.
Bril programs do not have a type checker and are aggressively dynamic: it’s perfectly valid to assign to a variable in one branch of a conditional but not the other.
The base SSA algorithm does not account for this, so we fail to properly rename in the second step of our algorithm.
For now, we work around this case by assigning a default value to to the variable before we branch.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;global-value-numbering&quot;&gt;Global value numbering&lt;&#x2F;h2&gt;
&lt;p&gt;There are two main approaches to global value numbering, as described in [&amp;quot;Value Numbering&amp;quot;][], Briggs, Cooper, and Simpson, 1997, both of which ultimately require input code in SSA form.
Hash-based techniques, like those used in local value numbering, hash operations and find redundant values by comparing to previously hashed operations.
Partitioning algorithms, on the other hand, divide all operations into equivalence classes.&lt;&#x2F;p&gt;
&lt;p&gt;We chose to implement hash-based value numbering because it allows for copy propagation extensions and operates on the dominator tree computed during conversion to SSA.
While converting to SSA and performing global value numbering can be performed together in one pass over the original code due to their similarity, we chose to implement them separately for modularity.
A distinct SSA conversion allows more downstream optimizations to operate on the program&#x27;s SSA form in addition to allowing testing of both passes independently.&lt;&#x2F;p&gt;
&lt;p&gt;We implemented the following pseudocode, adapted from Figure 4 of the aforementioned paper, so called because it is a &amp;quot;dominator-based value numbering technique.&amp;quot;
A value&#x27;s name is the SSA variable name (which is guaranteed to be unique by SSA) for which it was first computed.
The algorithm requires a program&#x27;s dominator tree and traverses blocks in reverse postorder.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;DVNT_GVN(block b):
  for each phi node in b:
    remove and continue if meaningless or redundant
    set the value number for the remaining phi node to be the assigned variable name
    add phi node to the hash table

  for each assignment:
    get value numbers for each operand
    simplify the expression if possible
    if the expression has been computed before:
      set the value number for the assigned variable to the expression&amp;#39;s value number
    else:
      set the value number for the expression to be the assigned variable name
      add the expression to the hash table

  for each child c of b in the control flow graph:
    replace all phi node operands in c that were computed in this block with their value numbers

  for each child c of b in the dominator tree:
    DVNT_GVN(c)

  remove all values hashed during this function call
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;copy-propagation&quot;&gt;Copy propagation&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Copy_propagation&quot;&gt;Copy propagation&lt;&#x2F;a&gt; entails identifying chains of copies (in Bril, &lt;code&gt;id&lt;&#x2F;code&gt; instructions) that can be replaced by the original value. For example, the following code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;void main(a : int) {
entry:
  b : int = id a;
  c : int = id b;
  d : int = id c;
  print d;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Can be safely simplified to:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;void main(a : int) {
entry:
  print a;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We extend our global value numbering implementation to include copy propagation via special treatment of &lt;code&gt;id&lt;&#x2F;code&gt; instructions.
In particular, for &lt;code&gt;id&lt;&#x2F;code&gt;&#x27;s we look up the value number for the argument value directly.
Because our programs are in SSA, finding this value number is as simple as grabbing the argument itself.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;constant-folding&quot;&gt;Constant folding&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Constant_folding&quot;&gt;Constant folding&lt;&#x2F;a&gt; entails performing simple arithmetic operations on known constants at compile time rather than runtime.
For example, the following Bril:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;void main() {
entry:
  a : int = const 1;
  b : int = const 2;
  c : int = add a b;
  print c;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Can be replaced with:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;void main() {
entry:
  c : int = const 3;
  print c;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We extend our global value numbering to perform constant folding by keeping a mapping from variable names defined as constants to their integer or boolean literal values.
When we encounter and arithmetic operation, we check whether all operands are constant, and fold (that is, perform the operation and drop in the result) if so.
The argument constants can then often be removed entirely with a later dead code elimination pass.&lt;&#x2F;p&gt;
&lt;p&gt;When implementing constant folding, we &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&#x2F;issues&#x2F;40&quot;&gt;found a bug&lt;&#x2F;a&gt; in Bril&#x27;s existing local value numbering implementation.
When Bril programs have division by zero, the local constant folding failed with an exception.
We made the design judgment that compiler passes should not fail on misbehaving code, but instead generate code with the same behavior (for example, the implementation&#x27;s dynamic error could exist on a branch that never actually executes).
We modified this behavior in both the central Bril local value numbering and our new global value numbering to instead bail out on constant folding when encountering a potential exception in the folding step.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;correctness&quot;&gt;Correctness&lt;&#x2F;h3&gt;
&lt;p&gt;The following example, from &lt;a href=&quot;http:&#x2F;&#x2F;citeseerx.ist.psu.edu&#x2F;viewdoc&#x2F;download?doi=10.1.1.36.8877&amp;amp;rep=rep1&amp;amp;type=pdf&quot;&gt;Value Numbering&lt;&#x2F;a&gt;, by Briggs, Cooper, and Simpson, illustrates how GVN removes redundant instructions across basic blocks:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;void main(a0 : int, b0 : int, c0 : int, d0 : int, e0 : int, f0 : int) {
B1:
  u0 : int = add a0 b0;
  v0 : int = add c0 d0;
  w0 : int = add e0 f0;
  cond : bool = const true;
  br cond B2 B3;
B2:
  x0 : int = add c0 d0;
  y0 : int = add c0 d0;
  jmp B4;
B3:
  u1 : int = add a0 b0;
  x1 : int = add e0 f0;
  y1 : int = add e0 f0;
  jmp B4;
B4:
  u2 : int = phi u0 u1 B2 B3;
  x2 : int = phi x0 x1 B2 B3;
  y2 : int = phi y0 y1 B2 B3;
  z0 : int = add u2 y2;
  u3 : int = add a0 b0;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Our implementation removes the redundant additions in blocks &lt;code&gt;B2&lt;&#x2F;code&gt; and &lt;code&gt;B3&lt;&#x2F;code&gt;.
It also removes from block &lt;code&gt;B4&lt;&#x2F;code&gt;: 1) the meaningless phi node &lt;code&gt;u2 : int = phi u0 u1 B2 B3;&lt;&#x2F;code&gt;, which has arguments that are always equal, and 2) the redundant phi node &lt;code&gt;y2 : int = phi y0 y1 B2 B3;&lt;&#x2F;code&gt;, which is identical to the phi node that precedes it.
Here is the full output, which matches the correct output in the paper:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;void main(a0 : int, b0 : int, c0 : int, d0 : int, e0 : int, f0 : int) {
B1:
  u0: int = add a0 b0;
  v0: int = add c0 d0;
  w0: int = add e0 f0;
  cond: bool = const true;
  br cond B2 B3;
B2:
  jmp B4;
B3:
  jmp B4;
B4:
  x2: int = phi v0 w0 B2 B3;
  z0: int = add u0 x2;
  ret ;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;trying-to-get-fewer-instructions&quot;&gt;Trying to get fewer instructions&lt;&#x2F;h3&gt;
&lt;p&gt;We treat the number of instructions as a proxy for code performance, where fewer lines of code is superior.
For each of a number of Bril test programs, we report in the following graph:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;The number of instructions after converting to SSA, converting out of SSA, and running Bril&#x27;s existing &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&#x2F;blob&#x2F;master&#x2F;examples&#x2F;tdce.py&quot;&gt;trivial dead code elimination&lt;&#x2F;a&gt; (TDCE), shown in blue.&lt;&#x2F;li&gt;
&lt;li&gt;The number of instructions after converting to SSA, running GVN, converting out of SSA, and running TDCE, shown in red.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;img src=&quot;eval-correctness.png&quot; width=&quot;500&quot;&#x2F;&gt;
&lt;h3 id=&quot;bigger-benchmarks-with-typescript-frontend-and-comparison-with-lvn&quot;&gt;Bigger benchmarks with TypeScript frontend and comparison with LVN&lt;&#x2F;h3&gt;
&lt;p&gt;For a slightly more realistic analysis, we ran GVN on the more complex programs afforded by Bril&#x27;s TypeScript frontend. In particular, none of these programs were written with intentional redundancy like some of our correctness test cases (though the conversion from TypeScript to Bril does cause some), so they portray a more realistic evaluation of the benefits of value numbering.
In addition to the existing test programs from project 1, we also implemented a quadratic formula calculation (&lt;code&gt;test&#x2F;gvn&#x2F;quadratic.ts.bril&lt;&#x2F;code&gt;), &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Fizz_buzz&quot;&gt;fizz buzz&lt;&#x2F;a&gt; (&lt;code&gt;test&#x2F;gvn&#x2F;fizz-buzz.ts.bril&lt;&#x2F;code&gt;), and a naive &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Sieve_of_Eratosthenes&quot;&gt;Sieve of Eratosthenes&lt;&#x2F;a&gt; without arrays (&lt;code&gt;test&#x2F;gvn&#x2F;check-primes.ts.bril&lt;&#x2F;code&gt;).
All TypeScript programs were first processed by the &lt;code&gt;ts2bril&lt;&#x2F;code&gt; frontend.
We encountered an implementation difficulty: the code generated by the TypeScript frontend sometimes inserts &lt;code&gt;jmp&lt;&#x2F;code&gt; instructions after returns in the same basic block.
We had to manually remove these in order for SSA conversion and dominator tree creation to work.&lt;&#x2F;p&gt;
&lt;p&gt;In addition to the metrics in the preceding graph, we also report the number of instructions after converting to SSA, running Bril&#x27;s existing &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&#x2F;blob&#x2F;master&#x2F;examples&#x2F;lvn.py&quot;&gt;local value numbering&lt;&#x2F;a&gt;, converting out of SSA, and running TDCE, shown in green.&lt;&#x2F;p&gt;
&lt;img src=&quot;eval-ts.png&quot; width=&quot;800&quot;&#x2F;&gt;
&lt;p&gt;If you, like us, are stoked the power of value numbering, you can check out the impressive heavy lifting undertaken by &lt;a href=&quot;https:&#x2F;&#x2F;lists.llvm.org&#x2F;pipermail&#x2F;llvm-dev&#x2F;2016-November&#x2F;107110.html&quot;&gt;&lt;code&gt;NewGVN&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; in &lt;a href=&quot;http:&#x2F;&#x2F;llvm.org&#x2F;doxygen&#x2F;NewGVN_8cpp.html&quot;&gt;LLVM&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Induction Variable Optimizations</title>
                <pubDate>Wed, 23 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/ive/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/ive/</guid>
                <description>&lt;h1 id=&quot;induction-variables&quot;&gt;Induction Variables&lt;&#x2F;h1&gt;
&lt;p&gt;Loops are well known targets for optimization since they execute repeatedly
and significant execution time is spent in loop bodies.
The class of loop optimizations which we&#x27;re considering in this post
are centered on special variables called &lt;em&gt;induction variables&lt;&#x2F;em&gt; (IVs).
An induction variable is any variable whose value can be represented as a function of:
loop invariants; the number of loop iterations that have executed; and other induction variables.&lt;&#x2F;p&gt;
&lt;p&gt;Generally speaking, most induction variable optimizations are limited to
induction variables that are &lt;em&gt;linear functions&lt;&#x2F;em&gt; of their inputs.
For Bril, that means induction variables are computed only using
the &lt;code&gt;mul&lt;&#x2F;code&gt;, &lt;code&gt;add&lt;&#x2F;code&gt; and &lt;a href=&quot;..&#x2F;manually-managed-memory&quot;&gt;&lt;code&gt;ptradd&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; instructions.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;optimization-overview&quot;&gt;Optimization Overview&lt;&#x2F;h1&gt;
&lt;p&gt;There are a large number of induction variable optimizations
which all have slightly different goals. Here, we&#x27;re going
to give a brief overview on some of the optimizations we
implemented and what they&#x27;re meant to achieve.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;strength-reduction&quot;&gt;Strength Reduction&lt;&#x2F;h3&gt;
&lt;p&gt;In reality (despite what many software developers like to think),
not all instructions are really created equal. Some instructions
are more expensive to execute at runtime than others. For instance,
integer addition is usually &amp;quot;cheaper&amp;quot; than integer multiplication.
Induction variable strength reduction lets us &amp;quot;reduce&amp;quot; multiplication
operations on IVs to addition operations.&lt;&#x2F;p&gt;
&lt;p&gt;Take this simple program as an example:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; j &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;100&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
    j &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i;
}
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; j;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;j&lt;&#x2F;code&gt; is an induction variable dervied by applying a multiplication
to another IV, &lt;code&gt;i&lt;&#x2F;code&gt;. This makes it a perfect candidate for strength
reduction. Each iteration we set &lt;code&gt;j&lt;&#x2F;code&gt; to a brand new value
computed with that multiplication. Instead, every iteration we can increment &lt;code&gt;j&lt;&#x2F;code&gt;
by two times whatever we increment &lt;code&gt;i&lt;&#x2F;code&gt; by.&lt;&#x2F;p&gt;
&lt;p&gt;To simplify this optimization this is usually done by introducing a new variable
to represent the &lt;code&gt;2*i&lt;&#x2F;code&gt; value for each iteration.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; j &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; s &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;2*i when i == 0
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;100&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  j &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; s;
  s &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; s &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;+2 since i gets incremented by 1 each iteration
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;After &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Copy_propagation&quot;&gt;some other common compiler optimizations&lt;&#x2F;a&gt;,
we can get this simpler version:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; j &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;100&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  j &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; j &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; j;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It&#x27;s important to note that &lt;code&gt;j&lt;&#x2F;code&gt; no longer has a direct dependence on &lt;code&gt;i&lt;&#x2F;code&gt;
since there are no instructions which read from &lt;code&gt;i&lt;&#x2F;code&gt; and write to &lt;code&gt;j&lt;&#x2F;code&gt;.
Strength reduction often helps remove data dependencies, paving
the way for other IV optimizations.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;induction-variable-elimination&quot;&gt;Induction Variable Elimination&lt;&#x2F;h3&gt;
&lt;p&gt;In many programs, IVs can be redundant.
For instance, a common programming idiom is to introduce
a variable only to use as a loop guard (such as &lt;code&gt;i&lt;&#x2F;code&gt; in the following program).&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; max &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; result &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; max; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
    result &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; result;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In this example, we can eliminate the &lt;code&gt;i&lt;&#x2F;code&gt; variable
by replacing its uses with another basic induction variable &lt;code&gt;result&lt;&#x2F;code&gt; to get:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; max &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; result &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(; result &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; max&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; result&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {}
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; result;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This obviously removes extraneous code by combining the &amp;quot;loop counting&amp;quot;
part of the loop with the actual work that it&#x27;s doing.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;implementing-induction-variable-optimizations&quot;&gt;Implementing Induction Variable Optimizations&lt;&#x2F;h1&gt;
&lt;p&gt;It turns out that IV analyses require a large number 
of other static analyses before even thinking about optimization.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;finding-loops&quot;&gt;Finding Loops&lt;&#x2F;h3&gt;
&lt;p&gt;For instance, IV optimizations are all loop optimizations, which
means we need to identify loops. Natural loops are denoted by sets
of basic blocks that all have a common entry point &lt;em&gt;and&lt;&#x2F;em&gt; a &amp;quot;backedge&amp;quot;
in the control flow graph. This backedge corresponds to a branch or
jump in the CFG that goes back to the beginning of the loop.
Finding loops requires finding backedges, which it turns out
requires calculating dominators. A backedge is defined as
any edge in the control flow graph where the source vertex
&lt;em&gt;is dominated by&lt;&#x2F;em&gt; the sink. Therefore to even start thinking about
optimizing we need to calculate the dominators and do a basic
reachability analysis. See the pictures below for an example CFG
with backedge annoations.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;cfg.png&quot; style=&quot;width:50%&quot;&#x2F;&gt;&lt;img src=&quot;dom.png&quot; style=&quot;width:50%&quot;&#x2F;&gt;
On the left hand side we have the control flow graph where its only backedge
is represented as a dashed line. The right hand side picture shows all of the
dominators; each red line can be read as &amp;quot;is dominated by.&amp;quot; As you can see,
the only edge in the CFG which is the reverse of an edge in the dominator graph
is the backedge from &lt;code&gt;body&lt;&#x2F;code&gt; to &lt;code&gt;loop&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;There are some other subtleties here with nested loops or two loops which happen
to have the same entry block. In these cases, we combine these overlapping loops
into a single loop. Otherwise we could incorrectly identify or re-write
IVs by looking at incomplete information.
This approximation of loop structure prevents our analysis from finding some
optimization opportunities but preserves correctness.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;identifying-induction-variables&quot;&gt;Identifying Induction Variables&lt;&#x2F;h3&gt;
&lt;p&gt;Once we find loops, then we need to figure out which variables exactly &lt;em&gt;are&lt;&#x2F;em&gt;
induction variables. We divide IVs into two categories: &lt;em&gt;basic&lt;&#x2F;em&gt; induction variables;
and &lt;em&gt;derived&lt;&#x2F;em&gt; induction variables. The most common examples of IVs are the
loop variables that are only used for loop tests (say &lt;code&gt;i&lt;&#x2F;code&gt; in the following code):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;100&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  A[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;However, basic IVs are more generally defined:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;A basic induction variable, X, is a variable whose only
updates within the loop are of the form X = X + &lt;em&gt;c&lt;&#x2F;em&gt;, where
&lt;em&gt;c&lt;&#x2F;em&gt; is loop-invariant.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;In Bril, &lt;em&gt;c&lt;&#x2F;em&gt; is always a variable (as opposed to an inlined constant) so we need to do some sort
of analysis to determine if instruction operands are loop-invariant.
We use a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Reaching_definition&quot;&gt;reaching definition&lt;&#x2F;a&gt;
analysis to find such variables. We consider any variable to be loop-invariant
if: 1) all of its definitions which reach the loop entrance originate outside
the loop; or 2) it has only one reaching definition which is a &lt;code&gt;const&lt;&#x2F;code&gt; expression.&lt;&#x2F;p&gt;
&lt;p&gt;In our implementation we only identify a subset of basic IVs, specifically those
that are updated precisely once inside the loop. We did this for simplicity,
since it greatly reduces the complexity of future IV optimizations.
An elegant way to deal with this complexity would be to run IV optimizations on
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Static_single_assignment_form&quot;&gt;SSA&lt;&#x2F;a&gt; code,
since all variables have only one definition.&lt;&#x2F;p&gt;
&lt;p&gt;In addition to basic IVs, derived IVs are also eligible for optimization.
A derived IV is:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;A variable with exactly &lt;em&gt;one&lt;&#x2F;em&gt; definition inside the loop whose value is
a linear function of loop-invariants and a basic IV.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;There are several methods for finding &lt;em&gt;derived&lt;&#x2F;em&gt; IVs, the most
general one being a dataflow analysis. We decided to implement a simpler
but probably less efficient and less complete
approach that just involved scanning all of the
definitions in the loop and collecting a set of definitions which satisfy
the above constraints.&lt;&#x2F;p&gt;
&lt;p&gt;In Bril, in particular, our algorithm can be 
&lt;em&gt;very&lt;&#x2F;em&gt; approximate. Since each definition can only implement
one operation, there may be derived IVs which are comprised of multiple
Bril defintions. For example, in Bril, &lt;code&gt;x = 3*i + 4&lt;&#x2F;code&gt; looks like:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;x&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:int =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; mul i three; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;three has been defined as const 3
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:int =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add x four;  &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;four has been defined as const 4
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Our code doesn&#x27;t consider &lt;code&gt;x&lt;&#x2F;code&gt; an induction variable because
of our very approximate heuristic: &amp;quot;&lt;code&gt;x&lt;&#x2F;code&gt; is updated twice in the
loop, so it may not be an IV.&amp;quot;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;induction-variable-representation&quot;&gt;Induction Variable Representation&lt;&#x2F;h3&gt;
&lt;p&gt;In most compilers, induction variables have a standard representation,
which we also adopt. Every induction variable is symboliclly stored
as a tuple of the form &lt;code&gt;(i, a, b)&lt;&#x2F;code&gt; where &lt;code&gt;i&lt;&#x2F;code&gt; is a &lt;em&gt;base IV&lt;&#x2F;em&gt;.
You can read this as &lt;code&gt;induction variable x = ai + b&lt;&#x2F;code&gt;; a neat consequence
of this representation is that base induction variables are all of the form &lt;code&gt;(i, 1, 0)&lt;&#x2F;code&gt;
since &lt;code&gt;i = i*1 + 0&lt;&#x2F;code&gt;. In our compiler, &lt;code&gt;a&lt;&#x2F;code&gt; and &lt;code&gt;b&lt;&#x2F;code&gt; can be the name of any loop-invariant variable.
This representation is easy to serialize into a sequence of Bril instructions.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;liveness&quot;&gt;Liveness&lt;&#x2F;h3&gt;
&lt;p&gt;Since induction variable elmination is meant to delete unnecessary
variable assigments, we need to be truly sure that those induction variables
are not used outside of the loop&#x27;s scope (or ensure that we update its final
output value at the end of the loop).
We use a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Live_variable_analysis&quot;&gt;liveness dataflow analysis&lt;&#x2F;a&gt;
to compute all of the &amp;quot;live-ins&amp;quot; and &amp;quot;live-outs&amp;quot; of every basic block.&lt;&#x2F;p&gt;
&lt;p&gt;Unfortunately, this isn&#x27;t enough for eliminating &amp;quot;useless&amp;quot; induction variables.
Consider the following Bril-esque C program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; max &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; result &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
LOOP&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:
  if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(result &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; max&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;live-ins = [result, i, max]
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;goto&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; BODY;
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;else 
    goto&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; END; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;live-outs = [result, i]
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;BODY&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  result &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; result &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;live-ins = [result, i]
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;  i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;goto&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; LOOP; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;live-outs = [result, i]
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;END&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; live-ins = [result]
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; result;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Even though &lt;code&gt;i&lt;&#x2F;code&gt; is used only to update itself,
a standard liveness analysis says that &lt;code&gt;i&lt;&#x2F;code&gt; must be both a live-out and a live-in
for all of the loop blocks. This prevents local dead code analyses from removing the useless update: &lt;code&gt;i = i + 1&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Instead of local liveness, we need to consider the live-outs &lt;em&gt;of the entire loop&lt;&#x2F;em&gt;.
Therefore, when considering the liveness of IVs that we&#x27;re trying to eliminate,
we don&#x27;t check the live-outs of any one basic block.
Instead, we union all of the live-ins of the
loop&#x27;s successors. If &lt;code&gt;i&lt;&#x2F;code&gt; is not in that set of variables, we know that no code
which executes after the loop will use &lt;code&gt;i&lt;&#x2F;code&gt; and we can safely delete it.&lt;&#x2F;p&gt;
&lt;p&gt;In the example above, the only successor to the loop is the &lt;code&gt;END&lt;&#x2F;code&gt; block
and therefore the only live-out of the loop is &lt;code&gt;result&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;strength-reduction-implementation&quot;&gt;Strength Reduction Implementation&lt;&#x2F;h3&gt;
&lt;p&gt;Strength reduction targets &lt;em&gt;derived&lt;&#x2F;em&gt; IVs, specifically.
Our implementation attempts to apply this optimization to
all derived IVs in the program. Since strength reduction can
increase the total dynamic instruction count (in some cases)
and code size (in all cases) you might imagine 
using some heuristic to decide when to apply this optimization.&lt;&#x2F;p&gt;
&lt;p&gt;Otherwise, our implementation is very standard and follows this
algorithm to optimize &lt;em&gt;derived&lt;&#x2F;em&gt; IV &lt;code&gt;x = (i, a, b)&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Before the beginning of the loop, create a fresh variable &lt;code&gt;f&lt;&#x2F;code&gt; and
initialize it to &lt;code&gt;f = a*i + b&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Replace the one assignment to &lt;code&gt;x&lt;&#x2F;code&gt; in the loop with &lt;code&gt;x = f&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Immediately following the update to &lt;code&gt;i&lt;&#x2F;code&gt;, insert the update &lt;code&gt;f = f + a&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Our implementation is somewhat naive and inserts a number of &lt;code&gt;id&lt;&#x2F;code&gt;
and other instructions which can be eliminated by copy propagation.
Step (3) from the above algorithm is simplified since we ensure that
basic induction variables are updated only once in the loop. If we were to
allow multiple updates to &lt;code&gt;i&lt;&#x2F;code&gt; we&#x27;d need to follow the correct update to &lt;code&gt;i&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;basic-induction-variable-elimination&quot;&gt;Basic Induction Variable Elimination&lt;&#x2F;h3&gt;
&lt;p&gt;After running strength reduction, we attempted to eliminate all basic induction variables from the program.
We chose to run this following strength reduction since that optimization often removes dependencies on basic IVs.
The first step of IVE is to chose a derived IV to replace the basic IV. This was another opportunity for applying
heuristics to guide our optimizations; instead, we chose which derived IV to use arbitrarily.
Once we picked this IV, we iterated over all comparisons in the loop which used the basic IV as an argument
and a loop-invariant variable as the other argument.
For each of these comparisons we replaced the basic IV with the derived IV and inserted instructions
to compute the appropriate value of the other argument. Since the other argument was loop-invariant,
we lifted these instructions outside of the loop (this is very similar to step (1) of strength reduction).&lt;&#x2F;p&gt;
&lt;p&gt;For example, in this C code, if &lt;code&gt;k&lt;&#x2F;code&gt; is an IV of the form &lt;code&gt;(i,3,5)&lt;&#x2F;code&gt; and &lt;code&gt;n&lt;&#x2F;code&gt; is loop-invariant:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; n) {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We can replace &lt;code&gt;i&lt;&#x2F;code&gt; and &lt;code&gt;n&lt;&#x2F;code&gt; in this conditional with the following:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(k &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;n &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This transformation removes uses of &lt;code&gt;i&lt;&#x2F;code&gt; and can likely eliminate all uses &lt;em&gt;except&lt;&#x2F;em&gt; for the use in the write to itself (&lt;code&gt;i = i + c&lt;&#x2F;code&gt;). If this is the case, and &lt;code&gt;i&lt;&#x2F;code&gt; is not a live-out of the loop we can remove this assignment (as mentioned before, global DCE won&#x27;t normally remove this update). Our implementation does delete such dead code.
Note that, even if &lt;code&gt;i&lt;&#x2F;code&gt; is a live-out, it&#x27;s sometimes possible to push this &lt;code&gt;i = i + c&lt;&#x2F;code&gt; update to the &lt;em&gt;end&lt;&#x2F;em&gt; of the loop so that it is not part of the loop body; however, we didn&#x27;t implement this due to its complexity and questionable utility.&lt;&#x2F;p&gt;
&lt;p&gt;At this point we have successfully removed all traces of &lt;code&gt;i&lt;&#x2F;code&gt; from the loop. &lt;code&gt;i&lt;&#x2F;code&gt; might still be used to initialize some of the strength reduction variables in the beginning of the loop. However, if &lt;code&gt;i&lt;&#x2F;code&gt; is initialized to a constant, this can probably be eliminated with constant propagation and simple dead code elimination.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;evaluating-our-optimizations&quot;&gt;Evaluating Our Optimizations&lt;&#x2F;h1&gt;
&lt;p&gt;In order to evaluate our optimization, we modified the &lt;code&gt;brili&lt;&#x2F;code&gt; Bril interpreter to also optionally output the breakdown of dynamically executed instructions by opcode. This allowed us to quantify both the effect on total dynamic instruction count and validate the impact of strength reduction. Nevertheless, these results are not indicative of real world performance gains. In particular, while being interpreted, it is unlikely that strength reduction will yield a significant (if any) real time speedup. Furthermore, if the Bril that we generate was compiled using something like LLVM, different processors may have different costs for adds and multiplies, which may render strength reduction less useful. Nevertheless, these measurements are a good indication that our pass is doing what it is supposed to (reducing the number of typically expensive operations).&lt;&#x2F;p&gt;
&lt;p&gt;In order to get some measurements for our optimization, we created a test suite of several different types of programs. One type of program is a &amp;quot;sanity check&amp;quot; program, which is a small program on which we could predict how our optimizations would perform. These helped us validate the correctness of our optimizations. The other type of program is a &amp;quot;real world&amp;quot; program, which is supposed to represent a real world task in order to see what kind of performance improvements we can get on more realistic programs. Of the programs below only &lt;code&gt;fib&lt;&#x2F;code&gt; and &lt;code&gt;mat_mul_8&lt;&#x2F;code&gt; are what we would consider &amp;quot;real world&amp;quot; programs (although they are of course still small examples).&lt;&#x2F;p&gt;
&lt;p&gt;The following table breaks down dynamic instructions counts for each of the programs we tested:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;Program&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Loop Iterations&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Total ICBase&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Total IC Opt&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;mul Count Base&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;mul Count Opt&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;add Count Base&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;add Count Opt&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;ptradd Count Base&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;ptradd Cont Opt&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;id Count Base&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;id Count Opt&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;array&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;8&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;95&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;118&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;24&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;24&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;16&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;18&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;18&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;array_mul&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;8&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;113&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;136&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;17&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;5&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;24&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;40&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;16&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;16&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;18&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;strength&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;30&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;187&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;193&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;30&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;3&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;60&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;60&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;30&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;strength_large&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1000&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;6007&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;6013&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1000&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;3&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2000&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2000&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1000&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;fib&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;48&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;642&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;700&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;4&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;194&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;98&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;146&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;150&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;0&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;144&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;mat_mul_8&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;512&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;10828&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;11076&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2048&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;541&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2632&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;2704&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1728&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1728&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;3&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;1539&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h3 id=&quot;test-descriptions&quot;&gt;Test Descriptions:&lt;&#x2F;h3&gt;
&lt;ul&gt;
&lt;li&gt;array: Accesses several arrays with an index variable for each array.&lt;&#x2F;li&gt;
&lt;li&gt;array_mul: The same as &lt;em&gt;array&lt;&#x2F;em&gt; but the acceses use multiplication to calculate array offsets.&lt;&#x2F;li&gt;
&lt;li&gt;strength: A simple loop that should be a good candidate for strength reduction.&lt;&#x2F;li&gt;
&lt;li&gt;strength_large: Strength but executing more loop iterations.&lt;&#x2F;li&gt;
&lt;li&gt;fib: Calculates the first 50 fibonacci numbers and stores them into an array.&lt;&#x2F;li&gt;
&lt;li&gt;mat_mul_8: Multiplies two 8x8 matricies. Note that this test starts with 588 matrix initialization instructions which are common to all executions (none of the initializers are multiplies).&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;evaluation-conclusions&quot;&gt;Evaluation Conclusions&lt;&#x2F;h3&gt;
&lt;p&gt;We conclude from the above results that our strength reduction optimization is very successful at replacing multiplications with additions and additions with copy instructions; however on programs with few loop iterations, it&#x27;s unclear whether or not this optimization will be &amp;quot;worth it.&amp;quot; However, the generation of so many &lt;code&gt;id&lt;&#x2F;code&gt; instructions (and our analysis of the outputs of the toy programs) suggests that future optimization passes would be able to eliminate many of the ineffeciently-generated instructions here. After executing those passes, it is likely that total instruction count overhead would disappear.&lt;&#x2F;p&gt;
&lt;p&gt;The second half of our pass, which eliminated basic induction variables, we believe had little impact.
Unfortunately, our implementation was structured such that it is difficult to test one without the other; we only removed uses of a variable when we applied strength reduction to one of its derivatives. However, this is easy to intuit and our manual inspection of the code confirms this. Removing the update to a single basic IV corresponds to removing &lt;em&gt;# of loop iteration&lt;&#x2F;em&gt; instructions. While this is at least an improvement that scales with execution time, it is still minor.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;evaluation-weaknesses&quot;&gt;Evaluation Weaknesses&lt;&#x2F;h3&gt;
&lt;p&gt;Our evaluation (and implementation) have a few salient weaknesses. First, we should have evaluated against all of our test programs on a number of different inputs. We neglected to do this primarily because of time and the triviality of the results. Obviously removing instructions from the inner loop bodies would reduce the occurrences of costly instructions &lt;em&gt;more&lt;&#x2F;em&gt; as the number of loop iterations increases. To demonstrate, we included the &lt;em&gt;strength_large&lt;&#x2F;em&gt; example in our suite. In this case, the additional overhead (even without copy prop or dce) was only 7 instructions but vastly reduced the number of multiplications.&lt;&#x2F;p&gt;
&lt;p&gt;Originally we sought out to implement general induction variable elimination optimizations; unfortunately strength reduction ended up being our primary success. For instance, a canonical use case for IVE is transforming:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[] A,B;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i1&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i2&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; max; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
 A[i1&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; B[i2&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;];
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Into:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;[] A,B;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; max; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
 A[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; B[i];
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Our implementation will not successfully execute this optimization (this case is essentially the &lt;code&gt;array&lt;&#x2F;code&gt; test from our test suite).
In this example &lt;code&gt;i&lt;&#x2F;code&gt;, &lt;code&gt;i1&lt;&#x2F;code&gt; and &lt;code&gt;i2&lt;&#x2F;code&gt; are all basic induction variables. Our implementation relies on replacing one basic IV with a derived IV
from its family. In this example, the optimization requires replacing one basic IV with another. We would have liked to implement this optimization given more time since it covers the most common case for induction variable elimination. Lacking this feature explains why we saw some useful optimization in the &lt;code&gt;array_mul&lt;&#x2F;code&gt; test but nothing in the &lt;code&gt;array&lt;&#x2F;code&gt; test.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;correctness&quot;&gt;Correctness&lt;&#x2F;h3&gt;
&lt;p&gt;We also added a set of correctness tests to verify that running our induction variable optimizations did not break anything. We paid particular attention to including programs with multiple loops that had interesting control structure. For example, we included programs that had loops with branches and multiple backedges corresponding to the same loop entry point. All of our correctness regression tests pass, so our optimizations are (&lt;em&gt;hopefully&lt;&#x2F;em&gt;) sound.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>The Loop Unswitching Optimization</title>
                <pubDate>Wed, 23 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/loop-unswitching/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/loop-unswitching/</guid>
                <description>&lt;h2 id=&quot;goal&quot;&gt;Goal&lt;&#x2F;h2&gt;
&lt;p&gt;For this project, I implemented the loop unswitching optimization for the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&quot;&gt;Bril&lt;&#x2F;a&gt; programming language. Loop unswitching involves detecting conditional expressions inside of loops whose condition is independent of the loop&#x27;s body. This condition is then moved outside of the loop, and the loop&#x27;s body is replicated inside of each branch.  Consider the following snippet of code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;bool b = false
x,y,z = 0
for _ in range(100):
    if b:
        x &amp;lt;= x + 1
    else:
        y &amp;lt;= y + 1
        z &amp;lt;= z + 1
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Since b is never modified or read inside of the loop&#x27;s body, we can &amp;quot;unswitch&amp;quot; this code to the following:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;bool b = false
x,y,z = 0
if b:
    for _ in range(100):
        x &amp;lt;= x + 1
else:
    for _ in range(100):
        y &amp;lt;= y + 1
        z &amp;lt;= z + 1
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Though code size has effectively doubled, we have now prevented the need to check the conditional statement while inside the loop&#x27;s body.  This leads to less branching and allows for additional, separate optimizations within each branch.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design-and-implementation&quot;&gt;Design and Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;This optimization primarily involves loops, so the first step is to specify a contract for the representation of loops in Bril.  We use one similar to &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;doxygen&#x2F;classllvm_1_1LoopBase.html&quot;&gt;LLVM&#x27;s representation&lt;&#x2F;a&gt;.  Note that other representations are allowed, though they will need to be preprocessed into the following format before the loop unswitching optimization occurs.  This is a matter of updating the Bril.&lt;&#x2F;p&gt;
&lt;p&gt;The for loop:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;for( int c = 0; c &amp;lt; max; c += i):
  x = x + x;
z = 11111
print(x)
print(z)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;is represented in Bril as:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main:
  i: int = const 1;
  c: int = const 0
  max: int = const 10;
header:
  b: bool = le c max;
  c: int = add c i;
  br b loopbody exit;
loopbody:
  x: int = add x x;
  jmp header;
exit:
  z: int = const 11111;
  print x;
  print z;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Most importantly, logic containing whether to enter the loop is encoded in the block &amp;quot;header.&amp;quot;  From here, the program either branches into the loop&#x27;s body which can consist of any number of blocks (here it is &amp;quot;loopbody&amp;quot;) or it exits the loop and continues through the rest of the program in the block &amp;quot;exit.&amp;quot;&lt;&#x2F;p&gt;
&lt;p&gt;There is also a standing assumption that we are working with Bril programs that have been transformed into SSA form.  Though this implementation does not require SSA form, it may miss unswitches.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;design&quot;&gt;Design&lt;&#x2F;h3&gt;
&lt;h5 id=&quot;loop-detection&quot;&gt;Loop Detection&lt;&#x2F;h5&gt;
&lt;p&gt;We detect loops by producing the control flow graph (CFG) for the program and then searching for a backedge whose tail is dominated by its head.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;back_edges = []
    for e in cfg_edges:
        head, tail = e
        if tail in dom_dic[head]:
            back_edges.append(e)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Backedges indicate the presence of cycles.  Next, we find all nodes in between these two nodes by populating a stack with predecessors, until we have reached the beginning of the loop.  We mark these blocks to be part of a loop and pass it to the next part of the algorithm.&lt;&#x2F;p&gt;
&lt;h5 id=&quot;deciding-unswitchability&quot;&gt;Deciding unswitchability&lt;&#x2F;h5&gt;
&lt;p&gt;Now, we need to decide if this loop is unswitchable.  Recall that in order to unswitch loops, we need to ensure that the condition is independent of the loop&#x27;s body during execution.  That is, the conditional statement we are unswitching cannot be modified in the loop.  This allows us to write deterministic code.&lt;&#x2F;p&gt;
&lt;p&gt;To implement this,  we adopt the following notation, borrowed from &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.rit.edu&#x2F;%7Emtf&#x2F;student-resources&#x2F;20155_surawski_mscourse.pdf&quot;&gt;Matthew John Surawski&lt;&#x2F;a&gt; of Rochester Institute of Technology.  Let $v_s$ denote the set of variables defined by statements, and let $v_a$ denote the set of variables defined by arguments.  Now let $V_b = v_a \cup v_s$ be the union of the two for a given block $b$, and let $V_L = \cup_i V_{b_i}$ denote the set of variables entirely in the loop.  Now suppose we have a branching statement on condition $t$:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;br t if then;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now if $v_t \not \in V_L$, we can unswitch this loop!  In the case of multiple conditional statements that can be unswitched, as we traditionally do in literature, we pick one uniformly at random.&lt;&#x2F;p&gt;
&lt;h5 id=&quot;implementing-unswitching&quot;&gt;Implementing Unswitching&lt;&#x2F;h5&gt;
&lt;p&gt;Once we have selected a subset of nodes in the CFG to be unswitched, we need to actually reorder the blocks.  At a high level, we implement the following reordering:&lt;&#x2F;p&gt;
&lt;img src=&quot;unswitched.png&quot; style=&quot;width: 40%&quot;&gt;
&lt;p&gt;In the above diagram, we have the following:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Before Loop Code:  This block represents all code before the start of the for loop.&lt;&#x2F;li&gt;
&lt;li&gt;For Loop Logic:  This consists of logic involving whether or not to enter the for loop&#x27;s body.  Usually, this encodes code such as: &lt;code&gt;for(int i=0; i&amp;lt;n; ++i) &lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Loop Body (1):  This contains the entire loop body up until conditional t.  In particular, it can consist of many blocks, branches, conditionals, and nonconditional jumps.&lt;&#x2F;li&gt;
&lt;li&gt;Conditional t:  This block consists of exactly one line of instruction which is in the form &lt;code&gt;br b if else&lt;&#x2F;code&gt; where &lt;code&gt;b&lt;&#x2F;code&gt; is the branching boolean that is independent of the loop&#x27;s body. &lt;&#x2F;li&gt;
&lt;li&gt;If Body: This block contains the contents of the &lt;code&gt;if&lt;&#x2F;code&gt; branch when &lt;code&gt;b&lt;&#x2F;code&gt; is true.&lt;&#x2F;li&gt;
&lt;li&gt;Else Body:  This block contains the contents of the &lt;code&gt;else&lt;&#x2F;code&gt; branch when &lt;code&gt;b&lt;&#x2F;code&gt; is false.&lt;&#x2F;li&gt;
&lt;li&gt;Loop body (2):  This contains the entire loop body following the conditional t.&lt;&#x2F;li&gt;
&lt;li&gt;End of Program: This block contains all code after the loop.  In particular, it may contain additional loops with conditionals, that we are not optimizing.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;To implement unswitching, we want to move the &lt;code&gt;Conditional t&lt;&#x2F;code&gt; block outside the for loop, create branches for each destination (in Bril, we are limited to two branches), and replicate the contents inside of the loop.  We wish to perform surgery in such a way to only to disrupt nodes involving the loop, leaving the rest of the CFG intact.  A high level control flow is as follows for post-unswitching operation:&lt;&#x2F;p&gt;
&lt;img src=&quot;switched.png&quot; style=&quot;width: 60%&quot;&gt;
&lt;p&gt;In particular, we have the following blocks:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Before loop code: This is the same block as before and contains the contents of the program before we enter the loop.&lt;&#x2F;li&gt;
&lt;li&gt;Conditional t:  This block contains one instruction, namely the branching instruction that involves the independent boolean.  Based on the value of the boolean, it connects to either the &amp;quot;if&amp;quot; or &amp;quot;else&amp;quot; blocks, each block containing its own for loop.&lt;&#x2F;li&gt;
&lt;li&gt;If for loop logic:  This block is a replica of the &lt;code&gt;for loop logic&lt;&#x2F;code&gt; block in the previous CFG, except it branches to two newly created blocks.  If the program decides to enter the loop, we branch to &lt;code&gt;If loop body&lt;&#x2F;code&gt; and if it decides to exit the loop, it branches to bypass.&lt;&#x2F;li&gt;
&lt;li&gt;Else for loop logic:  This block is identical to the previous block, except it branches to either &lt;code&gt;Else loop body&lt;&#x2F;code&gt; or &lt;code&gt;bypass&lt;&#x2F;code&gt; depending on the loop invariant. &lt;&#x2F;li&gt;
&lt;li&gt;If loop body:  This contains all code in the &lt;code&gt;if&lt;&#x2F;code&gt; branch of the original CFG.  In particular, this block contains code that is &lt;code&gt;Loop body (1)&lt;&#x2F;code&gt; $\cup$ &lt;code&gt;If body&lt;&#x2F;code&gt; $\cup$ &lt;code&gt;Loop body (2)&lt;&#x2F;code&gt;.  This block then automatically branches to a newly created block, &lt;code&gt;jmp loop&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Else loop body: Similarly, this block contains code in the &lt;code&gt;else&lt;&#x2F;code&gt; branch of the original CFG:  &lt;code&gt;Loop body (1)&lt;&#x2F;code&gt; $\cup$ &lt;code&gt;Else body&lt;&#x2F;code&gt; $\cup$ &lt;code&gt;Loop body (2)&lt;&#x2F;code&gt;.  This block then automatically branches to a newly created block (separate from the previous one), &lt;code&gt;jmp loop&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Jmp loop:  There are two of these blocks, and each acts as a proxy that feeds back into the loop logic.  Essentially, we delegate logic involving entering the loop through this block.  Further optimizations can make use of this dummy block, though in this implementation, the block only contains a jump instruction.&lt;&#x2F;li&gt;
&lt;li&gt;Bypass:  Both bypass blocks also delegate the program flow to the end of program block, essentially exiting the loop.&lt;&#x2F;li&gt;
&lt;li&gt;End of program:  This block is identical to the &lt;code&gt;End of Program&lt;&#x2F;code&gt; block in the original CFG.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h3&gt;
&lt;p&gt;Fine-grain details on the implementation in Python can be found &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sameerlal&#x2F;bril&#x2F;blob&#x2F;master&#x2F;bril-txt&#x2F;unroll_opt.py&quot;&gt;here.&lt;&#x2F;a&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Implementing unswitching requires favoring generality over specificity.  We begin by loading a Bril file and storing the control flow graph by keeping track of edges.  Next, we run a dominator analysis so we can achieve, in constant time, a dictionary of blocks that dominate a particular block.  We run the loop finding algorithm as mentioned before, and then verify that it is switchable by examining conditional variables and checking to see if that variable has been used before.  We mark these blocks to be reordered and then pass the CFG into the unswitching algorithm.&lt;&#x2F;p&gt;
&lt;p&gt;In my implementation, I stored the program as a dictionary mapping from block name to contents.  The unswitching algorithm takes in this mapping and produces a new mapping that is eventually converted to a Bril program.  It is important to preserve the older block ordering which may have been earlier optimized and only unswitch locally on marked blocks that consist of the loop.&lt;&#x2F;p&gt;
&lt;p&gt;Since there are multiple duplicated blocks, we first create two hashes, one for each block that act as suffixes for duplicated blocks.  For instance if &lt;code&gt;hash(if) = 083&lt;&#x2F;code&gt; and &lt;code&gt;hash(else) = 061&lt;&#x2F;code&gt; then the two bypass blocks are named:  &lt;code&gt;bypass083&lt;&#x2F;code&gt; and &lt;code&gt;bypass061&lt;&#x2F;code&gt;.  This eliminates the possibility of branching to the incorrect branch, since all block names are guaranteed to be unique.  We choose a hash function a priori according to the number of blocks in the CFG to minimize the probability of hash collisions.  Now we are ready to create the blocks and modify branches for reordering.&lt;&#x2F;p&gt;
&lt;p&gt;We first create the &lt;code&gt;conditional t&lt;&#x2F;code&gt; block, whose name has already been extracted.&lt;&#x2F;p&gt;
&lt;p&gt;Next, we create the loop logic blocks.  We copy the contents of the original program&#x27;s loop logic, duplicate the block, and append the appropriate hashing suffix to its label.  The &lt;code&gt;conditional t&lt;&#x2F;code&gt; block that we just created will flow to both of these blocks, so we will need to modify the names of the branches there.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;if_for_loop_logic = loop_logic  + hash(if)
contents[if_for_loop_logic] = contents[loop_logic]

else_for_loop_logic = loop_logic  + hash(else)
contents[else_for_loop_logic] = contents[loop_logic]
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now we are ready to create the loop body blocks.  Here, the contents should include instructions that dominate the &lt;code&gt;conditional t&lt;&#x2F;code&gt; block and is also dominated by the big loop block.  Here, we use set intersections in the domination dictionary.&lt;&#x2F;p&gt;
&lt;p&gt;We reorder these blocks according to the original CFG to ensure that we only locally changing blocks and preserve the original block ordering for all other blocks.  I will delegate the details of this for those interested to the linked code above.&lt;&#x2F;p&gt;
&lt;p&gt;In one body block, we include the contents of the &lt;code&gt;if&lt;&#x2F;code&gt; block, and in the other body block, we include the contents of the &lt;code&gt;else&lt;&#x2F;code&gt; block.  Finally, we add instructions that are dominated by each of those blocks to be part of the body block.  Combining all of this, as well as performing surgery on the branch instruction names completes the construction of the &lt;code&gt;if loop body&lt;&#x2F;code&gt; and &lt;code&gt;else loop body&lt;&#x2F;code&gt; blocks.&lt;&#x2F;p&gt;
&lt;p&gt;Next, we create the additional blocks, namely the &lt;code&gt;jmp&lt;&#x2F;code&gt; and &lt;code&gt;bypass&lt;&#x2F;code&gt; blocks that lead to the end of the program.  Creating these blocks requires surgery on the original instructions, as well as rename hashing, since it is not always guaranteed that the last instruction in a block is a &lt;code&gt;jump&lt;&#x2F;code&gt; instruction.  For instance, both of the following are valid Bril programs:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;blockone:
    x: int = const 0;
    jmp blocktwo;
blocktwo:
    print x;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;blockone:
    x: int = const 0;
blocktwo:
    print x;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;so we need to guarantee that these two blocks are placed next to each other when we output the new optimized program.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, we connect the bypass blocks to the end of the program, which we leave intact.&lt;&#x2F;p&gt;
&lt;p&gt;The last step is to overwrite the original mapping from block name to contents with our new block dictionary and output the resulting Bril program.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementation-difficulties&quot;&gt;Implementation Difficulties&lt;&#x2F;h2&gt;
&lt;p&gt;There were quite a few difficulties that arose during implementation due to the nature of the Bril programming language.&lt;&#x2F;p&gt;
&lt;p&gt;As alluded to before, since Bril does not require jumps at the end of blocks, we need to be careful about reordering blocks in the final Bril program.  Furthermore, this optimization is designed to run after other optimizations, so we would ideally like to keep the majority of the program untouched to prevent overwriting.  This involves a fair amount of bookkeeping especially when these situations occur within the loop body.&lt;&#x2F;p&gt;
&lt;p&gt;Traditionally, loop unswitching operates on SSA form, and due to time restriction, I was not able to write an SSA translater.  Thus this optimization assumes a prior SSA run.  One example of why we might want SSA form is dead code elimination.  It is possible that after dead code elimination, a branch within a loop becomes independent of the loop body.&lt;&#x2F;p&gt;
&lt;p&gt;Another shortcoming of this optimization is that it currently only operates on natural loops, and in particular, does not operate on nested loops.  That is, programs in the form:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;b = True
for i in range(10):
    for j in range(10):
        if b then:
            do something()
        else:
            do something else()
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;will not be completely unswitched since the conditional statement lies inside of a nested for loop.  Instead, after one run of the optimization, the conditional will be unswitched outside of the first for loop, which only mildy decreases the number of branches.  Implementing this is not too difficult to do, though it requires  additional abstractions for nested for loops.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation-and-results&quot;&gt;Evaluation and Results&lt;&#x2F;h2&gt;
&lt;h5 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h5&gt;
&lt;p&gt;This optimization was tested on many programs of varying length, complexity and number of possible unswitched loops.  The testing suite I generated tests on the following (though not limited to) attributes:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Should a conditional be unswitched?&lt;&#x2F;li&gt;
&lt;li&gt;Simple&#x2F;Complex blocks before&#x2F;after conditional&lt;&#x2F;li&gt;
&lt;li&gt;Non-unswitched branches before conditional to be unswitched&lt;&#x2F;li&gt;
&lt;li&gt;Blocks with no terminating jump instructions&lt;&#x2F;li&gt;
&lt;li&gt;Nested if&#x2F;else conditionals should never be unswitched&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h5 id=&quot;results&quot;&gt;Results&lt;&#x2F;h5&gt;
&lt;p&gt;The results of this optimization were quite interesting.  I primarily evaluated the code on percentage difference in the number of branches.  To do this, I modified the Bril Typescript interpreter to keep track of the number of branches for each run.  For a program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;for _ in range(maxiter):
    prebody
    if b:
        ifbody
    else:
        elsebody
    postbody
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and its unswitched version:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;if b:
    for _ in range(maxiter):
        prebody
        ifbody
        postbody
else:
    for _ in range(maxiter):
        prebody
        ifbody
        postbody
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;the improvement comes from eliminating branches in the &amp;quot;if...else&amp;quot; decision.  Through testing, one loop unswitching produces a decrease in the number of branches of approximately 14% on average for a short Bril program (~20 loc) with one possible unswitched loop.  For medium sized programs (~100 loc), with two unswitched loops, I observed approximately 20% reduction in branches.&lt;&#x2F;p&gt;
&lt;p&gt;This result is highly dependent on the structure of the program.  For instance, after unswitching, &lt;code&gt;ifbody&lt;&#x2F;code&gt; and &lt;code&gt;elsebody&lt;&#x2F;code&gt; can separately be optimized which would compound the results even further.  The results from the benchmark tests assume that the loops &lt;em&gt;cannot&lt;&#x2F;em&gt; be independently optimized, so our claimed average can be thought to be a lower bound on the decrease in the number of branches.&lt;&#x2F;p&gt;
&lt;p&gt;In literature, benchmarking code unswitching involves measuring the increase in code size.  In my implementation, code size roughly doubled for each unswitching which is consistent with the literature.&lt;&#x2F;p&gt;
&lt;p&gt;Overall, this project was a success, and the natural extension is to extend this optimization to work for nested loops.&lt;&#x2F;p&gt;
&lt;p&gt;I would be very happy with suggestions for additional test cases or pull requests that test this optimization on more test cases.  Furthermore, for questions or comments on design, please reach out by &lt;a href=&quot;mailto:sjl328@cornell.edu&quot;&gt;e-mail&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>POSH: A TLS Compiler that Exploits Program Structure</title>
                <pubDate>Wed, 23 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/posh/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/posh/</guid>
                <description>&lt;p&gt;The development of multicore processor architectures in the 2000s led to
significant advancements in the performance of parallel computing. As a
software developer, if you could split your program or your data into
discrete chunks, you could send different pieces off to different cores
and have all of the processing done in parallel.&lt;&#x2F;p&gt;
&lt;p&gt;Naturally, software developers, compiler writers, and hardware architects
all began to wonder: &lt;em&gt;&amp;quot;Can we somehow use these extra cores to speed
up sequential, non-parallelizable workloads?&amp;quot;&lt;&#x2F;em&gt;
One proposed technique to answer this question is &lt;em&gt;Thread-Level Speculation&lt;&#x2F;em&gt; (TLS).
TLS allows software to run portions of a sequential program in parallel while
retaining the original sequential semantics. The key idea is that special hardware
support will detect when any of these parallel tasks misbehave and either roll back
the effects of such &amp;quot;speculative tasks&amp;quot; or hide the &amp;quot;bad&amp;quot; behavior from other tasks
somehow.&lt;&#x2F;p&gt;
&lt;p&gt;In general, choosing where to insert these tasks so that they are likely to
succeed and actually provide speedup over serial execution is a difficult problem.
POSH is a compiler that automatically identifies some of these regions for you,
by using simple heuristics and profiling to eliminate candidate tasks that are
unlikely to be worth the cost of inserting them.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;tls-and-hardware-transactional-memory&quot;&gt;TLS and Hardware Transactional Memory&lt;&#x2F;h1&gt;
&lt;p&gt;Before we dive into POSH itself, we want to give a more detailed
background on both how TLS works and the context in which it was envisioned.
As we mentioned above, TLS relies on special hardware support for detecting
data dependencies between threads running on different processor cores.
Broadly, these kinds of features are known as Hardware Transactional Memory (HTM).
At the &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;posh&#x2F;#hardware-transactional-memory&quot;&gt;end of this article&lt;&#x2F;a&gt; we&#x27;ve included a brief aside on HTM and its
presence in modern processors for those who are interested.&lt;&#x2F;p&gt;
&lt;p&gt;POSH assumes that hardware has support
for the following features:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Inputs to tasks are passed via memory, not registers.&lt;&#x2F;li&gt;
&lt;li&gt;Hardware automatically detects conflicting memory reads&#x2F;writes
between the main thread and speculative tasks
and then automatically kills or restarts tasks.&lt;&#x2F;li&gt;
&lt;li&gt;The ISA extension has the &lt;code&gt;spawn&lt;&#x2F;code&gt; and &lt;code&gt;commit&lt;&#x2F;code&gt; primitives
for starting and ending task execution.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Most papers exploiting HTM rely on a very similar set of assumptions,
and indeed, real HTM extensions have guarantees not unlike those listed here.
The primary difference between these assumptions and reality are empirical limitations
on code and working set size for speculative tasks.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;sources-of-performance-improvements-of-tls&quot;&gt;Sources of Performance Improvements of TLS&lt;&#x2F;h1&gt;
&lt;p&gt;The goal for TLS (remember, HTM is the set of hardware features, while TLS
is a software-level technique that utilizes those features)
is to speculatively parallelize code by predicting which regions do
not have real data dependencies. Existing compiler optimizations already
attempt to identify such dependencies (e.g., &lt;a href=&quot;..&#x2F;instruction-scheduling&quot;&gt;instruction scheduling&lt;&#x2F;a&gt;)
in order to improve performance. However, those optimizations must be
conservative in order to preserve program semantics.
Since TLS compilers can rely on runtime support from the hardware to preserve
correctness, they can aggressively overestimate data independence to
maximize potential parallelism.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;f(a);
y &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;g(b);
z &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;h(x,y);
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For example, in the above code snippet, the calls to functions &lt;code&gt;f&lt;&#x2F;code&gt; and &lt;code&gt;g&lt;&#x2F;code&gt;
can probably be parallelized so that the operands to &lt;code&gt;h&lt;&#x2F;code&gt; are available as
soon as possible. However, &lt;code&gt;f&lt;&#x2F;code&gt; and &lt;code&gt;g&lt;&#x2F;code&gt; may be side-effectful functions that
modify shared memory; TLS using HTM is free to parallelize those two calls
without fear of race conditions on that memory. A normal compiler would have
to prove disjointness of their memory accesses to parallelize them automatically.&lt;&#x2F;p&gt;
&lt;p&gt;The authors point out another, more subtle, benefit to TLS: data prefetching.
Even speculative tasks which violate data dependencies are likely to
access data that will be useful to re-executions of that task.
The authors assume (fairly) that the hardware primitives for squashing tasks
will not roll back cache state; this implies that squashed tasks can
still prefetch useful data into the cache. Re-executions of the failed
task, or even future tasks may benefit from access to this cached data
and see reduced memory access latency.&lt;&#x2F;p&gt;
&lt;img src=&quot;prefetchExample.png&quot; style=&quot;width:100%&quot;&#x2F;&gt;
&lt;p&gt;This diagram from the POSH paper shows how &amp;quot;load times&amp;quot; in
speculative tasks are not totally wasted during task failure.
Executing a speculative load from memory improves the performance of loads in future tasks.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;posh-phases&quot;&gt;POSH Phases&lt;&#x2F;h1&gt;
&lt;p&gt;The POSH compiler optimization is broken into three phases;&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;em&gt;Task Selection&lt;&#x2F;em&gt;: Chose speculative tasks based on program structure.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Spawn Hoisting&lt;&#x2F;em&gt;: Place task initiation (spawn instructions) as early as possible.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Task Refinement&lt;&#x2F;em&gt;: Use dynamic profiling to remove tasks that are unlikely to be beneficial.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;The first step is chopping up the program into tasks that will benefit
from being run concurrently. This is similar to the &lt;em&gt;balanced min-cut&lt;&#x2F;em&gt;
problem, where you want to find a set of weighted paths through a graph such that
they are approximately equal and minimal. In TLS, the nodes are instructions
and the edge weigths take into account execution time and other runtime overheads.
As you might expect, doing this optimally is NP-hard, so POSH has to resort to heuristics.
Its primary heuristic leverages the existing high level program structure;
each subroutine call and loop iteration is considered a candidate task.
The authors justify this with some intuition:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;All these programmer-generated structures are taken&lt;&#x2F;em&gt;
&lt;em&gt;as hints to delineate code sections with a relatively independent and&lt;&#x2F;em&gt;
&lt;em&gt;sizable amount of work.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;In reality, not all subroutines or loop iterations are independent
&lt;em&gt;and&lt;&#x2F;em&gt; there are other sections of code which may be parallelizable.
The former problem is addressed by &lt;em&gt;Task Refinement&lt;&#x2F;em&gt; but POSH ignores
the latter source of imprecision.&lt;&#x2F;p&gt;
&lt;p&gt;During task selection, POSH inserts the special &lt;code&gt;spawn&lt;&#x2F;code&gt; and &lt;code&gt;commit&lt;&#x2F;code&gt;
instructions, as well as task begin labels to divvy up the program
according to the above heuristic. A subtle optimization included in
this phase is the introduction of &lt;a href=&quot;https:&#x2F;&#x2F;people.apache.org&#x2F;%7Exli&#x2F;papers&#x2F;vpw03-software-value-prediction.pdf&quot;&gt;software value prediction&lt;&#x2F;a&gt;.
Although POSH doesn&#x27;t focus on their implementation of software value
prediction, we include another &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;posh&#x2F;#software-value-prediction&quot;&gt;aside on how SVP works&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;In the &lt;em&gt;Spawn Hoisting&lt;&#x2F;em&gt; phase, POSH tries to move spawn instructions as early
as possible without violating dependencies or changing program behavior.
Spawning tasks early increases opportunities for parallel execution and
prefetching, but there are limits to how far we can move them.
For example, it is not sensible to spawn a thread before the assignment
of its input variables. Neither is it clearly beneficial to
move the spawn instruction outside a conditional
statement since that could result in unnecessary code execution.&lt;&#x2F;p&gt;
&lt;p&gt;The final phase uses profiling and simple syntactic criteria, such as task
size and number of inputs, to remove tasks that probably don&#x27;t improve performance.
Instead, these pieces of the program are executed straight line.
The profiling methodology is to use test inputs and simulate the parallel
execution of the program with a sequential interpreter, while keeping
track of how many dynamic instructions each task executes and how often
it would have to be squashed by the hardware.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h1&gt;
&lt;p&gt;The overall evaluation methodology is to optimize some of the SPECint benchmarks
using some configuration of the compiler, and look at statistics (e.g., execution time or memory behavior).
For the most part, their tests are concerned with the reduction in total execution
time compared to the sequential execution.
All experiments are run on a simulator as hardware with support for TLS
was not commercially available at the time.&lt;&#x2F;p&gt;
&lt;p&gt;While the authors do extensively break down their evaluation,
we&#x27;ll simply summarize some of the tests they run and our takeaways
from their results.&lt;&#x2F;p&gt;
&lt;p&gt;First they test POSH&#x27;s various optimizations:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Impact of choosing subroutine vs. loops as tasks&lt;&#x2F;li&gt;
&lt;li&gt;Effect of using &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;posh&#x2F;#software-value-prediction&quot;&gt;software value prediction&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;Importance of using the profiler to eliminate tasks&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;img src=&quot;performance.png&quot; style=&quot;width:100%&quot;&#x2F;&gt;
In the first case, they observe that the best performance comes from parallelizing
both loops and subroutines. Since these measurements involve tests that *do* use the
profiler (which hopefully eliminates tasks that make things worse),
it makes sense that *more candidates* for parallelization means *more performance*.
&lt;p&gt;Value prediction makes a big difference in some cases, but less in others;
but the same theory as before should apply here. Value prediction allows for
more chances to parallelize, so it should improve performance as long as the
profiler identifies when it might be a bad idea.&lt;&#x2F;p&gt;
&lt;p&gt;The following graphs shows the importance of using the profiler:
without it, some programs are &lt;em&gt;slowed down&lt;&#x2F;em&gt; due to the overhead of
managing tasks.
The profiler significantly improves performance, and realizes the
&amp;quot;do no harm&amp;quot; principle of compiler optimization.&lt;&#x2F;p&gt;
&lt;img src=&quot;doNoHarm.png&quot; style=&quot;width:100%&quot;&#x2F;&gt;
&lt;p&gt;As we mentioned earlier, most of the speedup comes from executing code in parallel, but
even when correct parallel execution is not possible and tasks get squashed, memory accesses
they make have the effect of prefetching data that the re-executions are likely to use.
To evaluate the impact of prefetching, they modify the simulator so that data brought
into the processor cache by squashed tasks are marked as invalid.
Comparing the speedup gained with prefetching and without in the graph before,
they claim 26% of the speedup is due to prefetching.&lt;&#x2F;p&gt;
&lt;img src=&quot;prefetch.png&quot; style=&quot;width:100%&quot;&#x2F;&gt;
&lt;h2 id=&quot;evaluation-takeaways&quot;&gt;Evaluation Takeaways&lt;&#x2F;h2&gt;
&lt;p&gt;The paper has a solid evaluation which asks (and answers) all the questions we
might want asked. Our primary gripe with their evaluation is that it is based on &amp;quot;eyeball statistics&amp;quot;.
No formal null hypothesis testing is done; instead, the authors point at a graph
and say &amp;quot;the bars are usually higher in this case&amp;quot;. Additionally, it&#x27;s unclear if
the SPECint benchmarks are really a representative use case for real code. In theory, these
are meant to test the sequential integer performance of CPUs and may have few opportunities for parallelism.
On the other hand, the opposite may be true &lt;em&gt;or&lt;&#x2F;em&gt; they may represent a good spread of optimizability.&lt;&#x2F;p&gt;
&lt;p&gt;Given the breakdowns that the authors provide, it seems likely that the most significant
contribution of POSH is its dynamic profiler, which allows their other optimizations
to be aggressively optimistic. While hardware support does prevent TLS from impacting
correctness, it doesn&#x27;t prevent TLS from being a bad idea. The POSH profiler fills this
gap instead and allows techniques like software value prediction and the structured
program heuristic to be utilized without hurting performance.&lt;&#x2F;p&gt;
&lt;p&gt;If HTM were actually widely usable by general purpose programs, then POSH would likely
be an effective optimization for automatically speeding up sequential code!&lt;&#x2F;p&gt;
&lt;h1 id=&quot;appendix&quot;&gt;Appendix&lt;&#x2F;h1&gt;
&lt;h3 id=&quot;hardware-transactional-memory&quot;&gt;Hardware Transactional Memory&lt;&#x2F;h3&gt;
&lt;p&gt;Before transactional memory, hardware support for parallel computing
was limited to synchronization primitives such as &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Compare-and-swap&quot;&gt;atomic compare-and-swap&lt;&#x2F;a&gt;
or &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Load-link&#x2F;store-conditional&quot;&gt;store-conditional&lt;&#x2F;a&gt;.
Transactional Memory was meant to accelerate the common use case for such
primitives: atomic software transactions, consisting of a potentially unbounded number of instructions.&lt;&#x2F;p&gt;
&lt;p&gt;In this ideal world, programmers could write systems code like:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;withdraw(bank_acct &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;acct, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; amt) {
  atomic {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(acct-&amp;gt;balance &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; amt) {
        acct-&amp;gt;balance &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; amt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    } &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;false&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; }
  }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;where &lt;code&gt;atomic&lt;&#x2F;code&gt; was a hardware-supported feature for ensuring the atomicity of the contained code.
If any other thread modified &lt;code&gt;acct-&amp;gt;balance&lt;&#x2F;code&gt; during the execution of this transaction,
it would &lt;em&gt;abort&lt;&#x2F;em&gt; and have to be retried or cancelled.&lt;&#x2F;p&gt;
&lt;p&gt;Usually, HTM is implemented by piggy-backing off of the cache coherence protocol,
which normally ensures that memory writes to the same address are eventually propagated
between cores. Unfortunately, cache coherency can be &lt;a href=&quot;https:&#x2F;&#x2F;doi.org&#x2F;10.1109&#x2F;2.55497&quot;&gt;notoriously complex&lt;&#x2F;a&gt;,
especially in the face of ambiguously defined and&#x2F;or weak memory models.
One might reasonably expect adding new synchronization features to introduce bugs
and&#x2F;or interact unexpectedly with existing weak memory guarantees.
Furthermore, relying on cache coherency drastically limits size of datasets read or written by hardware transactions;
in &lt;a href=&quot;https:&#x2F;&#x2F;researcher.watson.ibm.com&#x2F;researcher&#x2F;files&#x2F;us-rodaira&#x2F;ISCA2015_ComparisonOfHTM.pdf&quot;&gt;most systems&lt;&#x2F;a&gt; the write set must fit entirely inside the L1 cache.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;htm-today&quot;&gt;HTM Today&lt;&#x2F;h3&gt;
&lt;p&gt;In reality, hardware transactional memory has primarily been a failure
and does not see wide use today.
While Intel theoretically supports these kinds of instructions
with &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Transactional_Synchronization_Extensions&quot;&gt;TSX&lt;&#x2F;a&gt;,
numerous bug reports have caused them to &lt;a href=&quot;https:&#x2F;&#x2F;www.anandtech.com&#x2F;show&#x2F;8376&#x2F;intel-disables-tsx-instructions-erratum-found-in-haswell-haswelleep-broadwelly&quot;&gt;disable it on a number of processors&lt;&#x2F;a&gt;.
Furthermore, the &lt;a href=&quot;https:&#x2F;&#x2F;blog.ret2.io&#x2F;2019&#x2F;06&#x2F;26&#x2F;attacking-intel-tsx&#x2F;&quot;&gt;limitations of TSX&lt;&#x2F;a&gt;
and other such extensions often make using them impractical, unstable and&#x2F;or insecure.&lt;&#x2F;p&gt;
&lt;p&gt;However, some low-level code &lt;em&gt;does&lt;&#x2F;em&gt; utilize HTM to implement
efficient libraries for high performance computing. In these instances,
developers are targeting very specific architectures with very detailed
models of the processor and memory systems. Since developers in
this domain are already concerned with the finicky details that often
make HTM transactions impractical, HTM does offer utility as a more flexible
and performant synchronization primitive.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;software-value-prediction&quot;&gt;Software Value Prediction&lt;&#x2F;h3&gt;
&lt;p&gt;In TLS, there are some code regions which could be parallelized,
but they involve potentially predictable data dependencies.
For example, in a &lt;code&gt;while&lt;&#x2F;code&gt; loop, the iteration condition may not
be known before executing the entire body of the loop and thus parallelism
becomes very limited.&lt;&#x2F;p&gt;
&lt;p&gt;Value prediction transforms this sequential execution into
a potentially parallel one by creating data dependencies between
the original variable and a prediction variable. Value prediction
produces code with the following invariant:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Let x be some variable in the program, pred(x) be its predicted
value and real(x) be its real value. A TLS task that reads pred(x)
will be squashed whenever real(x) != pred(x).&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The following example from &lt;a href=&quot;https:&#x2F;&#x2F;people.apache.org&#x2F;%7Exli&#x2F;papers&#x2F;vpw03-software-value-prediction.pdf&quot;&gt;Li et al.&lt;&#x2F;a&gt;
shows how the newly spawned task will be squashed whenever &lt;code&gt;pred_x&lt;&#x2F;code&gt; is not equal to the correct value.
Specifically, this code ensures that the original thread will update &lt;code&gt;pred_x&lt;&#x2F;code&gt; to be correct before it commits, forcing
a read-write dependency on that variable.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;      pred_x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;initialize prediction
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Loop&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; pred_x; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;use prediction
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;      pred_x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;f(x); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;generate prediction
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;      spawn Loop;
       … &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;foo(x);
       x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; …;
      &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; pred_x) &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;verify prediction
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;        pred_x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;recover misprediction
      &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;commit();
 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(cont) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;goto&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; Loop;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The POSH authors don&#x27;t go into any real detail on their prediction mechanism
beyond what we&#x27;ve described here. While they do evalute its effectiveness,
we have no idea what kind of algorithm they&#x27;re using to choose prediction values.&lt;&#x2F;p&gt;
&lt;p&gt;Another downside of prediction is that it involves more runtime overhead in
generating predictions. Not only do the instructions used to produce predictions
slow down execution, but the prediction code likely accesses a shared data structure that
could increase the number of failed tasks due to races on that.
It would be a great idea for the POSH profiler to
also take into account this execution information
(&lt;em&gt;note it does account for the potential task squashing, just not the instruction overhead&lt;&#x2F;em&gt;).&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Profile-Guided Code Layout</title>
                <pubDate>Wed, 23 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/profile-guided-code-layout/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/profile-guided-code-layout/</guid>
                <description>&lt;p&gt;The goal of this project was to optimize Bril programs with profile-guided
function reordering and basic block reordering to improve code locality. I used
&lt;a href=&quot;&#x2F;blog&#x2F;making-function-calls-work&quot;&gt;Bril()&lt;&#x2F;a&gt; to serve as a base for the
optimization since it allows for function calls and passing in input via the
command line.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;defining-code-locality&quot;&gt;Defining Code Locality&lt;&#x2F;h3&gt;
&lt;p&gt;Before moving further, it is important to clearly define what &amp;quot;code locality&amp;quot;
means. Typically, the goal of code layout optimizations is to improve
instruction cache and instruction TLB utilization. When an instruction is loaded
into the instruction cache, a block of instructions are loaded into the cache
(known as a cache line). To improve cache utilization, we want to minimize the
number of times we load into the instruction cache. In other words, we want
sequences of instructions that frequently run together to live close together in
the code.&lt;&#x2F;p&gt;
&lt;p&gt;That being said, there are many factors that make measuring this metric
difficult for this project. Firstly, since Bril is interpreted, we do not load
Bril instructions directly into the instruction cache. I considered lowering
Bril programs to LLVM using a project 1 extension and then evaluating the
lowered program, but that leads into the second problem: I am one person running
Bril programs on one machine. The results I find for my machine may not be
consistent with other machines, so results would be misleading. Because of this,
I use a different metric to measure code locality: &lt;em&gt;instruction pointer jumps&lt;&#x2F;em&gt;.
Improving code layout usually decreases the total number of instruction pointer
jumps, since frequently executed code lives close together. To introduce the
notion of an instruction pointer to the interpreter, I use the index of the
currently executing instruction. Every time an instruction is executed, we can
compare the current instruction pointer location to the previous location to
determine the number of instruction pointer jumps. In other words, we are
measuring the distance between the current instruction and the previously
executed instruction.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;design&quot;&gt;Design&lt;&#x2F;h3&gt;
&lt;p&gt;The designs of these optimizations are closely related to the optimizations
discussed in &lt;a href=&quot;https:&#x2F;&#x2F;doi.org&#x2F;10.1145&#x2F;93542.93550&quot;&gt;&amp;quot;Profile guided code
positioning&amp;quot;&lt;&#x2F;a&gt; (1990) by Pettis and Hansen
and &lt;a href=&quot;https:&#x2F;&#x2F;research.fb.com&#x2F;publications&#x2F;optimizing-function-placement-for-large-scale-data-center-applications-2&#x2F;&quot;&gt;&amp;quot;Optimizing function placement for large-scale data-center
applications&amp;quot;&lt;&#x2F;a&gt;
(2017) by Ottoni and Maher. Specifically, the basic block reordering
optimization follows the approach laid out the first paper, and the function
reordering optimization follows the approach in the second paper. Both
optimizations are profile-guided, meaning they use sample inputs to make
optimization decisions. Assuming that real-world workloads mirror the sample
inputs, we can optimize the code layout for the sample inputs, and this should
lead to improved performance on real workloads.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;profiling&quot;&gt;Profiling&lt;&#x2F;h3&gt;
&lt;p&gt;The first step in any profile-guided optimization is—perhaps
unsurprisingly—profiling. But first, we need to know what kind of data we want
to collect when profiling Bril programs. Function reordering relies on weighted
directed call graphs, where nodes represent functions, edges represent calls
from one function to another, and edge weights are the number of times that call
occurred in the sample workload. Similarly, basic block reordering relies on
weighted control flow graphs, where edge weights are the number of times the
corresponding branch was taken. Therefore, when profiling a Bril program, we
would like to keep track of function calls and branches. In addition, to help
with evaluation, we will keep track of the total number of instruction pointer
jumps.&lt;&#x2F;p&gt;
&lt;p&gt;To build a Bril profiler, I first extended the parser to create an annotated
JSON representation of the program. I introduced a new &lt;code&gt;&amp;quot;block&amp;quot;&lt;&#x2F;code&gt; key for every
instruction, denoting which basic block each instruction belongs to, and a
&lt;code&gt;&amp;quot;line&amp;quot;&lt;&#x2F;code&gt; key, denoting the line number of the instruction (ignoring whitespace)
in Bril&#x27;s JSON representation. In other words, &amp;quot;line&amp;quot; points to the index of the
instruction in the program. The first instruction in the first function is &lt;code&gt;1&lt;&#x2F;code&gt;,
the second is &lt;code&gt;2&lt;&#x2F;code&gt;, and so on. I then extended the interpreter to accept
profiling data in the form of a line-separated list of command-line arguments.
It then passes each line of the profiling data to the program, tracking function
calls and basic block changes. The profiler returns a JSON representation of the
weighted call and control flow graphs. The format of the profiling output is
shown below.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;{
  &amp;quot;call_graph&amp;quot;: [
    {
      &amp;quot;from&amp;quot;: caller name,
      &amp;quot;to&amp;quot;: callee name,
      &amp;quot;count&amp;quot;: total number of times called
    },
    ...
  ],
  &amp;quot;basic_block_flows&amp;quot;: [
    {
      &amp;quot;from&amp;quot;: block label,
      &amp;quot;to&amp;quot;: block label,
      &amp;quot;count&amp;quot;: total number of times branched,
      &amp;quot;function&amp;quot;: function where &amp;quot;from&amp;quot; and &amp;quot;to&amp;quot; blocks live
    },
    ...
  ],
  &amp;quot;total_ip_jumps&amp;quot;: total number of instruction pointer jumps
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;To decouple the interpreter from the profiler, I created a new command called
&lt;code&gt;brilprofile&lt;&#x2F;code&gt;. It reads Bril programs as JSON from &lt;code&gt;stdin&lt;&#x2F;code&gt; (like &lt;code&gt;brili&lt;&#x2F;code&gt;), but
it takes an additional argument, a path to the sample workload. For example,
suppose we have a program called &lt;code&gt;fibonacci.bril&lt;&#x2F;code&gt; that takes an integer as a
command-line argument. If we have a workload &lt;code&gt;fibonacci.in&lt;&#x2F;code&gt; that contains
line-separated integers, we could profile the program by running:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;bril2json &amp;lt; fibonacci.bril | brilprofile fibonacci.in
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Let&#x27;s first take a look at how basic block reordering and function reordering
are done, and then we&#x27;ll run through an example from start to end.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;basic-block-reordering&quot;&gt;Basic Block Reordering&lt;&#x2F;h3&gt;
&lt;p&gt;Basic block reordering uses a weighted control flow graph to determine which
basic blocks should be close to each other. We want basic blocks that frequently
run one after another to live closer together. As a result, it makes sense to
co-locate basic blocks that have higher-weighted edges. To do this, I followed a
greedy approach introduced by &lt;a href=&quot;https:&#x2F;&#x2F;doi.org&#x2F;10.1145&#x2F;93542.93550&quot;&gt;Pettis and
Hansen&lt;&#x2F;a&gt;. At a high-level, it works by
coalescing basic blocks into chains and ordering these chains to form the new
block order. Initially, each basic block belongs to its own chain. We define the
&lt;em&gt;source&lt;&#x2F;em&gt; node of a chain as the &lt;em&gt;last&lt;&#x2F;em&gt; node, and the &lt;em&gt;target&lt;&#x2F;em&gt; node as the
&lt;em&gt;first&lt;&#x2F;em&gt; node. Note that initially every block is a source and a target. Then,
the algorithm works as follows.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Iterate over edges in decreasing order of weight.&lt;&#x2F;li&gt;
&lt;li&gt;During each iteration, if the tail of the edge is a source and the head is a
target, then combine the two chains into one chain.&lt;&#x2F;li&gt;
&lt;li&gt;Continue until no more chains can be coalesced.&lt;&#x2F;li&gt;
&lt;li&gt;Order the remaining chains based on frequency of execution. More frequently
run chains should be placed higher in the function.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;One exception to the above is that the first block in a function must remain the
first block in a function since it is the function&#x27;s entry point. As a result,
it must always be a target node, and its chain must be explicitly placed at the
top of the function.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;function-reordering&quot;&gt;Function Reordering&lt;&#x2F;h3&gt;
&lt;p&gt;Function reordering uses a weighted call graph to determine an improved function
order. The approach is similar to that of basic block reordering. I followed the
approach laid out by &lt;a href=&quot;https:&#x2F;&#x2F;research.fb.com&#x2F;publications&#x2F;optimizing-function-placement-for-large-scale-data-center-applications-2&#x2F;&quot;&gt;Ottoni and
Maher&lt;&#x2F;a&gt;,
with some modifications to account for the limitations of the Bril interpreter.
Function reordering works as follows:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;For each node, we assign a &amp;quot;hotness&amp;quot; metric, defined as the sum of their
incoming edge weights.&lt;&#x2F;li&gt;
&lt;li&gt;Iterate over the nodes in decreasing order of hotness.&lt;&#x2F;li&gt;
&lt;li&gt;During each iteration, determine the node&#x27;s highest-weight incoming edge, and
coalesced the node with the source of the edge.
&lt;ul&gt;
&lt;li&gt;Each node has a memory of the order in which nodes are coalesced. For
example, if the edge is (&lt;code&gt;a&lt;&#x2F;code&gt;, &lt;code&gt;b&lt;&#x2F;code&gt;), then after coalescing, the new node
will record that &lt;code&gt;b&lt;&#x2F;code&gt; is ordered after &lt;code&gt;a&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;Continue until no edges remain.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;The final node&#x27;s order is the function order. The difference in this approach
compared to the one in the paper is that the paper incorporates page sizes in
its ordering. Since we do not have the notion of a page in the Bril interpreter,
I just ignore that. The above algorithm is the same as the one in the paper if
we assume unlimited page sizes.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;an-example&quot;&gt;An example&lt;&#x2F;h3&gt;
&lt;p&gt;To get a better idea of how these optimizations work, let&#x27;s run through an
example from start to end. The following Bril program takes in two command-line
arguments &lt;code&gt;a&lt;&#x2F;code&gt; and &lt;code&gt;b&lt;&#x2F;code&gt;, and computes &lt;code&gt;exp(a, b)&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;multiply a b : int {
  zero: int = const 0;
  one: int = const 1;
  curr: int = id zero;
start:
  b_zero: bool = eq b zero;
  br b_zero end cont;
end:
  ret curr;
cont:
  curr: int = add curr a;
  b: int = sub b one;
  jmp start;
}

main base exp {
  zero: int = const 0;
  one: int = const 1;
  val: int = const 1;
start:
  exp_zero: bool = eq exp zero;
  br exp_zero end cont;
end:
  print val;
  ret;
cont:
  val: int = call multiply val base;
  exp: int = sub exp one;
  jmp start;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Below is the workload we will use to profile the program.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;2 4
4 2
5 5
6 6
2 9
3 9
4 10
2 20
2 10
3 7
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now, if we run &lt;code&gt;brilprofile&lt;&#x2F;code&gt; on the program with the above workload, we will get
the following call graph.&lt;&#x2F;p&gt;
&lt;p align=&quot;center&quot;&gt;
  &lt;img src=&quot;call-graph.png&quot; style=&quot;width: 95px&quot;&gt;
&lt;&#x2F;p&gt;
&lt;p&gt;The function reordering algorithm will place &lt;code&gt;main&lt;&#x2F;code&gt; before &lt;code&gt;multiply&lt;&#x2F;code&gt;, since
&lt;code&gt;main&lt;&#x2F;code&gt; calls &lt;code&gt;multiply&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Below is the basic block graph for &lt;code&gt;main&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p align=&quot;center&quot;&gt;
  &lt;img src=&quot;main-block-graph.png&quot; style=&quot;width: 450px&quot;&gt;
&lt;&#x2F;p&gt;
&lt;p&gt;The basic block reordering algorithm will first coalesce &lt;code&gt;start&lt;&#x2F;code&gt; and &lt;code&gt;cont&lt;&#x2F;code&gt; and
then prepend &lt;code&gt;b1&lt;&#x2F;code&gt; to that chain. Since &lt;code&gt;end&lt;&#x2F;code&gt; cannot be added to this chain, it
will remain on its own. Combining the chains will lead to a final block order of
&lt;code&gt;b1&lt;&#x2F;code&gt;, &lt;code&gt;start&lt;&#x2F;code&gt;, &lt;code&gt;cont&lt;&#x2F;code&gt;, and &lt;code&gt;end&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Below is the basic block graph for &lt;code&gt;multiply&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p align=&quot;center&quot;&gt;
  &lt;img src=&quot;multiply-block-graph.png&quot; style=&quot;width: 450px&quot;&gt;
&lt;&#x2F;p&gt;
&lt;p&gt;This is very similar to the weighted CFG for &lt;code&gt;main&lt;&#x2F;code&gt;. The output block order is
the same: &lt;code&gt;b1&lt;&#x2F;code&gt;, &lt;code&gt;start&lt;&#x2F;code&gt;, &lt;code&gt;cont&lt;&#x2F;code&gt;, and &lt;code&gt;end&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Putting all this together, the optimized Bril program is below (after running
&lt;em&gt;both&lt;&#x2F;em&gt; function reordering and block reordering).&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
  zero: int = const 0;
  one: int = const 1;
  val: int = const 1;
start:
  exp_zero: bool = eq exp zero;
  br exp_zero end cont;
cont:
  val: int = call multiply val base;
  exp: int = sub exp one;
  jmp start;
end:
  print val;
  ret ;
}
multiply {
  zero: int = const 0;
  one: int = const 1;
  curr: int = id zero;
start:
  b_zero: bool = eq b zero;
  br b_zero end cont;
cont:
  curr: int = add curr a;
  b: int = sub b one;
  jmp start;
end:
  ret curr;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;When testing, I found the above program to have 29% fewer instruction pointer
jumps than the unoptimized program when run on the same workload. We will now
explore the evaluation of these optimizations.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h3&gt;
&lt;p&gt;I evaluated these optimizations by comparing the total number of instruction
pointer jumps for the unoptimized and optimized programs on a given workload. I
wrote six Bril programs and ran each with three workloads. On average, across
all programs and workloads, function reordering decreased the total number of
instruction pointer jumps by 7.7% with a standard deviation of 36%. Basic block
reordering decreased the number of instruction pointer jumps by approximately
19%, with a standard deviation of 12%. Since the two optimizations do not
interfere with each other, I also evaluated the combination of function
reordering and basic block reordering, and found an average instruction pointer
jump decrease of 12%, with a standard deviation of 35%. The high standard
deviation of function reordering indicates that its performance is quite
inconsistent and that the reordering algorithm has some serious flaws. I discuss
this near the end of this section.&lt;&#x2F;p&gt;
&lt;p&gt;To conduct a thorough and rigorous evaluation of these optimizations, I ran them
on numerous benchmark programs with various workloads. My evaluation strategy
for a specific program was the following:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Profile the original program with a sample workload.&lt;&#x2F;li&gt;
&lt;li&gt;Run the function reordering optimization and profile with the same workload.&lt;&#x2F;li&gt;
&lt;li&gt;Repeat step 2 but with basic block reordering.&lt;&#x2F;li&gt;
&lt;li&gt;Repeat step 2 but run both optimizations one after another. Note that since
the optimizations do not interfere, the order in which we apply the
optimizations does not matter.&lt;&#x2F;li&gt;
&lt;li&gt;Repeat for all workloads.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;I considered testing the optimized programs with a different &amp;quot;testing&amp;quot; workload
but decided that this would not fairly evaluate the optimizations. Since we make
the assumption that sample workloads are representative of real-world workloads,
we should stick to that assumption.&lt;&#x2F;p&gt;
&lt;p&gt;Below is a graph showing the results of function and block reordering for 6
representative programs on three different workloads. Due to the lack of
existing Bril programs to benchmark, I wrote the programs and their workloads.
Each bar represents the corresponding program&#x27;s average &lt;em&gt;normalized&lt;&#x2F;em&gt; number of
instruction pointer jumps for the workloads, and the error bars represent the
standard error. To normalize the instruction pointer jumps, I divided by the
unoptimized program&#x27;s number of instruction pointer jumps. This makes it easier
to compare results for different workloads, as larger workloads will naturally
have more total instruction pointer jumps. The last two programs do not contain
any functions. Programs and workloads can be found
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Blue9&#x2F;bril&#x2F;tree&#x2F;project-2&#x2F;workload&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p align=&quot;center&quot;&gt;
  &lt;img src=&quot;code-layout-evaluation.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;&#x2F;p&gt;
&lt;p&gt;Note that the error bars for most programs are very small. The &amp;quot;Loop&amp;quot; test had
quite large error bars, and this is because one of the workloads only ran one
iteration of the loop, and this led to branching behavior to be quite different
from the other workloads. For this workload, block reordering only gave a 0.6%
increase in performance. This is important to note because it shows how the
performance of the code layout optimizations is sensitive to the workloads.&lt;&#x2F;p&gt;
&lt;p&gt;From the above, we can see that the basic block reordering optimization
consistently decreases the number of instruction pointer jumps by 10-20% (with
the exception of the first test, which could not benefit from branch
reordering), but the function reordering optimization is inconsistent. For
&amp;quot;Fib&amp;quot;, the function reordering optimization increased the number of instruction
pointer jumps by approximately 59% on average. Below is the &amp;quot;Fib&amp;quot; benchmark.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main (n:int) {
  v1: int = call fib n;
  print v1;
}
le_one (n: int): bool {
  one: int = const 1;
  lto: bool = le n one;
  ret lto;
}
fib n: int {
  base: bool = call le_one n;
  br base return continue;
return:
  ret n;
continue:
  one: int = const 1;
  prev: int = sub n one;
  prev2: int = sub prev one;
  fib1: int = call fib prev;
  fib2: int = call fib prev2;
  ans: int = add fib1 fib2;
  ret ans;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And below is the &lt;code&gt;optimized&lt;&#x2F;code&gt; program returned by the function reordering
algorithm.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main (n:int) {
  v1: int = call fib n;
  print v1;
}
fib n: int {
  base: bool = call le_one n;
  br base return continue;
return:
  ret n;
continue:
  one: int = const 1;
  prev: int = sub n one;
  prev2: int = sub prev one;
  fib1: int = call fib prev;
  fib2: int = call fib prev2;
  ans: int = add fib1 fib2;
  ret ans;
}
le_one (n: int): bool {
  one: int = const 1;
  lto: bool = le n one;
  ret lto;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Intuitively, this reordering makes sense. &lt;code&gt;main&lt;&#x2F;code&gt; calls &lt;code&gt;fib&lt;&#x2F;code&gt;, and &lt;code&gt;fib&lt;&#x2F;code&gt; calls
&lt;code&gt;le_one&lt;&#x2F;code&gt;. However, note the location in &lt;code&gt;fib&lt;&#x2F;code&gt; where &lt;code&gt;le_one&lt;&#x2F;code&gt; is called. It is at
the very top of the function—the first line in fact. As a result, when this call
is made, the instruction pointer has to go down the entirety of &lt;code&gt;fib&lt;&#x2F;code&gt; to get to
&lt;code&gt;le_one&lt;&#x2F;code&gt;, and then it has to go all the way back when &lt;code&gt;le_one&lt;&#x2F;code&gt; returns. In the
original program, the instruction pointer only had to traverse the length of
&lt;code&gt;le_one&lt;&#x2F;code&gt;, which is considerably shorter than &lt;code&gt;fib&lt;&#x2F;code&gt;. This demonstrates one of the
core weaknesses of this function reordering algorithm. It assumes that all
functions are of the same length and that on average, function calls are made
halfway through a function. However, in real programs, this is rarely the case.
I think it would be interesting to explore this further and see how we could
incorporate the size of functions as well as the location of function calls to
improve function ordering.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;In conclusion, basic block reordering consistently decreased the total number of
instruction pointer jumps, indicating that it improved code locality. On the
other hand, function reordering did not always improve code locality, and in one
case, increased the number of instruction pointer jumps by almost 60%. It would
be interesting to investigate how function reordering could be improved to
account for function lengths and function call locations. Ultimately, I found
implementing these optimizations to be very interesting as they are quite
different from classic optimizations.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Sparse Conditional Constant Propagation</title>
                <pubDate>Wed, 23 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/sccp/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/sccp/</guid>
                <description>&lt;p&gt;The goal of this project was to implement the sparse conditional constant
propagation optimization as a pass on programs in the intermediate language
&lt;a href=&quot;https:&#x2F;&#x2F;capra.cs.cornell.edu&#x2F;bril&#x2F;&quot;&gt;Bril&lt;&#x2F;a&gt;. Sparse conditional constant propagation is a compiler optimization
that detects variables and expressions in a program that will always evaluate to
a fixed value, and computes their values at compile time rather than at runtime.
It is set apart from traditional constant propagation by its reliance on &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Static_single_assignment_form&quot;&gt;static
single assignment (SSA) form&lt;&#x2F;a&gt; to improve the efficiency of the analysis
(sparse), and by its ability to detect control-flow edges which will never be
executed due to constant branch conditions (conditional).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;motivation&quot;&gt;Motivation&lt;&#x2F;h2&gt;
&lt;p&gt;As an example of constant propagation, consider the following block of Bril
code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
    a: int = const 1;
    b: int = add a a;
    cond: bool = const false;
    br cond then else;
then:
    b: int = add b a;
else:
    print b;
&lt;&#x2F;pre&gt;
&lt;p&gt;Any constant propagation analysis would be able to detect that the initial
defintion of &lt;code&gt;b&lt;&#x2F;code&gt; will always evaluate to the same value. As such, the addition
operation could be completed at compile time and replaced in the program with
&lt;code&gt;const 2&lt;&#x2F;code&gt;. Simple constant propagation does not make any conclusions about
control flow, and thus would be unable to determine whether or not the
instruction &lt;code&gt;b: int = add b a;&lt;&#x2F;code&gt; will be executed. However, by inspecting the
program, we can see that the branch condition is &lt;code&gt;false&lt;&#x2F;code&gt; and consequently the
instruction is &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Unreachable_code&quot;&gt;dead code&lt;&#x2F;a&gt;. As such, the final value of &lt;code&gt;b&lt;&#x2F;code&gt; that is printed
will always be 2. Conditional constant propagation has this ability to reason
about branch conditions, and thus can optimize the entire block above to the
following three instructions:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
    a: int = const 1;
    b: int = const 2;
    print b;
&lt;&#x2F;pre&gt;
&lt;p&gt;In fact, if the variable &lt;code&gt;a&lt;&#x2F;code&gt; is not used later on in the function, its
definition could be removed as well.&lt;&#x2F;p&gt;
&lt;p&gt;The example above is clearly contrived to show the capabilities of conditional
constant propagation. And so, it may not be obvious if and when this
optimization would truly be beneficial for real programs. Why would a program
contain code that will never run? A common answer to this question is that
people often put code that is meant for debugging under conditionals. As such,
when the code is compiled for production, with debugging disabled, all of that
code is unneeded and can be removed through this optimization.&lt;&#x2F;p&gt;
&lt;p&gt;However, I think that a more important motivation for this optimization revolves
around the fact that it is operating on an intermediate language. People would
most likely not write programs directly in Bril, but instead they would write
them in some other, higher-level language and then compile them down to Bril
(and then to assembly). The compilation process could easily produce constant
values that are not fully evaluated or control-flow edges that are never
traversed. Other optimizations, such as &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Inline_expansion&quot;&gt;function inlining&lt;&#x2F;a&gt;, could produce
oportunities for conditional constant propagation if, for example, the function
arguments are constants. As such, this optimization could have significant
benefits even for programs without any obviously constant expressions.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;Sparse conditional constant propagation was introduced by Wegman and Zadeck in
“&lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=103136&quot;&gt;Constant Propagation with Conditional Branches&lt;&#x2F;a&gt;” (1991). My
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;anastos&#x2F;bril&#x2F;blob&#x2F;sccp&#x2F;bril-ts&#x2F;sccp.ts&quot;&gt;implementation&lt;&#x2F;a&gt; of the data-flow analysis is based on the description given
in that paper. The analysis works on programs in SSA form, and as such I also
needed to implement transformations on Bril programs to and from SSA form. After
running the analysis, I then needed to actually use the information that it
provides to replace computations with constant values and to eliminate dead code
where possible. I wrote the optimization in TypeScript in order to take
advantage of the pre-existing &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&#x2F;blob&#x2F;master&#x2F;bril-ts&#x2F;bril.ts&quot;&gt;type definitions&lt;&#x2F;a&gt; for Bril, which are written
in that language. The optimization operates by taking in a Bril program (in the
canonical JSON representation) though standard input, and outputting the
optimized version of the program to standard output.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;ssa&quot;&gt;SSA&lt;&#x2F;h3&gt;
&lt;p&gt;After separating out the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Basic_block&quot;&gt;basic blocks&lt;&#x2F;a&gt; of a program and generating a
representation of the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Control-flow_graph&quot;&gt;control-flow graph&lt;&#x2F;a&gt; (CFG), the first step in performing
sparse conditional constant propagation is to convert the program into SSA form,
in which each assignment is to a unique variable. In order to handle cases where
the value of a variable could be from multiple of its definitions, SSA
introduces a φ instruction (e.g., &lt;code&gt;x_2: int = phi x_0 x_1;&lt;&#x2F;code&gt;), which takes as many
arguments as there are in-edges to the block in the CFG, and assigns to the
destination one of the arguments in correspondance with the in-edge that the
block was entered through. In order to convert back out of SSA, these φ
instructions must be removed. In general, they can be removed by placing
assignments at the end of the predecessor nodes or along the edges. As an
example, the code block given in the motivation section would require one φ
instruction, as there are two definitions of &lt;code&gt;b&lt;&#x2F;code&gt; that &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Reaching_definition&quot;&gt;reach&lt;&#x2F;a&gt; the print
statement:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
    a: int = const 1;
    b_0: int = add a a;
    cond: bool = const false;
    br cond then else;
then:
    b_1: int = add b_0 a;
else:
    b_2: int = phi b_0 b_1;
    print b_2;
&lt;&#x2F;pre&gt;
&lt;p&gt;The conversion to SSA form is divided into two parts: inserting φ instructions
where necessary, and then renaming the variables to give each definition its own
variable name. The only places where φ instructions might be necessary for a
variable are in blocks on the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Dominator_(graph_theory)&quot;&gt;dominance frontier&lt;&#x2F;a&gt; of definitions of the
variable. As such, in order to not add far too many instructions to the program,
I needed to compute the dominator tree of the control-flow graph.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#idom&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; To do
this I used the &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?doid=357062.357071&quot;&gt;Lengauer-Tarjan&lt;&#x2F;a&gt; dominator tree algorithm. The number of φ
instructions could further be reduced by running a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Live_variable_analysis&quot;&gt;live variable analysis&lt;&#x2F;a&gt;
and only inserting them if the variable is live in that block. However I did not
do this as it does not matter for the effectiveness of the optimization and, I
believe, it would more likely than not make the optimization slower.&lt;&#x2F;p&gt;
&lt;p&gt;Typically, when converting back out of SSA form, you would need to consider each
SSA variable independently and add assignment statements along the control-flow
edges in order to replace the φ instructions. Converting to and directly back
from SSA would then make a program, in general, less efficient then the
original.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#coalesce&quot;&gt;2&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; However, during implementation I realized that,
specifically for this optimization, the fully general conversion back from SSA
is not necessary. This is because constant propagation only ever decreases the
live ranges of SSA variables, by replacing their uses with constants or removing
dead code. As such, the live ranges of SSA variables that come from the same
original variable will never interfere with each other and can simply be renamed
back to the original variable name. If this SSA conversion were to be used for
other optimizations that do not share this property with constant propagation,
the conversion back would need to be modified.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;constant-propagation&quot;&gt;Constant Propagation&lt;&#x2F;h3&gt;
&lt;p&gt;The primary component of the optimization is the constant propagation analysis
itself. I implemented it according to the worklist algorithm described in the
paper. Doing so was mostly straightforward. However, the paper seems to not
mention one point in the algorithm where it is necessary to add to the worklist,
which took a while for me to figure out.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#visit-phi&quot;&gt;3&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; The output of the analysis
is a mapping from variables to elements of a lattice, which can be ⊤ (the
variable is undefined), ⊥ (the value is unknown), or a value (the variable is a
constant). Because the program is in SSA form, the analysis only needs one
lattice element per variable, instead of a lattice element for each variable at
each program point. The paper alludes to the fact that a constant propagation
analysis can gain information from control-flow branches. For example, for any
Bril branch, we can conclude that the condition variable is true on the first
out-edge and false on the second. I did not implement this, as it would require
keeping track multiple lattice positions per variable and would significantly
complicate the analysis.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#branches&quot;&gt;4&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;After completing the analysis, the next step is to use its results to actually
modify the code by replacing computations with constants and removing dead code.
Any block that is unreachable can simply be removed from the CFG. In fact, doing
this can create more dead code to remove if any variable defined elsewhere is
only used in removed blocks. The next step is to replace the expressions in
definitions of variables that are known to be constant with the constant value
itself. Similarly, this could also create more dead code. As such, the last step
is to remove any definitions of variables that have no uses. After this, the
program is converted back out of SSA form, the CFG is flattened back into a
single list of instructions, and the program is output in its standard JSON
form.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;It has proven difficult to evaluate the effectiveness of this optimization in a
manner that would accurately reflect its utility. Upon running the optimization
on several test cases, I have observed that, for programs that are written
directly by humans in Bril, the optimization tends to exhibit one of two
behaviors: either the program is effectively unchanged, or it is completely
evaluated leaving only assignments of constants to variables, and print
statements with those variables. By running these test cases, however, I have
been able to ensure that the optimization does not change the behavior of
well-typed Bril programs.&lt;&#x2F;p&gt;
&lt;p&gt;For example, of the test cases provided in the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;xu3kev&#x2F;bril-benchmark&#x2F;tree&#x2F;master&quot;&gt;bril-benchmark&lt;&#x2F;a&gt; repository,
two (&lt;code&gt;factorial&lt;&#x2F;code&gt; and &lt;code&gt;fibonacci&lt;&#x2F;code&gt;) are completely unchanged by the optimization
modulo the order of basic blocks, and the other two (&lt;code&gt;matrix_mul&lt;&#x2F;code&gt; and
&lt;code&gt;poly_mul&lt;&#x2F;code&gt;) are completely evaluated to just constant assignments and print
statements. The factor that seems to separate these two classes of programs is
the existance or lack of loops. This is because, if there are no loops, then
every assignment statement occurs no more than once. As such, because vanilla
Bril has no channels through which data can enter a function from the outside,
every variable&#x27;s value can be determined through the conditional constant
propagation analysis. When loops are involved, variables&#x27; values change between
iterations, and as such the analysis is unable to determine a constant value for
the variables.&lt;&#x2F;p&gt;
&lt;p&gt;I wrote a few programs in TypeScript, and used the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&#x2F;blob&#x2F;master&#x2F;bril-ts&#x2F;ts2bril.ts&quot;&gt;&lt;code&gt;ts2bril&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; compiler to
convert them to Bril. These converted programs are qualitatively different from
the handwritten programs, as the compiler inserts a lot of short-lived variables
in order to translate contructs from the TypeScript language. Due to the high
number of redundant variables, I believe a more effective optimization for
working on programs outputted by this compiler would be &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Copy_propagation&quot;&gt;copy propagation&lt;&#x2F;a&gt;.
For a TypeScript program that prints the first 20 fibonacci numbers, the
constant propagation optimization had the following effect:&lt;&#x2F;p&gt;
&lt;div align=&quot;center&quot;&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Instruction Count&lt;&#x2F;th&gt;&lt;th align=&quot;right&quot;&gt;Unoptimized&lt;&#x2F;th&gt;&lt;th align=&quot;right&quot;&gt;Optimized&lt;&#x2F;th&gt;&lt;th align=&quot;right&quot;&gt;Change&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Static&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;26&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;23&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;–11.5%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Dynamic&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;410&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;388&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;–5.4%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;&#x2F;div&gt;
&lt;p&gt;For this program, the optimization removed only three instructions, one of which
was in the loop body (an extraneous assignment to a variable that was never
used). In general, it is difficult to judge the effectiveness of this
optimization without the infrastructure of other optimizations and compilers to
pair it with. In the future, if someone were to implement, say, a function
inlining optimization, and were to extend this optimization to support function
calls, I would be interested to see the effect it would have.&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;idom&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;For the renaming step of the SSA conversion it was also necessary to
compute the immediate dominator of a node. As such the dominator tree was
computed instead of just the dominance relation.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;coalesce&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;2&lt;&#x2F;sup&gt;
&lt;p&gt;The efficiency lost by adding assignments in the transformation
back from SSA can be regained though &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Register_allocation#Coalescing&quot;&gt;move coalescing&lt;&#x2F;a&gt; during register
allocation.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;visit-phi&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;3&lt;&#x2F;sup&gt;
&lt;p&gt;In the function that the paper calls &lt;em&gt;Visit-φ&lt;&#x2F;em&gt;, if the lattice
position of the variable changes, you must add all uses of the variable to the
worklist.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;branches&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;4&lt;&#x2F;sup&gt;
&lt;p&gt;You could imagine gaining a lot of information from looking at
branches in this way. For instance, in the following case, you could conclude
that &lt;code&gt;i&lt;&#x2F;code&gt; is 5 on the true edge of the branch:&lt;&#x2F;p&gt;
&lt;pre&gt;
    ...
    b: bool = eq i 5;
    c: bool = and a b;
    br c foo bar;
&lt;&#x2F;pre&gt;
&lt;&#x2F;div&gt;
</description>
            </item>
        
            <item>
                <title>Tail Call Elimination</title>
                <pubDate>Wed, 23 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/tail-call-elimination/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/tail-call-elimination/</guid>
                <description>&lt;h2 id=&quot;overview&quot;&gt;Overview&lt;&#x2F;h2&gt;
&lt;p&gt;The goal of this project is to implement tail call elimination in Bril. A tail
call is a call to a function whose value is immediately returned. For example,
&lt;code&gt;return foo()&lt;&#x2F;code&gt; is a tail call. Whenever we have a tail call in our Bril program,
we do not need to create a new stack frame for it. We can simply &amp;quot;overwrite&amp;quot;
the current stack frame since won&#x27;t need any of the values in the frame anymore.
This is crucial for programming languages that use &lt;em&gt;tail recursion&lt;&#x2F;em&gt; as a
programming idiom (e.g., OCaml, Haskell, etc.). Consider the following (somewhat
contrived) TypeScript program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;function loop(n: number) {
  if (n == 0) {
    return 0;
  } else {
    return loop(n-1);
  }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This function simply loops &lt;code&gt;n&lt;&#x2F;code&gt; times, but does so in a functional way. Without
tail call elimination, programs like this would stack overflow with large values
of &lt;code&gt;n&lt;&#x2F;code&gt;. For languages that depend on this idiom, this is unacceptable.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design&quot;&gt;Design&lt;&#x2F;h2&gt;
&lt;p&gt;The base Bril language is not rich enough to express tail call elimination, or
even function calls in general. To enrich the language, we can make Bril more
closely resemble something like x86 assembly. Intuitively, all Bril variables
implictly live on the stack. To pass arguments in a function call, we must
explicitly push them onto the stack. This is akin to pushing values on the stack
in assembly. When a function returns, those arguments are implictly popped off the stack.
Additionally, the return value of the function is obtained using a special
keyword that is akin to getting the return value from &lt;code&gt;rax&lt;&#x2F;code&gt; according to the
&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;X86_calling_conventions#x86-64_calling_conventions&quot;&gt;System V Calling Conventions&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Importantly, we need the capability to jump to other functions, rather than just
labels in the current function. This way, if we have a tail call, then we can jump to
the beginning of the callee instead of creating a new stack frame. This is &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs4120&#x2F;2019sp&#x2F;lectures&#x2F;34functional&#x2F;lec34-sp19.pdf?1556325712&quot;&gt;how
tail call elimination is implemented&lt;&#x2F;a&gt;
in x86 assembly.&lt;&#x2F;p&gt;
&lt;p&gt;We have to modify the grammar to support features like pushing onto the stack
and defining functions, then update the interpreter. Thanks to the work done by
Alexa and Greg for &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&#x2F;pull&#x2F;16&quot;&gt;Project 1&lt;&#x2F;a&gt;, these
changes were much easier for me to make.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;modifying-the-grammar&quot;&gt;Modifying the Grammar&lt;&#x2F;h3&gt;
&lt;p&gt;The first step is to modify the grammar to support function declarations with
arguments and return types, as well as support the various new value&#x2F;effect
operations. Functions look as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;int foo(x: int, b: bool) {
  print x;
  print b;
  ...
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Functions must specify a return type, a name, and a potentially empty list of
typed arguments.&lt;&#x2F;p&gt;
&lt;p&gt;Here are the new effect operations and their semantics:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;push arg1 ... argn&lt;&#x2F;code&gt;: push arguments to the stack, which can be used by the
first function to be &lt;code&gt;call&lt;&#x2F;code&gt;&#x27;d after this instruction.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;call foo&lt;&#x2F;code&gt;: starts executing instructions defined by the function &lt;code&gt;foo&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Here are the new value operations and their semantics:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;retval&lt;&#x2F;code&gt;: Retrieve the return value of the previous function call, e.g.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;...
call foo
r: int = retval;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here, if &lt;code&gt;foo&lt;&#x2F;code&gt; returned 0, then &lt;code&gt;r&lt;&#x2F;code&gt; would have the value 0.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;extending-the-interpreter&quot;&gt;Extending the Interpreter&lt;&#x2F;h3&gt;
&lt;p&gt;The interpreter needed to be extended to implement the above semantics. To model
a stack frame, I explicitly keep track of the program counter (i.e., which
instruction is being interpreted), the name of the current function, and an
environment. I made this explicit because it made jumping to other functions
easier to implement. The interpreter simply tries to evaluate the frame at the
top of the stack until the stack is empty.&lt;&#x2F;p&gt;
&lt;p&gt;Arguments that are &lt;code&gt;push&lt;&#x2F;code&gt;ed stay on the current stack frame. Once a &lt;code&gt;call&lt;&#x2F;code&gt; is
made, we create a new stack frame. Note that the names of the &lt;code&gt;push&lt;&#x2F;code&gt;ed arguments
don&#x27;t necessarily match those declared by the function, so we need to map the
function&#x27;s arguments to the values of the &lt;code&gt;push&lt;&#x2F;code&gt;ed arguments.&lt;&#x2F;p&gt;
&lt;p&gt;To return a value, a special variable name is set in the environment in the
previous stack frame so it can be retrieved by the caller using &lt;code&gt;retval&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;extending-the-typescript-frontend&quot;&gt;Extending the TypeScript Frontend&lt;&#x2F;h3&gt;
&lt;p&gt;For the majority of this, I referred to &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&#x2F;pull&#x2F;16&quot;&gt;Alexa and Greg&#x27;s implementation&lt;&#x2F;a&gt;.
There were some differences however because I have separate instructions for
passing arguments to a function. So, whenever a function call was found in the
AST, the arguments needed to be converted to Bril instructions first, then those
would be used in a &lt;code&gt;push&lt;&#x2F;code&gt;. Additionally, a &lt;code&gt;retval&lt;&#x2F;code&gt; would need to be created
afterwards if the result of the function call would be used.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;identifying-and-eliminating-tail-calls&quot;&gt;Identifying and Eliminating Tail Calls&lt;&#x2F;h3&gt;
&lt;p&gt;The simple definition of a &lt;em&gt;tail call&lt;&#x2F;em&gt; would be an immediate return of a call
to a function. The translation from the TypeScript frontend of something like&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;return foo(n)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;to Bril would be&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;push n
call foo
v: int = retval;
ret v
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Thus we just need to look for &lt;code&gt;call&lt;&#x2F;code&gt;s that are immediately and optionally
followed by &lt;code&gt;retval&lt;&#x2F;code&gt;, and immediately followed by a &lt;code&gt;ret&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;This doesn&#x27;t take into account more complex cases where there isn&#x27;t an explicit
return of a function call, but the value returned comes from a call to the
same function from different branches. For example:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;function foo(n: number): number {
  ...
  if (b) {
    result = foo(n-1);
  } else {
    result = foo(n-2);
  }
  return result;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;To do this, we first do a global copy propagation. The dataflow analysis for
copy propagation that I used can be found &lt;a href=&quot;http:&#x2F;&#x2F;www.csd.uwo.ca&#x2F;%7Emoreno&#x2F;CS447&#x2F;Lectures&#x2F;CodeOptimization.html&#x2F;node8.html&quot;&gt;here&lt;&#x2F;a&gt;.
Then, for a value &lt;code&gt;v&lt;&#x2F;code&gt; that is &lt;code&gt;return&lt;&#x2F;code&gt;ed, we search backwards through the CFG until we find a &lt;code&gt;retval&lt;&#x2F;code&gt; that
corresponds to &lt;code&gt;v&lt;&#x2F;code&gt;, and make sure that it has not been modified along any of
these backwards paths, and the &lt;code&gt;reval&lt;&#x2F;code&gt; comes from a call to the same function.
In the above example, the corresponding Bril code looks something as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;then.6:
  ...
  call foo;
  v17: int = retval ;
  result: int = id v17;
  jmp endif.6;
else.6:
  ...
  call foo;
  v21: int = retval ;
  result: int = id v21;
endif.6:
  v22: int = id result;
  ret v22;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;After copy propagation, it looks like this:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;then.6:
  ...
  call foo;
  result: int = retval;
  jmp endif.6;
else.6:
  ...
  call foo;
  result: int = retval;
endif.6:
  ret result;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then we can analyze the CFG backwards to see that indeed we can replace
the calls with &lt;code&gt;jmp&lt;&#x2F;code&gt; instructions.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;then.6:
  ...
  jmp foo;
  jmp endif.6;
else.6:
  ...
  jmp foo;
endif.6:
  ret result;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Note that the extra instructions can simply be removed by a DCE pass, so we don&#x27;t
worry about that.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Unfortunately, I couldn&#x27;t get this to work properly because my copy propagation
pass had bugs.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;To evaluate that the tail call elimination is working and actually gives us an
improvement, we benchmark some recursive functions that use tail recursion, and
show the difference in execution time and memory usage between an optimized and
unoptimized Bril program.&lt;&#x2F;p&gt;
&lt;p&gt;The table entries show how much change was observed, as a percentage, by doing
tail call elimination (TCE). For example, an entry of -10% means the optimized program
used 10% less memory&#x2F;time than the unoptimized program. An &lt;code&gt;X&lt;&#x2F;code&gt; means that the output of the program was too
big to handle. &lt;code&gt;n&lt;&#x2F;code&gt; is the argument passed to the recursive function.
&lt;code&gt;loop&lt;&#x2F;code&gt; is a Bril program that simply loops &lt;code&gt;n&lt;&#x2F;code&gt; times using recursion. &lt;code&gt;factorial&lt;&#x2F;code&gt;
is a tail recursive implementation of factorials. &lt;code&gt;mutual_rec&lt;&#x2F;code&gt; is a program that
checks whether a program is even or odd in a mutually recursive way. The code for
these can be found &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&#x2F;pull&#x2F;37&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Percentage Change in Memory Usage Using TCE&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;n = 1&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;n = 100&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;n = 10000&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;n = 100000&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;loop&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;+0.1%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;-2.7%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;-31.1%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;-79.5%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;factorial&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;+0.2%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;-2.5%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;-79.1%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;X&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;mutual_rec&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;+0.2%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;+13.8%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;-31.2%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;-78.4%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;&lt;strong&gt;Percentage Change in Execution Time Using TCE&lt;&#x2F;strong&gt;&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;n = 1&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;n = 100&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;n = 10000&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;n = 100000&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;loop&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;+2.1%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;+2.2%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;-10%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;-29.8%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;factorial&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;+1.1%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;+5.5%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;-1.6%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;X&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;mutual_rec&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;+1.1%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;-1%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;+5.7%&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;+5.07%&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;To get the execution time and peak memory usage, I use &lt;code&gt;&#x2F;usr&#x2F;bin&#x2F;time -l&lt;&#x2F;code&gt; (which prints the contents of rusage).
To make sure the measurements are meaningful, I chose a maximum &lt;code&gt;n&lt;&#x2F;code&gt; value so
that the tests took a few seconds. Here we can clearly see that with large values
of &lt;code&gt;n&lt;&#x2F;code&gt;, the programs with TCE use considerably less memory. However, it is unclear
whether there is a benefit to the execution time of the program since the values
vary quite a bit.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;hardest-parts-to-get-right&quot;&gt;Hardest Parts to Get Right&lt;&#x2F;h2&gt;
&lt;p&gt;Finding the right level of abstraction for the IR was difficult. I decided to
make it closely resemble x86 because that is familiar and is what matched the
theory the most. The other difficult part was eliminating tail calls that
weren&#x27;t as simple as just &lt;code&gt;return foo()&lt;&#x2F;code&gt;. This required other optimizations and
careful consideration to make sure that indeed the function call could actually
be optimized to just a jump.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Threads Cannot Be Implemented as Libraries</title>
                <pubDate>Mon, 21 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/threads-not-as-libraries/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/threads-not-as-libraries/</guid>
                <description>&lt;p&gt;Attempts have been made to append thread support onto languages that lack thread semantics via a library that is paired with an informal thread semantics. 
A thread library can provide functions for creating and deleting threads and interacting with mutex locks.
Effectively, this introduces threads into a host language that remains oblivious to their presence.&lt;&#x2F;p&gt;
&lt;p&gt;Due to this obliviousness to threads, compilers may perform optimizations that inadvertently change the behavior of a multi-threaded program with respect to the thread library&#x27;s specification.
In other words, thread-oblivious compilers may perform optimizations that preserve &amp;quot;single-threaded&amp;quot; behavior without additionally preserving the thread library&#x27;s notion of &amp;quot;multi-threaded&amp;quot; behavior.
Any consumer of a thread library will necessarily depend on special compiler support, the correctness of which isn&#x27;t enforced by the language specification.
It is in this sense that &amp;quot;threads cannot be implemented as a library&amp;quot;; rather, they must be implemented in the language specification.&lt;&#x2F;p&gt;
&lt;p&gt;This paper examines the case of the widely used C&#x2F;C++ Pthreads library during the time of its writing (2004), demonstrating three kinds of compiler optimizations that can break valid Pthreads programs.&lt;&#x2F;p&gt;
&lt;p&gt;Afterwards, the author argues that languages with threads should additionally define the behavior of lock-free programs with data races, rather than only programs without data races.
In particular, they compare running lock-free algorithms with and without locks to demonstrate the cost of locks using two examples: an implementation of the Sieve of Eratosthenes and a tracing garbage collector.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, the author comments on ongoing efforts towards adding a formal thread model to the C++ standard based on the Java Memory Model.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;pthreads-pre-2005-threads-implemented-as-a-library&quot;&gt;Pthreads pre 2005: Threads Implemented as a Library&lt;&#x2F;h2&gt;
&lt;p&gt;At the time of writing of this paper, Pthreads was not formally part of the specification of C&#x2F;C++.
Rather, Pthreads specified threads informally separately from the C standard.
At this time, Pthreads specification for concurrent thread semantics was follows:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Applications shall ensure that access to any &lt;strong&gt;memory location&lt;&#x2F;strong&gt; by more than one thread of control (threads or processes) is restricted such that &lt;em&gt;&lt;strong&gt;no thread of control can read or modify a memory location while another thread of control may be modifying it&lt;&#x2F;strong&gt;&lt;&#x2F;em&gt;.
Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads. The following functions synchronize memory with respect to other threads:&lt;&#x2F;p&gt;
&lt;p&gt;pthread_mutex_lock(),
pthread_mutex_unlock(),
[Many other synchronization functions listed]&amp;quot;&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;According to this Pthreads standard, a threaded program is well-defined if it lacks &lt;strong&gt;data races&lt;&#x2F;strong&gt;.
A data race occurs when two threads concurrently operate on a memory location and at least one of these operations modifies its contents.&lt;&#x2F;p&gt;
&lt;p&gt;This specification might seem precise at first glance, but &lt;em&gt;how can we determine whether a program has a race?&lt;&#x2F;em&gt; 
We require a semantics for threaded programs in order to evaluate whether an execution trace contains a data race, but this semantics is itself given in terms of a data race! 
Thus, Pthreads provides a circular definition for thread semantics.&lt;&#x2F;p&gt;
&lt;p&gt;Conceptually, this circularity is resolved by an implementation-defined thread semantics.
Intuitively, we may expect an implementation akin to the &lt;strong&gt;sequential consistency&lt;&#x2F;strong&gt; (SC) model, which interprets a threaded program as an interleaving of instructions across threads, such that intra-thread instruction order is preserved.&lt;&#x2F;p&gt;
&lt;p&gt;For example, under the SC model, we would observe a data race whereby at least one of &lt;code&gt;r1 == 1&lt;&#x2F;code&gt; or &lt;code&gt;r2 == 1&lt;&#x2F;code&gt; holds after the execution of the following threaded program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; r1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; r2 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Thread 1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; r1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; y;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Thread 2
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;y &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; r1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;SC is an example of a &lt;strong&gt;memory model&lt;&#x2F;strong&gt;, a specification for the behavior of memory operations in a multi-threaded context.
Rather than use SC, however, most compilers implement a &lt;em&gt;weaker&lt;&#x2F;em&gt; memory model that allows for behaviors satisfying &lt;code&gt;r1 == r2 == 0&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Although this behavior is less intuitive, a weaker model than SC is necessary for two reasons:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Hardware may reorder memory operations in such a way that contradicts the SC model.&lt;&#x2F;li&gt;
&lt;li&gt;Important compiler optimizations rely on memory reorderings. To disable these is to significantly regress the quality of code.&lt;&#x2F;li&gt;
&lt;li&gt;Thread-oblivious compiler optimizations don’t need to preserve SC. Thus, it is legal for memory operations reordering to break SC.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;pthreads-partial-memory-model&quot;&gt;Pthreads’ Partial Memory Model&lt;&#x2F;h2&gt;
&lt;p&gt;Pthreads’ memory model is SC in the absence of data races.
This model is known as the &lt;em&gt;data race freedom implies sequential consistency&lt;&#x2F;em&gt; (DRF =&amp;gt; SC) model.
In the presence of data races, however, the memory model is &lt;strong&gt;formally undefined&lt;&#x2F;strong&gt;: any behavior is allowed.&lt;&#x2F;p&gt;
&lt;p&gt;In other words, Pthreads allows any behavior for programs with data races. &lt;&#x2F;p&gt;
&lt;p&gt;Thus, since the above example contains data races, any compiler-chosen behavior satisfies the specification, including one for which &lt;code&gt;r1 == r2 == 0&lt;&#x2F;code&gt; holds.
In principle, however, this also includes entirely unexpected behaviors, such as &lt;code&gt;r1 == 42 &amp;amp;&amp;amp; r2 == -1&lt;&#x2F;code&gt;, segfaulting, and even formatting your disk!&lt;&#x2F;p&gt;
&lt;p&gt;The reason behind this design decision is explained as follows:
&amp;gt;Formal definitions of the memory model were &lt;em&gt;rejected as unreadable by the vast majority of programmers&lt;&#x2F;em&gt;.
&amp;gt;In addition, most of the formal work in &lt;em&gt;the literature has concentrated on the memory as provided by the hardware as opposed to the application programmer&lt;&#x2F;em&gt; through the compiler and runtime system.
&amp;gt;It was believed that a simple statement intuitive to most programmers would be most effective&lt;&#x2F;p&gt;
&lt;p&gt;Recognizing the design of a clear, correct, and portable memory model for programs with data races as a complex, open research question, Pthreads opted to remove them altogether.&lt;&#x2F;p&gt;
&lt;p&gt;Thus, mechanisms for removing data races are mandatory for ensuring reasonable Pthread program behavior.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;fighting-compilers-for-well-defined-behavior&quot;&gt;Fighting Compilers for Well-Defined Behavior&lt;&#x2F;h2&gt;
&lt;p&gt;To facilitate the writing of race-free, well-defined Pthreads programs, Pthreads offers synchronization primitives such as the &lt;strong&gt;memory barrier&lt;&#x2F;strong&gt; and the &lt;strong&gt;mutex&lt;&#x2F;strong&gt;.
Synchronization enables the containment of shared memory operations in programmer-defined &lt;strong&gt;critical sections&lt;&#x2F;strong&gt;, where memory operations are defined to be &lt;em&gt;mutually exclusive&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Since C++ is thread-oblivious, compilers are not guaranteed to respect the semantics of synchronization primitives.
In particular, one might fear that reads and writes to shared variables be reordered by the compiler such that they are &lt;em&gt;moved out of critical sections&lt;&#x2F;em&gt;, introducing data races and, subsequently, undefined behavior.&lt;&#x2F;p&gt;
&lt;p&gt;Thankfully, memory operations cannot be directly moved across synchronization calls (e.g., &lt;code&gt;pthread_mutex_lock()&lt;&#x2F;code&gt;), since Pthread library functions are treated as &lt;strong&gt;opaque functions&lt;&#x2F;strong&gt;: functions with hidden implementations that are assumed to potentially modify any shared, global variable.&lt;&#x2F;p&gt;
&lt;p&gt;This treatment precludes unsafe memory-reordering optimizations, but it unfortunately doesn&#x27;t suffice.
As we&#x27;ll see ahead, optimizations may still break thread semantics by introducing data races.
This suggests that optimizations cannot be made to preserve thread semantics through heuristics: they must instead be disallowed by formally incorporating thread semantics into the language specification.
It is in this sense that &lt;strong&gt;Threads Cannot Be Implemented as a Library&lt;&#x2F;strong&gt;: if the language doesn&#x27;t specify the behavior of threads, optimizations cannot be guaranteed to preserve the thread library&#x27;s specification, even if they sometimes do in practice.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;three-optimizations-that-break-pthreads&quot;&gt;Three Optimizations that Break Pthreads&lt;&#x2F;h2&gt;
&lt;p&gt;The authors identify three compiler optimizations that break otherwise well-defined Pthreads programs. All these examples use the &amp;quot;opaque functions&amp;quot; where the compiler is not allowed to reorder memory operations across locks, but things can go wrong anyway!&lt;&#x2F;p&gt;
&lt;h3 id=&quot;concurrent-modification&quot;&gt;Concurrent modification&lt;&#x2F;h3&gt;
&lt;p&gt;Consider the following two statements which are run on 2 separate threads with both the variables initialized to zero:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Thread 1
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(x&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;==&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;y;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Thread 2
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(y&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;==&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Since there is no way under SC that either &lt;code&gt;x&lt;&#x2F;code&gt; or &lt;code&gt;y&lt;&#x2F;code&gt; can be read and modified concurrently, this program does not contain data races, and is therefore well-defined. Hence, the result of any SC execution of this program would be &lt;code&gt;x==0&lt;&#x2F;code&gt; and &lt;code&gt;y==0&lt;&#x2F;code&gt;. &lt;&#x2F;p&gt;
&lt;p&gt;Since the code does not contain any Pthreads operations around which the compiler is restricted to perform reordering optimizations, it can transform the code in anyway that preserves &amp;quot;single-threaded&amp;quot; correctness. In this process, it may opt to perform a &amp;quot;speculative&amp;quot; optimization as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Thread 1
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;y; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;--&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;y;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Thread 2
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(y &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;--&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This introduces a data race, since both &lt;code&gt;x&lt;&#x2F;code&gt; and &lt;code&gt;y&lt;&#x2F;code&gt; are concurrently read and modified. 
Thus, the originally well-defined program is transformed into one that is undefined. This program that looks race-free to the programmer can now segfault or run &lt;code&gt;rm -rf &#x2F;&lt;&#x2F;code&gt; or whatever!&lt;&#x2F;p&gt;
&lt;h3 id=&quot;rewriting-of-adjacent-data&quot;&gt;Rewriting of Adjacent Data&lt;&#x2F;h3&gt;
&lt;p&gt;Consider the following struct definition containing bit-fields:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;struct &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;17&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;15&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;} x;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Bit-fields have a fixed number of contiguous bits in memory allocated to them.
In the above case, &lt;code&gt;a&lt;&#x2F;code&gt; has been allocated 17 bits followed by &lt;code&gt;b&lt;&#x2F;code&gt; which has
been allocated 15 bits. All these bits are adjacent memory locations, which
makes bit-fields great for efficiently packing data.&lt;&#x2F;p&gt;
&lt;p&gt;Consider the following concurrent field assignments:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;x.a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;42&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;thread 1
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
x.b &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;37&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;thread 2
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;At first glance, one might expect the above program to be race free, since both concurrent writes appear to access distinct variables.
However, Pthreads does not define data races in terms of &lt;em&gt;program variables&lt;&#x2F;em&gt;, but rather in terms of &lt;strong&gt;memory locations&lt;&#x2F;strong&gt;.
Since this term is not formally defined by Pthreads, compilers may have different program variables unexpectedly &lt;em&gt;share&lt;&#x2F;em&gt; a memory location.&lt;&#x2F;p&gt;
&lt;p&gt;In the case above, a compiler targeting a 32-bit machine may choose to have the fields &lt;code&gt;x.a&lt;&#x2F;code&gt; and &lt;code&gt;x.b&lt;&#x2F;code&gt; share a memory location. This is probable in fact, since such machines are likely to have a 32-bit wide, rather than a 17 or 15-bit wide, store operation, necessitating the following implementation of bit-field assignment for &lt;code&gt;x.a = 42&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;{
  tmp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x;
  tmp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;amp;= ~&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0x0001ffff&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  tmp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;|= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;42&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; tmp;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Thus, both fields &lt;code&gt;x.a&lt;&#x2F;code&gt; and &lt;code&gt;x.b&lt;&#x2F;code&gt; share memory location &lt;code&gt;x&lt;&#x2F;code&gt;, causing a data race when concurrently modified.&lt;&#x2F;p&gt;
&lt;p&gt;Since the writing of this paper, the C&#x2F;C++ standard has been expanded to address this problem.
Specifically, the &lt;a href=&quot;https:&#x2F;&#x2F;wiki.sei.cmu.edu&#x2F;confluence&#x2F;display&#x2F;c&#x2F;CON32-C.+Prevent+data+races+when+accessing+bit-fields+from+multiple+threads&quot;&gt;C11 standard&lt;&#x2F;a&gt; defines the memory locations of structure bit-fields as follows:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;A bit-field and an adjacent non-bit-field member are in &lt;strong&gt;separate memory locations&lt;&#x2F;strong&gt;. The same applies to two bit-fields, if one is declared inside a nested structure declaration and the other is not, or if the two are separated by a zero-length bit-field declaration, or if they are separated by a non-bit-field member declaration. It is &lt;strong&gt;not safe&lt;&#x2F;strong&gt; to concurrently update two non-atomic bit-fields in the same structure if all members declared between them are also (non-zero-length) bit-fields, no matter what the sizes of those intervening bit-fields happen to be.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;This implies that the current C11 standards explicitly states that a concurrent
execution of statements like &lt;code&gt;x.a=42&lt;&#x2F;code&gt; and &lt;code&gt;x.b=37&lt;&#x2F;code&gt; will lead to data races since
the two operations write to the &amp;quot;same&amp;quot; memory location. So, this is not a
well-defined program even before the compiler chooses to optimize it using 32
bit stores.&lt;&#x2F;p&gt;
&lt;p&gt;As mentioned, this problematic transformation over adjacent bit-fields is motivated by architectural constraints. 
However, a compiler can also deploy this transformation as a memory-saving optimization over adjacent fields.&lt;&#x2F;p&gt;
&lt;p&gt;For example, given the following structure definition:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;struct &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; c; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; d; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; e; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; f; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; g; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;char&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; h; } x;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And a program that concurrently writes to adjacent fields:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Thread 1
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x.a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;a&amp;#39; 

 &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Thread 2
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x.b &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;b&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; x.c&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;c&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; x.d&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;d&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; x.e&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;e&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; x.f&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;f&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; x.g&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;g&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; x.h&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;h&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A compiler that targets a 64-bit machine can optimize space usage by transforming Thread 2 as follows (assuming &lt;code&gt;x&lt;&#x2F;code&gt; is 64-bit aligned):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;hgfedcb&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;\0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;|&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x.a;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This optimization generates a data race, since both &lt;code&gt;x.a&lt;&#x2F;code&gt; and &lt;code&gt;x&lt;&#x2F;code&gt; share the same memory location and are concurrently modified.
The C11 standard also talks about this optimization in case of adjacent fields
with different memory location:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Two threads of execution can update and access separate memory locations without interfering with each other.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;This clearly implies that the compiler is no longer allowed, as of 2011, to perform the above &amp;quot;bad&amp;quot; optimizations, so the code is safe.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;register-promotion&quot;&gt;Register Promotion&lt;&#x2F;h3&gt;
&lt;p&gt;Whenever a shared memory location is read and modified, it must first &lt;strong&gt;promote&lt;&#x2F;strong&gt; its value in the memory hierarchy: that is, it must move its value into a local register. 
After operating on this promoted value, it must then &lt;strong&gt;demote&lt;&#x2F;strong&gt; it by moving the computed value back into the shared memory location. 
Such promotion&#x2F;demotion operations are costly, and especially so in a loop.&lt;&#x2F;p&gt;
&lt;p&gt;Register promotion is an optimization that aims to minimize the cost of promotion&#x2F;demotion operations in a loop by maximizing data locality at the register level.
Put another way, it aims to &amp;quot;factor out&amp;quot; promotions in a loop.
This optimization transforms a loop by first speculatively promoting the value from a shared variable accessed in the loop into a new, local variable, before the loop. 
Next, all usages of the shared variable are replaced by this new promoted variable within the loop. 
Finally, the promoted variable is demoted after the loop.
If applied to lock-synchronized code, the promotion&#x2F;demotion operations added by register promotion will lie outside the critical section, therefore creating a data race.&lt;&#x2F;p&gt;
&lt;p&gt;Importantly, register promotion must take care to insert a pair of promotion&#x2F;demotion operations around opaque function calls within the loop body.
This is because such functions are assumed to refer to all shared variables, so a demotion is required before the call, followed by a promotion.
Register promotion is therefore not beneficial in general, since it may increase the number of promotions and demotions. 
A heuristic must be employed to determine whether its usage improves performance.&lt;&#x2F;p&gt;
&lt;p&gt;To illustrate how register promotion can create a data race, consider the following example, where a shared variable &lt;code&gt;x&lt;&#x2F;code&gt; is modified:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...
  if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(mt) pthread_mutex_lock(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
  x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= ...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...
  if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(mt) pthread_mutex_unlock(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Suppose &lt;code&gt;mt&lt;&#x2F;code&gt; is only true when multiple threads have been created.
Then this program lacks data races, since reads and writes to &lt;code&gt;x&lt;&#x2F;code&gt; are logically synchronized.&lt;&#x2F;p&gt;
&lt;p&gt;Further suppose that register promotion is determined to be a beneficial optimization, perhaps because the conditional is observed to rarely be taken.
Since the Pthreads library calls are simply regarded as opaque function calls, the optimization will produce the following:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x; 
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...
  if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(mt) {
    x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; r; pthread_mutex_lock(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;); r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x;
  }
  r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= ...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; r&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...
  if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(mt) {
    x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; r; pthread_mutex_unlock(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;); r &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; x;
  }
}
x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; r; 
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In this case, register promotion has introduced reads and writes to &lt;code&gt;x&lt;&#x2F;code&gt; outside
the critical section guarded by locks. For instance, assume thread 1 has acquired the
lock and is modifying register &lt;code&gt;r&lt;&#x2F;code&gt; ; concurrently thread 2 performs a read
operation on register &lt;code&gt;r1&lt;&#x2F;code&gt; right  before the &lt;code&gt;pthread_mutex_lock()&lt;&#x2F;code&gt; function call. 
This is clearly a data race which has been introduced by the compiler since it had 
no notion of thread semantics baked into the language.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;allowing-data-races&quot;&gt;Allowing Data Races&lt;&#x2F;h2&gt;
&lt;p&gt;In the face of these issues, it’s clear that, indeed, threads cannot be implemented as a library. Instead, the thread specification must be formally incorporated into the language specification, to ensure that compilers do not break the former.&lt;&#x2F;p&gt;
&lt;p&gt;The task of formally specifying Pthreads presents an opportunity for rethinking its synchronization-oriented paradigm.
The author identifies that synchronization is not always desirable, and that it requires a precise interleaving of expensive atomic operations:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;The cost of atomic operations and memory barriers varies widely, but is often comparable to that of a hundred or more register-to-register instructions, even in the absence of a cache miss. For example, on some Pentium 4 processors, hardware instructions to atomically update a memory location require well over 100 processor cycles, and these can also double as one of the cheaper mechanisms for ensuring that a store operation becomes visible to other threads before a subsequent load. &lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Instead of synchronization, the author argues in favor of a paradigm that allows data races and relies on atomic operations by showing performance gains from such a paradigm shift. &lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;As a result of the high cost of these hardware instructions, and the even higher cost of the pthread primitives built on them, there are a small number of cases in which synchronization performance is critical, and more careful and direct use of the hardware primitives, together with less constrained use of shared variables, is essential. In some cases it may also be necessary to avoid deadlock issues inherent in lock-based programming[7], or desirable because a different parallel programming model is preferable for an application (cf. [32]).&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;To demonstrate this claim, the author compares the performance of an implementation of the Sieve of Eratosthenes algorithm and a tracing garbage collector implemented under the synchronization and data raceful paradigm.&lt;&#x2F;p&gt;
&lt;p&gt;Four implementations are evaluated when run with 1, 2, and 4 threads.&lt;br &#x2F;&gt;
Two of these use synchronization: one using mutexes, and another using spinlocks.&lt;br &#x2F;&gt;
These are compared to two implementations with data races: one that makes use of atomic memory operations, and one that unsafely uses ordinary memory operations.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;images&#x2F;sieve-bit-array.png&quot; alt=&quot;alt text&quot; title=&quot;Sieve of Eratosthenes Performance&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;images&#x2F;gc.png&quot; alt=&quot;alt text&quot; title=&quot;Tracing Garbage Collector Performance&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The results indicate that there are applications for which allowing data races can be improve performance in a multi-processor, motivating the formal specification of programs with data races.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;formal-model-for-data-races&quot;&gt;Formal Model for Data Races&lt;&#x2F;h2&gt;
&lt;p&gt;To allow data races, the language standard must provide a memory model for programs with data races.
Java, as a safe language, provides such a model, so the author expressed optimism about the possibility of adapting its model for use in the C++ standard.&lt;&#x2F;p&gt;
&lt;p&gt;Fast-forwarding to the present, we see that the &lt;a href=&quot;http:&#x2F;&#x2F;www.open-std.org&#x2F;jtc1&#x2F;sc22&#x2F;wg14&#x2F;www&#x2F;docs&#x2F;n1548.pdf&quot;&gt;C11 standard&lt;&#x2F;a&gt; indeed defines the behavior of data races between atomic operations. 
This is accomplished by a &lt;em&gt;redefinition&lt;&#x2F;em&gt; of data races that doesn&#x27;t include concurrent atomic operations:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Two expression evaluations conflict if one of them modifies a memory location and the other one reads or modifies the same memory location.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;The execution of a program contains a data race if it contains two conflicting actions in different threads, &lt;strong&gt;at least one of which is not atomic&lt;&#x2F;strong&gt;, and neither happens before the other. Any such data race results in undefined behavior.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Thus, concurrent, atomic operations are no longer considered a data race.
Instead, they are often denoted as &lt;strong&gt;race conditions&lt;&#x2F;strong&gt;, and are well-defined.&lt;&#x2F;p&gt;
&lt;p&gt;Given data race freedom under this redefinition, C11 formally defines thread semantics as SC (&lt;em&gt;usually&lt;&#x2F;em&gt;):&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Under a hosted implementation, a program can have more than one thread of execution (or thread) running concurrently. 
The execution of each thread proceeds as defined by the remainder of this standard. 
The execution of the entire program consists of an execution of all of its threads. 
Under a freestanding implementation, it is implementation-defined whether a program can have more than one thread of execution.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;The execution can usually be viewed as an &lt;strong&gt;interleaving of all of the threads&lt;&#x2F;strong&gt;. 
However, some kinds of atomic operations, for example, allow executions inconsistent with a simple interleaving as described below&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Thus, the standard now formally specifies the &lt;strong&gt;DRF =&amp;gt; SC&lt;&#x2F;strong&gt; model.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Out of the Loop!</title>
                <pubDate>Sat, 19 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/loop-invariant-code-motion/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/loop-invariant-code-motion/</guid>
                <description>&lt;p&gt;Sometimes, loops do more work than they really &lt;em&gt;have&lt;&#x2F;em&gt; to. Take for example, the snippet of code below:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;range&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;):
    x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3
    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;y &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;y &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;print&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(y)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;where what we really want is &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;range&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;):
    y &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;y &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;print&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(y)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The first snippet redundantly sets &lt;code&gt;x = 3&lt;&#x2F;code&gt; for &lt;em&gt;every iteration&lt;&#x2F;em&gt; when this constant assignment only necessary to do once. This problem scales with larger programs—so how can we remove redundant calculations?&lt;&#x2F;p&gt;
&lt;p&gt;Loop Invariant Code Motion hoists what doesn&#x27;t need to be in the loop (invariant code) out of the loop. This optimization cuts down the number of instructions executed, by ensuring unnecessary repetition is avoided. Our implementation first identifies movable components, then iteratively moves them. &lt;&#x2F;p&gt;
&lt;p&gt;The flow of information in our project follows the following visual outline. First we identify natural loops using some existing tools, control flow graphs and dominator trees, which we slightly debugged. These tools were also used to implement reaching definitions. These both then allow us to detect loop invariant instructions to potentially hoist.&lt;&#x2F;p&gt;
&lt;p&gt;Once these instructions are identified, we check if these instructions can be hoisted.
We move hoisted instructions into a preheader basic block and rename jump targets accordingly.
After post-processing, we have our new instruction list!&lt;&#x2F;p&gt;
&lt;img src=&quot;plan.jpeg&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;Skip to the end to optimize your very own Bril program!&lt;&#x2F;p&gt;
&lt;h1 id=&quot;loop&quot;&gt;Loop&lt;&#x2F;h1&gt;
&lt;p&gt;All loops considered here are &lt;strong&gt;natural loops&lt;&#x2F;strong&gt;—that is, a cycle with one entry and a &lt;strong&gt;back-edge&lt;&#x2F;strong&gt;. Back-edges are defined as an edge $A \longrightarrow B$ for tail $A$ and head $B$, such that $B$ dominates $A$.  Natural loops are then defined as the smallest set of vertices $L$ with $A,B \in L$ such that for each vertex $v \in L$ we have $v=B$ or PREDS($v$)$\subseteq L$.&lt;&#x2F;p&gt;
&lt;p&gt;In essence, a back-edge is what brings us from the tail of the loop $A$ to the beginning $B$.
The cycle surrounding this backedge is our loop, where the entrypoint is the start of the cycle.
If the loop has only one entry, it is a natural loop. Below is an illustration of a natural loop and its back-edge. The natural loop is highlighted, with a labeled back-edge.&lt;&#x2F;p&gt;
&lt;img src=&quot;natloop.jpeg&quot; style=&quot;width: 70%&quot;&gt;
&lt;p&gt;In the case that an edge connected &lt;em&gt;entry&lt;&#x2F;em&gt; and &lt;em&gt;body&lt;&#x2F;em&gt;, this would no longer be a natural loop.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;detecting-natural-loops&quot;&gt;Detecting Natural Loops&lt;&#x2F;h3&gt;
&lt;p&gt;To find the loop invariant code, first we must detect all natural loops. To accomplish this, we make use of control flow graphs from &lt;code&gt;cfg.py&lt;&#x2F;code&gt; and dominator trees from &lt;code&gt;dom.py&lt;&#x2F;code&gt; within our three functions. Back-edges are identified with &lt;code&gt;get_backedges&lt;&#x2F;code&gt;, and &lt;code&gt;loopsy&lt;&#x2F;code&gt; finds the natural loop associated with an input back-edge.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;detecting-loop-invariants&quot;&gt;Detecting loop invariants&lt;&#x2F;h3&gt;
&lt;p&gt;An instruction within a natural loop is marked loop invariant if its arguments are defined outside of the natural loop. Alternately, if the instruction’s arguments are defined once—and that definition is loop invariant—then the instruction may be marked as loop invariant. Our goal is to find these loop invariants so that we may mark them as movable. Iteratively checking these two conditions, we converge on a list of invariant instructions.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;motion&quot;&gt;Motion&lt;&#x2F;h1&gt;
&lt;h3 id=&quot;hoisting-instructions&quot;&gt;Hoisting instructions&lt;&#x2F;h3&gt;
&lt;p&gt;Once we have determined which instructions are loop invariant, we check if we can actually hoist those instructions out of the loop. The following conditions must be met:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;condition 1:&lt;&#x2F;code&gt; instruction dominates all loop exits&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;condition 2:&lt;&#x2F;code&gt; instruction is the only definition for that variable in the loop&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;condition 3:&lt;&#x2F;code&gt; definition dominates all of its uses&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;After identifying all three criteria, an instruction is labeled in a Boolean map, as &lt;code&gt;true&lt;&#x2F;code&gt; if hoistable, and &lt;code&gt;false&lt;&#x2F;code&gt; if not. 
Instructions are removed from their native basic blocks if marked as &lt;code&gt;true&lt;&#x2F;code&gt;. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;preheader-generation-and-target-renaming&quot;&gt;Preheader generation and target renaming&lt;&#x2F;h3&gt;
&lt;p&gt;We gather hoisted instructions into a new basic preheader block executed before the loop header.
This means we have to rename jumps that target the loop header to target the preheader instead.
We only rename jumps that target the loop header if they are outside the loop. This ensures that the preheader is only to be executed when we first enter the loop.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;instruction-list-generation&quot;&gt;Instruction list generation&lt;&#x2F;h3&gt;
&lt;p&gt;Next is to generate a new list of instructions for the function body. To do this, we initially order the basic blocks. The order must follow several constraints: the first and last basic blocks of the original program remain in place. Fall-throughs between basic blocks and the original function are also respected.&lt;&#x2F;p&gt;
&lt;p&gt;Here, we ran into a few bugs. 🐝🐛🦋🐌🐞🐜🦗🕷🦂&lt;&#x2F;p&gt;
&lt;p&gt;The existing code in the repository that computes dominator trees throws an exception when a basic block is unreachable. To solve this, we removed the unreachable block from all data structures. This implements a very limited form of dead code eliminations as a bonus! ☠️&lt;&#x2F;p&gt;
&lt;p&gt;When we have an order for basic blocks we concatenate them to create a new list of instructions. Further post-processing to remove empty preheaders is then executed. Finally we have our new instruction list!&lt;&#x2F;p&gt;
&lt;h1 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h1&gt;
&lt;p&gt;We evaluated our optimization on a suite of benchmarks by instrumenting the Bril interpreter to count the number of instructions it executes. Our optimization are evaluated this way, as opposed to comparing execution times. This avoids measurement biases that might arise through timing. &lt;&#x2F;p&gt;
&lt;p&gt;By abstracting away performance in terms of the number of instructions executed, we give a fair comparison between non-optimized and optimized Bril programs: no matter the environment in which they are executed, a Bril program that executes fewer instructions performs better than an equivalent Bril program that executes more instructions.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;left&quot;&gt;test&lt;&#x2F;th&gt;&lt;th align=&quot;right&quot;&gt;original&lt;&#x2F;th&gt;&lt;th align=&quot;left&quot;&gt;optimized&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;df_test&#x2F;cond.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;9&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;9&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;df_test&#x2F;fact.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;62&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;55&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;dom_test&#x2F;loopcond.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;117&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;108&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;lvn_test&#x2F;clobber-fold.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;10&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;10&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;lvn_test&#x2F;clobber.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;10&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;10&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;lvn_test&#x2F;commute.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;6&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;6&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;lvn_test&#x2F;idchain.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;5&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;5&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;lvn_test&#x2F;idchain-prop.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;5&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;5&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;lvn_test&#x2F;nonlocal.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;7&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;7&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;lvn_test&#x2F;reassign.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;3&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;3&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;lvn_test&#x2F;redundant.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;6&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;6&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;lvn_test&#x2F;redundant-dce.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;6&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;6&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;tdce_test&#x2F;combo.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;6&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;6&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;tdce_test&#x2F;diamond.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;6&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;6&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;tdce_test&#x2F;double-pass.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;6&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;6&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;tdce_test&#x2F;reassign.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;6&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;6&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;tdce_test&#x2F;reassign-dkp.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;6&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;6&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;tdce_test&#x2F;simple.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;5&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;5&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;tdce_test&#x2F;skipped.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;4&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;4&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;loop_test&#x2F;depend.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;158&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;159&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;loop_test&#x2F;fibonacci.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;78&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;79&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;loop_test&#x2F;nest.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;147&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;130&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;examples&#x2F;loop_test&#x2F;hoist_expr.bril&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;right&quot;&gt;48&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;32&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;The table above shows all executed benchmarks, and the associated number of instructions executed by the interpreter—in both optimized and non-optimized versions. Our benchmarks are drawn from the &lt;code&gt;examples&lt;&#x2F;code&gt; directory of the Bril repository. We also created a new directory called &lt;code&gt;loop_test&lt;&#x2F;code&gt; that contain benchmarks we wrote ourselves. 
Among these freshly baked tests, are &lt;code&gt;nested.bril&lt;&#x2F;code&gt;,  &lt;code&gt;fibonacci.bril&lt;&#x2F;code&gt;, &lt;code&gt;hoist_expr.bril&lt;&#x2F;code&gt;, and &lt;code&gt;depend.bril&lt;&#x2F;code&gt;. The first, &lt;code&gt;nested.bril&lt;&#x2F;code&gt;,  checks if we can navigate redundant expressions in a nested loop.  This saved an execution of 17 instructions once optimized. &lt;&#x2F;p&gt;
&lt;p&gt;All optimized versions of benchmarks print the same results as their original versions, thus showing our optimization preserves the original semantics of the program. We note that our optimization instructions execute a smaller number of instructions. The only exception to this case is for two of our own built test cases—where fall-throughs existed in the original programs, and optimized versions had explicit jumps.&lt;&#x2F;p&gt;
&lt;p&gt;The results show that if the original program has no loops, our optimization does nothing. We also find that our optimization successfully hoists loop invariant code out of loops. Thus, the interpreter executes fewer instructions in this case. &lt;&#x2F;p&gt;
&lt;h1 id=&quot;try-it&quot;&gt;Try it!&lt;&#x2F;h1&gt;
&lt;p&gt;Our optimizer is implemented in &lt;code&gt;examples&#x2F;loop.py&lt;&#x2F;code&gt;. This program takes in a Bril program in JSON format in standard input, and returns an optimized Bril program in standard output.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;bril2txt &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;test.bril &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;python loop.py &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;bril2txt
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Replace &lt;code&gt;test.bril&lt;&#x2F;code&gt; with any program you would like to optimize.&lt;&#x2F;p&gt;
&lt;!---

# RAW DATA
results are the same between non-optimized and optimized versions for all benchmarks
 
test
no. of instructions executed for non-optimized version
no. of instructions executed for optimized version

examples&#x2F;df_test&#x2F;cond.bril
9
9

examples&#x2F;df_test&#x2F;fact.bril
62
55

examples&#x2F;dom_test&#x2F;loopcond.bril
117
108

examples&#x2F;lvn_test&#x2F;clobber-fold.bril
10
10

examples&#x2F;lvn_test&#x2F;clobber.bril
10
10

examples&#x2F;lvn_test&#x2F;commute.bril
6
6

examples&#x2F;lvn_test&#x2F;idchain.bril
5
5

examples&#x2F;lvn_test&#x2F;idchain-prop.bril
5
5

examples&#x2F;lvn_test&#x2F;nonlocal.bril
7
7

examples&#x2F;lvn_test&#x2F;reassign.bril
3
3

examples&#x2F;lvn_test&#x2F;redundant.bril
6
6

examples&#x2F;lvn_test&#x2F;redundant-dce.bril
6
6

examples&#x2F;tdce_test&#x2F;combo.bril
6
6

examples&#x2F;tdce_test&#x2F;diamond.bril
6
6

examples&#x2F;tdce_test&#x2F;double-pass.bril
6
6

examples&#x2F;tdce_test&#x2F;reassign.bril
6
6

examples&#x2F;tdce_test&#x2F;reassign-dkp.bril
6
6

examples&#x2F;tdce_test&#x2F;simple.bril
5
5

examples&#x2F;tdce_test&#x2F;skipped.bril
4
4

examples&#x2F;loop_test&#x2F;depend.bril
158
159

examples&#x2F;loop_test&#x2F;fibonacci.bril
78
79

examples&#x2F;loop_test&#x2F;nest.bril
147
130

---&gt;
&lt;!---eof---&gt;
</description>
            </item>
        
            <item>
                <title>Double-Checked Locking is Broken</title>
                <pubDate>Fri, 18 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/double-checked-locking/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/double-checked-locking/</guid>
                <description>&lt;p&gt;Double-checked locking is a software design pattern for 
reducing the overhead of acquiring a lock.
The program checks locking criteria first, 
and acquires the lock only if the check indicates that locking is required.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Lazy initialization&lt;&#x2F;strong&gt; is a commonly used tactic for delaying the object 
initialization until the first time it is accessed.
In multi-threaded environments, initialization is usually not thread safe,
so locking is required to protect the critical section.
Since only the first access requires locking,
double-checked locking is used to avoid locking overhead of subsequent accesses.
However, on many languages and hardware, the design can be unsafe. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;single-threaded-lazy-initialization-won-t-work-in-multi-threading&quot;&gt;Single-threaded lazy initialization won&#x27;t work in multi-threading&lt;&#x2F;h3&gt;
&lt;p&gt;If we were writing single-threaded code, we could write a lazy initialization 
like this:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Foo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;private &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;getHelper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)
            helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;();
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This code works for a single thread, but if the code is run in 
multi-threaded environments,
two or more threads could find that &lt;code&gt;helper&lt;&#x2F;code&gt; is &lt;code&gt;null&lt;&#x2F;code&gt; at the same time, 
and create multiple copies of &lt;code&gt;Helper&lt;&#x2F;code&gt; object.
This can even cause a memory leak in some languages, such as C++.&lt;&#x2F;p&gt;
&lt;img src=&quot;lazy-init-no-lock.png&quot; style=&quot;width: 100%;&quot;&gt;
&lt;p&gt;As shown in the graph above, 
when threads either run concurrently on single processor (e.g., thread 1 and thread 2), 
or run in parallel on different processors simultaneously (e.g. thread 3 and thread 4),
they can create multiple copies of &lt;code&gt;helper&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;always-synchronized-solution-is-slow&quot;&gt;Always-synchronized solution is slow&lt;&#x2F;h3&gt;
&lt;p&gt;To fix this issue, we can simply add a lock to this critical section as follows,
so that only one thread can enter this critical section at a time.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Foo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;private &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;getHelper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;synchronized&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(this) {
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)
                helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;();
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper;
        }
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;img src=&quot;lazy-init-with-lock.png&quot; style=&quot;width: 100%;&quot;&gt;
&lt;p&gt;However, we only need this section of code to be synchronized for the first 
thread access.
After the object is created, acquiring and releasing lock is unnecessary,
and they can have a huge performance impact.&lt;&#x2F;p&gt;
&lt;p&gt;What we want is something like this:&lt;&#x2F;p&gt;
&lt;img src=&quot;lazy-init-ideal.png&quot; style=&quot;width: 100%;&quot;&gt;
&lt;p&gt;Only the first thread will enter the synchronized section and create the object. 
Once the &lt;code&gt;helper&lt;&#x2F;code&gt; is initialized, all subsequent accesses can run in parallel
without synchronization.&lt;&#x2F;p&gt;
&lt;p&gt;Intuitively, we can come up with the following steps to do this job:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Check if the object is initialized without locking.
If it is, then return the object immediately.&lt;&#x2F;li&gt;
&lt;li&gt;Acquire the lock and check again if the object is initialized.
If another thread has previously grabbed the lock, 
the current thread can see the object is created, and return the object.&lt;&#x2F;li&gt;
&lt;li&gt;Otherwise, the current thread will create the object and return.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;With the guidelines above, we will get the following code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Foo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;private &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;getHelper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {              &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; first check
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;synchronized&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(this) {
                &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)         &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; second check
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;                    helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;();
            }
        }
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This tactic is called double-checked locking.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;double-checked-locking-is-broken&quot;&gt;Double-checked locking is broken&lt;&#x2F;h3&gt;
&lt;p&gt;However, this code is not guaranteed to work.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;helper = new Helper()&lt;&#x2F;code&gt; is not an atomic operation,
it consists of multiple instructions allocating space, initializing fields of the object,
and assigning address to &lt;code&gt;helper&lt;&#x2F;code&gt;. &lt;&#x2F;p&gt;
&lt;p&gt;In order to show what is really happening there,
we expand &lt;code&gt;helper = new Helper()&lt;&#x2F;code&gt; with some pseudocode 
and inline the object initialization code.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Foo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;private &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;getHelper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;synchronized&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(this) {
                &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
                    ptr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;allocate();
                    ptr.field1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;initField1();
                    ptr.field2 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;initField2();
                    helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr;
                }
            }
        }
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In order to improve overall performance,
some compilers, memory systems, or processors may reorder the instructions,
like moving &lt;code&gt;helper = ptr&lt;&#x2F;code&gt; before initializing fields of the object.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Foo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;private &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;getHelper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;synchronized&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(this) {
                &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
                    ptr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;allocate();
                    helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr;
                    ptr.field1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;initField1();
                    ptr.field2 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;initField2();
                }
            }
        }
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This reordering is legal because there is no data dependency between &lt;code&gt;helper = ptr&lt;&#x2F;code&gt;
and the instructions for initializing fields. 
However, this reordering, in some certain execution order, 
could result in other threads seeing a non-null value of &lt;code&gt;helper&lt;&#x2F;code&gt; 
but accessing uninitialized fields of the object.&lt;&#x2F;p&gt;
&lt;img src=&quot;dcl-exec.png&quot; style=&quot;width: 100%;&quot;&gt;
&lt;h3 id=&quot;another-fix-is-also-broken&quot;&gt;Another fix is also broken&lt;&#x2F;h3&gt;
&lt;p&gt;A memory barrier is a type of instruction that can make the compiler and processor
enforce the ordering, so that the instructions on one side of the memory barrier will
not be reordered to the other side of the barrier.&lt;&#x2F;p&gt;
&lt;p&gt;In order to enforce that object initialization &lt;code&gt;new Helper()&lt;&#x2F;code&gt; to execute before
assigning to &lt;code&gt;helper&lt;&#x2F;code&gt;,
some people came up with another fix with a &lt;code&gt;synchronized&lt;&#x2F;code&gt; to enforce ordering,
since &lt;code&gt;synchronized&lt;&#x2F;code&gt; is an implicit memory barrier that enforces the instructions
inside the synchronized section to be executed before exiting the section 
(i.e., releasing the lock).&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Foo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{ 
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;private &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; 
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;getHelper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; h;
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;synchronized&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(this) {
                h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper;
                &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)
                    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;synchronized&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(this) {
                        h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;();
                    }                       
                helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; h;
            }
        }
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The purpose of second &lt;code&gt;synchronized&lt;&#x2F;code&gt; is only to create memory barrier,
since mutual exclusion is already enforced by the first &lt;code&gt;synchronized&lt;&#x2F;code&gt;. &lt;&#x2F;p&gt;
&lt;p&gt;The intuition is that the lock releasing would act as a memory barrier,
so that &lt;code&gt;helper=h&lt;&#x2F;code&gt; will not be executed until the initialization 
in the synchronized section is done.&lt;&#x2F;p&gt;
&lt;p&gt;Unfortunately, the lock releasing is a one-way memory barrier on many processors.
It only enforces the instructions in the synchronized section 
to be executed before lock is released.
The instruction &lt;code&gt;helper=h&lt;&#x2F;code&gt; behind the memory barrier could still be moved into 
synchronized section and executed before the object initialization is done.&lt;&#x2F;p&gt;
&lt;p&gt;The expanded and reordered pseudocode should look like the following,
which will result in the same problem as the original version of double-checked locking.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Foo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{ 
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;private &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; 
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;getHelper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
            &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; h;
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;synchronized&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(this) {
                h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper;
                &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)
                    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;synchronized&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(this) {
                        ptr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;allocate();
                        h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr;
                        helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; h;
                        ptr.field1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;initField1();
                        ptr.field2 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;initField2();
                    }                       
            }
        }
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h2 id=&quot;working-solutions&quot;&gt;Working Solutions&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;explicit-memory-barrier&quot;&gt;Explicit Memory Barrier&lt;&#x2F;h3&gt;
&lt;p&gt;The previous fix with two synchronized sections does not work 
because releasing a lock is an implicit &amp;quot;one-way&amp;quot; memory barrier.
It is possible to make the double-checked locking actually work with an
explicit memory barrier.
For example, &lt;a href=&quot;https:&#x2F;&#x2F;preshing.com&#x2F;20130930&#x2F;double-checked-locking-is-fixed-in-cpp11&#x2F;&quot;&gt;Preshing&lt;&#x2F;a&gt; has provided an implementation of double-checked locking with 
&lt;code&gt;std::atomic&lt;&#x2F;code&gt; and &lt;code&gt;std::atomic_thread_fence&lt;&#x2F;code&gt; in C++ 11.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Foo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;private&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;:
        std::atomic &amp;lt;Foo&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;gt; helper;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;:
        Foo&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;get_helper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
            Foo&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper.load(std::memory_order_relaxed);
            std::atomic_thread_fence(std::memory_order_acquire);        &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;memory barrier
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;nullptr&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
                std::lock_guard&amp;lt;std::mutex&amp;gt; lock(m_init);
                h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper.load(std::memory_order_relaxed);
                &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;nullptr&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
                    h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= new&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; Helper;
                    std::atomic_thread_fence(std::memory_order_release);&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F;memory barrier
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;                    helper.store(h, std::memory_order_relaxed);
                }
            }
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; h;
        }
};
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;std::atomic_thread_fence(std::memory_order_acquire)&lt;&#x2F;code&gt; guarantees that 
read&#x2F;write operations after a memory barrier cannot be reordered with 
read operations before the memory barrier.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;std::atomic_thread_fence(std::memory_order_release)&lt;&#x2F;code&gt; guarantees that 
read&#x2F;write operations before a memory barrier cannot be reordered with
write operations after the memory barrier.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The memory barriers guarantee that &lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;h&lt;&#x2F;code&gt; loads the value from &lt;code&gt;helper&lt;&#x2F;code&gt; before starting object initialization.&lt;&#x2F;li&gt;
&lt;li&gt;object initialization finishes before storing the value to &lt;code&gt;helper&lt;&#x2F;code&gt;. &lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;atomic-operation&quot;&gt;Atomic Operation&lt;&#x2F;h3&gt;
&lt;p&gt;An atomic operation will either happen completely, or it does not happen at all.
This is no intermediate state, so that the side effects of an atomic operation
will not be visible until the operation is complete.&lt;&#x2F;p&gt;
&lt;p&gt;In previous analysis, we have seen that &lt;code&gt;h = new Helper()&lt;&#x2F;code&gt; can be interleaved
because it is not an atomic operations.
If this operation is atomic, the double-checked locking will work.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;volatile&quot;&gt;Volatile&lt;&#x2F;h4&gt;
&lt;p&gt;Since JDK 5, we can make reads and writes for any variable atomic by declaring
it as a volatile variable.
Every read of a volatile will invalidate cached value and 
load it from main memory.
Every write of a volatile will update value in cache and 
then flush out the cached value to main memory.&lt;&#x2F;p&gt;
&lt;p&gt;The &amp;quot;volatile&amp;quot; in Java also provides ordering guarantees,
which are the same guarantees &lt;code&gt;atomic_thread_fence&lt;&#x2F;code&gt; in C++ provides:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;The read&#x2F;write operations of other variables after a read from a volatile 
variable cannot be reordered before the read from the volatile variable.&lt;&#x2F;li&gt;
&lt;li&gt;The read&#x2F;write operations of other variables before a write to a volatile
variable cannot be reordered after the write to the volatile variable.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;With this new feature, the double-checked locking issue is resolved by simply 
declaring &lt;code&gt;helper&lt;&#x2F;code&gt; as a volatile variable.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Foo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;private volatile &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;getHelper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;synchronized&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(this) {    
                &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)
                    helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;();
            }
        }
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;However, since all read and write operations of a volatile variable 
triggers cache coherence protocol and accesses main memory,
it can be very slow. 
An improvement can be done with a local variable, to reduce number of times 
accessing volatile variable.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Foo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;private volatile &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;getHelper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper;
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;synchronized&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(this) {    
                h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper;
                &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
                    h &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;();
                    helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; h;
                }
            }
        }
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; h;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In cases that the &lt;code&gt;helper&lt;&#x2F;code&gt; is already initialized, 
this optimization can reduce one volatile read by returning the local variable.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;32-bit-primitive-variables&quot;&gt;32-bit Primitive Variables&lt;&#x2F;h4&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.cs.umd.edu&#x2F;%7Epugh&#x2F;java&#x2F;memoryModel&#x2F;DoubleCheckedLocking.html&quot;&gt;Pugh et al.&lt;&#x2F;a&gt; claimed that the double-checked locking can work safely for 32-bit
primitives.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Foo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;private int &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;magicNumber &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;getMagicNumber&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(magicNumber &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;synchronized&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(this) {
                &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(magicNumber &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)
                    magicNumber &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;GetMagicNumber();
            } 
        }
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;However, this specific case highly depends on the Java memory model.
The C&#x2F;C++ equivalent of the above code is not safe.
The &lt;a href=&quot;https:&#x2F;&#x2F;docs.oracle.com&#x2F;javase&#x2F;tutorial&#x2F;essential&#x2F;concurrency&#x2F;atomic.html&quot;&gt;Java documentation&lt;&#x2F;a&gt; specifies that 
read and write operations of most primitive variables 
(except &lt;code&gt;long&lt;&#x2F;code&gt; and &lt;code&gt;double&lt;&#x2F;code&gt; since they are 64-bit) are atomic.
But it is still not completely clear why it is safe and
how it is different from volatile primitive variables.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;static-singleton&quot;&gt;Static Singleton&lt;&#x2F;h3&gt;
&lt;p&gt;If the &lt;code&gt;helper&lt;&#x2F;code&gt; is static, i.e., all the instances of class &lt;code&gt;Foo&lt;&#x2F;code&gt; share the 
same instance of &lt;code&gt;helper&lt;&#x2F;code&gt;, defining the &lt;code&gt;helper&lt;&#x2F;code&gt; in a static field of a separate
class will solve the problem.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Foo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;private static class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;HelperSingleton &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public static final &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;();
    }

    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;getHelper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;HelperSingleton&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.helper;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is known as &lt;strong&gt;initialization-on-demand holder idiom&lt;&#x2F;strong&gt;, 
which is considered as a safe and efficient concurrent lazy initialization 
for all Java versions.&lt;&#x2F;p&gt;
&lt;p&gt;As the &lt;a href=&quot;https:&#x2F;&#x2F;docs.oracle.com&#x2F;javase&#x2F;tutorial&#x2F;essential&#x2F;concurrency&#x2F;atomic.html&quot;&gt;Java documentation&lt;&#x2F;a&gt; specifies,
a lock is used to ensure synchronized access to object initialization status 
(uninitialized&#x2F;initializing&#x2F;initialized&#x2F;erroneous state).
However, if all subsequent references to the object requires lock synchronization,
it is just equivalent to the &amp;quot;always-synchronized&amp;quot; solution above.
Unfortunately, it is still unclear how Java provides an efficient 
unsynchronized access to initialized objects.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;thread-local&quot;&gt;Thread Local&lt;&#x2F;h3&gt;
&lt;p&gt;Alexander Terekhov provided an implementation of double-checked locking using
thread local variables.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;ThreadLocal&lt;&#x2F;code&gt; is a variable where each thread will have its own copy of the
thread local variable.
Each thread can only access and modify its own copy of a thread local variable
independently of other threads.&lt;&#x2F;p&gt;
&lt;p&gt;A thread local can be used to maintain the state of &amp;quot;whether the state has gone
through the synchronized initialization or not&amp;quot;. 
If a thread has gone through the synchronized initialization once,
it can be confident that that object is already initialized.&lt;&#x2F;p&gt;
&lt;p&gt;Inside the synchronized initialization section, 
only the first thread will find the object is &lt;code&gt;null&lt;&#x2F;code&gt; and initialize the object.
All threads will then change their per-thread state 
at the first synchronized access, 
so that they will not enter the synchronized section again.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Foo &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;private static &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;ThreadLocal &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;perThreadState &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;ThreadLocal&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;();
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;private &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;getHelper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(perThreadState.get() &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
            &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;synchronized &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
                &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)
                    helper &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= new &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Helper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;();
                perThreadState.set(perThreadState);
            }
        }
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; helper;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Admittedly, this solution is slightly more costly 
compared with the &amp;quot;ideal&amp;quot; design:
instead of having only the first thread enter the synchronized section
and initialize the object,
each thread is required to enter the synchronized section exactly once
and change its own per-thread state (i.e., the thread local variable) 
to prevent future access of the synchronized section.
However, the performance in the long run is still acceptable.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;The article discusses the problem of double-checked locking for lazy 
initialization in multi-threaded environments.
It analyzes why some intuitive solutions do not work,
and also analyzed some working solutions.&lt;&#x2F;p&gt;
&lt;p&gt;Writing multi-threaded program is hard. 
Writing correct and safe multi-threaded program is even harder.
When analyzing the correctness of multi-threaded programs,
it requires the considerations of multiple components, 
including compilers, systems, and processors.
On the other hand, when designing compilers, systems, or processors,
one also needs to take into consideration commonly used design patterns.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;acknowledgement&quot;&gt;Acknowledgement&lt;&#x2F;h2&gt;
&lt;p&gt;I have referred following documents for code examples and explanations.
Code structures and variable names are modified for consistency in this post.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;www.cs.umd.edu&#x2F;%7Epugh&#x2F;java&#x2F;memoryModel&#x2F;DoubleCheckedLocking.html&quot;&gt;The &amp;quot;Double-Checked Locking is Broken&amp;quot; Declaration&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;http:&#x2F;&#x2F;www.cs.umd.edu&#x2F;users&#x2F;pugh&#x2F;java&#x2F;memoryModel&#x2F;jsr-133-faq.html&quot;&gt;Java Memory Model&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Double-checked_locking&quot;&gt;Double-Checked Locking&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Lazy_initialization&quot;&gt;Lazy Initialization&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Initialization-on-demand_holder_idiom&quot;&gt;Initialization-on-demand Holder Idiom&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;preshing.com&#x2F;20130930&#x2F;double-checked-locking-is-fixed-in-cpp11&#x2F;&quot;&gt;Double-Checked Locking is Fixed In C++11&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;en.cppreference.com&#x2F;w&#x2F;cpp&#x2F;atomic&#x2F;atomic_thread_fence&quot;&gt;C++ Reference: Atomic Thread Fence&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.oracle.com&#x2F;javase&#x2F;tutorial&#x2F;essential&#x2F;concurrency&#x2F;atomic.html&quot;&gt;Java Documentation: Atomic Access&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;a href=&quot;https:&#x2F;&#x2F;docs.oracle.com&#x2F;javase&#x2F;specs&#x2F;jls&#x2F;se13&#x2F;html&#x2F;jls-12.html#jls-12.4.2&quot;&gt;Java Documentation: Initialization Procedure&lt;&#x2F;a&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
</description>
            </item>
        
            <item>
                <title>Compiler Optimizations for Improving Data Locality</title>
                <pubDate>Wed, 09 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/loops/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/loops/</guid>
                <description>&lt;p&gt;Processing speed has long surpassed that of memory in modern computation.
Applications today deal with a massive amount of information that must at
some point be held in memory. The overhead cost of transferring data
continues to inhibit implementations if many applications. Furthermore, this
task is becoming increasingly complex as computers depart from von Neumann
towards heterogeneous architectures, inducing additional data transfer.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Compiler Optimizations for Improving Data Locality&lt;&#x2F;em&gt; by Carr, McKinley, and
Tseng tackles this problem by suggesting this is a task for the compiler to
combat. This paper focuses on improving the order of memory access. Table 1
illustrates the differences in a computers circa this paper, and today. The
problems faced in 1994 are very much present today: data transfer is the main
limiting factor in computation.&lt;&#x2F;p&gt;
&lt;img src=&quot;table.jpg&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;This table shows the fastest computer from when this paper was written (1994) and from today (2019).&lt;&#x2F;p&gt;
&lt;img src=&quot;moore.jpg&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;This figure shows how CPU performance has outpaced memory performance, and
how this gap shows no signs of closing. As a result of this massive
difference in CPU performance and memory performance, more and more kinds and
levels of caches have been introduced. This preponderance of caches has led
to cache locality becoming a central factor in program performance.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;background&quot;&gt;Background!&lt;&#x2F;h1&gt;
&lt;h4 id=&quot;memory-hierarchy&quot;&gt;Memory Hierarchy&lt;&#x2F;h4&gt;
&lt;p&gt;&lt;em&gt;Data locality&lt;&#x2F;em&gt; takes advantage of smaller, faster, volatile cache memories, by
keeping data close at hand for computation. Reusing lines of cache—both
spatially and temporally—allow for computation to proceed without waiting
excessive periods of time for the data to arrive. Large and increasingly
complex memory hierarchies are implemented for a variety of reasons. Sadly,
not everything is fast cache 💸.&lt;&#x2F;p&gt;
&lt;p&gt;As opposed to a single central memory, a variety of technologies are used to
store information and prepare it for computation by CPUs, GPUs, or TPUs. Each
processing unit has registers at the cliff of computations. They are
typically the fastest, largely due to their close proximity to the logic unit
that performs computation. After registers, comes caches. These are also very
fast, yet limited in size—largely to take advantage of locality.&lt;&#x2F;p&gt;
&lt;p&gt;Beyond the physical memory storage are the interconnects between all
components. Slowest transfer speeds are found in the cheaper and more robust
main memory, solid state drives, hard disks, and long term tape system
storage. The further the data gets from the processing units, the less we
want to access it. It’s too far for the data to walk.&lt;&#x2F;p&gt;
&lt;img src=&quot;memory_hierarchy.jpg&quot; style=&quot;width: 50%&quot;&gt;
&lt;p&gt;This figure shows a typical memory hierarchy. In considering memory speeds,
we must take into account the time it takes to get between storage, caches,
registers, processors—in addition to how much data can sit at each point on
the way.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;data-dependence&quot;&gt;Data Dependence&lt;&#x2F;h4&gt;
&lt;p&gt;There exists data dependence between two statements A and B if there is a path between the two, and both access the same memory location. Here we have a tree of making a peanut butter and jelly sandwich. In order to spread jelly on the bread, you need the bread; the two steps are dependent on each other. This extends to operations accessing memory locations, and thus we can build a data dependence tree.&lt;&#x2F;p&gt;
&lt;img src=&quot;data_dependence_pbj.jpg&quot; style=&quot;width: 100%&quot;&gt;
&lt;h1 id=&quot;loop-optimizations&quot;&gt;Loop Optimizations&lt;&#x2F;h1&gt;
&lt;h4 id=&quot;loop-permutation&quot;&gt;Loop Permutation&lt;&#x2F;h4&gt;
&lt;p&gt;Loop Permutation is perhaps the most straightforward loop optimization.
Iterating over arrays in the wrong order is one of the easiest ways to cause
a huge amount of cache misses. It’s also the cause of an endless &lt;a href=&quot;https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;33722520&#x2F;why-is-iterating-2d-array-row-major-faster-than-column-major&quot;&gt;number&lt;&#x2F;a&gt; of &lt;a href=&quot;https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;13093155&#x2F;c-2d-array-access-speed-changes-based-on-ab-order&quot;&gt;SO&lt;&#x2F;a&gt;
&lt;a href=&quot;https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;9936132&#x2F;why-does-the-order-of-the-loops-affect-performance-when-iterating-over-a-2d-arra&quot;&gt;questions&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Take these two snippets of code (which can also be found online &lt;a href=&quot;http:&#x2F;&#x2F;ideone.com&#x2F;PUJhdP&quot;&gt;here&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;A. Column-Major&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;int A[DIM1][DIM2];
for (int iter = 0; iter &amp;lt; iters; iter++)
    for (int j = 0; j &amp;lt; DIM2; j ++)
        for (int i = 0; i &amp;lt; DIM1; i++)
            A[i][j]++;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;B. Row-Major&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;int A[DIM1][DIM2];
for (int iter = 0; iter &amp;lt; iters; iter++)
    for (int i = 0; i &amp;lt; DIM1; i++)
        for (int j = 0; j &amp;lt; DIM2; j ++)
            A[i][j]++;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Both of these loops perform &lt;code&gt;DIM1 * DIM2&lt;&#x2F;code&gt; increments. The only difference is
whether they iterate through A in row-major or column-major order. What do
you think their performance difference is?&lt;&#x2F;p&gt;
&lt;p&gt;If you’re wise to the ways of cache locality, you might answer B. And for
suitably large values of &lt;code&gt;DIM1&lt;&#x2F;code&gt; and &lt;code&gt;DIM2&lt;&#x2F;code&gt;, you’d be right!&lt;&#x2F;p&gt;
&lt;p&gt;With DIM1=1024, DIM2=1024, and iters=1e3, we get an order of magnitude win for B!&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;A (Column-Major): 4916ms&lt;&#x2F;li&gt;
&lt;li&gt;B (Row-Major): 485ms&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Loop permutation captures this optimization. Intuitively, you could imagine
the below picture when it comes to what&#x27;s happening. Imagine that the grid
represents a 2-dimensional array. However, since 2-dimensional arrays are
actually 1-dimensional arrays in memory, the cache lines must align along a
particular axis.&lt;&#x2F;p&gt;
&lt;p&gt;Now, compare the 2 access patterns (the blue line and the red line). The blue
access pattern loads a single cache line and then accesses all of its
elements before moving onto the next cache line. The red access pattern,
however, loads a single cache line, accesses a single element, and then moves
onto the next cache line.&lt;&#x2F;p&gt;
&lt;img src=&quot;cache_lines.jpeg&quot; style=&quot;width: 100%&quot;&gt;
&lt;h4 id=&quot;loop-reversal&quot;&gt;Loop Reversal&lt;&#x2F;h4&gt;
&lt;p&gt;Loop Reversal simply reverses the order of a loop. This does 2 things. First,
it may allow the use of more efficient jump operations (for example, &lt;code&gt;JMPZ&lt;&#x2F;code&gt;).
Another thing it does is reverse data dependencies. This can serve as a kind
of canonicalization, and can also allow for other loop optimizations to be
applied. In their paper, loop reversal didn’t improve data locality in any of
their benchmarks.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;loop-fusion&quot;&gt;Loop Fusion&lt;&#x2F;h4&gt;
&lt;p&gt;Loading in each cache line takes a significant amount of time. If we have
multiple loops, we run into the possibility that we will load a single cache
line multiple times, wasting time.&lt;&#x2F;p&gt;
&lt;p&gt;For example, take this code (an online example can be found &lt;a href=&quot;http:&#x2F;&#x2F;ideone.com&#x2F;OnbRXU&quot;&gt;here&lt;&#x2F;a&gt;):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;for (int i = 0; i &amp;lt; MAXN; i++)
    A[i] += j;
for (int i = 0; i &amp;lt; MAXN; i++)
    A[i] *= j;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It’s easy to see that this code is performing redundant cache line loads. We
load the cache line that &lt;code&gt;A[0]&lt;&#x2F;code&gt; belongs to twice-once in the first loop and
once in the second. We can speed up this loop by fusing the loops. This way,
we load a cache line and perform both operations on it at once, before
loading another cache line.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;for (int i = 0; i &amp;lt; MAXN; i++) {
    A[i] += j;
    A[i] *= j;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Locally, this gives me 660ms for the unfused one and 411ms for the fused one.&lt;&#x2F;p&gt;
&lt;p&gt;Loop fusion is a big deal on CPUs, but it’s an even bigger deal on GPUs (and
other hardware accelerators). As opposed to say, loop permutation, which
simply improves data locality, loop fusion can actually reduce the number of
memory loads needed. For example, in the above, it’s a trivial optimizations
to then rewrite it as&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;for (int i = 0; i &amp;lt; MAXN; i++) {
    int t = A[i];
    t += j;
    t *= j;
    A[i] = t;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This halving in memory loads can often translate directly to halving of
runtime in more memory bound systems (like GPUs).&lt;&#x2F;p&gt;
&lt;h4 id=&quot;loop-fission-loop-distribution&quot;&gt;Loop Fission&#x2F;Loop Distribution&lt;&#x2F;h4&gt;
&lt;p&gt;This is the opposite of loop fusion. Although loop fusion is useful if you
can reduce memory loads, it can be counter-productive to have unrelated
operations jammed together into a single loop nest. Not only does it
introduce more memory pressure, it also doesn’t allow optimizations like loop
permutation to be applied to a single operation at a time.&lt;&#x2F;p&gt;
&lt;p&gt;For example, take a loop like this&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;int A[MAXN][MAXN], B[MAXN][MAXN];
for (int i = 0; i &amp;lt; MAXN; i++) {
    for (int j=0; j&amp;lt;MAXN; j+=2) {
        A[i][j] ++;
        B[j][i] ++;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;As seen in the loop permutation section, we’d like to iterate along both A
and B in row-major order. However, the fact that the operations in A and B
are in one loop nest doesn’t make it possible to do this for both arrays.
However, if we split this loop, then we can write it like&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;int A[MAXN][MAXN], B[MAXN][MAXN];

for (int i = 0; i &amp;lt; MAXN; i++)
    for (int j=0; j&amp;lt;MAXN; j+=2)
        A[i][j]++;

for (int j=0; j&amp;lt;MAXN; j+=2)
    for (int i = 0; i &amp;lt; MAXN; i++)t
        B[j][i]++;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Thus iterating in the optimal order for both arrays.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;cost-model&quot;&gt;Cost Model&lt;&#x2F;h2&gt;
&lt;p&gt;For class discussion, think about the following algorithm:&lt;&#x2F;p&gt;
&lt;img src=&quot;cost_model.png&quot; style=&quot;width: 80%&quot;&gt;
&lt;p&gt;They use this cost model for determining the optimal sequence of loop fusion&#x2F;fission&#x2F;permutation to apply.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;future-work&quot;&gt;Future Work&lt;&#x2F;h1&gt;
&lt;p&gt;The loop optimizations presented here are still used everywhere. However,
these are perhaps the more straightforward optimizations to make. In
particular, they don&#x27;t deal with any optimizations (besides loop reversal)
that change the structure of a loop. Writing optimized loop nests is far more
complicated than that. For a phenomenal introduction to the types of
optimizations that need to be done for modern loop nests, check out &lt;a href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=3uiEyEKji0M&quot;&gt;this
talk from one of Halide&#x27;s
creators&lt;&#x2F;a&gt;. The first 15 minutes
aren&#x27;t related to Halide at all, and serve as a fantastic introduction. I&#x27;ll
provide some notes about important topics here.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;parallelism&quot;&gt;Parallelism&lt;&#x2F;h3&gt;
&lt;p&gt;Parallelism across CPU cores is crucial for performance. Depending on how you
compute your values, you may introduce serial dependencies that make it more
difficult to parallelize your code.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;vectorization&quot;&gt;Vectorization&lt;&#x2F;h3&gt;
&lt;p&gt;Vectorization refers to taking advantage of SIMD instructions in your
hardware. These instructions can do things like &amp;quot;sum up the 4 values at
&lt;code&gt;x[i]&lt;&#x2F;code&gt; through &lt;code&gt;x[i+4]&lt;&#x2F;code&gt;&amp;quot; significantly faster than doing them one element at
a time.&lt;&#x2F;p&gt;
&lt;p&gt;Taking advantage of this requires some care. For example, if your inner loop
is too small, then you often won&#x27;t be able to take full advantage of SIMD
instructions, which operate on 8 or 16 numbers at a time.&lt;&#x2F;p&gt;
&lt;p&gt;The fact that they operate on 8 or 16 numbers at a time also means that
utilizing them requires some special case handling - for example, what if you
have 18 elements?&lt;&#x2F;p&gt;
&lt;h3 id=&quot;tiling&quot;&gt;Tiling&lt;&#x2F;h3&gt;
&lt;p&gt;Imagine that we wished to compute a multiplication table of sorts. The code for that would look something like (an online example can be found &lt;a href=&quot;http:&#x2F;&#x2F;ideone.com&#x2F;4HRl3F&quot;&gt;here&lt;&#x2F;a&gt;. Note that moreso than the other optimizations, tiling often depends on the particular hardware used):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;for (int x = 0; x&amp;lt;MAXN; x++)
    for (int y =0; y &amp;lt;MAXN; y++)
        A[x][y] = rows[x] * columns[y];
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Note that in this implementation, although each element in &lt;code&gt;rows&lt;&#x2F;code&gt; will be
loaded only once and reused until it&#x27;s no longer needed, each element in
&lt;code&gt;columns&lt;&#x2F;code&gt; will be loaded once for every single row.&lt;&#x2F;p&gt;
&lt;img src=&quot;tiling_bad.png&quot; style=&quot;width: 80%&quot;&gt;
&lt;p&gt;A rough estimate gives us 20 &amp;quot;bad&amp;quot; reads - each read from &lt;code&gt;columns&lt;&#x2F;code&gt; won&#x27;t be
in the cache, and the first read of &lt;code&gt;rows[x]&lt;&#x2F;code&gt; won&#x27;t be in the cache either.&lt;&#x2F;p&gt;
&lt;p&gt;If we simply permuted the loops we could solve this issue for &lt;code&gt;columns&lt;&#x2F;code&gt;...
but then we&#x27;d have this issue for &lt;code&gt;rows&lt;&#x2F;code&gt;. Note that if we had significantly
more rows or columns, this could be the optimal answer.&lt;&#x2F;p&gt;
&lt;p&gt;However, there is a better option. We can select a compromise - we select
some number of row elements and some number of column elements, and compute
the outputs that depend on those elements.&lt;&#x2F;p&gt;
&lt;img src=&quot;tiling_good.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;Naively, this now only requires 12 &amp;quot;bad&amp;quot; reads. To compute the first 2
columns, we only need to read 6 elements, and same for the second 2 columns.&lt;&#x2F;p&gt;
&lt;p&gt;In code, that would translate to:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;int B = 8;
for (int t = 0; t &amp;lt; NUMITERS; t++)
    for (int i = 0; i &amp;lt; n; i += B)
        for (int j = 0; j &amp;lt; n; j += B)
            for (int x = i; x &amp;lt; i + B; x++)
                for (int y = j; y &amp;lt; j + B; y++)
                    a[x][y] = rows[x] * cols[y];
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;defining-the-space-of-loop-optimizations&quot;&gt;Defining the space of loop optimizations&lt;&#x2F;h3&gt;
&lt;p&gt;Looking at all of these loop optimizations and the difficult ways in which
they interact is a fairly daunting task. One avenue of active research is
defining the space of loop optimizations. There are 2 primary efforts I&#x27;m
aware of: the polyhedral model and Halide. Both of these models define a
space of loop optimizations.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;cost-models&quot;&gt;Cost Models&lt;&#x2F;h3&gt;
&lt;p&gt;Although the previous models mentioned allow you to express the space of
legal loop transformations, they don&#x27;t tell you what loop transformations you
should be performing. Some recent work has focused on learning a cost model
for Halide schedules, and using that to guide an &lt;a href=&quot;https:&#x2F;&#x2F;halide-lang.org&#x2F;papers&#x2F;autoscheduler2019.html&quot;&gt;autoscheduler for Halide
programs&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;references&quot;&gt;References&lt;&#x2F;h1&gt;
&lt;p&gt;[1]      H. Miyoshi et al., “Development and achievement of NAL numerical wind tunnel (NWT) for CFD computations,” in Proceedings of the ACM&#x2F;IEEE Supercomputing Conference, 1994, pp. 685–692.&lt;&#x2F;p&gt;
&lt;p&gt;[2]      “National Aerospace Laboratory of Japan’s Numerical Wind Tunnel,” Information Processing Society of Japan Computer Museum. [Online]. Available: http:&#x2F;&#x2F;museum.ipsj.or.jp&#x2F;en&#x2F;computer&#x2F;super&#x2F;0020.html. [Accessed: 03-Oct-2019].&lt;&#x2F;p&gt;
&lt;p&gt;[3]      Y. Matsuo, “Special contribution numerical wind tunnel: History and evolution of supercomputing,” Fujitsu Scientific and Technical Journal, vol. 53, no. 3. pp. 15–23, 2017.&lt;&#x2F;p&gt;
&lt;p&gt;[4]      “Summit User Guide – Oak Ridge Leadership Computing Facility,” 2019. [Online]. Available: https:&#x2F;&#x2F;www.olcf.ornl.gov&#x2F;for-users&#x2F;system-user-guides&#x2F;summit&#x2F;summit-user-guide&#x2F;. [Accessed: 03-Oct-2019].&lt;&#x2F;p&gt;
&lt;p&gt;[5]      P. Wang, “Unified Memory on P100.”&lt;&#x2F;p&gt;
&lt;p&gt;[6]      G. Goff, K. Kennedy, and C. W. Tseng, “Practical dependence testing,” in Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 1991, pp. 15–29.&lt;&#x2F;p&gt;
&lt;p&gt;[7]      G. Rivera and C. W. Tseng, “A comparison of compiler tiling algorithms,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 1999, vol. 1575, pp. 168–183.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Efficient Instruction Scheduling for Pipelined Architectures</title>
                <pubDate>Fri, 04 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/instruction-scheduling/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/instruction-scheduling/</guid>
                <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;&#x2F;h2&gt;
&lt;p&gt;A pipelined architecture allows for machine instructions to overlap each other for greater throughput, but it comes with the cost of &lt;em&gt;pipeline hazards&lt;&#x2F;em&gt;. These hazards emerge when one structural or data resource is needed by more than one instruction, forcing the hardware to resolve the hazard by delaying subsequent instructions for one cycle, known as a pipeline interlock (a.k.a., stalls). These interlocks decrease throughput and this paper proposes a heuristic way to minimize interlocks that has a better worst-case runtime than other solutions while maintaining comparable results. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;background&quot;&gt;Background&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;pipelined-processors&quot;&gt;Pipelined Processors&lt;&#x2F;h3&gt;
&lt;p&gt;As a quick refresher, pipelined processors allow different instructions to execute on different parts of the processor which offers significant improvements over single-stage processors where only one instruction is live at a time. A very basic pipeline structure may have the following stages, in order: Fetch, Decode, Execute, Memory, Writeback. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;pipeline-hazards&quot;&gt;Pipeline Hazards&lt;&#x2F;h3&gt;
&lt;p&gt;You may recall from CS 3410 or another equivalent systems class that there are three types of hazards:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Structural - A hardware resource is needed by multiple instructions in one cycle.&lt;&#x2F;li&gt;
&lt;li&gt;Data - A piece of information is needed before it is available.&lt;&#x2F;li&gt;
&lt;li&gt;Control - A branch is not resolved when the next instruction location is needed.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;We can ignore control hazards as this paper only reorders instructions within a basic block.&lt;&#x2F;p&gt;
&lt;p&gt;The architecture of this paper is based on has three hazards:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Loading to a register from memory and then using that register as a source. This is commonly known as a load-use hazard.&lt;&#x2F;li&gt;
&lt;li&gt;Any store followed by any load.&lt;&#x2F;li&gt;
&lt;li&gt;Loading from memory followed by any arithmetic or logical instruction.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;For example, this program with a load-use hazard:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
load    0(sp), r0
add     #1, r0, r0  &#x2F;&#x2F;hazard caused by use of r0 immediately after load
add     #5, r1, r1
&lt;&#x2F;pre&gt;
&lt;p&gt;Rescheduling it as follows eliminates the hazard:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
load    0(sp), r0
add     #5, r1, r1
add     #1, r0, r0  &#x2F;&#x2F;r0 not used immediately after load
&lt;&#x2F;pre&gt;
&lt;p&gt;We were unsure of why the second and third hazards presented by the paper were problematic, and we will discuss it later in this post. For now, it is sufficient to accept that they are hazards for their target architecture.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;goals&quot;&gt;Goals&lt;&#x2F;h3&gt;
&lt;p&gt;The authors wanted to target a range of architectures which could differ in what constituted hazards and how interlocks are implemented. To this end, it was not possible to prevent all interlocks for all architectures, but rather to design a heuristic algorithm that performed well in general. They also wanted this algorithm as efficient as possible to improve practicality.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;assumptions&quot;&gt;Assumptions&lt;&#x2F;h3&gt;
&lt;p&gt;To create an algorithm that was generalizable across architectures, the authors made three important assumptions to simplify the problem:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Each memory location is assumed to be referenced by an offset from one base register.
&lt;ul&gt;
&lt;li&gt;We were unsure of why this was needed. Our best guess is that more complex addressing modes could take too long calculate, such as &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Addressing_mode#Scaled&quot;&gt;scaled indexing&lt;&#x2F;a&gt; which could need multiplication, or &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Addressing_mode#Memory_indirect&quot;&gt;memory indirect indexing&lt;&#x2F;a&gt; which could take multiple cycles to return.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;All pointers are assumed to alias (though this can be made tighter if the compiler produced aliasing information).&lt;&#x2F;li&gt;
&lt;li&gt;The target architecture will have a hardware hazard detection with interlock such that it is not necessary to remove all hazards.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;technical-approach&quot;&gt;Technical Approach&lt;&#x2F;h3&gt;
&lt;p&gt;This optimization is carried out by reordering assembly instructions after code generation and register allocation. It acts on basic blocks and it&#x27;s a transformation from assembly to assembly. To create such a transformation, scheduling constraints first need to be modeled and then the heuristic for selection order must be applied while abiding by those constraints. The scheduling contraints provide all sets of orderings that guarantees correctness and then the heuristic chooses the ordering that the most likely to have the least amount of hazards.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;expressing-constraints&quot;&gt;Expressing Constraints&lt;&#x2F;h4&gt;
&lt;p&gt;As instructions cannot be arbitrarily reordered due to dependencies, they are placed in a directed acyclic graph (dag) where each node (an instruction) succeeds the instruction(s) it is dependent on. In terms of scheduling, this means that parent nodes must be executed before child nodes, and root nodes do not have dependencies so they can be placed wherever convenient. &lt;&#x2F;p&gt;
&lt;p&gt;This dag serializes based on three criteria, followed by an example for each:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Definitions vs. definitions:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
 load    -8(sp),r4    &#x2F;&#x2F;r4 is defined
 add     #1, r0, r4   &#x2F;&#x2F;r4 is redefined
&lt;&#x2F;pre&gt;&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Definitions vs. uses:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
 store   r0, A        &#x2F;&#x2F;A is defined
 load    A, r5        &#x2F;&#x2F;A is used as a source
&lt;&#x2F;pre&gt;&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Uses vs. definitions:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
 load    -4(sp), r3   &#x2F;&#x2F;sp is used
 add     #8, sp, sp   &#x2F;&#x2F;sp is defined (and also used)
&lt;&#x2F;pre&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;This criteria is broad enough to account for all serialization constraints. Deployed on a larger example, the dependency dag will look like:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Instruction List&lt;&#x2F;th&gt;&lt;th&gt;Dependency Dag&lt;&#x2F;th&gt;&lt;th&gt;Reordered Instructions&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;ins1.png&quot; style=&quot;width: 100%&quot;&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;img src=&quot;dag.png&quot; style=&quot;width: 100%&quot;&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;img src=&quot;ins2.png&quot; style=&quot;width: 100%&quot;&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;This dag is created by scanning backward through the block, and for each instruction, finding the definitions or uses that precede it. As such, this construction costs O(n&lt;sup&gt;2&lt;&#x2F;sup&gt;) where n is the number of instructions in the basic block.&lt;&#x2F;p&gt;
&lt;p&gt;There are also carry&#x2F;borrow dependencies which are definitions or uses of carry&#x2F;borrow bits, which should be treated similarly to a register since they are another stateful processor resource. They are changed during arithmetic operations where a carry or borrow is used, making them frequently defined but rarely used. Adding them to the dependency dag would be unnecessarily constraining, so the authors placed them in a special subgraph for instructions that uses a carry or borrow.&lt;&#x2F;p&gt;
&lt;p&gt;This dag representation differs from other literature on instruction ordering as they include edges for definitions vs definitions. This is necessary for the final definition of a resource at the end of a basic block (as it could be used by instructions that follow the basic block), or for defs to one register followed by a read from the same register.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;selecting-an-order-the-static-evaluator&quot;&gt;Selecting an Order: The Static Evaluator&lt;&#x2F;h4&gt;
&lt;p&gt;Now, using this dag, any scheduling order following a topological sort will produce an execution indistinguishable from the original order.&lt;&#x2F;p&gt;
&lt;p&gt;Their algorithm travels down the dag from the roots and selects &lt;em&gt;candidates&lt;&#x2F;em&gt;---instructions whose immediate predecessors have all been scheduled (or root instructions).&lt;&#x2F;p&gt;
&lt;p&gt;When choosing the &amp;quot;best&amp;quot; candidate to schedule, they provide two guidelines:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Schedule an instruction that will not interlock with the one just scheduled (if possible).&lt;&#x2F;li&gt;
&lt;li&gt;Schedule the instruction that is most likely to cause interlocks with instructions after it. If an instruction may cause interlock, they want to schedule it as early as possible.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Lookaheads would definitely improve scheduling, but that also significantly increases worst-case complexity. Instead, they use three concrete heuristics that evaluate the candidates&#x27; static local properties. In order of importance, they are as follows:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Whether an instruction interlocks with any of its immediate successors.&lt;&#x2F;li&gt;
&lt;li&gt;The number of immediate successors.&lt;&#x2F;li&gt;
&lt;li&gt;The height of the daf rooted at that node.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;These criteria yields instructions which:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;May cause interlocks. This is desirable because it allows instructions that are likely to interlock to be scheduled as early as possible as that gives the greatest number of candidates for subsequent instructions.&lt;&#x2F;li&gt;
&lt;li&gt;Uncover the most potential successors, thereby giving greater freedom of future choices.&lt;&#x2F;li&gt;
&lt;li&gt;Balance the progress along the paths of the dag, which ensures a more even number of choices throughout the process.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;The steps in this algorithm for dag traversal and scheduling are outlined pretty clearly in the paper and so we will not duplicate them here. The final result is displayed in the table above. The original order had four interlocks (referring to instructions via line number): 3-4, 5-6, 7-8. and 8-9. The reordered version only has one: 8-1. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;computational-complexity-of-the-algorithm&quot;&gt;Computational Complexity of the Algorithm&lt;&#x2F;h3&gt;
&lt;p&gt;Let the number of instructions in the basic block be n.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;constructing-the-dag&quot;&gt;Constructing the dag&lt;&#x2F;h4&gt;
&lt;p&gt;In the worst case, every instruction needs to be compared with the instructions already in dag. During this process, we can also compute the information needed by the heuristic evaluation, such as the length of the longest path and the number of immediate successors. It is thus O(n&lt;sup&gt;2&lt;&#x2F;sup&gt;).&lt;&#x2F;p&gt;
&lt;h4 id=&quot;scheduling-the-instructions&quot;&gt;Scheduling the instructions&lt;&#x2F;h4&gt;
&lt;p&gt;To schedule an instruction, we need to evaluate all the candidates based on the heuristic. Evaluation can be done in O(1) because we already build the dag contains all the relevant information. Thus, scheduling is also O(n&lt;sup&gt;2&lt;&#x2F;sup&gt;).&lt;&#x2F;p&gt;
&lt;h4 id=&quot;overall-complexity&quot;&gt;Overall complexity&lt;&#x2F;h4&gt;
&lt;p&gt;The overall complexity is O(n&lt;sup&gt;2&lt;&#x2F;sup&gt;). This is significantly better compared to other algorithms in this space which are O(n&lt;sup&gt;4&lt;&#x2F;sup&gt;), even when adjusted to have similar hardware assumptions.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;experiments&quot;&gt;Experiments&lt;&#x2F;h3&gt;
&lt;p&gt;The authors implemented this instruction scheduler and made the following observations from benchmark results:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;In practice, these heuristics effectively remove avoidable interlocks and run in approximately linear time.&lt;&#x2F;li&gt;
&lt;li&gt;The memory referencing assumptions greatly improve results, and effectiveness increases with better aliasing information (provided by other parts of the compiler).&lt;&#x2F;li&gt;
&lt;li&gt;The carry&#x2F;borrow subgraph (the dag for carry&#x2F;borrow dependencies) does not improve much for most programs. Significant improvements only occur when the program is computationally intensive. 
&lt;ul&gt;
&lt;li&gt;It is unclear as to how they would improve scheduling as they are for constructed for correctness.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;Using more versatile dags proposed by other literature only slightly improves the instruction scheduling effectiveness, despite them having significantly worse complexity.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The referenced additional information on performance in the &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=13321&quot;&gt;[Joh86]&lt;&#x2F;a&gt; paper which tested load&#x2F;store scheduling and showed a 5% improvement. It was published in the same proceedings by colleagues working on the same architecture, and this improvement was measured by the reduction in interlocks caused by load&#x2F;store instructions. &lt;&#x2F;p&gt;
&lt;p&gt;Such a small suite of benchmarks and statistics would be unacceptable today, but it was perhaps okay for 1986. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;our-thoughts&quot;&gt;Our Thoughts&lt;&#x2F;h2&gt;
&lt;p&gt;This paper left a few things to be desired. &lt;&#x2F;p&gt;
&lt;p&gt;First, we did not quite understand some of the hazards this paper was concerned about. The first hazard of &amp;quot;loading a register from memory followed by using &lt;em&gt;that&lt;&#x2F;em&gt; register as a source&amp;quot; is clear as that&#x27;s a traditional load-use hazard where the load finishes at the end of the cycle while the value it was loading was needed at the beginning of the cycle. &lt;&#x2F;p&gt;
&lt;p&gt;However, the second hazard of &amp;quot;storing to any memory location followed by loading from any location&amp;quot; was puzzling as it is not clear why this would not work. Even if we considered the addresses to alias and assumed no store-to-load forwarding, it seems reasonable that the store would finish before the subsequent load and that the load would return the correct value. If we go a step further and assume memory accesses take multiple cycles, there still does not appear to be a problem as this is a read-after-write dependency, and the writing happens before the read. Another explanation could be that memory accesses could occur at different points in the pipeline, such as both before and after the execution stage. If an add instruction required a read from memory and then writing back to memory, and that add instruction was followed by a load, that load would have to stall until the add instruction&#x27;s write finished, resulting in an interlock. However, this explanation would not work if we assumed a &lt;a href=&quot;https:&#x2F;&#x2F;www.openpa.net&#x2F;pa-risc_architecture.html&quot;&gt;PA-RISC&lt;&#x2F;a&gt; architecture as the only memory operations allowed are explicit loads and store. &lt;&#x2F;p&gt;
&lt;p&gt;The third hazard of &amp;quot;loading from memory followed by using &lt;em&gt;any&lt;&#x2F;em&gt; register as the target of an arithmetic&#x2F;logical instruction or a load&#x2F;store with address modification&amp;quot; (verbatim from the paper) was even more confusing. This seemed to imply that the memory stage included the one and only arithmetic logic unit (ALU) in the processor as it specifically mentioned load&#x2F;stores with address modification (which we took to imply adding an offset to a base address register). We did not like this explanation so we looked up many HP Precision Architecture designs. In particular, &lt;a href=&quot;http:&#x2F;&#x2F;hpmuseum.net&#x2F;document.php?catfile=372&quot;&gt;this architecture&lt;&#x2F;a&gt; seemed to give a convincing explanation:&lt;&#x2F;p&gt;
&lt;img src=&quot;pipeline.png&quot; style=&quot;max-width: 100%&quot;&gt;
&lt;p&gt;In this pipeline, it takes one cycle for the ALU to calculate address, then only in the next cycle is the ALU address result written to a register, and only in the cycle after that is the loaded data finally written to a register. Each stage is subdivided into two halves, where register writes can onyl happen in the first half and reads in the second. As such, an ALU operation cannot happen the cycle following a load&#x2F;store address calculation since that value has not been written to a register yet, making an interlock necessary.&lt;&#x2F;p&gt;
&lt;p&gt;Perhaps improvements to pipeline designs in the three decades since this paper was published caused these mysteries, but all of this could be easily resolved if they had provided more information on exactly what kind of architecture they drew their hazards from. Their only reference was to &lt;a href=&quot;https:&#x2F;&#x2F;newcatalog.library.cornell.edu&#x2F;catalog&#x2F;835270&quot;&gt;[Kog81]&lt;&#x2F;a&gt;, which is a computer architecture textbook only available in print, and they cited the entire book without specific page numbers. We requested it from the Cornell Library Annex and combed through it, but it did not give any specific instruction set architecture designs. &lt;&#x2F;p&gt;
&lt;p&gt;Of these three hazards, the example they provided included two interlocks each for hazard types two and three. We would have preferred at least one of each type of interlock to have a more representative example. This paper also did not provide an example that would have utilized the carry&#x2F;borrow dependency subgraph.&lt;&#x2F;p&gt;
&lt;p&gt;In terms of performance evaluation, they had a good analysis for the worst-case complexity, but their empirical results could have included more data. They mentioned numbers for the amount of interlocks mitigated, but they only talked about three benchmarks. The [Joh86] paper that they referenced for more information stated that &amp;quot;the range of improvement [for load&#x2F;store scheduling] varied greatly with the program being optimized&amp;quot;, which can be rationalized by assuming that the programs being tested had high variation in their instruction types. However, looking at the actual measurements, the percentage improvements were 54%, 19%, 4%, 1%, 0%, 0%,and 0% (one benchmark with no load&#x2F;store interlocks was omitted). It seems strange that there is an insignificant improvement for more than half of the benchmarks tested, and I wish the authors gave an explanation for this in the original paper. &lt;&#x2F;p&gt;
&lt;p&gt;Even with these complaints, this paper did a number of things well. They did the requisite research for other literature in the pipelined instruction scheduling field and made fair and knowledgeable comparisons to other approaches. They also were clear in defining the subproblem they needed to solve and gave clear explanations of their approach in solving them with justifications for design decisions (e.g., not using a lookahead). Furthermore, they did not just give a heuristic and empirically argue that it works, but instead gave intuition behind the desired behaviors before diving into the actual algorithm. Lastly, the structure of this paper was also very streamlined in going from problem to solution to evaluation. Nowhere in the paper were we confused about why it was talking about something. &lt;&#x2F;p&gt;
&lt;p&gt;Overall, this paper offers a simple and efficient way of improving performance by reducing interlocks. While it may not be optimal on all architectures, its design goal was to perform well on most architectures while maintaining a low computational complexity, which they have successfully done. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;other-remarks&quot;&gt;Other Remarks&lt;&#x2F;h3&gt;
&lt;p&gt;Modern high performance processors such as those manufactured by Intel and AMD use more complex pipelines and fancier hardware optimizations. Instructions are also broken down into micro-ops and the documentation for them is not always released by the company. This makes optimal instruction scheduling very hard on these modern processors. However, the algorithm proposed in this paper inspired a class of algorithm called list scheduling and they are used in modern compilers such as LLVM today.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Codestitcher: Inter-procedural Basic Block Layout Optimization</title>
                <pubDate>Wed, 02 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/codestitcher/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/codestitcher/</guid>
                <description>&lt;p&gt;Programs have dramatically increased in size over the years. Eliminating unnecessary code via traditional optimizations such as dead code elimination, copy propagation, and constant propagation can only do so much. Meanwhile, the capacities of memory units like L1 caches and the TLB haven&#x27;t grown to account for larger program sizes. One reason for this is that as cache size increases, latency increases. Because of this, we want to optimally order instructions to maximize cache utilization and minimize cache misses. Code layout is critical here. In this post we will discuss this paper&#x27;s approach to code layout.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;background-optimizing-code-layout&quot;&gt;Background: Optimizing Code Layout&lt;&#x2F;h2&gt;
&lt;p&gt;Before we get too far in, let&#x27;s consider how modern compilers approach code layout, specifically function placement. Since 1990, &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=93550&quot;&gt;Pettis and Hansen&#x27;s code reordering method&lt;&#x2F;a&gt; has been the de facto approach, used by LLVM, HHVM, and other compiler infrastructures. Recent function layout techniques, such as call-chain clustering, attempt to improve upon the Pettis-Hansen method by addressing some of its limitations.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;pettis-hansen-method&quot;&gt;Pettis-Hansen Method&lt;&#x2F;h3&gt;
&lt;p&gt;Informally, given a program, the Pettis-Hansen method (hereafter referred to as PH for brevity) works as follows.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Generate an undirected call graph, where nodes are functions and edges represent calls from functions to other functions. More formally, a function A and a function B would have an edge between each other if A calls B or B calls A.&lt;&#x2F;li&gt;
&lt;li&gt;Run a benchmark workload on this program, keeping track of what calls are made.&lt;&#x2F;li&gt;
&lt;li&gt;Generate a weighted call graph, where edge weights are proportional to the frequency of calls between the two corresponding functions.&lt;&#x2F;li&gt;
&lt;li&gt;Go through the edges in decreasing weight, joining the nodes connected by the edge. When nodes join, their edges combine, summing weights when necessary. Node names are also concatenated, denoting the corresponding function order.
&lt;ul&gt;
&lt;li&gt;The node names concatenate to maximize the sum of the weights between consecutive nodes. This can involve reversing node names.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;After this process is complete, no edges will remain in the graph, and the order in which nodes are joined denote the function order.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Let&#x27;s go through a brief example. Suppose the weighted call graph is as follows.&lt;&#x2F;p&gt;
&lt;img src=&quot;ph-step-1.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;From the above graph, we see that &lt;em&gt;A&lt;&#x2F;em&gt; calls &lt;em&gt;B&lt;&#x2F;em&gt; 100 times, &lt;em&gt;B&lt;&#x2F;em&gt; calls &lt;em&gt;C&lt;&#x2F;em&gt; 30 times, and so on. In the first iteration of the algorithm (step 4), we pick the edge with the largest weight. That&#x27;s &lt;em&gt;AB&lt;&#x2F;em&gt;. We then combine nodes &lt;em&gt;A&lt;&#x2F;em&gt; and &lt;em&gt;B&lt;&#x2F;em&gt; to arrive at the following graph.&lt;&#x2F;p&gt;
&lt;img src=&quot;ph-step-2.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;We then combine &lt;em&gt;C&lt;&#x2F;em&gt; and &lt;em&gt;D&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;img src=&quot;ph-step-3.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;Here, we obviously combine &lt;em&gt;A-B&lt;&#x2F;em&gt; and &lt;em&gt;C-D&lt;&#x2F;em&gt;, but recall the rule about concatenating node names. There are four ways to coalesce the nodes:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;A-B-C-D&lt;&#x2F;em&gt;: Here, &lt;em&gt;BC&lt;&#x2F;em&gt; has a weight of 30.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;A-B-D-C&lt;&#x2F;em&gt;: &lt;em&gt;BD&lt;&#x2F;em&gt; has a weight of 0 (no edge), so this is worse than the above ordering.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;B-A-C-D&lt;&#x2F;em&gt;: &lt;em&gt;AC&lt;&#x2F;em&gt; has a weight of 90, which is now the best ordering.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;B-A-D-C&lt;&#x2F;em&gt;: &lt;em&gt;AD&lt;&#x2F;em&gt; has a weight of 0.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Based on the above, &lt;em&gt;B-A-C-D&lt;&#x2F;em&gt; is the optimal ordering, since it maximizes the weight between the coalesced nodes.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;basic-block-reordering&quot;&gt;Basic block reordering&lt;&#x2F;h3&gt;
&lt;p&gt;PH also works at the basic block level. PH conducts basic block reordering based on &amp;quot;hot&amp;quot; and &amp;quot;cold&amp;quot; portions of code to maximize spatial locality of the program. This is done by moving the basic blocks that run the most frequently to the top of the procedure. Below is a diagram showing how this works. &amp;quot;Primary&amp;quot; refers to &amp;quot;hot&amp;quot; code while &amp;quot;Fluff&amp;quot; refers to &amp;quot;cold&amp;quot; code.&lt;&#x2F;p&gt;
&lt;img src=&quot;hotcold.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;As we reorder blocks, we will need to insert additional jumps to preserve the control flow. Therefore, we wish to order basic blocks in a way that minimizes these unconditional jumps. This is to prevent unnecessary instruction pointer moves during the execution of the program.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;shortcomings-of-ph&quot;&gt;Shortcomings of PH&lt;&#x2F;h3&gt;
&lt;p&gt;We have a method for reordering entire functions, but this method does not consider the relationship between individual basic blocks and functions. Consider the following call graph presented in the Codestitcher paper, in which function &lt;em&gt;A&lt;&#x2F;em&gt; has three basic blocks, &lt;em&gt;A0&lt;&#x2F;em&gt;, &lt;em&gt;A1&lt;&#x2F;em&gt;, and &lt;em&gt;A2&lt;&#x2F;em&gt;:&lt;&#x2F;p&gt;
&lt;img src=&quot;codestitcher-example.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;Using PH, a viable block layout would be &lt;em&gt;M-A-B-C&lt;&#x2F;em&gt;, which we expand to &lt;em&gt;M-A0-A1-A2-B-C&lt;&#x2F;em&gt;. If we add up the weights of consecutive blocks (hereafter referred to as control flow transfers), we arrive at 180 (100 for &lt;em&gt;M-A0&lt;&#x2F;em&gt; and 80 for &lt;em&gt;A0-A1&lt;&#x2F;em&gt;). This is far from optimal, as we can generate the layout &lt;em&gt;M-A0-A1-B-A2-C&lt;&#x2F;em&gt; with 280 control flow transfers, an improvement of over 50% from before. Note that we use this metric because it tells us how often code will &amp;quot;fall-through.&amp;quot; We want to maximize fall-throughs as this improves spatial locality.&lt;&#x2F;p&gt;
&lt;p&gt;Another drawback of PH is that it uses an undirected call graph. As a result, in the eyes of PH, function &lt;em&gt;A&lt;&#x2F;em&gt; calling &lt;em&gt;B&lt;&#x2F;em&gt; is the same as &lt;em&gt;B&lt;&#x2F;em&gt; calling &lt;em&gt;A&lt;&#x2F;em&gt;, and PH treats ordering &lt;em&gt;A&lt;&#x2F;em&gt; before &lt;em&gt;B&lt;&#x2F;em&gt; as the same as ordering &lt;em&gt;B&lt;&#x2F;em&gt; before &lt;em&gt;A&lt;&#x2F;em&gt;. However, this is clearly not the case. Let&#x27;s say &lt;em&gt;A&lt;&#x2F;em&gt; calls &lt;em&gt;B&lt;&#x2F;em&gt; 100 times, and &lt;em&gt;B&lt;&#x2F;em&gt; calls &lt;em&gt;A&lt;&#x2F;em&gt; 0 times. Then, if we order &lt;em&gt;A&lt;&#x2F;em&gt; before &lt;em&gt;B&lt;&#x2F;em&gt;, assuming the call occurs in the middle of &lt;em&gt;A&lt;&#x2F;em&gt;, the instruction pointer will have to jump &lt;em&gt;len(A)&#x2F;2&lt;&#x2F;em&gt; instructions to the top of &lt;em&gt;B&lt;&#x2F;em&gt;, where &lt;em&gt;len(A)&lt;&#x2F;em&gt; denotes the number of instructions in &lt;em&gt;A&lt;&#x2F;em&gt;. Alternatively, if &lt;em&gt;B&lt;&#x2F;em&gt; was placed before &lt;em&gt;A&lt;&#x2F;em&gt;, then &lt;em&gt;A&lt;&#x2F;em&gt; would on average jump &lt;em&gt;len(B) + len(A)&#x2F;2&lt;&#x2F;em&gt; instructions. Because of this discrepancy, it is important to take into account the caller and callee relations in the call graph.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;call-chain-clustering&quot;&gt;Call-chain clustering&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;research.fb.com&#x2F;publications&#x2F;optimizing-function-placement-for-large-scale-data-center-applications-2&#x2F;&quot;&gt;Call-chain clustering (C3)&lt;&#x2F;a&gt; is a recent function layout optimization developed at Facebook Research that addresses the issue with using undirected call graphs. We do not discuss its internal details in this post, but a simplified way of thinking about C3 is that it is PH but with a directed call graph. In this way, it improves spatial locality by considering caller-callee relations.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;codestitcher&quot;&gt;Codestitcher&lt;&#x2F;h2&gt;
&lt;p&gt;This brings us to Codestitcher, an optimizer that integrates basic block reordering and function reordering to improve upon the PH model. Below is a high-level overview of the steps taken in Codestitcher to generate the optimized code layout, given an input program.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Using input profile data, generate a weighted CFG with frequencies as edge weights. Using this weighted CFG, perform basic block chaining, ordering basic blocks in a way to minimize jumps.&lt;&#x2F;li&gt;
&lt;li&gt;Perform hierarchical code collocation, ordering these basic block chains to form higher-level layouts. Here, the &lt;em&gt;d&lt;&#x2F;em&gt;-close partial layout algorithm works to build layouts to maximize spatial locality.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;profiling&quot;&gt;Profiling&lt;&#x2F;h3&gt;
&lt;p&gt;In order to build our weighted CFG for layout analysis, we need to collect data about &amp;quot;control flow transfers&amp;quot; (when control transfers between CFG nodes). Luckily, most Intel CPUs come with a &amp;quot;last branch record&amp;quot;, or LBR--- storing recently executed instructions such as jumps, branches, and calls. With the Linux &lt;code&gt;perf&lt;&#x2F;code&gt; tool, this data can be gathered and used to generate a weighted CFG for optimizing program layout.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;basic-block-chaining&quot;&gt;Basic Block Chaining&lt;&#x2F;h3&gt;
&lt;p&gt;We define a basic block chain as a directed sequence of basic blocks where no jumps can occur until the end of the last basic block. Generating a maximum-cardinality set of these chains would constitute the fewest number of unconditional jumps required. The authors denote this problem the &lt;em&gt;fall-through maximization problem&lt;&#x2F;em&gt;. This problem is the same as the maximum path cover problem, which is &lt;strong&gt;NP-hard&lt;&#x2F;strong&gt;. PH uses a greedy heuristic to approach this but does not provide a theoretical guarantee of how close to optimal the generated solution is.&lt;&#x2F;p&gt;
&lt;p&gt;There &lt;em&gt;is&lt;&#x2F;em&gt; is a 1&#x2F;2-approximation algorithm for this problem guarantees that the weight of the path cover is within &lt;em&gt;1&#x2F;2&lt;&#x2F;em&gt; of the optimal solution. Codestitcher uses a hybrid of PH&#x27;s method and the approximation algorithm, which provides better results than either algorithm run individually. However, this improvement is minimal---only around 0.03% over the approximation algorithm (according to the authors&#x27; tests), so it is unclear why the authors added so much complexity to this problem for minimal gain.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;hierarchical-code-collocation&quot;&gt;Hierarchical Code Collocation&lt;&#x2F;h3&gt;
&lt;p&gt;Coming up with a framework to integrate intraprocedural and interprocedural ordering optimizations while taking &lt;em&gt;layout distance&lt;&#x2F;em&gt; (the distance between two instructions) into account is difficult. We will first try to formalize this problem.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;d-close-partial-layout&quot;&gt;&lt;em&gt;d&lt;&#x2F;em&gt;-Close Partial Layout&lt;&#x2F;h4&gt;
&lt;p&gt;It would help if we could formally define what layout distance really means. The paper&#x27;s definition is quite intuitive: the number of bytes between two basic blocks. To take advantage of spatial locality, we want to ensure that related instructions that are executed frequently have as few bytes between them as possible. This is because when an instruction is loaded into the cache, the entire cache line (which consists of adjacent instructions) is loaded.&lt;&#x2F;p&gt;
&lt;p&gt;Given a specific distance value &lt;em&gt;d&lt;&#x2F;em&gt;, we can try to lay out our program to maximize the number of times control is transferred between two basic blocks within &lt;em&gt;d&lt;&#x2F;em&gt; bytes of each other. However, this is not enough. We need to optimize over many different &lt;em&gt;d&lt;&#x2F;em&gt; values to account for hierarchical memory systems with multiple cache levels. Our goal is to find the &lt;em&gt;finest grain&lt;&#x2F;em&gt; layout that maximizes control transfers across all combinations of basic blocks, across multiple distances.&lt;&#x2F;p&gt;
&lt;p&gt;How do we solve this? The paper suggests a greedy approach that encodes the problem as a graph of basic block sequences, with edges weighted by the number of control flow transfers within &lt;em&gt;d&lt;&#x2F;em&gt; bytes. Nodes of the graph are iteratively coalesced in order of descending weight, and this is used to build up a function ordering. We output this ordering after all nodes are coalesced and no edges remain.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;finishing-the-codestitcher-layout-optimization&quot;&gt;Finishing the Codestitcher layout optimization&lt;&#x2F;h3&gt;
&lt;p&gt;In the &lt;em&gt;d&lt;&#x2F;em&gt;-close partial layout problem, edge weights between basic blocks are a function of the number of &lt;em&gt;d&lt;&#x2F;em&gt;-close transfers. However, basic blocks with more instructions will clearly have more &lt;em&gt;d&lt;&#x2F;em&gt;-close transfers to other basic blocks---they have more instructions. Codestitcher accounts for this by normalizing each edge weight by the sum of the binary sizes of the two corresponding basic blocks. This layout is then used to recompile an optimized version of the original program.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h1&gt;
&lt;p&gt;The authors evaluate Codestitcher by running it against LLVM&#x27;s default profile-guided optimization as well as PH&#x27;s basic block reordering method.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;&#x2F;strong&gt; Only one test machine is used in the evaluation of Codestitcher. This immediately raises some red flags. Much of Codestitcher&#x27;s performance improvements depends upon the memory units such as the instruction cache, unified cache, and TLB. Using multiple machines with different cache configurations and architectures is necessary for a thorough evaluation of Codestitcher as compared to other code layout optimization methods. The machine that was used runs Ubuntu 16.04 on two quad-core 3.60 GHz i7-7700 Kaby-Lake processors. It has a 32 KB L1 instruction cache and 256 KB L2 unified cache.&lt;&#x2F;p&gt;
&lt;p&gt;To test Codestitcher, the authors benchmark five programs with large codebases:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;MySQL cluster 7.4.12 with the non-transactional read-only test (oltp_simple)&lt;&#x2F;li&gt;
&lt;li&gt;Firefox 50.0 with the tp5o benchmark&lt;&#x2F;li&gt;
&lt;li&gt;Clang 3.9 with LLVM&#x27;s multisource benchmark test suite&lt;&#x2F;li&gt;
&lt;li&gt;PHP 7.0 (with httpd 2.4) with WP-Test&lt;&#x2F;li&gt;
&lt;li&gt;Python 2.7.15 with the apps benchmark group from the Unladen Swallow suite&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;For each application, the authors compile the binaries and run various benchmarks. The specific compilation configurations used are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;PGO: LLVM&#x27;s default profile-guided optimization&lt;&#x2F;li&gt;
&lt;li&gt;PGO.PH: PGO with PH&#x27;s reordering algorithm applied on top&lt;&#x2F;li&gt;
&lt;li&gt;PGO.C3: PGO with call-chain clustering applied on top&lt;&#x2F;li&gt;
&lt;li&gt;PH.BB: Baseline binary with PH applied on top&lt;&#x2F;li&gt;
&lt;li&gt;CS: Baseline binary with CS applied on top&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;To do the actual testing, the authors generate profiles over multiple runs of each distinct method, then take the average time improvement for each code layout optimization technique over ten runs.&lt;&#x2F;p&gt;
&lt;p&gt;Here are the results when testing the programs:&lt;&#x2F;p&gt;
&lt;img src=&quot;codestitcher-evaluation.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;Based on their evaluation, Codestitcher has a visible performance benefit on three of the programs (MySQL, Clang, and Firefox), while PHP and Python show little to no improvement. In fact, most of the code layout optimization methods followed a similar trend. Why is that?&lt;&#x2F;p&gt;
&lt;p&gt;The authors provide an analysis of the MPKI (misses per thousand instructions) on the instruction cache and TLB:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Application&lt;&#x2F;th&gt;&lt;th&gt;Instruction cache MPKI&lt;&#x2F;th&gt;&lt;th&gt;Instruction TLB MPKI&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;MySQL&lt;&#x2F;td&gt;&lt;td&gt;62.44&lt;&#x2F;td&gt;&lt;td&gt;9.35&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Clang&lt;&#x2F;td&gt;&lt;td&gt;8.14&lt;&#x2F;td&gt;&lt;td&gt;1.01&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Firefox&lt;&#x2F;td&gt;&lt;td&gt;9.16&lt;&#x2F;td&gt;&lt;td&gt;1.54&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;PHP Server&lt;&#x2F;td&gt;&lt;td&gt;7.63&lt;&#x2F;td&gt;&lt;td&gt;0.96&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Python&lt;&#x2F;td&gt;&lt;td&gt;3.40&lt;&#x2F;td&gt;&lt;td&gt;0.19&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;As expected, when the load on both the cache and TLB increase, code layout optimization is more effective---this is evidence that these optimization methods in general are taking advantage of the processor&#x27;s memory units to improve performance. However, we are not convinced that this is enough evidence to conclude that Codestitcher is superior to PH&#x27;s method, especially because this is only evaluated on one specific processor.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;other-metrics&quot;&gt;Other Metrics&lt;&#x2F;h2&gt;
&lt;p&gt;While a high-level analysis of the performance of Codestitcher is helpful, the diversity of methods evaluated allows a deeper dive into the specific claims that govern Codestitcher.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Function reordering as applied to PGO had little to no impact on performance, with a maximum improvement of 4.1% for instruction cache miss rates. The paper claims that block-based reordering offers significant improvement but uses the performance of CS as a point of comparison. We were somewhat wary of this---perhaps comparing PGO-PH and PGO with some sort of block-based reordering technique would be a more fair analysis of their differences in performance.&lt;&#x2F;li&gt;
&lt;li&gt;Large pages universally increase instruction cache miss rate.&lt;&#x2F;li&gt;
&lt;li&gt;The authors admit that Codestitcher&#x27;s performance is somewhat ambiguous with regards to the TLB. While it performs well on MySQL compared to other techniques, the miss rate for the instruction TLB is much higher on Firefox, Python, and PHP.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h2 id=&quot;overhead&quot;&gt;Overhead&lt;&#x2F;h2&gt;
&lt;p&gt;One final axis that the paper decided to evaluate Codestitcher against was performance overhead of the actual optimization methods:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Overhead due to profiling&lt;&#x2F;li&gt;
&lt;li&gt;Overhead to construct the weighted CFG&lt;&#x2F;li&gt;
&lt;li&gt;Overhead to actual reorder and compute an optimal layout&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;img src=&quot;cs-table.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;While the profiling overhead is much lower for CS, the added costs from building the weighted CFG and engaging in code reordering are important to keep into mind when making this comparison. For example, their basic block chaining method has a worse time complexity than that of PH, since it incorporates the approximation algorithm which is slower than PH&#x27;s greedy heuristic. We think it would have been interesting to also do an overhead comparison to PGO-PH and PGO-C3, which would have had a more fair comparison with regards to layout construction and trace processing overhead.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;The authors discuss a novel method of reordering basic blocks by considering both basic blocks and function-level layouts when reordering code segments. Intuitively, it makes sense that this would lead to &amp;quot;better&amp;quot; layouts, as we have more data to move around. However, we are a little hesitant to accept the authors&#x27; claim that Codestitcher is actually better than existing methods. Namely, their evaluation strategy uses only one test machine, and as a result, only runs Codestitcher on a single instruction cache layout. It would have been interesting to see a graph comparing instruction cache size to Codestitcher&#x27;s benefits, as this would help show the benefits of the &lt;em&gt;d&lt;&#x2F;em&gt;-close partial layout stage of the optimizer.&lt;&#x2F;p&gt;
&lt;p&gt;Ultimately, this paper explores a very interesting technique for optimizing code layout, but its evaluation does not do a great job of convincing us that Codestitcher provides substantial benefits over other code layout optimizers.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Probabril</title>
                <pubDate>Wed, 02 Oct 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/probabril/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/probabril/</guid>
                <description>&lt;h1 id=&quot;introduction&quot;&gt;Introduction&lt;&#x2F;h1&gt;
&lt;p&gt;Often one would like to represent a non-deterministic process as a combination of operations. Such a representation is often much more compact and often much more clearly represents the process you have in mind, and is easier to edit than conditional probability tables; a probabilistic program is one such way of representing such a program. All that was needed to do to turn &lt;code&gt;bril&lt;&#x2F;code&gt; into a probabilistic programming language is to add a source of randomness---in our simple case, a coin flip operation. Running such a program, gives you a sample from the distribution.&lt;&#x2F;p&gt;
&lt;p&gt;Of course, having the source code, we can do much more than running programs.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-goal-an-exact-solver&quot;&gt;The Goal: An Exact Solver&lt;&#x2F;h2&gt;
&lt;p&gt;The main bit of the project was to write an abstract interpreter that solves for the exact distribution that a probabilistic program represents. For instance, the program below flips two coins:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;  x : bool = flip;
  y : bool = flip;
  ret
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;we should be able to see represents the distribution&lt;&#x2F;p&gt;
&lt;p&gt;$$ p \left(\begin{matrix} x \land y  \\
x \land \lnot y  \\
\lnot x \land y \\
\lnot x \land \lnot y  \end{matrix}\right)
= \left(\begin{matrix} .25 \\ .25 \\ .25 \\ .25 \end{matrix}\right)$$&lt;&#x2F;p&gt;
&lt;p&gt;By repeatedly running a program and recording the frequencies of resulting environments (Monte Carlo sampling), one can get a rough approximation of the distribution, but this can be less than satisfying for a number of reasons:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;The resulting distribution is only &lt;em&gt;likely&lt;&#x2F;em&gt; to be &lt;em&gt;approximately&lt;&#x2F;em&gt; correct&lt;&#x2F;li&gt;
&lt;li&gt;This distribution will likely be very distorted in regions that are unlikely&lt;&#x2F;li&gt;
&lt;li&gt;It can require exponentially many samples to resolve events&lt;&#x2F;li&gt;
&lt;li&gt;Equally likely paths are likely to be close, but almost guaranteed not to have exactly the same mass estimate. Worse, no sampling method will ever be able to conclude that $p(x \land y) \not&amp;gt; p(x \land \lnot y)$ with high probability, regardless of the number of samples.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Instead, we can interpret the code as branching into two worlds on a flip, each tracking the exact correct amount of mass. While this works for straight line code, such as the program above, any loop which could run an unbounded number of times will cause the program to run forever. At this point, we have removed the probabilistic component, and we now have a deterministic approximation which will converge to the answer we would like. This alleviates the most pressing of our issues, but it is mildly annoying that we will never terminate when evaluating a program with possibly unbounded iteration, even if the limit point is obvious. For instance, this program which repeatedly flips two coins until one of them lands tails:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;start:
  x : bool = flip;
  y : bool = flip;
  z : bool = and x y;
  br z start end;
end:
  ret
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;which one can easily see results in the distribution&lt;&#x2F;p&gt;
&lt;p&gt;$$
p \left(\begin{matrix}x \land y \land \lnot z \vphantom{\frac{1}{3}} \\
x \land \lnot y \land \lnot z \vphantom{\bigg|} \\
\lnot x \land y \land \lnot z \vphantom{\frac{1}{3}}  \end{matrix}\right)
= \left(\begin{matrix} ~\frac{1}{3}~ \\ ~\frac{1}{3}~\vphantom{\bigg|} \\ ~\frac{1}{3}~  \end{matrix}\right)
$$
... will never terminate if we just split worlds on flips, because there&#x27;s there&#x27;s always non-zero mass on some run which has not terminated, even though the pattern is clear, and the same computation has already been done in each iteration.&lt;&#x2F;p&gt;
&lt;p&gt;The goal is to design an algorithm which soundly deals with issues like this, which exactly computes distributions over any program with finite state space in a finite number of steps.&lt;&#x2F;p&gt;
&lt;!-- I was also hoping to implement the [the R2 paper](https:&#x2F;&#x2F;www.microsoft.com&#x2F;en-us&#x2F;research&#x2F;project&#x2F;r2-a-probabilistic-programming-system&#x2F;), which more explicitly makes use of Metropolis-Hastings algorithm. --&gt;
&lt;h1 id=&quot;what-i-did&quot;&gt;What I Did&lt;&#x2F;h1&gt;
&lt;p&gt;I built an abstract interpreter which exactly (to reiterate: neither approximately, nor probabilistically) solves for the distribution of any program with finite state space, together with tools for generating random programs for evaluation, as well as some tools for observing and looping programs. To the best of my knowledge, everything like this that already exists is an iterative approximation of the fixed point, rather than exact calculation of the limit point. However, the procedure is simple enough that I would not be at all surprised if it had been done.&lt;&#x2F;p&gt;
&lt;p&gt;Despite the fact that we conceptually have done something cool, the distinction between exact and approximate may ultimately be in irrelevant, for a couple of reasons:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;In difficult cases of interest, distributions are continuous, and so this technique does not work.&lt;&#x2F;li&gt;
&lt;li&gt;The approximations quickly converge to arbitrary precision, and so if you know what tolerance you want, it is sufficient to just run a few more iterations.&lt;&#x2F;li&gt;
&lt;li&gt;As we will see the standard approach for probabilists &lt;em&gt;does&lt;&#x2F;em&gt; involve solving exactly, but only once you have calculated the eigenspace; it is easy to overlook the fact that computing eigenvectors &#x2F; values is already an approximate algorithm, as standard libraries do this for you quickly at machine precision.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Other notable exact solvers include the &lt;a href=&quot;https:&#x2F;&#x2F;files.sri.inf.ethz.ch&#x2F;website&#x2F;papers&#x2F;psi-solver.pdf&quot;&gt;PSI Solver, 2016&lt;&#x2F;a&gt;, which handles continuous distributions, but requires straight line code (it unrolls loops), and Holtzen et. al.,&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;1904.02079.pdf&quot;&gt;Symbolic Exact Inference for Discrete Probabilistic Programs, 2019&lt;&#x2F;a&gt; which uses a much more efficient representation than ours, but also operates on programs with no control flow. The solvers that operate on arbitrary control structures, such as &lt;a href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;1507.00996.pdf&quot;&gt;this one&lt;&#x2F;a&gt;, all seem to be approximate (but again, converge to the true distribution up to machine precision quickly).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design&quot;&gt;Design&lt;&#x2F;h2&gt;
&lt;p&gt;To the language specification I have added three instructions,&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;flip&lt;&#x2F;code&gt;: an instruction which stores a random boolean in its target destination&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;obv&lt;&#x2F;code&gt;: an &amp;quot;observe&amp;quot; primitive, used for conditioning, which can be thought of as an assert---if it fails, the world and any mass on it are destroyed, netting a sub-distribution. If one thinks of programs as being normalized distributions (that is, conditioned on a program finishing), then this mass is re-distributed to the other runs, and this instruction is equivalent to a restart of the program.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;clear&lt;&#x2F;code&gt;: clears the environment variables, removing all bindings.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; &lt;code&gt;obv&lt;&#x2F;code&gt; can be compiled to a branch which restarts the program, with a &lt;code&gt;clear&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;background-on-exact-inference-of-posterior-distribution&quot;&gt;Background on Exact Inference of Posterior Distribution&lt;&#x2F;h3&gt;
&lt;p&gt;There are at least two canonical ways of approaching this problem: one from programming languages, and one from ergodic theory. In both cases, a program $P$ can be thought of as a weighted graph $(\mathcal S, T)$, where the vertices
$$ \mathcal S := \mathrm{Instructions} \times \mathrm{Env}  $$&lt;&#x2F;p&gt;
&lt;p&gt;are pairs consisting of the program counter and the environment state, and the weight $T_{s_1, s_2}$ of the edge between states $s_1$ and $s_2$ is the probability of transitioning from state $s_2$ from state $s_1$. Note that this graph is incredibly sparse, as each state can only move to one or two other states.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;1-abstract-interpretation-and-jacobi-iterates&quot;&gt;[ 1 ]  Abstract Interpretation and Jacobi Iterates&lt;&#x2F;h4&gt;
&lt;p&gt;The first thing we can do is in the same spirit as data flow analysis: we interpret programs abstractly---that is, run them by keeping track of some restricted information that necessarily must be true about each variable, rather than its exact value.&lt;&#x2F;p&gt;
&lt;p&gt;Consider the abstract domain $\mathcal D = ({\Delta \mathcal S}, \preceq, s, \oplus)$ of sets of distributions over states, which we will call $\Delta \mathcal S$. $\mathcal D$ can be endowed with a natural order $\preceq$ on which $T$ is monotonic, making it a complete partial order (CPO), i.e., an ordered set with arbitrary suprema. Because it is a CPO and $f$ is monotonic, there is a least fixed point of $x$ of $f$ such that $x \succeq s$ for any $s \in \mathcal S$, computed by&lt;&#x2F;p&gt;
&lt;p&gt;$$ x := \mathrm {lfp}^{\preceq} (f,s) =  \lim_{n \to \infty} f^{n} (s) $$&lt;&#x2F;p&gt;
&lt;p&gt;The values obtained by stopping at any given point are called the Jacobi iterates, and are the basis of Cousot style abstract interpretation. However, even if the state space $\mathcal S$ is finite, the set of distributions over them is decidedly not---and this procedure will not terminate. In practice, to get termination people sacrifice completeness to get a sound, terminating abstract interpreter, pulling tricks such as &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Widening_(computer_science)#Use_in_Abstract_Interpretation&quot;&gt;widening&lt;&#x2F;a&gt;. In this setting, this corresponds to giving up on the exact probability distribution, and instead reporting an upper probability.&lt;&#x2F;p&gt;
&lt;p&gt;The distribution of interest is the restriction of this fixedpoint $x$ to the program points that are return values, renormalized if thought of as a distribution rather than a sub-distribution.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;2-stationary-distributions-on-markov-chains&quot;&gt;[ 2 ]  Stationary Distributions on Markov Chains&lt;&#x2F;h4&gt;
&lt;p&gt;The second common view of a probabilistic program is as a Markov chain. In a very clear way, a program describes exactly the data required to transition from one state (including both the environment variables and the program counter) to a distribution over next states. In particular, the transition $\mathbf T_{i,j}$ is the probability of transitioning to state $j$ given that you&#x27;re in state $i$. For a deterministic program, for instance, $\mathbf T_{i,j}$ is a function, and therefore has exactly a single one in each row, and zeros elsewhere. A fixed point here is a stationary distribution for the matrix $\mathbf T$---that is, an eigenvector associated to eigenvalue 1.&lt;&#x2F;p&gt;
&lt;!--Because segments of the program which  contracting map with respect to entropy, the Banach fixpoint theorem tells use that a fixed point exists, and it can be calculated by iteratively applying the matrix $T$ to any point in our space ---&gt;
&lt;h5 id=&quot;projecion-into-eigenspaces&quot;&gt;Projecion into Eigenspaces&lt;&#x2F;h5&gt;
&lt;p&gt;Given an oracle for computing the eigenvectors of this transition matrix $\mathbf T$, the right thing to do is clear:&lt;&#x2F;p&gt;
&lt;p&gt;$$ \lim_{n \to \infty}  \mathbf T^n \vec s  = \lim_{n \to \infty}  \mathbf U \Sigma^n \mathbf V^T \vec s $$&lt;&#x2F;p&gt;
&lt;p&gt;where $\mathbf U \Sigma \mathbf V^T$ is the singular decomposition of $\mathbf T$---that is, $\mathbf U$ and $\mathbf V$ are unitary and $\Sigma$ is a diagonal matrix of the singular values of $\mathbf T$. Because we know $\mathbf T$ is a (sub)stochastic matrix, we know that it can have no eigenvalue greater than 1, and if the program has any chance of returning, the return statement corresponds to an eigenvector that does in fact have corresponding eigenvalue 1. It is easy to see that any singular value that is less than 1 will ultimately go to zero, and so really we are just projecting the start state $\vec s$ into the eigenspace associated to the eigenvalue 1. If there are $k$ dimension of this eigenspace, we can write the previous equation as&lt;&#x2F;p&gt;
&lt;p&gt;$$\lim_{n \to \infty}  \mathbf T^n \vec s   = \mathbf U_k \mathbf V_k^T \vec s$$&lt;&#x2F;p&gt;
&lt;p&gt;where $\mathbf U_k$ and $\mathbf V_k$ are the first $k$ left and right eigenvectors of $\mathbf T$, respectively. This process is widely considered the standard solution to problems like ours.&lt;&#x2F;p&gt;
&lt;p&gt;However, there are a number of drawbacks of just implementing the algorithm like this:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;It requires having fully explored the graph&lt;&#x2F;li&gt;
&lt;li&gt;Calculations of SVD and eigendecomposition do not benefit much from sparsity.&lt;&#x2F;li&gt;
&lt;li&gt;We already know most of the eigenvectors associated to eigenvalue 1: the &lt;code&gt;ret&lt;&#x2F;code&gt; statements. (The others correspond to infinite loops.)&lt;&#x2F;li&gt;
&lt;li&gt;Eigenvectors &#x2F; eigenvalues must themselves be computed iteratively rather than exactly, due to the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Abel%E2%80%93Ruffini_theorem&quot;&gt;Abel-Ruffini theorem&lt;&#x2F;a&gt;: in general, a matrix could have an arbitrary characteristic polynomial, and solving for eignenvalues means giving roots of this polynomial---and therefore eigenvalues cannot be calculated with algebraic operations in general so long as $| \mathcal S | &amp;gt; 5$.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;!--hr&#x2F;--&gt;
&lt;h3 id=&quot;algorithm&quot;&gt;Algorithm&lt;&#x2F;h3&gt;
&lt;p&gt;The key insight is that the limit point of a single cycle can be computed the moment you spot the cycle, and know where all of the &amp;quot;off-ramps&amp;quot; are. For instance, if your program looks like this:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;probabril&#x2F;graph-sketch.png&quot; alt=&quot;example-graph&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;then the moment you&#x27;ve seen the path $a \to b \to c$, and realized that the probability mass has dropped from $1$ to $\frac{1}{8}$ by going around the circle, we see that that we will ultiately end up with a geometric series&lt;&#x2F;p&gt;
&lt;p&gt;$$ 1 + \frac{1}{12} + \frac{1}{12^2} + \frac{1}{12^3} + \cdots  =  \frac{1}{1 - \frac{1}{12}} $$&lt;&#x2F;p&gt;
&lt;p&gt;That is, we can immediately pass to a limit, by removing all weight at the origin of the cycle, and multiplying the probability masses which branch off by $\frac{12}{11}$, which means the resulting graph looks like this:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;probabril&#x2F;graph-sketch-2.png&quot; alt=&quot;example-graph&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;If we repeat this process for the left cycle in the graph, we&#x27;ll see that we suddenly have mass back on node $a$ again, and so it may look as though all of this work hasn&#x27;t really helped us after all, if we have multiple interacting cycles. This too can be overcome, by saving our work, in the form of a distribution frontier for each node, so that we can circumvent any duplicate calculations. I can use this intuition to give a rough sketch of a proof that algorithm will terminate (again, only for finite state space), but I have not yet investigated whether we can guarantee polynomial time.&lt;&#x2F;p&gt;
&lt;p&gt;In any case, here is the algorithm:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Let $\texttt{best} : \text{Map}\langle \mathcal S, \Delta \mathcal S \rangle$ be initially a default dict with $(i,\text{env}) \mapsto \text{execInst}_i(\text{env})$.&lt;&#x2F;li&gt;
&lt;li&gt;Initialize a queue $Q := []$.&lt;&#x2F;li&gt;
&lt;li&gt;Repeat while $|Q| &amp;gt; 0$:
&lt;ul&gt;
&lt;li&gt;Dequeue $p := (i, \text{env}) \leftarrow Q$ from $Q$.&lt;&#x2F;li&gt;
&lt;li&gt;set $\texttt{best}[p] := \Big[\texttt{best}[p] &#x2F; \texttt{best}[\textit{support}(\texttt{best}[p])]\Big]$; that is, extend the frontier of the distribution associated to $p$ by one level (or equivalently, use the monad multiplication $\mu$ to collapse $\texttt{best} \circ \texttt{best}$ to a distribution).&lt;&#x2F;li&gt;
&lt;li&gt;if  $\text{weight}_p(\texttt{best}[p]) &amp;gt; 0$, i.e., if there is a cycle in $\texttt{best}[p]$:
&lt;ul&gt;
&lt;li&gt;$\texttt{best}[p] \leftarrow \texttt{best}[p] * \dfrac{1}{1 - \text{weight}_p(\texttt{best}[p])}$&lt;&#x2F;li&gt;
&lt;li&gt;$\text{weight}(\texttt{best}[p]) := 0$&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;append $\textit{support}(\texttt{best}[p])$ to $Q$; if this is non-empty, also append $p$.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h4 id=&quot;what-this-looks-like-in-terms-of-power-iteration&quot;&gt;What this looks like in terms of Power Iteration&lt;&#x2F;h4&gt;
&lt;p&gt;If $\mathbf T = \Big[ t_{i,j} \Big]$, then we are only updating one line at once, and changing $\mathbf T$ at each iteration. To start off, we have a collection of matrices that look like this, for each $i$, which we plan on applying sequentially to $\vec s$:&lt;&#x2F;p&gt;
&lt;p&gt;$$ \begin{bmatrix}1 &amp;amp;  &amp;amp; \cdots &amp;amp; &amp;amp; 0 \\  &amp;amp; 1 &amp;amp;&amp;amp;&amp;amp; \\   \Big[\text{---} &amp;amp; \text{---} &amp;amp; t_{i,j} &amp;amp; \text{---} &amp;amp; \text{---}\Big] \\ &amp;amp;&amp;amp;&amp;amp;1 &amp;amp; \\ 0 &amp;amp;  &amp;amp; \cdots &amp;amp;&amp;amp; 1
\end{bmatrix} $$&lt;&#x2F;p&gt;
&lt;p&gt;At each iteration, we combine adjacent ones together into a larger matrix via substitution &#x2F; matrix multiplication, and save the result as the new vector $[- ~t_{i,j}~- ]$. Every time there is a cycle, i.e., positive mass in $t_{i,i}$ for some $i$, we pass immediately to the limit by redistributing the weight in $t_{i,i}$ to each of the other $t_{i,j}$ in proportion to their existing mass.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;how-is-this-possible-given-abel-ruffini&quot;&gt;How is this possible given Abel-Ruffini?&lt;&#x2F;h4&gt;
&lt;p&gt;While probabilistic Bril programs are not deterministic, they are far from being arbitrary matrices---because they&#x27;re probabilistic transition matrices $A$ must have $\sum_{i} A_{i,j} = 1$ for all $j \in \mathcal S$. Moreover, because we already know the eigenvalues and even most of the eigenvectors we care about, there is no need to do some of this computation.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;This algorithm has been implemented in a file called &lt;code&gt;xbrili&lt;&#x2F;code&gt;, next to the normal interpreter &lt;code&gt;brili&lt;&#x2F;code&gt;. The code is available &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;orichardson&#x2F;bril&#x2F;blob&#x2F;master&#x2F;bril-ts&#x2F;xbrili.ts&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The refactoring required to get this working include the following setup:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Create a &lt;code&gt;StringifyingMap&lt;&#x2F;code&gt; whose keys can be other maps, so that we can index into our many maps and queues by program point.&lt;&#x2F;li&gt;
&lt;li&gt;Expand &lt;code&gt;Action&lt;&#x2F;code&gt;s to also include splitting of worlds, for coin flips, and interpret instructions accordingly&lt;&#x2F;li&gt;
&lt;li&gt;Push print statements in a buffer so they only complete at the end and once per run---unobserved runs won&#x27;t print.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;One can then roughly translate the pseudo-code above. There is no exploration before the algorithm starts; we directly traverse the instructions themselves, and build the $\texttt{best}$ map behind us as we go, just like any graph search algorithm. When we encounter back edges, we skip all of the interim computation and go directly to the frontier of the new node.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;extensions&quot;&gt;Extensions&lt;&#x2F;h3&gt;
&lt;p&gt;There are several additional features of note in the implemented version beyond what has already been discussed. First, the more major elided details and extensions:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;I have also implemented an extension of this algorithm which backs off of finite space assumptions, by specifying a tolerance $t$. When the mass of a state $u$ from the state $s$ drops off enough, $\Pr(u \mid \vec{s}) &amp;lt; t$, we stop keeping track of the run entirely. It was actually rather hard to get this part right, because the algorithm relies on the queue both to track what still needs doing, and also to determine if a point changed after updating. There are many tempting resolutions involving just the queue, but all of the ones I tried came with subtle bugs. To fix this, I needed to separately track which program points had already finished, independent of the queue.&lt;&#x2F;li&gt;
&lt;li&gt;I have also implemented a utility called &lt;code&gt;random-bril.py&lt;&#x2F;code&gt; which generates random Bril programs, with a mechanism for making sure that the distribution of different kinds of instructions is fairly even. There are a lot of parameters, so they are hard-coded. The parameters have been tuned to make things that look like reasonable programs to me, but of course do not reflect any real distribution of instructions, and there may be programs which are impossible to generate with it. Also, due to jumps, it is possible to generate programs that refer to variables before they are defined. We throw out any programs that generate errors like this.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;And also some more minor or more technical details that may be of interest:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;brili&lt;&#x2F;code&gt; now interprets &lt;code&gt;flip&lt;&#x2F;code&gt; as random, &lt;code&gt;obv&lt;&#x2F;code&gt; as a restart, and &lt;code&gt;xbrili&lt;&#x2F;code&gt; introduces new actions for world splitting.&lt;&#x2F;li&gt;
&lt;li&gt;Printing is now delayed until the end of the execution and saved in the buffer, so that observations truly can totally undo everything and allow for a restart.&lt;&#x2F;li&gt;
&lt;li&gt;There are options for removing the printing instructions entirely, and dumping environment variables at the end, for both interpreters, so that we can automatically test the correctness of the two against one another via &lt;code&gt;random-bril&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;I have also added quite a bit to &lt;code&gt;util.ts&lt;&#x2F;code&gt;, including a way of indexing TypeScript Maps by other TypeScript Maps by means of stringification.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;difficulties&quot;&gt;Difficulties&lt;&#x2F;h2&gt;
&lt;p&gt;This took me a long time, and I was trying to figure out whether the algorithm I had in mind was sound, and how to get it to terminate, particularly once I didn&#x27;t want to assume that my state space as finite right at the beginning.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;shortcomings-in-algorithm&quot;&gt;Shortcomings in Algorithm&lt;&#x2F;h3&gt;
&lt;ul&gt;
&lt;li&gt;In some cases I&#x27;m paying an extra factor of two: if there is a loop with an odd number of instructions, I have to go through the loop twice before the $\textt{best}$ map stabilizes. It is not clear there&#x27;s an easy way of avoiding this, but it feels like there should be. Of course, &amp;quot;2 if you&#x27;re unlucky&amp;quot; is much better than &amp;quot;as many as it takes to converge&amp;quot;.&lt;&#x2F;li&gt;
&lt;li&gt;The algorithm keeps revisiting nodes that it cannot possibly be done with yet. Rather than a queue, which gives this algorithm, or a stack, which looks more like a Gibbs Sampler, really the ideal is to have a priority queue, ordered by probability mass (where ties are broken by last time to arrival).&lt;&#x2F;li&gt;
&lt;li&gt;I have sketch of a proof of soundness, but have not worked out the details. I also have not proved any complexity results.&lt;&#x2F;li&gt;
&lt;li&gt;For a program with infinite state space, the algorithm might not terminate. Of course, this is unavoidable; the more pressing concern is that &lt;code&gt;xbrili&lt;&#x2F;code&gt; will use additional memory even if the original program runs in fixed space, which makes it much worse than executing a deterministic program. This could be alleviated by only saving things on branches; the trade off is that we lose opportunities to recognize the state we&#x27;re in.&lt;&#x2F;li&gt;
&lt;li&gt;A large number of the runs have the same control flow; these take up a lot of extra storage space.&lt;&#x2F;li&gt;
&lt;li&gt;The algorithm we use here, to follow coin flips, does not transfer to continuous variables whatsoever. Resolving this is the subject of the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;cs6120&#x2F;issues&#x2F;67&quot;&gt;second project&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;shortcomings-in-implementation&quot;&gt;Shortcomings in Implementation&lt;&#x2F;h3&gt;
&lt;ul&gt;
&lt;li&gt;In backing off of the finite state space assumption, I have given only one mechanism for throwing out traces: if they have mass less than some tolerance. Other, arguably more useful criteria include, a global total tolerance,&lt;&#x2F;li&gt;
&lt;li&gt;The priority queue idea was not implemented mostly because it seemed like a premature optimization, and the data structure we use to index it is mutable, so it would have been an additional engineering burden.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h1 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h1&gt;
&lt;h2 id=&quot;example-programs&quot;&gt;Example Programs&lt;&#x2F;h2&gt;
&lt;p&gt;There are more examples in &lt;code&gt;&#x2F;test&#x2F;prob&#x2F;&lt;&#x2F;code&gt;, for those interested. We will highlight two important ones:&lt;&#x2F;p&gt;
&lt;h3 id=&quot;thirds&quot;&gt;Thirds&lt;&#x2F;h3&gt;
&lt;p&gt;Consider the program we wrote earlier:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
start :
  x: bool = flip;
  y: bool = flip;
  z: bool = and x y;
  br z start end;
end:
  ret;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This outputs our desired thirds distribution, immediately and exactly:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;[ &amp;#39;done&amp;#39;, Map { &amp;#39;x&amp;#39; =&amp;gt; true, &amp;#39;y&amp;#39; =&amp;gt; false, &amp;#39;z&amp;#39; =&amp;gt; false } ] 0.3333333333333333
[ &amp;#39;done&amp;#39;, Map { &amp;#39;x&amp;#39; =&amp;gt; false, &amp;#39;y&amp;#39; =&amp;gt; true, &amp;#39;z&amp;#39; =&amp;gt; false } ] 0.3333333333333333
[ &amp;#39;done&amp;#39;, Map { &amp;#39;x&amp;#39; =&amp;gt; false, &amp;#39;y&amp;#39; =&amp;gt; false, &amp;#39;z&amp;#39; =&amp;gt; false } ] 0.3333333333333333
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We can also consider the de-looped version of this, where we use an &lt;code&gt;obv&lt;&#x2F;code&gt; instruction instead of a branch, explicitly killing mass and representing a sub-distribution:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
  a3: bool = flip;
  a2: bool = flip;
  print a3 a2;

  a1: bool = or a3 a2;
  obv a1;
  print a1;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;this results in the following output:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;[ &amp;#39;done&amp;#39;, Map { &amp;#39;a3&amp;#39; =&amp;gt; true, &amp;#39;a2&amp;#39; =&amp;gt; true, &amp;#39;a1&amp;#39; =&amp;gt; true } ] 0.25
[ &amp;#39;done&amp;#39;, Map { &amp;#39;a3&amp;#39; =&amp;gt; true, &amp;#39;a2&amp;#39; =&amp;gt; false, &amp;#39;a1&amp;#39; =&amp;gt; true } ] 0.25
[ &amp;#39;done&amp;#39;, Map { &amp;#39;a3&amp;#39; =&amp;gt; false, &amp;#39;a2&amp;#39; =&amp;gt; true, &amp;#39;a1&amp;#39; =&amp;gt; true } ] 0.25
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;which can easily be re-normalized to $\frac{1}{3}$ in each case if necessary, but reflects the fact that some of the mass was killed rather than looped. The difference will become important if we integrate probabril with functions---the exact place that you restart to, and when you renormalize, make a big difference, and if there are many places one could return to, there may be multiple natural choices.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;backing-off-of-finite-state-space&quot;&gt;Backing Off of Finite State Space&lt;&#x2F;h3&gt;
&lt;p&gt;The following program, which has a counter, could have an unbounded number of states.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
  one : int = const 1;
  y : int = const 0;
reflip:
  x : bool = flip;
  y : int = add y one;
  br x reflip end;
end:
  ret;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Below is its output (with the constant field removed so that it fits more cleanly on the screen). Note that we have again solved for the exact distribution for every run that has a probability of more than $10^{-7}$, and also where the remaining program mass is after pushing the program this far.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;[ 2, Map { &amp;#39;y&amp;#39; =&amp;gt; 24n, &amp;#39;x&amp;#39; =&amp;gt; true } ] 5.960464477539063e-8
[ 6, Map { &amp;#39;y&amp;#39; =&amp;gt; 24n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 5.960464477539063e-8
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 23n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 1.1920928955078125e-7
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 22n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 2.384185791015625e-7
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 21n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 4.76837158203125e-7
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 20n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 9.5367431640625e-7
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 19n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.0000019073486328125
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 18n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.000003814697265625
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 17n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.00000762939453125
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 16n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.0000152587890625
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 15n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.000030517578125
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 14n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.00006103515625
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 13n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.0001220703125
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 12n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.000244140625
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 11n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.00048828125
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 10n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.0009765625
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 9n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.001953125
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 8n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.00390625
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 7n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.0078125
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 6n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.015625
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 5n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.03125
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 4n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.0625
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 3n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.125
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 2n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.25
[ &amp;#39;done&amp;#39;, Map { &amp;#39;y&amp;#39; =&amp;gt; 1n, &amp;#39;x&amp;#39; =&amp;gt; false } ] 0.5
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Even though there is still some symmetry left to be desired (a more satisfying solution could be given by widening perhaps), note that assembling a picture like this via Monte Carlo would take an insane number of samples. Effectively we have done power iteration, but for a space which we did not know was finite beforehand.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;random-testing&quot;&gt;Random Testing&lt;&#x2F;h2&gt;
&lt;p&gt;Due to the number of parameters and computing time required to make the random testing reasonable, this bit of the evaluation is less conclusive than I was hoping it would be. Nonetheless, the infrastructure is all there.  &lt;code&gt;probabril&#x2F;random-tester.py&lt;&#x2F;code&gt; generates a program, &lt;code&gt;xbrili&lt;&#x2F;code&gt;, and then many instances of &lt;code&gt;brili --noprint --envdump&lt;&#x2F;code&gt;, doing a $\chi^2$ test on the resulting output frequencies of environment variables.&lt;&#x2F;p&gt;
&lt;p&gt;Here is a sample output of 10 random programs, with 100 samples. The first number is the $\chi^2$ statistic, then the $p$-value, and then the expected and observed frequencies.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;chi2 0.36 p 0.5485 	 (50, 50) (47, 53)
chi2 4.0 p 0.26146 	 (25, 25, 25, 25) (20, 20, 30, 30)
chi2 0.0 p nan 	 (100,) (100,)
chi2 1.20 p 0.7530	 (25, 25, 25, 25) (23, 26, 22, 29)
chi2 3.52 p 0.3181 	 (25, 25, 25, 25) (21, 23, 23, 33)
chi2 0.64 p 0.4237 	 (50, 50) (54, 46)
chi2 0.24 p 0.970 	 (25, 25, 25, 25) (23, 26, 26, 25)
chi2 0.64 p 0.423 	 (50, 50) (46, 54)
chi2 4.88 p 0.180 	 (25, 25, 25, 25) (19, 34, 23, 24)
chi2 18.6 p 0.0092 	 (12, 12, 12, 12, 12, 12, 12, 12) (4, 14, 19, 8, 9, 13, 12, 21)
chi2 1.96 p 0.1615 	 (50, 50) (57, 43)
chi2 2.72 p 0.2564 	 (50, 24, 24) (53, 29, 18)


random programs that could not be executed by xbrili due to undefined variables: 3
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Low $p$-values are bad, because they suggest that the expected distribution is different from the observed one. However, when this happens there are a few explanations for this---not only are the sample sizes very low, but also I had to set a very aggressive timeout for the interpreter to avoid the numerous infinite loops, which results in a count of zero for some programs which may have terminated.&lt;&#x2F;p&gt;
&lt;p&gt;Nonetheless, one can see that the expected frequencies mirror the expected ones. One might worry that this sample does not include many interesting programs with control flow that generates things with more than powers of two---and this is largely true, but every once in a while it does happen, and anecdotally &lt;code&gt;xbrili&lt;&#x2F;code&gt; still gets them right.&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;This can also be seen as the analog of freeing all heap variables at once, but for the stack, or the creation of a new stack frame (where the old one doesn&#x27;t need to be saved due to tail recursion).&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</description>
            </item>
        
            <item>
                <title>Type-Based Alias Analysis</title>
                <pubDate>Sun, 29 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/tbaa/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/tbaa/</guid>
                <description>&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=277670&quot;&gt;Type-Based Alias Analysis&lt;&#x2F;a&gt; by Diwan et al. describes a set of efficient yet precise alias
analysis algorithms using the types of type-safe programming languages, called type-based alias
analysis (TBAA). Three techniques are introduced: (1) &lt;em&gt;TypeDecl&lt;&#x2F;em&gt;, a very conservative analysis which
decides that two memory references may alias if they have the same type; (2) &lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt;, which
uses type declarations of fields and other high-level information of the program to improve
&lt;em&gt;TypeDecl&lt;&#x2F;em&gt;; and (3) &lt;em&gt;SMTypeRefs&lt;&#x2F;em&gt;, which examines the effects of assignments to more accurately
determine the types that a memory reference may access. The &amp;quot;final version&amp;quot;, referred as
&lt;em&gt;SMFieldTypeRef&lt;&#x2F;em&gt; in the paper, combines the three techniques mentioned above. &lt;&#x2F;p&gt;
&lt;p&gt;One highlight of the paper is that the authors evaluated their proposed approaches in a pretty
rigorous way. Aside from the traditional static evaluation where only the sizes of may-alias and
point-to sets are examined, the authors evaluated the effect of TBAA on potential further
optimizations by applying Redundant Load Elimination (RLE) to programs analyzed by TBAA.  The
authors further show a limit analysis, which demonstrates that at least RLE would not benefit much
from an alias analysis that is more accurate that TBAA. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;some-background&quot;&gt;Some Background&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;alias-analysis&quot;&gt;Alias Analysis&lt;&#x2F;h3&gt;
&lt;p&gt;Since we have not covered alias analysis in class when I read the paper (and may
not finish when I present the paper), I think it would be useful to briefly describe what alias
analysis does in this post. Alias analysis tries to statically disambiguate memory references in the
program so that the compiler can get a knowledge on which instructions might access the same memory
location. On a traditional computer architecture, such information can be useful when the compiler
tries to reorder memory loads and stores; if we want to map the code to hardware, alias analysis
provides information about how to actually arange the memory in hardware, and how to design the
control FSMs of the hardware. Despite the importance of alias analysis, a very precise alias
analysis can be prohibitively slow due to the complexity of analyzing each pair of memory references
in the program. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;modula-3-programming-language&quot;&gt;Modula-3 Programming Language&lt;&#x2F;h3&gt;
&lt;p&gt;The authors describe their TBAA techniques and evaluate these
techniques using programs written in Modula-3.  Modula-3 is a statically-typed, type-safe
programming language. Since the language is type-safe, it does not allow arbitrary pointer casting
like C and C++. The language seems fairly old and limited information can be found online. I found
&lt;a href=&quot;https:&#x2F;&#x2F;www.cs.purdue.edu&#x2F;homes&#x2F;hosking&#x2F;m3&#x2F;reference&#x2F;m3.html&quot;&gt;this site&lt;&#x2F;a&gt; which might be useful if you are interested in the details of the language.
Modula-3 only allows three types of memory references: &lt;code&gt;p.f&lt;&#x2F;code&gt; to access a field of an object, &lt;code&gt;p^&lt;&#x2F;code&gt; to
dereference a pointer, and &lt;code&gt;p[i]&lt;&#x2F;code&gt; to access the i-th element of an array &lt;code&gt;p&lt;&#x2F;code&gt;. Pointer type casting
is only allowed between a type and its subtypes. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;tbaa-techniques&quot;&gt;TBAA Techniques&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;asssumptions&quot;&gt;Asssumptions&lt;&#x2F;h3&gt;
&lt;p&gt;Section 2 of the paper describes the three TBAA techniques mentioned above. Aside
from assuming that the language is type-safe, the authors further assume that the compiler has
access to the whole program except for standard libraries. This assumption is later abandoned in
Section 4, where the authors evaluate the effectiveness of TBAA when only part of the program is
available.  It is also assumed that all references of a type &lt;code&gt;T&lt;&#x2F;code&gt; might access all fields of &lt;code&gt;T&lt;&#x2F;code&gt; and
its subtypes. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;terminology&quot;&gt;Terminology&lt;&#x2F;h3&gt;
&lt;ul&gt;
&lt;li&gt;Access Path (AP): An access path is a combination of the three types of allowed memory references,
like &lt;code&gt;a.b^.c[i]&lt;&#x2F;code&gt;.  Distinct object fields are assumed to have different names. &lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;Type()&lt;&#x2F;code&gt; and &lt;code&gt;Subtypes()&lt;&#x2F;code&gt;: &lt;code&gt;Type(p)&lt;&#x2F;code&gt; is the static type of &lt;code&gt;p&lt;&#x2F;code&gt;, where &lt;code&gt;p&lt;&#x2F;code&gt; is an AP. &lt;code&gt;Subtypes(T)&lt;&#x2F;code&gt;
is a set of all subtypes of &lt;code&gt;T&lt;&#x2F;code&gt;, including &lt;code&gt;T&lt;&#x2F;code&gt; itself. For subtyping, if &lt;code&gt;T1&lt;&#x2F;code&gt; is a subtype of &lt;code&gt;T&lt;&#x2F;code&gt;,
then all objects of type &lt;code&gt;T1&lt;&#x2F;code&gt; are also of type &lt;code&gt;T&lt;&#x2F;code&gt;. &lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;proposed-techniques&quot;&gt;Proposed Techniques&lt;&#x2F;h3&gt;
&lt;h4 id=&quot;typedecl&quot;&gt;&lt;em&gt;TypeDecl&lt;&#x2F;em&gt;&lt;&#x2F;h4&gt;
&lt;p&gt;The first technique proposed in the paper is surprisingly simple. Assuming that an
AP can reference any object with the same type or subtypes, &lt;em&gt;TypeDecl&lt;&#x2F;em&gt; sees two APs &lt;code&gt;p&lt;&#x2F;code&gt; and &lt;code&gt;q&lt;&#x2F;code&gt; as
aliasing if &lt;em&gt;SubTypes(Type(p))&lt;&#x2F;em&gt; $\cap$ &lt;em&gt;SubTypes(Type(q))&lt;&#x2F;em&gt; $\neq \phi$.  This is apparently very,
very conservative, since the condition is clearly too strict and &lt;em&gt;TypeDecl&lt;&#x2F;em&gt; does not take any
program syntax information into account. &lt;&#x2F;p&gt;
&lt;h4 id=&quot;fieldtypedecl&quot;&gt;&lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt;&lt;&#x2F;h4&gt;
&lt;p&gt;&lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt; solves part of the problem by considering the field names and
the different types of memory accesses. For example, &lt;code&gt;t.f&lt;&#x2F;code&gt; and &lt;code&gt;t.g&lt;&#x2F;code&gt; do not alias even if they are
of the same type, because these two APs are accessing different fields of objects. As another
example, &lt;code&gt;p.f&lt;&#x2F;code&gt; and &lt;code&gt;q[i]&lt;&#x2F;code&gt; do not alias because Modula-3 does not allow that. In more general cases,
&lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt; will check whether the program has ever taken address of an object of the target
type. If not, the involved instruction cannot alias with any other references. If this access check
fails, or if the two APs are just &amp;quot;raw&amp;quot; references to objects, &lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt; reverts back to
&lt;em&gt;TypeDecl&lt;&#x2F;em&gt; to make a final decision. &lt;&#x2F;p&gt;
&lt;h4 id=&quot;smtyperefs&quot;&gt;&lt;em&gt;SMTypeRefs&lt;&#x2F;em&gt;&lt;&#x2F;h4&gt;
&lt;p&gt;&lt;em&gt;SMTypeRefs&lt;&#x2F;em&gt; solves the other part of the problem by actually examining the
assignments in the instructions. While &lt;em&gt;TypeDecl&lt;&#x2F;em&gt; makes a very conservative assumption that all
references to the same type and subtypes might alias, &lt;em&gt;SMTypeRefs&lt;&#x2F;em&gt; goes through the program and
tries to &amp;quot;merge&amp;quot; the types together only when there is a pointer assignment between the types. This
subtle change makes &lt;em&gt;SMTypeRefs&lt;&#x2F;em&gt; seemingly much more powerful than &lt;em&gt;TypeDecl&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;algorithm-complexity&quot;&gt;Algorithm Complexity&lt;&#x2F;h3&gt;
&lt;p&gt;The final version of TBAA combines &lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt; with &lt;em&gt;SMTypeRefs&lt;&#x2F;em&gt;, where
&lt;em&gt;SMTypeRefs&lt;&#x2F;em&gt; is used in the place of &lt;em&gt;TypeDecl&lt;&#x2F;em&gt; in &lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt;. The output of TBAA is a
type-based table indicating whether accesses to certain types may alias with each other, rather than
a table recording whether each pair of memory references alias or not. As a result, the time
complexity of constructing this table is linear with respect to the number of instructions in the
program and the number of types in the language. However, obtaining the alias status of each memory
reference pair can require $O(e^2)$ time where $e$ is the number of memory references in the
program. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;evaluation-methods&quot;&gt;Evaluation Methods&lt;&#x2F;h3&gt;
&lt;p&gt;The authors thoroughly evaluate TBAA using static evaluation, dynamic
evaluation, and limit analysis:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Static evaluation focuses on examining the sizes of may-alias and point-to sets, where smaller 
sets are better. While it seems very convicing and straightforward, static evaluation has two major
drawbacks. Firstly, static evaluation cannot reflect how effective the analysis is to the
optimization passes that use it. Secondly, static evaluation uses only the set sizes as metric,
which cannot reflect the strengths and weaknesses of different alias analysis methods. &lt;&#x2F;li&gt;
&lt;li&gt;Dynamic evaluation actually examines how the alias analysis affects the runtime of the optimized
program when used together with other optimization passes. It actually reflects how effective the
alias analysis assists the optimizations. On the downside, results of dynamic evaluation depend on
the benchmarks, inputs, and the &amp;quot;client optimization&amp;quot; that uses alias analysis. In this paper,
Redundant Load Elimination (RLE) is implemented as this &amp;quot;client&amp;quot;. &lt;&#x2F;li&gt;
&lt;li&gt;Limit analysis evaluates how much improvement we can possibly get compared with a &amp;quot;perfect&amp;quot; alias
analysis. It can be performed together with dynamic evaluation, where the oracle alias pairs can
be found by profiling. &lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;experimental-setup&quot;&gt;Experimental Setup&lt;&#x2F;h3&gt;
&lt;p&gt;The authors assembled their own benchmark suite to evaluate the performance of TBAA. One possible
reason why the authors did not use a standard benchmark like SPEC (which came out in 1992) is that
the standard benchmark suites were not written in Modula-3. The authors&#x27; experiments also got 
impeded by GCC bugs. All the optimizations and analysis are implemented in the middle-end of the
author&#x27;s compiler toolchain. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;static-evaluation&quot;&gt;Static Evaluation&lt;&#x2F;h3&gt;
&lt;p&gt;The authors evaluated &lt;em&gt;TypeDecl&lt;&#x2F;em&gt;, &lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt;, and &lt;em&gt;SMFieldTypeRefs&lt;&#x2F;em&gt; on
their benchmark suite. Clearly, the latter two versions of TBAA are much more powerful than the
simple &lt;em&gt;TypeDecl&lt;&#x2F;em&gt;, both in identifying intra-procedural and inter-procedural aliases. In general
TBAA is much less effective in eliminating inter-procedual aliases. My guess for this behavior is
that since TBAA is almost a pure type-based analysis, it should be inherently more conservative when
analyzing programs with a large amount of instructions, where pointer assignments and memory
references appear for more times. For inter-procedural alias analysis, the compiler gets to see more
assignments, and TBAA might end up believing that all types and subtypes alias with each other. &lt;&#x2F;p&gt;
&lt;p&gt;It is surprising to me that &lt;em&gt;SMFieldTypeRefs&lt;&#x2F;em&gt; offers very limited improvement over &lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt;.
While the authors did not explain the reason, my guess is that the process of checking whether the
address has been taken or not in &lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt; implicitly provides a lot of information about
assignments. Unfortunately the authors did not provide a study that reveals more insights. In
addition, personally I am interested in seeing how effective &lt;em&gt;SMTypeRefs&lt;&#x2F;em&gt; is when it&#x27;s used alone.
Perhaps &lt;em&gt;SMTypeRefs&lt;&#x2F;em&gt; is actually not that much more powerful than &lt;em&gt;TypeDecl&lt;&#x2F;em&gt; on real programs, which
can explain why &lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt; and &lt;em&gt;SMFieldTypeRefs&lt;&#x2F;em&gt; have very similar performance. The last
interesting thing I found from this section is that for &lt;code&gt;m2tom3&lt;&#x2F;code&gt;, &lt;em&gt;SMFieldTypeRefs&lt;&#x2F;em&gt; generates a
larger local alias set than &lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt;. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;case-study-on-redundant-load-elimination&quot;&gt;Case Study on Redundant Load Elimination&lt;&#x2F;h3&gt;
&lt;h4 id=&quot;redundant-load-elimination-rle&quot;&gt;Redundant Load Elimination (RLE)&lt;&#x2F;h4&gt;
&lt;p&gt;RLE simplifies redundant memory expressions with variable
references, and tries to move the memory references out of the loop if there is no alias inside the
loop. Similar optimizations also apply to branches. &lt;&#x2F;p&gt;
&lt;h4 id=&quot;results&quot;&gt;Results&lt;&#x2F;h4&gt;
&lt;p&gt;By statically examining the number of removed redundant loads, we can draw the same
conclusion as in the static evaluation section: &lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt; and &lt;em&gt;SMFieldTypeRefs&lt;&#x2F;em&gt; are strictly
more powerful than &lt;em&gt;TypeDecl&lt;&#x2F;em&gt;. However, when comparing the performance of optimized programs, all
three techniques offer similar speedup when used together with RLE. This result demonstrates that at
least for RLE, a more precise alias analysis may not provide much benefit over TBAA. The limit
analysis confirms this statement, because almost all redundant loads can be removed by using RLE
together with TBAA for most benchmarks.  For the redundant loads that cannot be removed, the
authors studied the cause and found that they are mostly caused by problems other than alias
analysis. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;performance-on-incomplete-programs&quot;&gt;Performance on Incomplete Programs&lt;&#x2F;h3&gt;
&lt;p&gt;The assumption of the compiler having access to the whole
program is often violated in cases such as separate compilation.  As a result, the authors performed
additional experiments to evaluate how TBAA performs when this assumption does not hold.  The
authors use &amp;quot;open world&amp;quot; to refer to the cases where the assumption does not hold, and use &amp;quot;close
world&amp;quot; to refer to situations where the whole program is available to the compiler. To ensure that
TBAA on incomplete programs still yield correct results in the &amp;quot;open world&amp;quot; scenario, the authors
made some changes to make it more conservative:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;When evaluating whether the address of a memory access has been taken elsewhere, the modified
version also checks the function arguments. If two function arguments are both references and have
the same type, then the instructions accessing these two function arguments might alias, because the
code that calls the function might assign aliasing objects to the function arguments. &lt;&#x2F;li&gt;
&lt;li&gt;When merging types in &lt;em&gt;SMFieldTypeRefs&lt;&#x2F;em&gt;, any two types with a subtype relationship are merged
together, since the unavailable code might assign them. To my understanding, this modification
makes &lt;em&gt;SMFieldTypeRefs&lt;&#x2F;em&gt; almost equivalent to &lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt;. &lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;While these changes seem very conservative, actually they do not affect the performance of TBAA that
much. The authors show that the &amp;quot;open world&amp;quot; assumption has negligible effect on the execution time
of the compiled program when RLE is applied. The result is not surprising, because even in the
&amp;quot;close world&amp;quot; scenario, &lt;em&gt;SMFieldTypeRefs&lt;&#x2F;em&gt; does not have clear advantage over &lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt;. The
modified version of &lt;em&gt;SMFieldTypeRefs&lt;&#x2F;em&gt; is very close to &lt;em&gt;FieldTypeDecl&lt;&#x2F;em&gt;, with just slightly more
restrictions. When I first read the paper, I felt that there should be an additional experiment
showing how TBAA performs when different portions of the code are available. Thanks to Adrian&#x27;s
suggestions, now I think that the critical point of this experiment is to show that the
effectiveness of TBAA is not greatly affected by the two seemingly conservative assumptions. In
the &amp;quot;open world&amp;quot; scenario, no matter how much code the compiler gets to see, it always needs to
make the same assumptions.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion-and-discussions&quot;&gt;Conclusion and Discussions&lt;&#x2F;h2&gt;
&lt;p&gt;This paper introduces Type-Based Alias Analysis (TBAA), a
simple but powerful technique for disambiguating memory references.  The technique is efficient,
because building the data structure for the analysis requires only linear time with respect to both
the number of instructions in the program and the number of types in the language. The time
complexity for querying a pair of instructions is linear with respect to the number of types in the
language. Using a self-implemented optimization pass as the &amp;quot;client&amp;quot; to alias analysis, the authors
show that TBAA can offer close-to-optimal performance improvement to the compiled programs. &lt;&#x2F;p&gt;
&lt;p&gt;The authors tried their best to thoroughly evaluate TBAA and performed some evaluations that
previous work never did.  Unfortunately, they did not further dig into some points that I am
personally interested in, and some experimental details are either missing or not clearly explained.
This echos the importance of doing thorough, complete experimental evaluations. &lt;&#x2F;p&gt;
&lt;p&gt;TBAA is definitely a very useful type of alias analysis, since it achieves very good trade-off
between complexity and accuracy. LLVM has implemented TBAA as an analysis pass. Tools that are built
on top of LLVM also leverage the results of TBAA as hints to optimizations. One family of such
tools, named High-Level Synthesis tools, try to enable designers to describe their hardware in pure
software languages by performing automatic analysis, optimizations, and hardware generation inside
the compiler. With such tools becoming popular, alias analysis will have a completely different
group of &amp;quot;clients&amp;quot; compared with what the authors had twenty years ago. &lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Shrimp: Verifying IRs with Rosette</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/a-verification-backend/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/a-verification-backend/</guid>
                <description>&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;&#x2F;h2&gt;
&lt;p&gt;Writing programs is famously hard. Writing program that generate programs
(compilers) is harder still. Compiler verification usually comes in two
flavors: (1) Proving a compiler is correct by construction using a
&lt;a href=&quot;https:&#x2F;&#x2F;coq.inria.fr&#x2F;&quot;&gt;proof-assistant&lt;&#x2F;a&gt;, or (2) proving that each compiler pass preserves the
observable semantics of a program by checking the equivalence of the input and
the output programs. The salient difference is that in the
correct-by-construction approach, the &lt;em&gt;compiler&lt;&#x2F;em&gt; is verified, i.e. the
verification is done once, while in the equivalence-checking approach, the
&lt;em&gt;output&lt;&#x2F;em&gt; is verified which requires verification to happen every time a
pass is run.&lt;&#x2F;p&gt;
&lt;p&gt;Non-trivial correct by construction compilers have been demonstrated to be viable for
but require several person-years of work to implement, specify&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;,
and prove correct. On the other hand, proving program equivalence automatically
is a &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Turing_completeness&quot;&gt;remarkably hard problem&lt;&#x2F;a&gt;
which forces such verification efforts to somehow bound the space of program
behaviors.&lt;&#x2F;p&gt;
&lt;p&gt;For our project, we implemented a pass verification infrastructure for Bril using the
&lt;a href=&quot;https:&#x2F;&#x2F;emina.github.io&#x2F;rosette&#x2F;&quot;&gt;Rosette&lt;&#x2F;a&gt; framework and verified the correctness of a local value numbering
pass.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;smt-solving-briefy&quot;&gt;SMT Solving, Briefy&lt;&#x2F;h2&gt;
&lt;p&gt;Underyling a lot of automatic proof generation is SMT solving. Satisfiability
Modulo Theories (SMT) is a generalization of the &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Boolean_satisfiability_problem&quot;&gt;SAT&lt;&#x2F;a&gt; problem that allows us
to augment our logic with various &amp;quot;theories&amp;quot; (naturals, rationals, arrays, etc.)
to prove properties in a domain that we care about. For example, SAT + theory of
integers can be used to solve &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Integer_programming&quot;&gt;Integer Linear Programming&lt;&#x2F;a&gt; problems.&lt;&#x2F;p&gt;
&lt;p&gt;Program properties can be verified by first encoding the semantics of your
language as an SMT formula and asking a solver to prove its correctness by finding
a satisfying assignment.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;rosette&quot;&gt;Rosette&lt;&#x2F;h2&gt;
&lt;p&gt;Rosette is a symbolic execution engine for the &lt;a href=&quot;https:&#x2F;&#x2F;racket-lang.org&#x2F;&quot;&gt;Racket&lt;&#x2F;a&gt; programming language.
It lets us write normal Racket programs and does the work of automatically lifting them
to perform symbolic computations. This is different than simply having bindings into an SMT
solver where you use code to generate constraints because Rosette gives symbolic meaning
to actual Racket programs.&lt;&#x2F;p&gt;
&lt;p&gt;Consider the following program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
def add(x):
  return x + 1
&lt;&#x2F;pre&gt;
&lt;p&gt;In addition to running this program with &lt;em&gt;concrete&lt;&#x2F;em&gt; inputs (like &lt;code&gt;1&lt;&#x2F;code&gt;), Rosette
allows us to run it with a &lt;em&gt;symbolic input&lt;&#x2F;em&gt;. When computing with symbolic
inputs, Rosette &lt;em&gt;lifts&lt;&#x2F;em&gt; operations like &lt;code&gt;+&lt;&#x2F;code&gt; to return symbolic formulas
instead.  So, running this program with the symbolic input &lt;code&gt;x&lt;&#x2F;code&gt; would give us
the symbolic value &lt;code&gt;x + 1&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Rosette also lets us ask &lt;em&gt;verification queries&lt;&#x2F;em&gt; using a symbolic inputs.
We can write the following program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
symbolic x integer?
verify (forall x. add(x) &amp;gt; x)
&lt;&#x2F;pre&gt;
&lt;p&gt;Rosette will convert this into an SMT formula and verify its correctness using
a backend solver.&lt;&#x2F;p&gt;
&lt;p&gt;If we give Rosette a falsifiable formula:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
symbolic x integer?
verify (forall x. add(x) &amp;lt; x)
&lt;&#x2F;pre&gt;
&lt;p&gt;Rosette generate a &lt;em&gt;model&lt;&#x2F;em&gt; where the formula is false. In this case, Rosette
will report that when &lt;code&gt;x = 0&lt;&#x2F;code&gt;, this formula is false.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;symbolic-interpretation&quot;&gt;Symbolic Interpretation&lt;&#x2F;h2&gt;
&lt;p&gt;A symbolic interpreter is simply an interpreter that executes over symbolic values rather than real values.
A standard interpreter takes an expression, such as &lt;code&gt;x + 2 + 3&lt;&#x2F;code&gt;, and a concrete variable assignment, like &lt;code&gt;x = 1&lt;&#x2F;code&gt;,
and then recursively evaluates the expression, substituting the value for &lt;code&gt;x&lt;&#x2F;code&gt; every time we see it. In this
case &lt;code&gt;x + 2 + 3&lt;&#x2F;code&gt; evaluates to &lt;code&gt;6&lt;&#x2F;code&gt;. A symbolic interpreter works on the same types of programs,
but takes symbols as arguments instead of concrete value assignments. For the same program, &lt;code&gt;x + 2 + 3&lt;&#x2F;code&gt;, symbolic
interpretation produces the formula &lt;code&gt;x + 5&lt;&#x2F;code&gt;. Computations that don&#x27;t involve symbols are still run concretely and
Rosette is smart enough to do this regardless of the parenthesization of the expression.&lt;&#x2F;p&gt;
&lt;p&gt;This proves useful for verification because it reduces the problem of program equivalence to formula equivalence.
To prove that the program &lt;code&gt;x + 2 + 3&lt;&#x2F;code&gt; is equivalent to the program &lt;code&gt;3 + 2 + x&lt;&#x2F;code&gt; we only need to reduce these
to formulas and then prove their equivalence. This still looks hard, but it turns out that we can use SMT
solvers to do most of the hard work.&lt;&#x2F;p&gt;
&lt;p&gt;We have reduced the problem of program equivalence to symbolic interpretation plus a query
to an SMT solver. Fortunately, Rosette makes both of these tasks simple. We can write a normal interpreter for Bril
in Racket and Rosette will lift the computation into SMT formulas and also make the query to the SMT solver.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;limiting-scope-to-basic-blocks&quot;&gt;Limiting scope to basic blocks&lt;&#x2F;h3&gt;
&lt;p&gt;The scalability of SMT-based verification depends on the choice of the theories
we use (reals, integers, arrays, etc.) and the size of formulas we generate.&lt;&#x2F;p&gt;
&lt;p&gt;Choosing rich theories like non-linear arithemtic make it easier to
translate the semantics of a program but also genearate formulaes that might be
undecidable. We restrict ourselves to the fragement of Quantifier
Bitvector Formulas which have fast, decidable solvers. Rosette automatically
translates all integers to bitvectors of a programmer defined size.&lt;&#x2F;p&gt;
&lt;p&gt;To reduce the size of the formulas we generate, we only try to prove equivalence
at the basic block level. Note that basic-block equivalence implies program
equivalence but not the other way around. That means that our verifier is
necessarily conservative and might give false positives. We think basic
block equivalence is the right level of verification because most optimizations
only locally change the program structure (they either add basic blocks or remove
them, but not both). We defer the problem of verifier more complicated optimizations.&lt;&#x2F;p&gt;
&lt;p&gt;To verify that two basic blocks are equivalent, we assume that the common set
of live variables are equal, and ask Rosette to verify that the symbolic
formulas we get from interpretation for each assigned variable are equivalent.
Formally, for two basic blocks $b1$ and $b2$, we generate the formula:&lt;&#x2F;p&gt;
&lt;p&gt;$ \forall lives(b1), lives(b2). interpret(b1) = interpret(b2) $&lt;&#x2F;p&gt;
&lt;p&gt;where the $lives$ function generates the live variables in a basic block and
$interpret$ returns a new state map for all variables in the basic block.&lt;&#x2F;p&gt;
&lt;p&gt;Concretely, given the following basic block:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
block1 {
  ...
  sum1: int = add a b;
  sum2: int = add a b;
  prod: int = mul sum1 sum2;
}
&lt;&#x2F;pre&gt;
&lt;p&gt;a simple CSE and dead code elimination will produce the following code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
block2 {
  ...
  sum1: int = add a b;
  prod: int = mul sum1 sum1;
}
&lt;&#x2F;pre&gt;
&lt;p&gt;We first find the common set of live variables.
In this case, &lt;code&gt;a, b&lt;&#x2F;code&gt; are live at the beginning of both of these blocks. Next, we create a symbolic version
of these variables for each block. We&#x27;ll use &lt;code&gt;$&lt;&#x2F;code&gt; to designate symbolic variables.
This gives us &lt;code&gt;a$1, b$1&lt;&#x2F;code&gt; for the first block and &lt;code&gt;a$2, b$2&lt;&#x2F;code&gt; for the second block. We assume that
&lt;code&gt;a$1 = a$2&lt;&#x2F;code&gt; and &lt;code&gt;b$1 = b$2&lt;&#x2F;code&gt;. Then we can call our basic block symbolic interpreter with these
variables to get the following formula:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
block1
sum1 = a$1 + b$1
prod = (a$1 + b$1) * (a$1 + b$1)

block2
sum1 = a$2 + b$2
sum2 = a$2 + b$2
prod = (a$2 + b$2) * (a$2 + b$2)
&lt;&#x2F;pre&gt;
&lt;p&gt;Finally we check if the variables which are defined in both blocks are equivalent.
In other words, assuming that the common live variables are equal, is the following true:&lt;&#x2F;p&gt;
&lt;p&gt;$ \forall a1, a2, b1, b2. a1 + b1 = a2 + b2 \land (a1 + b1) * (a1 + b1) = (a2 + b2) * (a2 + b2) $&lt;&#x2F;p&gt;
&lt;p&gt;The SMT solver will verify this for us, and if it can&#x27;t prove the formula to be valid,
it will provide a counter-example to prove it. In this case, it is not too hard to see
that this formula is in fact valid, which shows that these two basic blocks are functionally
equivalent.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;downsides&quot;&gt;Downsides&lt;&#x2F;h3&gt;
&lt;p&gt;The downside of this approach is that it only conservatively approximates the result
of each basic block. We may lose information about constraints on variables that cross
basic block boundaries and are therefore unable to verify the correctness of global
program optimizations. For example, consider the following toy program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
main {
  a: int = const 2;
  b: int = const 4;
  c: int = id a;
  jmp next;
next:
  sum: int = add a c;
}
&lt;&#x2F;pre&gt;
&lt;p&gt;Because &lt;code&gt;c&lt;&#x2F;code&gt; is a copy of &lt;code&gt;a&lt;&#x2F;code&gt;, this program would be functionally the same if you replaced the assignment
to &lt;code&gt;sum&lt;&#x2F;code&gt; with &lt;code&gt;sum: int = add a a&lt;&#x2F;code&gt;. However, because we are only doing verification on the basic block level,
we don&#x27;t know that these programs are equivalent.&lt;&#x2F;p&gt;
&lt;p&gt;Another problem is that this approach to verification relies on the existence
of test programs. We are not actually analyzing the code of the optimization so
if you don&#x27;t have extensive enough tests, bugs may go by unnoticed. Of course,
you could run this after every invocation of the compiler to increase the
likelihood of finding bugs.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;To evaluate Shrimp, we implemented &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Common_subexpression_elimination&quot;&gt;Common sub-expression elimination (CSE)&lt;&#x2F;a&gt;
using &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Value_numbering#Local_value_numbering&quot;&gt;Local value numbering (LVN)&lt;&#x2F;a&gt; to show that Shrimp is useful in finding
correctness bugs. We intentionally planted two bugs and found a third bug in the process of testing.&lt;&#x2F;p&gt;
&lt;p&gt;There are some subtleties to a correct implementation of LVN. If you know that the variable
&lt;code&gt;sum1&lt;&#x2F;code&gt; holds the value &lt;code&gt;a + b&lt;&#x2F;code&gt;, you have to make sure that &lt;code&gt;sum1&lt;&#x2F;code&gt; is not assigned to again before
you use it. For example, consider the following Bril program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
sum1: int = add a b;
sum1: int = id a;
sum2: int = add a b;
prod: int = mul sum1 sum2;
&lt;&#x2F;pre&gt;
&lt;p&gt;We would like to replace &lt;code&gt;sum2: int = add a b&lt;&#x2F;code&gt; with &lt;code&gt;sum2: int = id sum1&lt;&#x2F;code&gt; because we
have already computed the value. However, we can&#x27;t do this directly because then &lt;code&gt;sum2&lt;&#x2F;code&gt; would
have the value &lt;code&gt;a&lt;&#x2F;code&gt;, not &lt;code&gt;a + b&lt;&#x2F;code&gt;. The solution is to rename the first instance of &lt;code&gt;sum1&lt;&#x2F;code&gt; to something unique so that we don&#x27;t lose our reference to the value &lt;code&gt;a + b&lt;&#x2F;code&gt;. We can
then replace &lt;code&gt;sum2&lt;&#x2F;code&gt; with a copy from this new variable.&lt;&#x2F;p&gt;
&lt;p&gt;We implemented the faulty version and ran Shrimp. It was able to show that the programs
are not equivalent and even produced a counter example to prove this.
With this information, it is easy to walk through the execution of the code
and discover the source of the bug.&lt;&#x2F;p&gt;
&lt;p&gt;Next we tried extending CSE to deal with associativity.
It would be nice if the compiler knew that &lt;code&gt;a + b&lt;&#x2F;code&gt; is equal to &lt;code&gt;b + a&lt;&#x2F;code&gt; so that it could eliminate more
sub-expressions. The most naïve thing to do is sort the arguments for all expressions when you
compare them so that &lt;code&gt;a + b&lt;&#x2F;code&gt; is the same value as &lt;code&gt;b + a&lt;&#x2F;code&gt;. However, this by itself is not enough.
Testing the following example with Shrimp reveals the problem:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
 sub1: int = sub a b;
 sub2: int = sub b a;
 prod: int = mul sub1 sub2;
&lt;&#x2F;pre&gt;
&lt;p&gt;Shrimp gives us the counter example &lt;code&gt;a = -8, b = -4&lt;&#x2F;code&gt;. The problem is that we can&#x27;t
sort the arguments for every instruction; $a - b \neq b - a$. Shrimp helps to reveal
this problem.&lt;&#x2F;p&gt;
&lt;p&gt;The final bug was actually an unintentional bug that Shrimp helped us find. We made the arguably
bad decision to give each Bril instruction its own structure that is a sub-type of a &lt;code&gt;dest-instr&lt;&#x2F;code&gt; structure
rather than to give &lt;code&gt;dest-instr&lt;&#x2F;code&gt; an op-code field. When we were looking up values in the LVN table,
we were only comparing that fields in &lt;code&gt;dest-instr&lt;&#x2F;code&gt; were the same. We forgot to compare the actual
types of the instructions! Shrimp was able to reveal this code from the following example:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
sub1: int = sub a b;
sub1: int = add a b;
sub2: int = sub b a;
prod: int = mul sub1 sub2;
&lt;&#x2F;pre&gt;
&lt;p&gt;This made it easy to find and fix a rather embarrassing bug in the LVN implementation.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;Symbolic verification provides a trade-off between verification effort and
the completeness of a verification procedure. Beyond our implementation,
there has also been recent work in verifying correctness of &lt;a href=&quot;https:&#x2F;&#x2F;homes.cs.washington.edu&#x2F;%7Eemina&#x2F;doc&#x2F;yggdrasil.osdi16.pdf&quot;&gt;file systems&lt;&#x2F;a&gt;,
&lt;a href=&quot;https:&#x2F;&#x2F;homes.cs.washington.edu&#x2F;%7Eemina&#x2F;doc&#x2F;memsynth.pldi17.pdf&quot;&gt;memory models&lt;&#x2F;a&gt;, and &lt;a href=&quot;https:&#x2F;&#x2F;unsat.cs.washington.edu&#x2F;papers&#x2F;nelson-serval.pdf&quot;&gt;operating systems&lt;&#x2F;a&gt; code using symbolic verification
demonstrating the flexibility of this approach to program verification.&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;The problem of specifying the correctness condition of a compiler is itself
a non-trivial, open research problem. Should the compiler preserve the stdout
behavior, or should it give even stronger guarantees such as preserving the
timing behavior?&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;2&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;2&lt;&#x2F;sup&gt;
&lt;p&gt;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Symbolic_execution#Limitations&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</description>
            </item>
        
            <item>
                <title>Automatic Differentiation in Bril</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/autograd/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/autograd/</guid>
                <description>&lt;p&gt;Our goal was to add &lt;em&gt;automatic differentiation&lt;&#x2F;em&gt; to Bril. Automatic
Differentiation is a technique to calculate the derivative for arbitrary
computer programs. Say we wanted to calculate the derivative of some
arbitrarily complex function, say &lt;code&gt;f(x, y) = (x*y + y)&#x2F;(x*x)&lt;&#x2F;code&gt;. One &lt;em&gt;could&lt;&#x2F;em&gt;
manually derive the partial derivatives of x and y... or we could simply
apply automatic differentiation. :)&lt;&#x2F;p&gt;
&lt;p&gt;The central observation of automatic differentiation is that even the most
complicated equation can be split into a composition of primitives
(like add, multiply, or trigonometric functions). Then, through the magic of
the chain rule, we can compose the derivatives together arbitrarily to get
the derivative of our complicated function.&lt;&#x2F;p&gt;
&lt;p&gt;In the ML community, this is often known as &lt;em&gt;autograd&lt;&#x2F;em&gt;. This is somewhat of a
misnomer - Autograd is actually the name of a &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;HIPS&#x2F;autograd&quot;&gt;popular Python
package&lt;&#x2F;a&gt;, and several popular ML libraries
made confusions of
&lt;a href=&quot;https:&#x2F;&#x2F;twitter.com&#x2F;soumithchintala&#x2F;status&#x2F;925700439450030082&quot;&gt;terminology&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;background-design-overview&quot;&gt;Background &#x2F; Design Overview&lt;&#x2F;h2&gt;
&lt;p&gt;There are two primary ways of doing automatic differentiation. The first is
known as &amp;quot;forward-mode automatic differentiation&amp;quot;, the second as
&amp;quot;reverse-mode automatic differentiation&amp;quot; (i.e, backpropagation). In reality,
these are simply different orders to apply the chain-rule, but they have
far-reaching consequences.&lt;&#x2F;p&gt;
&lt;p&gt;Take the function composition (example taken from Wikipedia).
$$
y = f(g(h(x))) = f(g(h(w_0))) = f(g(w_1))= f(w_2) = w_3
$$
$$
w_0 = x
$$
$$
w_1 = h(w_0)
$$
$$
w_2 = g(w_1)
$$
$$
w_3 = f(w_2) = y
$$&lt;&#x2F;p&gt;
&lt;p&gt;Then, through an application of the chain rule, we obtain that $\frac{dy}{dx} =
\frac{dy}{dw_2} \frac{dw_2}{dw_1} \frac{dw_1}{dx}$. Note that this application of the chain rule may
not look like the standard form you were taught ($\frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx}$ vs
$f(g(x))&#x27; = f&#x27;(g(x))g&#x27;(x)$). However, they are fundamentally the same,
assuming that $z = f(y)$ and $y = g(x)$.&lt;&#x2F;p&gt;
&lt;p&gt;Substituting $\frac{dz}{dy}$ with $f&#x27;(y)$ and substituting $\frac{dy}{dx}$ with $g&#x27;(x)$, we get:&lt;&#x2F;p&gt;
&lt;p&gt;$$
\frac{dz}{dy} \frac{dy}{dx} = f&#x27;(y)g&#x27;(x) = f&#x27;(g(x))g&#x27;(x)
$$&lt;&#x2F;p&gt;
&lt;p&gt;Now, back to automatic differentiation.&lt;&#x2F;p&gt;
&lt;p&gt;Given $\frac{dy}{dx} = \frac{dy}{dw_2} \frac{dw_2}{dw_1} \frac{dw_1}{dx}$, there are two ways we can decompose
this expression into functions. We could compute $\frac{dy}{dx} = \left(\frac{dy}{dw_2}\right)
\left(\frac{dw_2}{dw_1} \frac{dw_1}{dx}\right)$ or we could compute $\frac{dy}{dx} = \left(\frac{dy}{dw_2} \frac{dw_2}{dw_1}\right)
\left(\frac{dw_1}{dx}\right)$. The first one is forward-mode automatic differentiation, the
second one is reverse-mode.&lt;&#x2F;p&gt;
&lt;p&gt;When we have only one input and one output, these don&#x27;t differ in a
meaningful manner. Mathematically, these will result in the same values as
well. The only difference is in the complexity of these 2 methods - we&#x27;ll
cover that in a later portion.&lt;&#x2F;p&gt;
&lt;p&gt;Despite the fact that this mathematical symmetry is pretty nifty, it doesn&#x27;t
provide much intuition about how these automatic differentiation methods work
in practice. How does one translate actual code into this abstract
mathematical expression?&lt;&#x2F;p&gt;
&lt;p&gt;Take this code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;f(x, y):
    a = x * y
    b = x + a
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Applying the chain rule, we obtain:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;f(x, y):
    a = x * y
    da = y * dx + x * dy # We calculate da after calculating a.
    b = x + a
    db = dx + da # We can calculate db after calculating b.
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now, let &lt;code&gt;dx=1&lt;&#x2F;code&gt; and &lt;code&gt;dy=0&lt;&#x2F;code&gt; to find any of these derivative wrt x. This is &lt;em&gt;forward-mode automatic differentiation&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;q-what-does-setting-dx-1-even-mean&quot;&gt;Q: What does setting &lt;code&gt;dx=1&lt;&#x2F;code&gt; even mean?&lt;&#x2F;h4&gt;
&lt;p&gt;A: One common formulation is to treat your input variables as being
differentiated wrt some arbitrary variable $t$. One can think of this $t$ as
a variable that controls which direction in your input space you&#x27;re taking
the derivative along. So setting $\frac{dx}{dt}=1$ and $\frac{dy}{dt}=0$ gets you the
derivative wrt moving $x$ positively. Setting $\frac{dx}{dt}=0$ and $\frac{dy}{dt}=1$ gets
you the derivative wrt moving along $y$. One could even imagine setting
$\frac{dx}{dt}=1$ and $\frac{dy}{dt}=1$ to get the derivative wrt moving diagonally in input
space.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;q-what-we-wrote-after-applying-the-chain-rule-looks-awfully-close-to-actual-code-can-we-generate-code-that-generates-a-function-that-generates-derivatives&quot;&gt;Q: What we wrote after applying the chain rule looks awfully close to actual code. Can we generate code that generates a function that generates derivatives?&lt;&#x2F;h4&gt;
&lt;p&gt;A: Yes! This is a cool method of doing automatic differentiation, recently
popularized by &lt;a href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1810.07951&quot;&gt;Julia&lt;&#x2F;a&gt;. In practice, this
is a lot more difficult than my example may make it seem. Handling
expressions is relatively straight-forward, but doing a source to source
transformation that preserves control flow, state, and other such things is
substantially more difficult. The primary pitch for doing things this way is
that we can then take advantage of all the compiler optimizations we usually
apply on our code.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;q-isn-t-this-pretty-basic&quot;&gt;Q: Isn&#x27;t this pretty ... basic?&lt;&#x2F;h4&gt;
&lt;p&gt;A: I&#x27;d say so. Forward-mode automatic differentiation is a fairly intuitive
technique. We just let our code run as normal and keep track as derivatives
as we go. For example, in the above code,&lt;&#x2F;p&gt;
&lt;h2 id=&quot;forward-mode-implementation&quot;&gt;Forward-Mode Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;There&#x27;s a neat trick for implementing forward-mode automatic differentiation,
known as dual numbers. Dual numbers are in the form &lt;code&gt;a + bε&lt;&#x2F;code&gt;, and look an
awful lot like complex numbers (so &lt;code&gt;(a+bε) + (c+dε) = (a+c) + (b+d)ε&lt;&#x2F;code&gt;),
with the important difference that &lt;code&gt;ε^2 = 0&lt;&#x2F;code&gt;. So, to perform automatic
differentiation, we replace all numbers in our program with dual numbers.&lt;&#x2F;p&gt;
&lt;p&gt;Then, the derivative we want of a given number wrt to the input is simply
given by its epsilon term. So, in the addition example, the derivative of the
addition of two numbers is equal to the sum of their derivatives. For
multiplication &lt;code&gt;(a+bε) * (c+dε)&lt;&#x2F;code&gt;, it&#x27;s equal to &lt;code&gt;(a*b + c*d)&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;So, what is this epsilon term? One (not entirely inaccurate) way of
interpreting it is as an infinitesimal. Squaring an infinitesimal makes it
disappear, but otherwise, it&#x27;s somewhat similar to numerical differentiation.&lt;&#x2F;p&gt;
&lt;p&gt;We implemented forward-mode in this fashion. The primary work was done in the
&lt;code&gt;brili&lt;&#x2F;code&gt; interpreter, where we simply interpreted every integer as a floating
point number, and then augmented it to be a dual number. Then, we simply
augmented every single operation to either operate on dual numbers (addition,
multiplication, etc.) or kill the derivatives (conditionals).&lt;&#x2F;p&gt;
&lt;p&gt;When we come to conditionals, the derivatives no longer flow through the
program, as they&#x27;re not directly involved in the output.&lt;&#x2F;p&gt;
&lt;p&gt;For example, take:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;if (y &amp;gt; 0)
    return x;
else
    return 0;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The output of the function has no derivative wrt &lt;code&gt;y&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;reverse-mode-automatic-differentiation&quot;&gt;Reverse-Mode Automatic Differentiation&lt;&#x2F;h2&gt;
&lt;p&gt;Reverse-mode automatic differentiation is less intuitive than forward-mode.
So why do we need reverse-mode in the first place? Forward-mode allows us
to compute arbitrary derivatives, but notice that in order to get how the
output varied with the 2 input variables, we needed to apply the
auto-differentiation algorithm twice. In particular, forward-mode
auto-differentiation requires computation equivalent to &lt;code&gt;O(N)&lt;&#x2F;code&gt; evaluations of
a function &lt;code&gt;f: R^N -&amp;gt; R^M&lt;&#x2F;code&gt; in order to calculate all the derivatives.&lt;&#x2F;p&gt;
&lt;p&gt;This kind of function shows up a lot in machine learning, where we&#x27;re often
optimizing for a single loss. In that case, we often want &lt;code&gt;f:R^(millions of parameters) -&amp;gt; R^1&lt;&#x2F;code&gt;. Needing to evaluate the function millions of times to
just perform a single step of gradient descent is too difficult.&lt;&#x2F;p&gt;
&lt;p&gt;To resolve this, we use reverse-mode automatic differentiation. At a high
level, if forward-mode is asking the question &amp;quot;How are all the output
variables affected by my current input variable?&amp;quot;, then reverse-mode is
asking the question &amp;quot;How do all the input variables affect my current output
variable?&amp;quot;. In some sense, we&#x27;re flipping the forward-mode differentiation
algorithm on its head.&lt;&#x2F;p&gt;
&lt;p&gt;This allows us to compute an output variable wrt arbitrarily many input
variables in merely a single pass through the function. That is, we need
&lt;code&gt;O(M)&lt;&#x2F;code&gt; evaluations of a function &lt;code&gt;f: R^N -&amp;gt; R^M&lt;&#x2F;code&gt; (as opposed to &lt;code&gt;O(N)&lt;&#x2F;code&gt; for
forward-mode).&lt;&#x2F;p&gt;
&lt;p&gt;Implementing this is a bit trickier than implementing forward-mode. We can no
longer tie our gradient computation to our actual computation, we must
construct a graph that allows us to &lt;em&gt;replay&lt;&#x2F;em&gt; our derivatives after all of our
computation is finished.&lt;&#x2F;p&gt;
&lt;p&gt;We implemented this by constructing a &lt;code&gt;RevADNode&lt;&#x2F;code&gt; for every computation we
perform. This &lt;code&gt;RevADNode&lt;&#x2F;code&gt; is pushed onto the children of the inputs to this
computation. In this way, our entire computation is saved as a graph in
memory, while no gradients are computed. At each point, we simply need to
store the value of the expression, as well as its portion of the derivative.&lt;&#x2F;p&gt;
&lt;p&gt;The result is a dataflow graph, where each node corresponds to a value that
was computed during the normal interpretation of the program. For example, if
there&#x27;s a loop that runs &lt;code&gt;N&lt;&#x2F;code&gt; times with &lt;code&gt;M&lt;&#x2F;code&gt; instructions in the body, this
will result in &lt;code&gt;N*M&lt;&#x2F;code&gt; &lt;code&gt;RevADNode&lt;&#x2F;code&gt;s.&lt;&#x2F;p&gt;
&lt;p&gt;Once we&#x27;ve finished all of our computation, we can finally traverse our entire graph to calculate all of our gradients. This is the key code that performs the computation.&lt;&#x2F;p&gt;
&lt;p&gt;Reverse-mode auto-differentiation requires us to calculate all of the
dependencies from the output to the value we&#x27;re calculating the gradient of.
To do so, we recursively calculate the gradient of all the node&#x27;s children,
and then calculate the gradient using the chain rule + the values that were
stored at that point.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;  grad(): number {
    if (!this.gradValue) {
      this.gradValue = 0;
      for (const val of this.children) {
        const [num, node] = val;
        this.gradValue += num * node.grad();
      }
    }
    return this.gradValue!;
  }
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A couple notes:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;While previously, we needed to manually set the initial gradient of the inputs, we now need to set the initial gradient of the outputs.&lt;&#x2F;li&gt;
&lt;li&gt;Explicitly constructing a node and pointers to its children is very memory inefficient. Typical implementations often do it in a &amp;quot;tape-based&amp;quot; manner, also known as a Wengert List. This is a mere constant factor optimization - the core idea is the same as the one presented above.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;reverse-mode-vs-forward-mode&quot;&gt;Reverse-Mode vs Forward-Mode&lt;&#x2F;h3&gt;
&lt;p&gt;It may seem that, in typical contexts, reverse-mode automatic differentiation
is strictly superior. It&#x27;s far more common to see a &amp;quot;many to one&amp;quot; function
than a &amp;quot;one to many&amp;quot; function. And that is true! Most automatic
differentiation frameworks (i.e: ML frameworks) only implement reverse-mode
(PyTorch, TensorFlow, MxNet, etc.).&lt;&#x2F;p&gt;
&lt;p&gt;However, there are still instances where forward-mode automatic
differentiation is very useful: computing the Hessian of &lt;code&gt;f(x)&lt;&#x2F;code&gt;
multiplied by a vector &lt;code&gt;v&lt;&#x2F;code&gt; - a Hessian vector product - is one instance. The typical algorithm
for this requires &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;pytorch&#x2F;pytorch&#x2F;issues&#x2F;10223#issuecomment-413935344&quot;&gt;both forward-mode and reverse-mode automatic
differentiation&lt;&#x2F;a&gt;.
If one only has one of these implemented, we must perform at least &lt;code&gt;O(N)&lt;&#x2F;code&gt;
evaluations of the function.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;automatic-differentiation-in-the-presence-of-control-flow&quot;&gt;Automatic differentiation in the presence of control flow&lt;&#x2F;h3&gt;
&lt;p&gt;One notable absence from all this discussion has been control flow. We
haven&#x27;t discussed how gradients flow through control flow, nor mentioned
control flow at all.&lt;&#x2F;p&gt;
&lt;p&gt;Put simply, control flow can be treated as a static construct. Remember that
we&#x27;re taking derivatives, which ask for how our function changes if we add
epsilon to an input. In that model, our control flow will not change at all,
and our derivative can be calculated with the exact same operations that we
performed when evaluating the function.&lt;&#x2F;p&gt;
&lt;p&gt;This is limiting in certain instances, and may not correspond with what we
believe &lt;em&gt;should&lt;&#x2F;em&gt; be the gradient. For example, take&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;if m &amp;gt; 0:
    return x
else:
    return 1000 * x
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If our goal is optimize the function, then perhaps there should be some kind
of gradient through &lt;code&gt;m&lt;&#x2F;code&gt;. Differentiable relaxations of common control flow
constructs is an ongoing area of research.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;notable-challenges&quot;&gt;Notable Challenges&lt;&#x2F;h1&gt;
&lt;p&gt;One issue for us is that automatic differentiation is typically applied to
functions. But, Bril doesn&#x27;t have functions... :&#x27;( What we did was add a
representation to the Bril JSON format that specified input variables as well
as initial values for those input variables. As such, much of the usage of
the auto-differentiation must be done outside of the Bril language itself.
One thing that would be interesting future work would be to integrate the
function call work of one the other groups with Bril syntax so that we could
use auto-differentiation from within Bril.&lt;&#x2F;p&gt;
&lt;p&gt;Another issue that we had was in separating out the different
auto-differentiation systems. While in other languages, AD systems can often
be implemented through a combination of custom objects + operator
overloading, Bril doesn&#x27;t have either one of these concepts. As such, we
simply shoved everything inside of the interpreter.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, we had difficulties scaling our automatic differentiation systems
automatically, as Bril lacks arrays. Without arrays, we&#x27;re naturally bounded
in how many input variables we can have. We could have written some code-gen
to write much larger Bril programs, but we decided not to.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;Our primary evaluation was to run Bril as an AD engine through the command
line, for the purposes of optimizing some function within Python.&lt;&#x2F;p&gt;
&lt;p&gt;For this purpose, we wrote an &lt;code&gt;opt.py&lt;&#x2F;code&gt;, which simply compiles a TypeScript
file, sets the parameters, and repeatedly runs our AD engine for the purposes
of gradient descent.&lt;&#x2F;p&gt;
&lt;p&gt;For all of the below examples, we ran with both forward-mode and reverse-mode to ensure they both worked.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s take a simple function:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;y = x * x
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;With a high learning rate, our optimization process diverges! Oh no!
&lt;img src=&quot;high_lr.jpg&quot; style=&quot;max-width: 100%&quot;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;However, with a suitably low learning rate, we see that we converge to our desired minima.
&lt;img src=&quot;low_lr.jpg&quot; style=&quot;max-width: 100%&quot;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s take a function with some control flow.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;if (x&amp;gt;0) {
    y = x;
} else {
    y = 0-x;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;img src=&quot;abs.jpg&quot; style=&quot;max-width: 100%&quot;&gt;
We see that unlike when our function was nice and convex, we constantly oscillate in this setting.
&lt;p&gt;We also tested it on functions with multiple input variables and output variables. Unluckily, those a bit more difficult to visualize. However, we can report some results. :)&lt;&#x2F;p&gt;
&lt;p&gt;For example, optimizing the function:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;var a = x0 - x1;
var b = a - 2;
y = b * a;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Gives the results of &lt;code&gt;x0 = 0.7&lt;&#x2F;code&gt;, and &lt;code&gt;x1=-0.3&lt;&#x2F;code&gt;, for a minimum value of &lt;code&gt;-1&lt;&#x2F;code&gt;. The first several steps of the optimization process are shown here.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;f(x):   [x0, x1]:
[5.76] [-0.6  1. ]
[-0.7296] [ 0.44 -0.04]
[-0.98830141] [ 0.64592 -0.24592]
[-0.9994546] [ 0.68832305 -0.28832305]
[-0.99997269] [ 0.69738716 -0.29738716]
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We did not end up verifying the scaling properties of forward-mode vs reverse-mode for 2 reasons.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;From the implementation, these properties are fairly obvious.&lt;&#x2F;li&gt;
&lt;li&gt;It&#x27;s difficult to verify these properties in a meaningful way without arrays.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
</description>
            </item>
        
            <item>
                <title>A Backend That Translates Bril into C</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/bril-c-backend/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/bril-c-backend/</guid>
                <description>&lt;p&gt;Bril is an educational compiler intermediate representation that is designed for this compiler course. While there is already a Bril interpreter written in
TypeScript called &lt;code&gt;brili&lt;&#x2F;code&gt;, it would be interesting to have other backends so that we can compare the performance and the implementation complexity. In this first project,
I built a backend that translates Bril into the C language and then use GCC to compile and execute the program.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;why-choose-c&quot;&gt;Why choose C?&lt;&#x2F;h2&gt;
&lt;p&gt;Translating to C provides some benefits such as portability, good performance, and easier integration with other C library and tools. The C language is a widely used language in many embedded devices. We can also get native performance and leverage the GCC compiler&#x27;s optimizations. By translating to C, we can easily integrate it with other C libraries. Plus, because the Bril instructions (as it is now) can be mapped to C statements, we can also potentially use gdb as a debugger.
Lastly, translating to C is very common in the programming community and can be a fun project!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;method&quot;&gt;Method&lt;&#x2F;h2&gt;
&lt;p&gt;As of now, Bril only has one main function. For the sake of simplicity, I first collect all the name and type of variables used in the Bril program, and declare them at the top of the generated C code with the corresponding type, where I use &lt;code&gt;int64_t&lt;&#x2F;code&gt; for &lt;code&gt;int&lt;&#x2F;code&gt; and &lt;code&gt;int&lt;&#x2F;code&gt; for &lt;code&gt;bool&lt;&#x2F;code&gt;. Then, all the arithmetic, comparison, and logic instructions can be directed translated to corresponding C statement. For label and &lt;code&gt;jmp&lt;&#x2F;code&gt; instructions, I use &lt;code&gt;label&lt;&#x2F;code&gt; and &lt;code&gt;goto&lt;&#x2F;code&gt; in C. The instruction &lt;code&gt;br&lt;&#x2F;code&gt; can also be easily translated to statement compose of &lt;code&gt;if&lt;&#x2F;code&gt; and &lt;code&gt;goto&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;In order to verify the correctness of this C backend, I created some valid handwritten tests to verify against the existing interpreter &lt;code&gt;brili&lt;&#x2F;code&gt;. The C backend successfully passes all the tests.&lt;&#x2F;p&gt;
&lt;p&gt;Here we show a small example to demonstrate the translation.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
  a: int = const 4;
  b: int = const 4;
  cmp: bool = ge a b;
  jmp somewhere;
  a: int = const 2;
  somewhere:
  c: int = add a b;
  print c;
  print cmp;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The above Bril program can be translated to C as the following.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;#include &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;lt;stdint.h&amp;gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;#include &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;lt;stdio.h&amp;gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;#include &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;lt;inttypes.h&amp;gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;main&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(){
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;int64_t&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;int64_t&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; cmp;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;int64_t&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; c;
a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4LL&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
b &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4LL&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
cmp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;goto&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; somewhere;
a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2LL&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
somewhere:;
c &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;printf&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;%&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; PRId64 &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, c);
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;printf&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(cmp&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;?&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;true&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;false&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;I implemented the translation tool in Python. The implementation is straightforward and consists of 134 lines of code.  Compared to other projects such as &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;seanlatias&#x2F;bril&#x2F;tree&#x2F;master&#x2F;codegen-llvm&quot;&gt;codegen-llvm&lt;&#x2F;a&gt; for Bril using LLVM that consists of 500+ lines of C++ code and &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Neroysq&#x2F;bril2jb&quot;&gt;bril2jb&lt;&#x2F;a&gt; generating Java bytecode consisting of 400+ lines of Java code, I think it is safe to say that the implementation complexity of C backend is smaller.&lt;&#x2F;p&gt;
&lt;p&gt;The source code and tests can be found at &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;xu3kev&#x2F;bril2c&quot;&gt;Bril2C&lt;&#x2F;a&gt;. The tool to translate Bril JSON format to C is &lt;code&gt;bril2c.py&lt;&#x2F;code&gt;. It takes input on stdin and produces output on stdout.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;benchmark&quot;&gt;Benchmark&lt;&#x2F;h2&gt;
&lt;p&gt;I created some simple benchmarks to measure the timing compared with the interpreter run in Node.js.&lt;&#x2F;p&gt;
&lt;p&gt;Each test was measured while running 1000 times. In order to avoid the startup time of Node.js, I slightly modified the &lt;code&gt;brili&lt;&#x2F;code&gt; interpreter to run the Bril program 1000 times internally so I can avoid running &lt;code&gt;brili&lt;&#x2F;code&gt; 1000 times.&lt;&#x2F;p&gt;
&lt;p&gt;The experiments are all run on Intel Xeon CPU E5-1630 v3 @ 3.70GHz with Ubuntu 16.04. Turbo boost is turned off and the scaling governor is set to performance.&lt;&#x2F;p&gt;
&lt;p&gt;The version of Node.js to run Bril interpreter is 12.11.0.
The gcc version  is 5.4.0 and the optimization flag is O3.&lt;&#x2F;p&gt;
&lt;p&gt;The benchmark result is as following.
&lt;br&gt;
&lt;img src=&quot;c_backend_benchmark.png&quot; width=&quot;500&quot;&gt;
&lt;br&gt;
The four test programs are factorial computation, the Fibonacci sequence, polynomial multiplication and matrix multiplication.
We can see that we gain a significant speedup across different tests. However, because Bril now has only bool and int type, the things we can compute are still limited. In the future, as Bril extends with more features such as floating point arithmetic and arrays, we can have more practical benchmarks.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;In this project, I built a C backend for Bril. I verified its correctness and benchmarked its performance on several tests. The result shows that compared with the &lt;code&gt;brili&lt;&#x2F;code&gt; interpreter, It gains a significant speedup.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;acknowledgment&quot;&gt;Acknowledgment&lt;&#x2F;h2&gt;
&lt;p&gt;I want to thank Hongbo and Siqiu for the helpful discussion.  I also want to thank Adrian and Matthew for the feedback on this project.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Bril to LLVM using OCaml</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/bril-to-llvm-ocaml/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/bril-to-llvm-ocaml/</guid>
                <description>&lt;h2 id=&quot;the-goal&quot;&gt;The Goal&lt;&#x2F;h2&gt;
&lt;p&gt;For this project, I wanted to create a transformation from Bril to LLVM IR and implement this transformation in OCaml. The motivation for the first goal of the project (LLVM code generation) was to allow Bril to be compiled and run natively, instead of just interpreted. Furthermore, LLVM IR supports many optimizations which allow even a naive transformation of Bril to LLVM to be quite performant. The motivation for doing this transformation in OCaml is that functional languages in general encourage a way of writing code that lends itself well to writing AST transformations. OCaml in particular implements language features such as variants, GADTs, and partial function application. These features might make it nicer to write IR transformations in OCaml than it would to do the transformations in TypeScript.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-implementation&quot;&gt;The Implementation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;an-unsuccessful-representation&quot;&gt;An unsuccessful representation&lt;&#x2F;h3&gt;
&lt;p&gt;The first part of this project was creating some representation of the Bril IR in OCaml. One of my goals when defining a representation for Bril in OCaml was that I wanted to maintain some of the inheritance structure of the Bril definition in TypeScript. For example, the TypeScript definition of Bril distinguishes between effect operations and value operations with two different interfaces:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;export interface &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;EffectOperation {
  op&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;br&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;jmp&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;print&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;ret&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  args&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Ident[];
}

&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;export interface &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ValueOperation {
  op&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;add&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;mul&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;sub&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;div&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;|
      &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;id&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;nop&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;|
      &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;eq&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;lt&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;gt&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;ge&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;le&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;not&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;and&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;or&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
  args&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Ident[];
  dest&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Ident;
  type&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Type;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This creates a nice separation between the two kinds of operations and allows a programmer to handle a generic effect operation or value operation without having to worry about which specific operation they are handling. Initially, I created a basic type that was a single variant type with a different constructor for each operation:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;type operation &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;=
  | Br &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;of br_data
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;| Jmp &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;of jmp_data
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;| Print &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;of print_data
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;| Ret
  | Id &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;of un_op_data
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;| Const &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;of const_data&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  ...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This was a simple representation and generally worked fine for something like basic block generation, when I only really cared about specifically identifying &lt;code&gt;Ret&lt;&#x2F;code&gt;, &lt;code&gt;Br&lt;&#x2F;code&gt;, and &lt;code&gt;Jmp&lt;&#x2F;code&gt; operations. &lt;&#x2F;p&gt;
&lt;p&gt;One of the first things I wanted to do when I was generating LLVM code was to create a stack (since I wasn&#x27;t going to implement an SSA transformation). On this stack I would map variables to stack indices, so I needed a list of all variables that were written to in a function. Using the representation I just described above, in order to extract all of the destinations from instructions I would have to write something like:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;match&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; op &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;with 
| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Id {dest;_}
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Const {dest; _}
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Add {dest;_}
... -&amp;gt; Some dest
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;| &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;_ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;-&amp;gt; None
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In typescript I could do something like:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(op.dest) {
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;op.dest;
} &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;else &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;null&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The above code does not need to know which operation it is operating on, only if the operation contains a specific piece of data.&lt;&#x2F;p&gt;
&lt;p&gt;One of the goals that I came up with for the representation was that I should have a representation that allowed me to match on an operation based on the data it contained or match on a specific operation. Furthermore, if I matched on the data in an operation, this should  statically limit the kinds of operations that I could match on. For example, if the case of a match statement I am in tells me that I have a &lt;code&gt;dest&lt;&#x2F;code&gt; field, OCaml should complain if I try to match the opcode of that operation with &lt;code&gt;Br&lt;&#x2F;code&gt;, since &lt;code&gt;Br&lt;&#x2F;code&gt; does not have a &lt;code&gt;dest&lt;&#x2F;code&gt; field.&lt;&#x2F;p&gt;
&lt;p&gt;The way I did this was by combining GADTs with polymorphic variants to create what I called constrained extensible records. The top level record type for an operation was:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;type &amp;#39;a operation &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;= {op: &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;#39;a opcode; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ex: &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;#39;a op_ex&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Every operation has an opcode field. Furthermore, that opcode encodes information about the type of data held in the &lt;code&gt;ex&lt;&#x2F;code&gt; field of the &lt;code&gt;operation&lt;&#x2F;code&gt; record. An opcode is a GADT that can only be constructed using a constructor representing one of the Bril opcodes:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;type _ opcode &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;=
  | Jmp : [`Jmp] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;opcode
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;| Br : [`Br] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;opcode
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;| Add : [`Add] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;opcode&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  ...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is where the polymorphic variants come in. The &lt;code&gt;&#x27;a&lt;&#x2F;code&gt; in the &lt;code&gt;operation&lt;&#x2F;code&gt; record is a polymorphic variant representing the operation. The polymorphic variant constrains the type we can put into the &lt;code&gt;ex&lt;&#x2F;code&gt; field of the &lt;code&gt;operation&lt;&#x2F;code&gt; record. The &lt;code&gt;op_ex&lt;&#x2F;code&gt; type looks like:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;type _ op_ex &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;=
  | Effect_op : &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;#39;a effect_op &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;#39;a op_ex
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;| Mutation_op : &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;#39;a mutation_op &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;-&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;#39;a op_ex
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;| Nop_op : [`Nop] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;op_ex&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  ...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The basic idea was that the GADT parameterized over polymorphic variants representing the opcodes would constrain the type of data that could be in the &lt;code&gt;ex&lt;&#x2F;code&gt; field of an operation. This kind of pattern continues down the type hierarchy and took a while to get correct. Even though this is the representation I ended up doing this project in, the reason I called it an unsuccessful representation is that in hindsight a similar effect could have been achieved by simply defining a hierarchy of plain old variants with different data structures. I thought I would be saving myself unnecessary match statements by incorporating GADTs to constrain the data but ultimately I don&#x27;t think I saved myself any match cases and had to spend a lot of time wrestling with the OCaml type system.&lt;&#x2F;p&gt;
&lt;p&gt;I think there is a cautionary tale here about trying to overengineer an AST representation based on how you think you are going to use it. If anyone is interested in looking at the full type representation that I used, it can be found &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Dan12&#x2F;bril&#x2F;blob&#x2F;master&#x2F;bril-ocaml&#x2F;bril&#x2F;bril_v2.ml&quot;&gt;here&lt;&#x2F;a&gt;. I will say though, it was a good exercise in learning about the more peculiar parts of OCaml&#x27;s type system and I think made me more comfortable with GADTs and made me more cognizant about their limitations.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;llvm-code-generation&quot;&gt;LLVM Code Generation&lt;&#x2F;h3&gt;
&lt;p&gt;LLVM code generation consisted of 2 main parts. First, since LLVM IR requires each basic block to have a terminator, I decided first generate all of the basic blocks of a Bril program. I could have probably simply looked for all labels with no preceding terminator in a Bril program and added a jump, but since I already written the code for creating and processing basic blocks, I decided to generate LLVM code at a basic block level.&lt;&#x2F;p&gt;
&lt;p&gt;Next, because I didn&#x27;t implement an SSA transformation for Bril and because Bril variables can be overwritten, I need to create some &amp;quot;stack&amp;quot; space for all of the variables. The way I did this was by using the &lt;code&gt;alloca&lt;&#x2F;code&gt; LLVM instruction at the top of the function. I first collected all of the variables that were written to in a function and mapped them to an index in the stack. I also decided to make all variables the LLVM &lt;code&gt;i64&lt;&#x2F;code&gt; type. OCaml ints are 63 bits and I wanted to support the largest range of numbers possible. So at the beginning of a function, I inserted a call to &lt;code&gt;alloca i64, i64 n&lt;&#x2F;code&gt;, where &lt;code&gt;n&lt;&#x2F;code&gt; was the number of unique variables written to. Note that if a variable is used without being defined anywhere in the function, this will generate a blank instruction and likely cause LLVM to fail when it typechecks the generated code.&lt;&#x2F;p&gt;
&lt;p&gt;Whenever a variable is used as part of an operation it is loaded from the stack. Whenever a variable is modified by an operation, the result of the operation is written back to the variable&#x27;s location on the stack. So, for example, an add of two variables first performs a load of both variables from the stack into fresh variable names, adds the two variables together with an LLVM &lt;code&gt;add&lt;&#x2F;code&gt; instruction, and stores the result back to the &lt;code&gt;dest&lt;&#x2F;code&gt; variable of the Bril add instruction.&lt;&#x2F;p&gt;
&lt;p&gt;One other interesting aspect of this project was implementing the print function. I wrote some C code in a &lt;code&gt;helpers.c&lt;&#x2F;code&gt; file that just defined a &lt;code&gt;printi&lt;&#x2F;code&gt; and a &lt;code&gt;printb&lt;&#x2F;code&gt; function that printed out a 64-bit integer and a boolean &lt;code&gt;true&lt;&#x2F;code&gt; or &lt;code&gt;false&lt;&#x2F;code&gt; respectively. This C code was then compiled and linked in with the generated &lt;code&gt;.ll&lt;&#x2F;code&gt; file to create the final binary. In order to figure out which function to generate for each print operation, I added some code to get the type of a variable when it was defined. For example, if I saw:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;v: int = add v w;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;in a Bril program, I would say that variable &lt;code&gt;v&lt;&#x2F;code&gt; has type &lt;code&gt;int&lt;&#x2F;code&gt;. This may not work in general because the type of a Bril variable can technically be dynamic. For example, this is legal Bril:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;v: int = const 1;
v: bool = const true;
print v;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;There is some discussion of modifying the Bril spec to make the above code snippet illegal. Therefore, in this project I was comfortable assuming a static type for each variable.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;code-correctness&quot;&gt;Code correctness&lt;&#x2F;h3&gt;
&lt;p&gt;In order to evaluate wether I had succeeded at correctly creating code that transforms Bril into LLVM IR I wrote a battery of tests to test every possible operation and every possible edge case of every operation. The way I verified correctness was by generating the LLVM code for a Bril file, compiling and linking the LLVM code with the &lt;code&gt;helpers.c&lt;&#x2F;code&gt; file into a binary, running the binary and capturing the standard output of the execution. Then, I ran the same Bril code through &lt;code&gt;brili&lt;&#x2F;code&gt; and captured the standard output of that execution. I then compared the two standard outs to see if they agreed. &lt;&#x2F;p&gt;
&lt;p&gt;One thing that I worried about was that I never defined the expected output of a program. So in the unlikely event that the Bril interpreter and my code have a similar bug (like a copy paste error when handling &lt;code&gt;add&lt;&#x2F;code&gt; and &lt;code&gt;sub&lt;&#x2F;code&gt;) I would not notice this issue.&lt;&#x2F;p&gt;
&lt;p&gt;I had some test programs that were written in Bril and some that were written in TypeScript and compiled to Bril. The former allowed me to test some weird kinds of combinations of instructions not possible to generate by compiling TypeScript. The latter allowed me to more easily write large programs that did more complicated things to stress test the code generator.&lt;&#x2F;p&gt;
&lt;p&gt;Some other evaluation metrics were considered, such as comparing the speed of a Bril program when run through the interpreter to the speed of that Bril program when compiled and run natively. However, it was decided that these results would not be very meaningful other that to confirm that running code natively is &lt;em&gt;much&lt;&#x2F;em&gt; faster than running code in an interpreter.&lt;&#x2F;p&gt;
&lt;p&gt;Additionally, measuring the speed at which Bril programs can be transformed to LLVM as a function of their code size would be an interesting extension to this project. We would likely prefer that this transformation take a linear amount of time in relation to the size of the input program.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;code-design&quot;&gt;Code design&lt;&#x2F;h3&gt;
&lt;p&gt;As mentioned above, one aspect where I felt like this project didn&#x27;t succeed was in creating a good, strongly typed, OCaml representation of the Bril AST. Additionally, the way the LLVM code was generated was by simply generating strings of LLVM instructions for each Bril opcode. One potential way to improve upon this is to make an OCaml module that describes the LLVM AST and then write code to transform the Bril AST into the LLVM AST. Then we could just convert the LLVM AST to a string and output it to a file. There are also &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm-mirror&#x2F;llvm&#x2F;tree&#x2F;master&#x2F;bindings&#x2F;ocaml&quot;&gt;LLVM bindings for OCaml&lt;&#x2F;a&gt; which I could have used and then not have had to think about also creating an LLVM representation.&lt;&#x2F;p&gt;
&lt;p&gt;Another potential way and one I started to explore is similar to how you can generate LLVM code in C++. You create function objects and basic block objects within those functions and then append instructions to those basic block objects. I tried to do something similar and add some static type checking when composing certain types of operations. For example, a &lt;code&gt;br&lt;&#x2F;code&gt; instruction takes in an &lt;code&gt;i1&lt;&#x2F;code&gt; argument (a boolean). So I tried to write some operation builders with some input and output type constraints. This also lead to a lot of struggling with the OCaml type system but I was able to get something reasonable working for a subset of Bril operations. You can check it out &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Dan12&#x2F;bril&#x2F;blob&#x2F;master&#x2F;bril-ocaml&#x2F;llvm_gen&#x2F;llvm.ml&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The hardest part of this project was definitely trying to get a good representation of the Bril AST in OCaml. I think I might want to revisit this representation in future projects and try and get something that I am happy with. One interesting comparison would have been to also try to do this project in TypeScript, since it has some cool type constructs.&lt;&#x2F;p&gt;
&lt;p&gt;All of the code can be found &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Dan12&#x2F;bril&#x2F;tree&#x2F;master&#x2F;bril-ocaml&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Type Inference for Bril</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/bril-type-inference/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/bril-type-inference/</guid>
                <description>&lt;h2 id=&quot;overview&quot;&gt;Overview&lt;&#x2F;h2&gt;
&lt;p&gt;The goal of this project is to build a type inferrer for Bril. Ultimately, we want to
take a Bril program that has some (or none) of the types specified on variables
and produce a new Bril program that has all the correct type annotations.
For example, this program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
  v0 = const 1;
  v1 = const 2;
  v2 = add v0 v1;
  print v2;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;should be transformed into the program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
  v0: int = const 1;
  v1: int = const 2;
  v2: int = add v0 v1;
  print v2;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This makes it quicker to write Bril programs because we don&#x27;t have
to worry about correctly typing the variables.&lt;&#x2F;p&gt;
&lt;p&gt;We also will implement type checking through this type inference.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design&quot;&gt;Design&lt;&#x2F;h2&gt;
&lt;p&gt;For our workflow, we&#x27;d like to take a semi-typed Bril program and convert it
between JSON and text representations. This requires changing the grammar so
that type annotations are optional.&lt;&#x2F;p&gt;
&lt;p&gt;Because Bril only supports two types, &lt;code&gt;int&lt;&#x2F;code&gt; and &lt;code&gt;bool&lt;&#x2F;code&gt;, the type inference is
very straightforward. For example, consider arithmetic operations, i.e.,
&lt;code&gt;add&lt;&#x2F;code&gt;, &lt;code&gt;mul&lt;&#x2F;code&gt;, &lt;code&gt;sub&lt;&#x2F;code&gt;, and &lt;code&gt;div&lt;&#x2F;code&gt;. The arguments to this operations &lt;em&gt;must&lt;&#x2F;em&gt; be ints,
and the result is an int. Therefore, if we see a statement like &lt;code&gt;x = add a b&lt;&#x2F;code&gt;,
we know &lt;code&gt;x&lt;&#x2F;code&gt;, &lt;code&gt;a&lt;&#x2F;code&gt;, and &lt;code&gt;b&lt;&#x2F;code&gt; are all ints. If at any point we inferred that one
of these variables are not ints, then we have a type unification error, and the
program is not well-typed.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;modifying-the-grammar&quot;&gt;Modifying the Grammar&lt;&#x2F;h3&gt;
&lt;p&gt;We want to make type annotations optional. So first I modified the grammar to
have a rule for type annotations:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type_decl.5: &amp;quot;:&amp;quot; type
const.4: IDENT [type_decl] &amp;quot;=&amp;quot; &amp;quot;const&amp;quot; lit &amp;quot;;&amp;quot;
...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is an issue however because there is an ambiguity between labels and
assignments. Consider these two Bril programs:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
l:
  x = const 5;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
  l: x = const 5;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These programs consist of the same tokens. However, the first one intuitively
means that we have a label &lt;code&gt;l&lt;&#x2F;code&gt; and some int &lt;code&gt;x&lt;&#x2F;code&gt; with a value of 5. The second
one means that we have a variable called &lt;code&gt;l&lt;&#x2F;code&gt; of type &lt;code&gt;x&lt;&#x2F;code&gt; with a value of 5.
It is incorrect to simply have one rule as a higher priority, because
semantically, both of these should be allowed as separate programs. So we have
two options:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Force labels to be on its own line&lt;&#x2F;li&gt;
&lt;li&gt;Only allow fixed type names, like &amp;quot;int&amp;quot; and &amp;quot;bool&amp;quot;&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;I decided to go with (2) because I didn&#x27;t know how to easily do (1), but in
retrospect (1) may have been better, because some other students&#x27; projects
allow for user-defined types.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;type-inference&quot;&gt;Type Inference&lt;&#x2F;h3&gt;
&lt;p&gt;As noted in the design, inferring the types of variables for a single statement
is straightforward. Here is an example snippet showing how types for comparison
ops are inferred:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;...
elif instr[&amp;quot;op&amp;quot;] in COMPARISON_OPS:
    for arg in instr[&amp;quot;args&amp;quot;]:
        type_var(gamma, arg, &amp;quot;int&amp;quot;, i)
    type_var(gamma, instr[&amp;quot;dest&amp;quot;], &amp;quot;bool&amp;quot;, i)
...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here, we keep track of a typing context &lt;code&gt;gamma&lt;&#x2F;code&gt; which maps variables to their
type. Then, we check that each argument is either untyped or already has the
type &lt;code&gt;int&lt;&#x2F;code&gt;; otherwise we&#x27;ll throw a type unification error. Finally, we do
the same for the destination, making sure that it is typed as &lt;code&gt;bool&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;For our implementation, we simply iterate through each instruction and determine
the types of our variables. The one case where this presents issues is with
&lt;code&gt;id&lt;&#x2F;code&gt;. In general, &lt;code&gt;id&lt;&#x2F;code&gt; sets the type of the destination to the type of the
variable on the right hand side. However, what happens if we don&#x27;t know the type
of the variable being copied? For example, consider this Bril program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
  jmp later;
earlier:
  x = id y;
  ret;
later:
  y = const 5;
  jmp earlier;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is a valid program that should typecheck. Specifically, &lt;code&gt;x&lt;&#x2F;code&gt; and &lt;code&gt;y&lt;&#x2F;code&gt; are
both ints. However, if we naively go sequentially through each instruction, we
don&#x27;t know the type of &lt;code&gt;y&lt;&#x2F;code&gt; until we have &lt;code&gt;y = const 5&lt;&#x2F;code&gt;, at which point it is too
late to type &lt;code&gt;x&lt;&#x2F;code&gt;. To resolve this, we have two options:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Rerun the type inference algorithm, stopping when no more variables have been
inferred&lt;&#x2F;li&gt;
&lt;li&gt;Keep track of which variables must have the same type, and after 1 pass,
setting their types to be the same.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;For simplicity sake, we choose (1). This means in the worst case, type inference
will take &lt;code&gt;O(n^2)&lt;&#x2F;code&gt; time, where &lt;code&gt;n&lt;&#x2F;code&gt; is the number of instructions. This happens
when there are multiple such &lt;code&gt;id&lt;&#x2F;code&gt; assignments as in the example above.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;type-checking&quot;&gt;Type Checking&lt;&#x2F;h3&gt;
&lt;p&gt;With type inference, type checking is relatively simple. After type inference,
we have the original Bril program and the fully typed Bril program. We then go
through the original Bril program and make sure that for any variable that has a
type annotation, the type matches the inferred type. For completeness, I also
check to make sure that variables aren&#x27;t being used as labels, and vice versa.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;To properly evaluate that everything presented here is correct, we have to test
the parser and the type inferrer. To do these simultaneously, we can run tests
as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;command = &amp;quot;cat {filename} | bril2json | python ..&#x2F;..&#x2F;infer.py | bril2txt&amp;quot;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We are taking Bril text programs, turning them into JSON, generating a new
equivalent typed program, and turning it back to text. First I started with
simple Bril programs that already existed to build some confidence. Then to gain
full confidence, I wrote tests that use every kind of operations, e.g.,
arithmetic ops, comparison ops, logical ops, effect ops, and misc. ops.&lt;&#x2F;p&gt;
&lt;p&gt;I tested for both positive and negative results. In other words, I ensured
Bril programs that &lt;em&gt;should&lt;&#x2F;em&gt; typecheck were correctly type-inferred, and programs
that &lt;em&gt;shouldn&#x27;t&lt;&#x2F;em&gt; typecheck in fact could not be type-inferred. For example:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
  F1 = const false;
  T1 = const 1;
  b1 = and F1 T1;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Unfortunately I couldn&#x27;t get Turnt to validate the error message that was
output, so instead I manually checked to see that the error was what I expected.
The exception raised for the above program is:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;Exception: (stmt 3) Expected &amp;quot;T1&amp;quot; to have type &amp;quot;bool&amp;quot; but found &amp;quot;int&amp;quot;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;I also made sure to test Bril programs that contained some type annotations and
made sure that didn&#x27;t interfere with the inference. This same process was
repeated for testing the typechecker, which was run by passing the &lt;code&gt;-t&lt;&#x2F;code&gt; flag to
&lt;code&gt;infer.py&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;command = &amp;quot;cat {filename} | bril2json | python ..&#x2F;..&#x2F;infer.py -t | bril2txt&amp;quot;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Through this process, I uncovered some bugs when implementing the type inferrer:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;grammar ambiguity, detailed in the &amp;quot;Modifying the Grammar&amp;quot; section&lt;&#x2F;li&gt;
&lt;li&gt;wrongly tested for literals True and False using &lt;code&gt;==&lt;&#x2F;code&gt; instead of &lt;code&gt;is&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;accidentally had comparison ops&#x27; args be bools&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;By rigorously testing all possible language features, we can be confident that
the type inference is correct.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;hardest-parts-to-get-right&quot;&gt;Hardest Parts to Get Right&lt;&#x2F;h2&gt;
&lt;p&gt;The hardest parts to get right were the parser and full correctness of the type
inferrer. I didn&#x27;t realize the issue with the parser towards the beginning
because I arbitrarily set a priority for the type annotation rule; debugging
this took a while. Additionally, it was really easy to get a type inferrer
that seemed to work for majority of cases. However, small bugs like the ones
mentioned previously were only fixed through intense testing. Comparatively, the
actual type inference and typechecking was relatively straightforward, mostly
because there are only two types in Bril and no function calls.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;possible-extensions&quot;&gt;Possible Extensions&lt;&#x2F;h2&gt;
&lt;p&gt;Other groups are working on adding function calls to the language. By doing
type inference on the function arguments and checking the return type, we can
determine the type of a particular function. From this, we can relatively easily
infer types for assignments to function calls.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>bril2jb: A Tool That Translates Bril to Java Bytecode</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/bril2jb/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/bril2jb/</guid>
                <description>&lt;h3 id=&quot;goal&quot;&gt;Goal&lt;&#x2F;h3&gt;
&lt;p&gt;The goal of this project is to provide a tool to translate Bril code to Java bytecode.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;design&quot;&gt;Design&lt;&#x2F;h3&gt;
&lt;p&gt;At first glance, translating Bril to &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Java_bytecode&quot;&gt;Java bytecode&lt;&#x2F;a&gt; seems a 
pretty straightforward job, since Bril (currently) only contains simple instructions and supports no function invocation.
But it turns out that making something directly runnable by &lt;em&gt;Java Virtual Machine&lt;&#x2F;em&gt; (JVM, the runtime environment of Java bytecode)
is not trivial, because we need to construct our program as a Java &lt;em&gt;class&lt;&#x2F;em&gt;, 
which is the only code format JVM accepts. 
Also, every class as a standalone application need a &lt;code&gt;main&lt;&#x2F;code&gt; function as an entry point 
(Since currently, Bril code contains also only one &lt;code&gt;main&lt;&#x2F;code&gt; function, it&#x27;s a perfect match!).&lt;&#x2F;p&gt;
&lt;p&gt;To demonstrate this, suppose we want to translate the following Bril program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
    a:int = const 1;
    b:int = const 2;
    c:int = add a b;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The counterpart Java program of it would be:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Wrapper &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public static void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;main&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;java&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;lang&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;String&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;[] &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;args) {
      &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;long&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
      &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;long&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
      &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;long&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; c &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; a &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; b;
  }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Please notice that: (1) the class name can be an arbitrary valid class name; 
(2) bril2jb is not generating Java code, the Bril code is directly translated to Java bytecode.&lt;&#x2F;p&gt;
&lt;p&gt;Before talking about how to translate, 
I want to introduce how JVM runs the code and what a compiled Java class looks like.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;structure-of-jvm&quot;&gt;Structure of JVM&lt;&#x2F;h4&gt;
&lt;img alt=&quot;JVM Architecture by Michelle Ridomi&quot; src=&quot;https:&#x2F;&#x2F;upload.wikimedia.org&#x2F;wikipedia&#x2F;commons&#x2F;d&#x2F;dd&#x2F;JvmSpec7.png&quot; style=&quot;width: 100%&quot;&gt;
&lt;p&gt;The picture above shows an overview of JVM architecture.&lt;&#x2F;p&gt;
&lt;p&gt;To run a Java class, JVM will first load the compiled class file into memory, 
and then start to execute the code (by default, the function &lt;code&gt;main&lt;&#x2F;code&gt;). 
There is one significant difference between JVM&#x27;s execution model and other languages like C:
the stack in JVM is a stack of &lt;em&gt;stack frames&lt;&#x2F;em&gt;, each stack frame contains &lt;em&gt;local variable arrays&lt;&#x2F;em&gt;, &lt;em&gt;frame data&lt;&#x2F;em&gt;, and &lt;em&gt;oprand stack&lt;&#x2F;em&gt;,
while operands for arithmetic or logical operations are most often placed into registers and operated on there, 
they happen in operand stack in JVM. 
Thus, in terms of computation, JVM is more like a stack-based machine.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;structure-of-a-compiled-java-class&quot;&gt;Structure of a compiled Java class&lt;&#x2F;h4&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Item Name&lt;&#x2F;th&gt;&lt;th&gt;A Brief Introduction&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;magic&lt;&#x2F;td&gt;&lt;td&gt;The magic number identifying the class file format; it has the value 0xCAFEBABE.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;minor_version, major_version&lt;&#x2F;td&gt;&lt;td&gt;The minor and major version numbers of this class file.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;constant_pool_count, constant_pool[]&lt;&#x2F;td&gt;&lt;td&gt;The constant_pool is a table of structures representing various string constants, class and interface names, field names, and other constants that are referred to within the class and its substructures. constant_pool_count represents the size of this table.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;access_flags&lt;&#x2F;td&gt;&lt;td&gt;A mask of flags used to denote access permissions to and properties of this class.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;this_class&lt;&#x2F;td&gt;&lt;td&gt;A valid index into the constant_pool table referring to a CONSTANT_Class_info structure representing this class.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;super_class&lt;&#x2F;td&gt;&lt;td&gt;Zero or a valid index into the constant_pool table referring to a CONSTANT_Class_info representing the superclass of this class.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;interfaces_count, interfaces[]&lt;&#x2F;td&gt;&lt;td&gt;Each value in interfaces array is a valid index into the constant_pool table referring to a CONSTANT_Class_info structure representing a superinterface of this class.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;fields_count, fields[]&lt;&#x2F;td&gt;&lt;td&gt;Each value in the fields table is a field_info structure describing a field in this class.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;methods_count, methods[]&lt;&#x2F;td&gt;&lt;td&gt;Each value in the methods table is a method_info structure describing a method in this class.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;attributes_count, attributes[]&lt;&#x2F;td&gt;&lt;td&gt;Each value in the attributes table is an attribute_info structure describing an attribute in this class.&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;The table above summarizes the overall structure of a compiled class. 
More details can be found in Chapter 4 of &lt;a href=&quot;https:&#x2F;&#x2F;docs.oracle.com&#x2F;javase&#x2F;specs&#x2F;jvms&#x2F;se13&#x2F;jvms13.pdf&quot;&gt;JVM document&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;You can inspect the structure of a compiled Java class using the command &lt;code&gt;javap&lt;&#x2F;code&gt; provided by Java JDK.
For example, we can inspect the Wrapper class introduced earlier.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;~&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;G&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;bril2jb ❯❯❯ javap &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;v &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Wrapper
Classfile &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;home&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;animula&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;GitRep&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;bril2jb&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Wrapper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.class
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Last&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; modified &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Oct 18&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2019&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; size &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;291&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; bytes
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;MD5&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; checksum e4dc4c91a1fcd5333b7617f95cd75232
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Compiled&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; from &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;Wrapper.java&amp;quot;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public class &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Wrapper
  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;minor version&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  major version&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;54&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  flags&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0x0021&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;ACC_PUBLIC&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;ACC_SUPER&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  this_class&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4                          &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Wrapper
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;  super_class&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5                         &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; java&#x2F;lang&#x2F;Object
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;  interfaces&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, fields&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, methods&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, attributes&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1
Constant&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; pool&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
   #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Methodref&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;          #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.#&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;14         &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; java&#x2F;lang&#x2F;Object.&amp;quot;&amp;lt;init&amp;gt;&amp;quot;:()V
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;   #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Long               2&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;l&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
   #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Class&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;              #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;15            &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Wrapper
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;   #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Class&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;              #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;16            &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; java&#x2F;lang&#x2F;Object
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;   #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Utf8               &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;lt;init&amp;gt;
   #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;7 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Utf8               ()&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;V&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
   #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;8 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Utf8               Code&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
   #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;9 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Utf8               LineNumberTable&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Utf8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;               main
  #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;11 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Utf8               ([&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Ljava&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;lang&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;String&lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;V&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;12 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Utf8               SourceFile&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;13 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Utf8               Wrapper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.java
  #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;14 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;NameAndType&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;        #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;#&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;7          &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; &amp;quot;&amp;lt;init&amp;gt;&amp;quot;:()V
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;  #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;15 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Utf8               Wrapper&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;16 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Utf8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;               java&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;lang&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Object
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;public &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Wrapper();
    descriptor&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;()&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;V&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    flags&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0x0001&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;ACC_PUBLIC
    Code&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
      stack&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, locals&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, args_size&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1
         0&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; aload_0
         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; invokespecial #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1                  &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; Method java&#x2F;lang&#x2F;Object.&amp;quot;&amp;lt;init&amp;gt;&amp;quot;:()V
         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: return
      &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;LineNumberTable&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
        line &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;

  public static void main(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;java&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;lang&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;.&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;String&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;[]&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
    descriptor&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;([&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;Ljava&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;lang&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&#x2F;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;String&lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;V&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    flags&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0x0009&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;ACC_PUBLIC&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;ACC_STATIC
    Code&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
      stack&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, locals&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, args_size&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1
         0&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; lconst_1
         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; lstore_1
         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ldc2_w        #&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2                  &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; long 2l
         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; lstore_3
         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; lload_1
         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;7&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; lload_3
         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ladd
         &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;9&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; lstore        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5
        11&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: return
      &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;LineNumberTable&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
        line &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
        line &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
        line &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;5&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
        line &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;6&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;11
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;SourceFile&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;Wrapper.java&amp;quot;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The Wrapper class contains no field members and two methods. One is a default init function,
the other is &lt;code&gt;main&lt;&#x2F;code&gt; where we put the translated code.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;the-translation-process&quot;&gt;The translation process&lt;&#x2F;h4&gt;
&lt;p&gt;I chose &lt;a href=&quot;https:&#x2F;&#x2F;asm.ow2.io&quot;&gt;ASM&lt;&#x2F;a&gt;, a Java bytecode manipulation and analysis framework to help with the translation.
ASM provides APIs for generating (and transforming, which is not used in this project) 
compiled JAVA classes. 
It provides two APIs: the core API provides an event-based representation of class, while the tree API provides an object-based representation.
And we only leveraged the core API. &lt;&#x2F;p&gt;
&lt;p&gt;The core API is described in &lt;a href=&quot;https:&#x2F;&#x2F;asm.ow2.io&#x2F;asm4-guide.pdf&quot;&gt;the ASM document&lt;&#x2F;a&gt; as:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;With the event based model a class is represented with a sequence of events, 
each event representing an element of the class, 
such as its header, a field, a method declaration, an instruction, etc. 
The event based API defines the set of possible events and the order in which they must occur, 
and provides a class parser that generates one event per element that is parsed, 
as well as a class writer that generates compiled classes from sequences of such events.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The translation process is pretty regular. 
Most instructions share the pattern: 
&amp;quot;load arguments to stack&amp;quot; -&amp;gt; &amp;quot;evaluate&amp;quot; -&amp;gt; &amp;quot;store the result&amp;quot;.
One interesting point is that Java bytecode is strictly-typed,
so I made this design decision that all variables in Bril should also be strictly-typed, 
and for all possible execution traces each variable&#x27;s type should be the same. 
What&#x27;s more, some instructions in Bril are polymorphic (&lt;code&gt;print&lt;&#x2F;code&gt; and &lt;code&gt;id&lt;&#x2F;code&gt;).
Therefore, we preprocess the instructions first to gather all type information and labels.&lt;&#x2F;p&gt;
&lt;p&gt;Another interesting point and also the hardest part of this project is translating &lt;code&gt;print&lt;&#x2F;code&gt;.
There are two steps involved: 
(1) Convert all arguments of print to &lt;code&gt;String&lt;&#x2F;code&gt; and concatenate them (Separated with one space);
(2) Call &lt;code&gt;java.lang.System.println&lt;&#x2F;code&gt; with the previous &lt;code&gt;String&lt;&#x2F;code&gt; as its argument.&lt;&#x2F;p&gt;
&lt;p&gt;The step (2) is a straightforward function call, so we&#x27;ll focus on step (1).&lt;&#x2F;p&gt;
&lt;p&gt;The step (1) is interesting, because 
the way Java compiler deals with static &lt;code&gt;String&lt;&#x2F;code&gt; concatenation changed significantly.
In Java 8 or earlier, 
the compiler leverages class &lt;code&gt;StringBuilder&lt;&#x2F;code&gt;,
it converts arguments to &lt;code&gt;String&lt;&#x2F;code&gt;s and appends them to &lt;code&gt;StringBuilder&lt;&#x2F;code&gt; one by one,
while since Java 9, 
the compiler simply pushes all arguments into the stack
and call a &lt;em&gt;dynamic method&lt;&#x2F;em&gt; &lt;code&gt;java.lang.invoke.StringConcatFactory.makeConcatWithConstants​&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The latter approach was claimed to 
&amp;quot;enable future optimizations of String concatenation without requiring further changes to the bytecode emitted by javac.&amp;quot;
So we go with the later one. 
The tricky part is that,
a dynamic method is generated at runtime, 
to generate it, 
a static &lt;em&gt;bootstrap method&lt;&#x2F;em&gt; is required.
The work this method does here is that, 
given a descriptor (a string that describes types of arguments) and a formatter
(a string that describes the format of output string),
generate a dynamic function accordingly (&lt;a href=&quot;https:&#x2F;&#x2F;www.guardsquare.com&#x2F;en&#x2F;blog&#x2F;string-concatenation-java-9-untangling-invokedynamic&quot;&gt;further reading&lt;&#x2F;a&gt;).
Luckily, ASM handles most of the generation of this bootstrap method for us, 
what we need to do is for each usage of &lt;code&gt;print&lt;&#x2F;code&gt;, 
generate the corresponding descriptor and formatter.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Neroysq&#x2F;bril2jb&quot;&gt;This tool&lt;&#x2F;a&gt; is implemented in Java. It translates Bril code (in JSON format) to 
a same-name Java class file, which can run on the JVM. &lt;&#x2F;p&gt;
&lt;p&gt;To use this tool, specify the Bril source file (in JSON format) and output path, 
it will output a Java class file, and then you can run this class using &lt;code&gt;java&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;For example,
suppose we have a Bril code implementing &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Euclidean_algorithm&quot;&gt;Euclidean algorithm for computing the greatest common divisor&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
    a: int = const 2341234;
    b: int = const 653266234;
    zero: int = const 0;
loop:
    cond: bool = eq b zero;
    br cond final here;
here:
    c: int = div a b;
    c: int = mul c b;
    c: int = sub a c;
    a: int = id b;
    b: int = id c;
    print a b;
    jmp loop;
final:
    print a b;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then we run &lt;code&gt;bril2json&lt;&#x2F;code&gt; and get its JSON form, and use &lt;code&gt;bril2jb&lt;&#x2F;code&gt; to generate &lt;code&gt;gcd.class&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;~&#x2F;G&#x2F;bril2jb ❯❯❯ .&#x2F;bril2jb test&#x2F;gcd.json .&#x2F;
~&#x2F;G&#x2F;bril2jb ❯❯❯ java gcd
653266234 2341234
2341234 61948
61948 49158
49158 12790
12790 10788
10788 2002
2002 778
778 446
446 332
332 114
114 104
104 10
10 4
4 2
2 0
2 0
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You can also use &lt;code&gt;javap -v gcd&lt;&#x2F;code&gt; to see more details.&lt;&#x2F;p&gt;
&lt;p&gt;In our implementation, our tool expects the input program will be strictly-typed and 
it typechecks as a Bril program. 
Because we think error-handling is not the topic of this project.
If the program doesn&#x27;t typecheck 
(For example, use a variable that&#x27;s never defined as an argument.), 
the tool will crash.&lt;&#x2F;p&gt;
&lt;p&gt;There are some tricky details:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Since &lt;code&gt;int&lt;&#x2F;code&gt; in Bril is 64-bit, all &lt;code&gt;int&lt;&#x2F;code&gt; variables are &lt;code&gt;long&lt;&#x2F;code&gt; variables in JVM; 
there is no boolean type in JVM, so all &lt;code&gt;bool&lt;&#x2F;code&gt; variables are &lt;code&gt;int&lt;&#x2F;code&gt; (32-bit integer) type in JVM.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;ASM handles the size of local variable arrays (that is, the size of space all local variables in one function need),
but the indices of local variables need to be manually maintained (&lt;code&gt;long&lt;&#x2F;code&gt; type takes two index units while &lt;code&gt;int&lt;&#x2F;code&gt; takes 1).&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;ASM handles the construction of the constant pool automatically, which is convenient. &lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h3&gt;
&lt;p&gt;We manually tested our tool using hand-written test cases (including the ones in Bril repo and seven more in &lt;code&gt;test&lt;&#x2F;code&gt; folder in our repo) to ensure the correctness,
they cover all the instructions in Bril.
All of our tests&#x27; output results agree with the reference interpreter.
We also manually looked into some of the classes (&lt;code&gt;gcd&lt;&#x2F;code&gt; and &lt;code&gt;fibonacci&lt;&#x2F;code&gt;), and they all look reasonable.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h3&gt;
&lt;p&gt;In conclusion, we successfully built a translator from Bril to Java bytecode. 
I look forward to further maintaining this tool to support potential new features 
such as function calls and memory allocation.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Bril Debugger</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/brildb/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/brildb/</guid>
                <description>&lt;p&gt;The goal of this project was to create an interactive debugger in the style of
&lt;a href=&quot;https:&#x2F;&#x2F;www.gnu.org&#x2F;software&#x2F;gdb&#x2F;&quot;&gt;GDB&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;lldb.llvm.org&quot;&gt;LLDB&lt;&#x2F;a&gt; that would run programs in the intermediate language
&lt;a href=&quot;https:&#x2F;&#x2F;capra.cs.cornell.edu&#x2F;bril&#x2F;&quot;&gt;Bril&lt;&#x2F;a&gt;. In order to be a helpful tool for debugging a Bril program, I decided
that the debugger must have, at a minimum, the capability to perform the
following tasks:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Interpret a program and present its output&lt;&#x2F;li&gt;
&lt;li&gt;Step through individual instructions of a program&lt;&#x2F;li&gt;
&lt;li&gt;Present the current state of a program to the user&lt;&#x2F;li&gt;
&lt;li&gt;Halt interpretation of a program upon reaching a user-declared program point&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;In my implementation of this project, &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;anastos&#x2F;bril&#x2F;tree&#x2F;brildb&#x2F;brildb&quot;&gt;BrilDB&lt;&#x2F;a&gt;, I included the above four
core capabilities as well as a few additional features, including the ability
to modify the values of variables in the state, and the ability to condition
breakpoints on expressions such that the interpreter only halts at the program
point if the condition holds.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design&quot;&gt;Design&lt;&#x2F;h2&gt;
&lt;p&gt;BrilDB is designed as a command-line interface. The Bril program to be debugged
is declared by its file name as an argument to the program, and commands are
issued by text in the interface. Commands can only be issued while the program
is halted, not while it is actively being interpreted. The commands that BrilDB
accepts are as follows:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;run&lt;&#x2F;code&gt;: Interprets the program from its current state until it reaches a
breakpoint whose condition is satisfied or the program terminates. &lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;step [N]&lt;&#x2F;code&gt;: The same as &lt;code&gt;run&lt;&#x2F;code&gt;, but stops executing after &lt;code&gt;N&lt;&#x2F;code&gt; instructions if
neither a satisfied breakpoint nor the end of the program is reached before
that. &lt;code&gt;N&lt;&#x2F;code&gt; defaults to &lt;code&gt;1&lt;&#x2F;code&gt; if not given in the command.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;restart&lt;&#x2F;code&gt;: Resets the program state to the beginning of the &lt;code&gt;main&lt;&#x2F;code&gt; function,
undeclaring all program variables. Breakpoints are not affected.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;scope&lt;&#x2F;code&gt;: Lists all variables that have already been declared in the current
state and their values.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;print VAR&lt;&#x2F;code&gt;: Prints the value of the variable &lt;code&gt;VAR&lt;&#x2F;code&gt; in the current state.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;assign VAR VAL&lt;&#x2F;code&gt;: Modifies the current state to set variable &lt;code&gt;VAR&lt;&#x2F;code&gt; to have
value &lt;code&gt;VAL&lt;&#x2F;code&gt;. &lt;code&gt;VAL&lt;&#x2F;code&gt; must be an integer or boolean literal.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;breakpoint LOC [COND]&lt;&#x2F;code&gt;: Places a breakpoint at location &lt;code&gt;LOC&lt;&#x2F;code&gt; that halts
when condition &lt;code&gt;COND&lt;&#x2F;code&gt; is satisfied. &lt;code&gt;LOC&lt;&#x2F;code&gt; must be either a label name or an
instruction&#x2F;line number. &lt;code&gt;COND&lt;&#x2F;code&gt; defaults to &lt;code&gt;true&lt;&#x2F;code&gt; if not given. The syntax
for conditions is explained below. Breakpoints can be &amp;quot;removed&amp;quot; by placing
new breakpoints with &lt;code&gt;COND&lt;&#x2F;code&gt; as &lt;code&gt;false&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;list&lt;&#x2F;code&gt;: Prints the current function with all of its instructions shown in
their textual format. Includes inline information about breakpoints and the
line numbers.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The interpreter does not print any logging information until it reaches a
satisfied breakpoint or the program terminates. This is to ensurethat the Bril
program output is not confused with any BrilDB output. In contrast to this, the
&lt;code&gt;step&lt;&#x2F;code&gt; function, when used to execute only a single instruction, will print the
instruction that it executes to the screen. This is to help orient the user as
to where they currently are within the program.&lt;&#x2F;p&gt;
&lt;p&gt;In my experience, a common use case for debuggers is to set a breakpoint at a
section in the code where a bug is believed to reside, run the program up to
that breakpoint, and then step through the section instruction-by-instruction
until the bug presents itself. As such it is important to know exactly where
you are within the code while stepping so that you can know where to fix the
bug when you find it.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;scope&lt;&#x2F;code&gt; and &lt;code&gt;print&lt;&#x2F;code&gt; commands are useful to see whether variables hold the
values that you expect them to at a given program point. &lt;code&gt;assign&lt;&#x2F;code&gt; is useful,
in the case where a variable is not as expected, to see whether modifying it to
the expected value fixes the issue.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;list&lt;&#x2F;code&gt; command gives a perspective on the whole function currently being
executed. Because it shows the line number for each instruction, it can be
useful for setting breakpoints in exactly the correct position. It also shows
the current position within the function and breakpoints that have been places
in the function in order to further orient the user. For example, below is an
invocation of &lt;code&gt;list&lt;&#x2F;code&gt; on a program for calculating Fibonacci numbers that has a
conditional breakpoint set at the &lt;code&gt;loop&lt;&#x2F;code&gt; label:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
(brildb) list
main {
    1      n: int = const 10;
    2      a: int = const 0;
    3      b: int = const 1;
    4  loop: (B: eq n 3)
    5      zero: int = const 0;
-&amp;gt;  6      done: bool = eq n zero;
    7      br done end iter;
    8  iter:
    9      t: int = id a;
   10      a: int = id b;
   11      b: int = add t a;
   12      one: int = const 1;
   13      n: int = sub n one;
   14      jmp loop;
   15  end:
   16      print a;
}
&lt;&#x2F;pre&gt;&lt;h3 id=&quot;breakpoint-conditions&quot;&gt;Breakpoint Conditions&lt;&#x2F;h3&gt;
&lt;p&gt;In order to support conditioning breakpoints on arbitrary expressions, I needed
a syntax in which users could express the conditions. I decided to use a syntax
based on the operations in Bril, with a grammar as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
BEXP ::= &amp;quot;true&amp;quot;
       | &amp;quot;false&amp;quot;
       | IDENTIFIER
       | &amp;quot;(&amp;quot; AB_BINOP AEXP AEXP &amp;quot;)&amp;quot;
       | &amp;quot;(&amp;quot; &amp;quot;not&amp;quot; BEXP &amp;quot;)&amp;quot;
       | &amp;quot;(&amp;quot; BB_BINOP BEXP BEXP &amp;quot;)&amp;quot;

AEXP ::= INTEGER
       | IDENTIFIER
       | &amp;quot;(&amp;quot; AA_BINOP AEXP AEXP &amp;quot;)&amp;quot;

BB_BINOP ::= &amp;quot;and&amp;quot; | &amp;quot;or&amp;quot;

AB_BINOP ::= &amp;quot;eq&amp;quot; | &amp;quot;lt&amp;quot; | &amp;quot;gt&amp;quot; | &amp;quot;le&amp;quot; | &amp;quot;ge&amp;quot;

AA_BINOP ::= &amp;quot;add&amp;quot; | &amp;quot;sub&amp;quot; | &amp;quot;mul&amp;quot; | &amp;quot;div&amp;quot;
&lt;&#x2F;pre&gt;
&lt;p&gt;Any &lt;code&gt;BEXP&lt;&#x2F;code&gt; (boolean expression) can be used as the condition of a breakpoint.
For example, if you wanted to halt at a label &lt;code&gt;foo&lt;&#x2F;code&gt; in the program only if
&lt;em&gt;x &amp;lt; y &amp;lt; z&lt;&#x2F;em&gt;, you would issue the command:
&lt;code&gt;breakpoint foo (and (lt x y) (lt y z))&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The condition grammar is typed such that only boolean expressions can be placed
where a boolean would be expected and only arithmetic expressions (&lt;code&gt;AEXP&lt;&#x2F;code&gt;) can
be placed where an integer would be expected. However, notably, any identifier
can be used as either a boolean or an arithmetic expression.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;BrilDB is implemented as a standalone Haskell program. The primary component of
the program is the interpreter, which handles executing commands and
manipulating the state. The other components are the command parser and the type
definitions, which include the ability to convert from the canonical JSON
representation of a Bril program to the internal algebraic data type (ADT)
representation.&lt;&#x2F;p&gt;
&lt;p&gt;The command parser component is built using the parser combinator library
&lt;a href=&quot;https:&#x2F;&#x2F;hackage.haskell.org&#x2F;package&#x2F;parsec&quot;&gt;Parsec&lt;&#x2F;a&gt;. Use of this library for this task is arguably excessive, as the
command syntax is very simple. However, as the grammar for breakpoint conditions
is somewhat more complex, I found it useful to use Parsec for that, and thus it
made sense to use it for all of the command parsing.&lt;&#x2F;p&gt;
&lt;p&gt;The implementation of the interpreter component and the type definitions are
designed with the anticipation of adding support for function calls in mind. The
program state is built with a call stack, which currently has at most one stack
frame in it. Interpreting the &lt;code&gt;ret&lt;&#x2F;code&gt; instruction corresponds to popping an
element off the call stack; determining if the program has terminated
corresponds to checking whether the call stack is empty. However, to add support
for calls, significant changes would need to be made to the commands to support
actions such as viewing the call stack, setting breakpoints in other functions,
stepping into or over calls, etc.&lt;&#x2F;p&gt;
&lt;p&gt;Breakpoints are implemented as an extra field in the ADT for instructions. Every
instruction has a condition (which is an ADT corresponding to &lt;code&gt;BEXP&lt;&#x2F;code&gt; above)
which is by default &lt;code&gt;false&lt;&#x2F;code&gt;. Adding a breakpoint replaces that condition with
the one supplied by the command.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;how-to-break-brildb&quot;&gt;How to Break BrilDB&lt;&#x2F;h3&gt;
&lt;p&gt;While BrilDB handles most kinds of user errors gracefully, such as malformed
commands or nonsensical conditions on breakpoints, it will crash under certain
circumstances. Specifically, BrilDB does not act as a typechecker. If a program
is not well-typed and, say, tries to add an integer to a boolean, BrilDB will
crash as it expects the arguments to an &lt;code&gt;add&lt;&#x2F;code&gt; operation to both be integers. As
such, it is suggested that you run a program through a typechecker first before
attempting to debug it with BrilDB.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;difficulties&quot;&gt;Difficulties&lt;&#x2F;h3&gt;
&lt;p&gt;The most difficult part of this project for me was determining the syntax and
writing the parser for the &lt;code&gt;breakpoint&lt;&#x2F;code&gt; command with a condition. I needed the
language to be usable in a one-line command, while also fitting in with the Bril
language and being expressive enough for complex conditions. Creating the parser
was somewhat difficult due to differences between the command language I was
creating and Parsec&#x27;s expectations of how a language should work. For example,
in the command language, I wanted a variable to be allowed to have the same name
as a command or operator.&lt;&#x2F;p&gt;
&lt;p&gt;Another difficulty I faced was making sure that the debugger would never attempt
to interpret an instruction after the Bril program had terminated. I had to
consider both how the user might attempt to execute instructions
post-termination, and how the iterative interpreter could keep going despite
reaching the end. To help with this I wrote a function which wraps a state
operation and checks that the program has not terminated before completing that
operation:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
checkTerminated :: StateT DebugState IO () -&amp;gt; StateT DebugState IO ()
checkTerminated st = do
    term &amp;lt;- gets terminated
    if term then
        liftIO $ putStrLn &amp;quot;program terminated.&amp;quot;
    else
        st
&lt;&#x2F;pre&gt;&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;As BrilDB is a program built around a user-interface, it did not make sense to
me to evaluate it through an automated testing suite. Instead I opted to perform
tests of the program&#x27;s effectiveness by using the program myself. I also decided
that the grounds on which I would evaluate the program would be its correctness,
as it is vitally important when trying to debug a program that the tools you are
using to debug it are correct in what they are showing you. The speed of the
debugger is not a huge concern (within reason) as in most cases the interpreter
will not be running for very long before it hits a breakpoint.&lt;&#x2F;p&gt;
&lt;p&gt;As such, to evaluate the debugger I first ran a collection of programs through
the &lt;code&gt;run&lt;&#x2F;code&gt; command to see if pure interpretation of the programs gave the
expected results. Some of these programs came from the tests for &lt;code&gt;brili&lt;&#x2F;code&gt; and
others were slightly longer programs that I created to test specific constructs.
I then ran through a few of these program setting breakpoints and conditional
breakpoints to ensure they triggered when and only when they were supposed to.
I used the &lt;code&gt;scope&lt;&#x2F;code&gt; command to test whether the state is as expected at certain
program points. I tested state modification by seeing if running the program
after issuing an &lt;code&gt;assign&lt;&#x2F;code&gt; command gave the same result as if there were an
assignment instruction at that location.&lt;&#x2F;p&gt;
&lt;p&gt;During this evaluation, I found the following two issues, both of which have
since been fixed:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;assign&lt;&#x2F;code&gt; command did not allow the value to be a negative number (parsing
issue).&lt;&#x2F;li&gt;
&lt;li&gt;If a program ended without a &lt;code&gt;ret&lt;&#x2F;code&gt;, the interpreter would crash when the
program terminated.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
</description>
            </item>
        
            <item>
                <title>C Implementation of Bril</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/c-implementation/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/c-implementation/</guid>
                <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;&#x2F;h2&gt;
&lt;p&gt;While Bril is a very useful testbed for exploring existing language
technologies and experimenting with new ideas, I was frustrated with the
tooling around the language. It took me three or four hours of fiddling around
with Node, npm, and Python to get &lt;code&gt;brili&lt;&#x2F;code&gt; working, and I did not get either
&lt;code&gt;bril2json&lt;&#x2F;code&gt; or &lt;code&gt;bril2txt&lt;&#x2F;code&gt; to work on my machine (I reimplemented them myself in
Python). I&#x27;ve rarely had such issues with my trusted systems programming
language, C. I decided to implement a simple, fast, and correct interpreter for
Bril. I call it &lt;code&gt;cril&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;All code can be found in the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Bhargee&#x2F;cril&quot;&gt;project
repository&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design&quot;&gt;Design&lt;&#x2F;h2&gt;
&lt;p&gt;The goal here is simplicity and speed (in comparison to other student
interpreters). The only external library used was for &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;kgabis&#x2F;parson&quot;&gt;parsing
JSON&lt;&#x2F;a&gt;. The interpreter&#x27;s evaluation loop, in
&lt;code&gt;src&#x2F;interp.c&lt;&#x2F;code&gt;, first loops through the parsed Bril program and notes the
indices of labels. This was the simplest way to implement jumps and branches,
which become a simple setting of the instruction pointer to the label&#x27;s index
(or an error if the label is not found in the label-&amp;gt;index map). Actual
instructions are implemented with a set of functions, one for each op code.
Once the &lt;code&gt;op&lt;&#x2F;code&gt; is known, the interpreter calls one of these functions, which
fetches arguments, does the required manipulation, and stores its result in the
right place (or in the case of effect operations, has the correct effect). &lt;&#x2F;p&gt;
&lt;h2 id=&quot;op-code-implementation&quot;&gt;Op Code Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;All instructions but &lt;code&gt;jmp&lt;&#x2F;code&gt;, &lt;code&gt;br&lt;&#x2F;code&gt;, &lt;code&gt;print&lt;&#x2F;code&gt;, and &lt;code&gt;const&lt;&#x2F;code&gt; have a trivial
implementation. A function called &lt;code&gt;get_or_quit&lt;&#x2F;code&gt; fetches arguments from the
source program and stores their values in a global array, sized to the max of
the argument count of all non-print instructions (2). If the variable is not
found in storage, an error is reported (containing the specific issue,
incorrect variable, and instruction pointer value) and the program quits.
Otherwise, the op implementing function calls &lt;code&gt;put&lt;&#x2F;code&gt; with the argument that
implements the specific op code, which stores the result in the destination
variable. For example, the &lt;code&gt;add&lt;&#x2F;code&gt; op code is implemented
as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;static void &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;op_add&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;() {
  get_or_quit();
  put(mem_args[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;mem_args[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]);
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Above, &lt;code&gt;mem_args&lt;&#x2F;code&gt; is the name of the global argument array. Thus, most op codes
have 2 line implementations. Even &lt;code&gt;br&lt;&#x2F;code&gt;, the most complex op code, requires only
16 lines to implement. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;memory-implentation-and-more-for-free&quot;&gt;Memory Implentation (and More, for Free)&lt;&#x2F;h2&gt;
&lt;p&gt;I needed a hash table implementation to implement variable storage, so I built
it myself. It can be found in &lt;code&gt;src&#x2F;table.{c,h}&lt;&#x2F;code&gt;. The dictionary uses open
addressing and linear chaining. The hash function copies the hash used by
&lt;a href=&quot;https:&#x2F;&#x2F;docs.oracle.com&#x2F;javase&#x2F;7&#x2F;docs&#x2F;api&#x2F;java&#x2F;lang&#x2F;String.html#hashCode()&quot;&gt;java.lang.String’s hashCode()
method&lt;&#x2F;a&gt;. The &lt;code&gt;_hash&lt;&#x2F;code&gt; function computes
a polynomial whose coeffecients are the integer values of the string&#x27;s
characters, evaluated at &lt;code&gt;x=31&lt;&#x2F;code&gt;. The polynomial is evaluated with Horner&#x27;s
method. The &lt;code&gt;table&lt;&#x2F;code&gt; maps string keys to &lt;code&gt;int64_t&lt;&#x2F;code&gt; values.  I represent bril
integers and bril booleans with &lt;code&gt;int64_t&lt;&#x2F;code&gt; to avoid storing the type of bril
variables. Since bril&#x27;s typesystem only allows for those two types, this design
is sufficient for now. Incorporating more types will be simple: I can extend
the &lt;code&gt;table_elem&lt;&#x2F;code&gt; struct with a type bitfield&#x2F;enum value.  For now, this simple
table suffices. &lt;&#x2F;p&gt;
&lt;p&gt;The table code is also used for the label-&amp;gt;index map and to store the mapping
between string op codes and the index for that op code&#x27;s implementing function
in an array of function pointers.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;correctness&quot;&gt;Correctness&lt;&#x2F;h3&gt;
&lt;p&gt;I ran &lt;code&gt;cril&lt;&#x2F;code&gt; against the programs in &lt;code&gt;bril&#x2F;test&lt;&#x2F;code&gt; and made sure the outputs
matched &lt;code&gt;&amp;lt;program&amp;gt;.out&lt;&#x2F;code&gt;. My interpreter gave correct results on the Fibonacci
program in &lt;code&gt;benchmark&#x2F;fibonacci.json&lt;&#x2F;code&gt;, while the &lt;code&gt;brili&lt;&#x2F;code&gt; reference interpreter
gave wrong results on the last 3 outputted Fibonacci numbers due to rounding
issues with Javascript&#x27;s &lt;code&gt;BigInt&lt;&#x2F;code&gt; type. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;performance&quot;&gt;Performance&lt;&#x2F;h3&gt;
&lt;h4 id=&quot;benchmark&quot;&gt;Benchmark&lt;&#x2F;h4&gt;
&lt;p&gt;I collected relatively compute heavy Bril programs under &lt;code&gt;benchmark&lt;&#x2F;code&gt; in the
cril repository. I took these programs from Wen-Ding Li&#x27;s &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;xu3kev&#x2F;bril-benchmark&#x2F;tree&#x2F;master&quot;&gt;Bril benchmark
repository&lt;&#x2F;a&gt;. I used his
Fibonacci and factorial implementation verbatim, and used his matrix
multiplication and polynomial multiplication programs to generate Bril programs
with options &lt;code&gt;n=1&lt;&#x2F;code&gt; to &lt;code&gt;n=5&lt;&#x2F;code&gt; (the size of the respective matrices and
polynomials). &lt;&#x2F;p&gt;
&lt;h4 id=&quot;measurement-and-comparison&quot;&gt;Measurement and Comparison&lt;&#x2F;h4&gt;
&lt;p&gt;I wanted to have a way to get reliable performance numbers to square off
against any other students&#x27; implementation of bril. First, I had the main
evaluation loop return a &lt;code&gt;uint64_t&lt;&#x2F;code&gt; number of nanoseconds of elapsed time.
I used the POSIX provided &lt;code&gt;timespec&lt;&#x2F;code&gt; structure to record the time tracked by
&lt;code&gt;CLOCK_PROCESS_CPUTIME_ID&lt;&#x2F;code&gt;, the nanosecond resolution process time clock. This
tracks CPU ticks spent on the program process itself, irrespective of other
scheduled processes. This data is procured with the C standard library&#x27;s
&lt;code&gt;clock_gettime&lt;&#x2F;code&gt;. Some notes: &lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;I start timing on after JSON parsing. I did not want that to be
included. I do the same in my evaluation of the reference interpreter&lt;&#x2F;li&gt;
&lt;li&gt;I stop timing after cleaning up all data structures used for interpretation&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Then, in &lt;code&gt;src&#x2F;main.c&lt;&#x2F;code&gt;, I have a constant named &lt;code&gt;NUM_RUNS&lt;&#x2F;code&gt;, and if &lt;code&gt;cril&lt;&#x2F;code&gt; is
called with &lt;code&gt;--benchmark&lt;&#x2F;code&gt;, it will run each program under &lt;code&gt;benchmark&lt;&#x2F;code&gt; &lt;code&gt;NUM_RUNS&lt;&#x2F;code&gt;
times, calculate a mean and standard deviation per program, and output the
results. I modified the reference &lt;code&gt;brili&lt;&#x2F;code&gt; source code to perform the same
measurements on the same benchmark programs. Times reported are in
milliseconds, displayed as means plus or minus standard deviations. &lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;left&quot;&gt;Program&lt;&#x2F;th&gt;&lt;th align=&quot;left&quot;&gt;Cril&lt;&#x2F;th&gt;&lt;th align=&quot;left&quot;&gt;Brili&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;Fibonacci&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.099 ± .03&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.44 ± .23&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;Factorial&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.019 ± .005&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.042 ± .073&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;MatMul 1&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.005 ± .002&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.004 ± .001&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;MatMul 2&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.014 ± .004&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.022 ± .025&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;MatMul 3&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.037 ± .008&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.041 ± .027&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;MatMul 4&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.08 ± .012&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.075 ± .05&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;MatMul 5&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.017 ± .023&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.136 ± .07&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;PolyMul 1&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.005 ± .002&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.006 ± .002&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;PolyMul 2&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.022 ± .014&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.011 ± .004&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;PolyMul 3&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.018 ± .003&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.019 ± .005&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;PolyMul 4&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.028 ± .006&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.027 ± .015&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td align=&quot;left&quot;&gt;PolyMul 5&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.04 ± .008&lt;&#x2F;td&gt;&lt;td align=&quot;left&quot;&gt;.04 ± .027&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
</description>
            </item>
        
            <item>
                <title>Exceptions in Bril</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/exceptions-in-bril/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/exceptions-in-bril/</guid>
                <description>&lt;h2 id=&quot;overview&quot;&gt;Overview&lt;&#x2F;h2&gt;
&lt;p&gt;We add exceptions as a control structure in Bril to support non-local
control flow. 
This allows frontends to compile exceptions directly.
The ultimate goal is to extend Bril further with more powerful control
structures such as first-class continuations or algebraic effects.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;function-calls&quot;&gt;Function Calls&lt;&#x2F;h2&gt;
&lt;p&gt;To make the implementation of exceptions nontrivial and to make exceptions
actually be non-local control structures, we first extend Bril so that it
supports function calls.
We do this by adding a &lt;code&gt;call&lt;&#x2F;code&gt; instruction that takes as arguments a
function name and a list of function arguments.
We also extended function declarations to allow for formal parameters.
The Bril text format now allows for declarations of the form&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;funcName arg1:type1 ... argN:typeN { instrs }
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Our interpreter implemention explicitly represents the program stack
as an object. &lt;&#x2F;p&gt;
&lt;p&gt;The interpreter&#x27;s activation record has four components:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Bril function object --- the current function whose list of instructions
is being executed&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Variable environment --- map from local variables to values&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Handler environment --- map from exception names to handlers (see below)&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;PC index --- the index into the instruction list of the current instruction
being executed&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The interpreter proceeds by executing instructions, which either change
environment mappings and&#x2F;or set the PC to the index of the next instruction
to execute.
When &lt;code&gt;call&lt;&#x2F;code&gt; is executed, the interpreter does the following:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The current activation record is pushed into the program stack.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;An empty variable environment is constructed, and the formal parameters
of the function are mapped to the arguments of the &lt;code&gt;call&lt;&#x2F;code&gt; instruction.
An empty handler environment is constructed.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;The PC index is changed to 0, and the variable and handler environments 
are updated to be empty.
The interpreter resumes execution in this new state.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;When a function returns, it pops the top of the program stack as the new
activation record or, if the program stack is empty, ends program execution.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;exceptions&quot;&gt;Exceptions&lt;&#x2F;h2&gt;
&lt;p&gt;Once we added support for function calls, implementing exceptions came quite
easily.
We add two instructions to this end: &lt;code&gt;handle exnName handlerLabel&lt;&#x2F;code&gt;
installs a handler for exception &lt;code&gt;exnName&lt;&#x2F;code&gt; in the handler environment,
while &lt;code&gt;throw exnName&lt;&#x2F;code&gt; throws an exception with name &lt;code&gt;exnName&lt;&#x2F;code&gt;.
When an exception &lt;code&gt;exnName&lt;&#x2F;code&gt; is called, the interpreter does
the following:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The interpreter performs &lt;em&gt;stack unwinding&lt;&#x2F;em&gt;: first, it checks if there
is a handler for &lt;code&gt;exnName&lt;&#x2F;code&gt; in the current handler environment.
If none exists, it pops the top of the program stack and checks for a handler
in that activation frame.
It does this until it finds an appropriate handler; otherwise, it throws
a exception (in the metalanguage) since a Bril exception was not handled
properly.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Once it finds an appropriate handler, the interpreter sets the
handler&#x27;s activation record as its current activation record and then
it jumps to the PC index of the handler label.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;apologia&quot;&gt;Apologia&lt;&#x2F;h2&gt;
&lt;p&gt;Our implementation explicitly represents the program stack.
Our implementation could also have implicitly represented the program stack
using the interpreter&#x27;s own stack by making recursive calls to 
&lt;code&gt;evalFunc&lt;&#x2F;code&gt;, but we obviated this design for the more explicit one to have
finer control over the program stack of Bril and to support
future development of other non-local control structures that
manipulate the stack.
(For example, first-class continuations would involve storing program stacks
as values in the environment map.)
Also, the implicit representation of program stacks would essentially fix
the non-local control structures of Bril to the control structures
the metalanguage has.
With the implicit representation, to throw exceptions in Bril programs
one would need to throw an exception in the interpreter as well;
supporting first-class continuations or algebraic effects would be impossible
in this way.&lt;&#x2F;p&gt;
&lt;p&gt;Our implementation of exception handlers does not support passing
exception objects.
To support this in the future, we will make handlers be standalone functions
instead of just labels in an existing function.
This would make reasoning about the control flow of the program even less local
since it would make handlers not syntactically part of the contexts in which
they might be thrown, so we obviated this design choice for the current
implementation.&lt;&#x2F;p&gt;
&lt;p&gt;Another limitation of the current implementation of exceptions is the fact that
handlers are not tied to a lexical scope. 
That means a handler would override any handler for the same exception name
installed at the same function.
We choose this simpler albeit somewhat unintuitive design to obviate the
introduction of nested lexical scopes in Bril, which we believe is
antithetical to its intended use as a simple intermediate language.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;Our main goal for evaluation is to check whether our implementation is correct.
We considered performance considerations as out of bounds for evaluation since
the Bril interpreter is sufficiently different enough from how, say, ISA
instructions are implemented in hardware that we cannot infer anything useful
regarding the performance of an analogous implemenation of our exceptions
mechanism in the latter from results in the former.&lt;&#x2F;p&gt;
&lt;p&gt;To evaluate correctness, we created a suite of tests that check a variety of
situations in which exceptions might be used.
The suite has both positive and negative test cases to check for normal use of
exceptions and for when they are used improperly.
An inexhaustive list of tests include:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;(Positive) No stack unwinding: thrown exception is handled by a handler 
installed in the current activation frame&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;(Positive) Deep stack unwinding: thrown exception is handled by a handler
high up in the stack&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;(Negative) Unhandled exception: no handler is installed for an exception&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Using the Turnt testing tool, we were able to verify that our implementation
passes all the tests in the suite.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Writing a Faster Interpreter for Bril</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/faster-interpreter/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/faster-interpreter/</guid>
                <description>&lt;h1 id=&quot;interpreting-bril&quot;&gt;Interpreting Bril&lt;&#x2F;h1&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&quot;&gt;Bril&lt;&#x2F;a&gt; is a pleasantly simple intermediate representation for
teaching compilers and compiler optimizations. Part of Bril being useful for this purpose is the
existence of a solid, simple reference interpreter for the language: &lt;code&gt;brili&lt;&#x2F;code&gt;. Written in
&lt;a href=&quot;https:&#x2F;&#x2F;www.typescriptlang.org&#x2F;&quot;&gt;TypeScript&lt;&#x2F;a&gt;, &lt;code&gt;brili&lt;&#x2F;code&gt; is straightforward to extend and allows easy
experimentation with new language features and optimizations. The flipside of &lt;code&gt;brili&lt;&#x2F;code&gt;&#x27;s simplicity
and the use of TypeScript for its implementation is that it isn&#x27;t very fast. Thus, while &lt;code&gt;brili&lt;&#x2F;code&gt; is
suitable for working with small, simple Bril programs, it may make experiments using more complex,
long-running programs unnecessarily onerous.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;gotta-go-fast&quot;&gt;Gotta go fast&lt;&#x2F;h1&gt;
&lt;p&gt;Given this, our goal was to make a faster interpreter for Bril. The simplest way to do this is to
pick a faster language than TypeScript for the interpreter&#x27;s implementation. As such, we chose to
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;ansuz&#x2F;RIIR&quot;&gt;Rewrite It In Rust&lt;&#x2F;a&gt;. &lt;a href=&quot;https:&#x2F;&#x2F;www.rust-lang.org&#x2F;&quot;&gt;Rust&lt;&#x2F;a&gt; is a
modern systems programming language designed to make it easy to &amp;quot;build reliable and efficient
software&amp;quot;; it typically performs comparably to (or sometimes faster than) the same software written
in a more traditional systems language such as C or C++.&lt;&#x2F;p&gt;
&lt;p&gt;Rust is appealing for this project for a few other reasons. Chief among these is its rich library
ecosystem, which includes the excellent &lt;a href=&quot;https:&#x2F;&#x2F;serde.rs&#x2F;&quot;&gt;Serde&lt;&#x2F;a&gt;. Serde is a library (technically,
a family of libraries) for &lt;strong&gt;ser&lt;&#x2F;strong&gt;ializing and &lt;strong&gt;de&lt;&#x2F;strong&gt;serializing data between a range of formats,
including JSON. It provides utilities for automatically &lt;a href=&quot;https:&#x2F;&#x2F;serde.rs&#x2F;derive.html&quot;&gt;deriving&lt;&#x2F;a&gt;
functions to parse JSON (and other formats) directly into native Rust structs, which is convenient
for an interpreter. With Serde, we can define the data structures for Bril&#x27;s various operations and
overall program structure, then get parsing code from Bril&#x27;s JSON representation for free! Even
better, Serde&#x27;s automatically generated code is typically very fast.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;interpreter-structure&quot;&gt;Interpreter structure&lt;&#x2F;h1&gt;
&lt;p&gt;Our first &amp;quot;draft&amp;quot; of an interpreter implemented a simple structure:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Using Serde, derive JSON deserialization code for types representing the Bril operations
and other syntax (including labels, functions, etc.)&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Parse Bril JSON into the aforementioned structures.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Form basic blocks and a CFG from the program instructions. This process constructs an mapping
from label names to basic block indices and uses this map to &amp;quot;link&amp;quot; together connected basic
blocks. Linking in this way (rather than just constructing the name&#x2F;index mapping and using that
at runtime) gives us slightly faster execution - it&#x27;s faster to load an array index than it is to
e.g., load a heap-allocated object (as in a pointer-linked graph) or compute a string hash every
time we need to find a label location. To see why this is faster than chasing a pointer, consider
that we can exploit spatial locality by storing blocks in contiguous memory. Since a block is
usually small, a set of blocks stored in contiguous memory may be able to fit in cache, making
access fast. It&#x27;s worth noting that while we could&#x27;ve gotten away with not forming basic blocks
or a CFG, and potentially gained some speedup from this laziness, the CFG building process should
not dominate interpretation times and enables us to build more optimizations down the line.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Starting at the first basic block of the &lt;code&gt;main&lt;&#x2F;code&gt; function, iterate through, matching each
instruction based on its type (using Rust&#x27;s &lt;code&gt;match&lt;&#x2F;code&gt; expression on a large variant type representing
the Bril operations). Each branch of this match implements the semantics for a Bril operation; we
also handle type-checking at this stage. We handle control flow by setting the index of the &amp;quot;next&amp;quot;
basic block for the next iteration of the main execution loop.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h1 id=&quot;derive-problems&quot;&gt;&lt;code&gt;#[derive(Problems)]&lt;&#x2F;code&gt;&lt;&#x2F;h1&gt;
&lt;p&gt;The above is a perfectly reasonable but naive implementation of a Bril interpreter. There are a lot
of possible improvements we could make on this baseline; many of these stem from being smarter with
how we look up variable values and how we dispatch operations. To enable these sorts of
improvements, it helps to be able to parse variable identifiers and operation types into numerical
representations (to enable direct indexing instead of name hashing for variable lookup, and to
enable branch-predictor friendly dispatch of operations based on broad classes of operation &amp;quot;type&amp;quot;).
We should note, however, that this transformation is not free: for short programs where names are
only accessed once&#x2F;a handful of times, it may cost more to perform identifier transformation before
interpretation. This transformation will be most beneficial for programs with loops and frequent
variable access.&lt;&#x2F;p&gt;
&lt;p&gt;We implemented this in a relatively clean way around the original Serde structure, but ran into some
challenges around the difficulty of performing stateful deserialization.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;state-of-the-deserialization&quot;&gt;State of the Deserialization&lt;&#x2F;h2&gt;
&lt;p&gt;Here&#x27;s our basic approach to replacing variable identifiers with indices, in pseudocode (Python):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;next_id &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;id_name_map &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{}
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#323232;&quot;&gt;deserialize_identifier&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(ident):
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ident &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;in &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;id_name_map:
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;id_name_map[ident]

  id_name_map[ident] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;next_id
  next_id &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1
  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;next_id &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;- &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We ran into two problems implementing this in Rust. First, Rust is strongly typed. This means that
either we need to perform this transformation during deserialization of JSON into Rust data
structures, or we need two versions of our Rust structures: One which represents identifiers as
strings, and one which uses the above numerical representation.&lt;&#x2F;p&gt;
&lt;p&gt;The first option here is problematic because Serde makes it challenging to use mutable state inside
the deserialization logic for a deeply-nested field of a data type. Indeed, it seems that you lose
most of the benefits of Serde&#x27;s &lt;code&gt;#[derive(Deserialize)]&lt;&#x2F;code&gt; magic auto-implementation, and have to
implement a tree of deserializers manually.&lt;&#x2F;p&gt;
&lt;p&gt;The second option is better: make the IR data types polymorphic and deserialize to a &lt;code&gt;String&lt;&#x2F;code&gt;
specialization of the types, then run a pass over the program to transform to a numerical
specialization of the types. It&#x27;s easy to use mutable state in this second transformation, and the
use of parametric polymorphism makes the code clean.&lt;&#x2F;p&gt;
&lt;p&gt;We&#x27;ve implemented this, and it is working. Most of the potential speedup remains untapped - we did
not have time in this project to implement parsing of operations into branch-friendly numerical
representations.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;how-do-we-stack-up&quot;&gt;How do we stack up?&lt;&#x2F;h1&gt;
&lt;p&gt;We ran benchmarks using &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sharkdp&#x2F;hyperfine&quot;&gt;hyperfine&lt;&#x2F;a&gt;. The same Bril program was
run using both &lt;code&gt;brili&lt;&#x2F;code&gt; and &lt;code&gt;brilirs&lt;&#x2F;code&gt;. Overall, &lt;code&gt;brilirs&lt;&#x2F;code&gt; was faster, as expected. On one benchmark, &lt;code&gt;brili&lt;&#x2F;code&gt;
was twice as fast, however. The results (measurements reported as mean plus&#x2F;minus standard
deviation):&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Benchmark&lt;&#x2F;th&gt;&lt;th&gt;brili&lt;&#x2F;th&gt;&lt;th&gt;brilirs&lt;&#x2F;th&gt;&lt;th&gt;Speedup&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;matrix_mul, n=10&lt;&#x2F;td&gt;&lt;td&gt;45.5 ms ± 3.4 ms&lt;&#x2F;td&gt;&lt;td&gt;21.0 ms ± 2.7 ms&lt;&#x2F;td&gt;&lt;td&gt;2.16 ± 0.32&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;matrix_mul, n=20&lt;&#x2F;td&gt;&lt;td&gt;82.2 ms ± 2.5 ms&lt;&#x2F;td&gt;&lt;td&gt;148.9 ms ± 3.6 ms&lt;&#x2F;td&gt;&lt;td&gt;brili was 1.81 ± 0.07 times faster&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;poly_mul, n=50&lt;&#x2F;td&gt;&lt;td&gt;53.4 ms ± 3.9 ms&lt;&#x2F;td&gt;&lt;td&gt;44.2 ms ± 1.9 ms&lt;&#x2F;td&gt;&lt;td&gt;1.21 ± 0.10&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;poly_mul, n=100&lt;&#x2F;td&gt;&lt;td&gt;86.2 ms ± 5.0 ms&lt;&#x2F;td&gt;&lt;td&gt;174.7 ms ± 4.0 ms&lt;&#x2F;td&gt;&lt;td&gt;brili was 2.03 ± 0.13 times faster&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;For other benchmarks, &lt;code&gt;brilirs&lt;&#x2F;code&gt; was so fast that hyperfine warned that the average run was under or
around five milliseconds:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Benchmark&lt;&#x2F;th&gt;&lt;th&gt;brili&lt;&#x2F;th&gt;&lt;th&gt;brilirs&lt;&#x2F;th&gt;&lt;th&gt;Speedup&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;factorial&lt;&#x2F;td&gt;&lt;td&gt;38.2 ms ± 4.2 ms&lt;&#x2F;td&gt;&lt;td&gt;2.4 ms ± 1.3 ms&lt;&#x2F;td&gt;&lt;td&gt;16.09 ± 8.84&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;fibonacci&lt;&#x2F;td&gt;&lt;td&gt;39.8 ms ± 3.3 ms&lt;&#x2F;td&gt;&lt;td&gt;4.1 ms ± 2.4 ms&lt;&#x2F;td&gt;&lt;td&gt;9.74 ± 5.68&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;id_chain, n=10&lt;&#x2F;td&gt;&lt;td&gt;36.8 ms ± 3.0 ms&lt;&#x2F;td&gt;&lt;td&gt;1.9 ms ± 1.1 ms&lt;&#x2F;td&gt;&lt;td&gt;19.06 ± 11.06&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;id_chain, n=500&lt;&#x2F;td&gt;&lt;td&gt;39.4 ms ± 4.1 ms&lt;&#x2F;td&gt;&lt;td&gt;5.7 ms ± 1.2 ms&lt;&#x2F;td&gt;&lt;td&gt;6.89 ± 1.64&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;poly_mul, n=10&lt;&#x2F;td&gt;&lt;td&gt;37.8 ms ± 2.3 ms&lt;&#x2F;td&gt;&lt;td&gt;5.6 ms ± 1.3 ms&lt;&#x2F;td&gt;&lt;td&gt;6.80 ± 1.65&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Most of the benchmarks are from &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;xu3kev&#x2F;bril-benchmark&#x2F;&quot;&gt;bril-benchmark&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Benchmarks were run on a 2018 Thinkpad T580 running Arch Linux, kernel version 5.2.11.&lt;&#x2F;p&gt;
&lt;h1 id=&quot;what-else-could-be-done&quot;&gt;What else could be done?&lt;&#x2F;h1&gt;
&lt;p&gt;If we continue to develop the interpreter, it would be worthwhile to try adding some interpreter
optimizations (e.g., those found
&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;status-im&#x2F;nimbus&#x2F;wiki&#x2F;Interpreter-optimization-resources&quot;&gt;here&lt;&#x2F;a&gt;) to squeeze out
more speed.&lt;&#x2F;p&gt;
&lt;p&gt;Aside from adding interpreter implementation optimizations, it would also be interesting to apply
CFG-level optimizations and compile Bril to some optimized in-memory bytecode representation for
faster interpretation, as well as to add some of the cool new language features created as a part of
other Project 1&#x27;s.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Floating Points and Fixed-Length Arrays in Bril</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/floats-static-arrays/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/floats-static-arrays/</guid>
                <description>&lt;h3 id=&quot;introduction&quot;&gt;Introduction&lt;&#x2F;h3&gt;
&lt;p&gt;My goal for this project was to add floating points and fixed-length arrays to &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&quot;&gt;Bril&lt;&#x2F;a&gt;.  The intention behind this decision is to promote Bril as a lower-level intermediate language, with fewer type abstractions than Javascript without the complexity of representation of LLVM.  The hope is that, by writing the Bril IR to provide a &lt;em&gt;minimal&lt;&#x2F;em&gt; but &lt;em&gt;descriptive&lt;&#x2F;em&gt; set of numeric operations, users will be able to explore optimiziations without sacrificing visibility of low-level computational operations.&lt;&#x2F;p&gt;
&lt;p&gt;These goals were mostly successful at the level of Bril semantics, parsing, and interpretation.  What was more difficult than expected (and less successful) were my attempts to model TypeScript values and arrays with &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&#x2F;blob&#x2F;master&#x2F;bril-ts&#x2F;ts2bril.ts&quot;&gt;TS2Bril&lt;&#x2F;a&gt;.  Before exploring the limitations of the Bril representation, however, we must examine the floating point and array semantics of Bril more formally.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;floating-points-in-bril&quot;&gt;Floating Points in Bril&lt;&#x2F;h3&gt;
&lt;p&gt;Bril now supports the types &lt;code&gt;double&lt;&#x2F;code&gt; and &lt;code&gt;float&lt;&#x2F;code&gt;.  These represent, respectively,  the stadard IEEE-754 &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Double-precision_floating-point_format&quot;&gt;double-precision&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Single-precision_floating-point_format&quot;&gt;single-precision&lt;&#x2F;a&gt; floating-point formats.  These types come equipped with the basic arithmetic operations &lt;code&gt;fadd&lt;&#x2F;code&gt;, &lt;code&gt;fsub&lt;&#x2F;code&gt;, &lt;code&gt;fmult&lt;&#x2F;code&gt;, and &lt;code&gt;fdiv&lt;&#x2F;code&gt;, along with the comparisons &lt;code&gt;feq&lt;&#x2F;code&gt;, &lt;code&gt;flt&lt;&#x2F;code&gt;, &lt;code&gt;fgt&lt;&#x2F;code&gt;, &lt;code&gt;fle&lt;&#x2F;code&gt;, and &lt;code&gt;fge&lt;&#x2F;code&gt;.  The introduction of these operations is intended to highlight that floating-point and integer operations should be treated as fundamentally different objects.&lt;&#x2F;p&gt;
&lt;p&gt;In Bril, these floating-point operations are overloaded to work on either floats or doubles.  However, floating-point precisions should not be mixed in a single operation.  It is expected, for instance, that the following code will is illegal:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;v0: double = const 5;
v1: float = const 5;
v2: float = fadd v0 v1
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Similarly, mixing floating-points and integers in Bril is undefined, as well as using floating-point arithmetic operations on integers or vice-versa.  In a future project, I intend to explore the consequences of relaxing these semantic requirements, particularly when mixing floating-point precision.  It would also be interesting to add arbitrary precision floating-points to Bril; however, it is unclear if removing the abstraction of having only two named types will be worth the additional cost of complexity to implementations on the Bril IR.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;bril-fixed-length-arrays&quot;&gt;Bril Fixed-Length Arrays&lt;&#x2F;h3&gt;
&lt;p&gt;Bril now supports types of the form &lt;code&gt;type[size]&lt;&#x2F;code&gt; in addition to types previously defined.  Note the inclusion of type recursion in this definition, which permits multi-dimensional arrays.  Arrays in Bril must contain elements all of the same type, and the size of arrays cannot be changed after initialization.&lt;&#x2F;p&gt;
&lt;p&gt;There are three operations related to manipulating arrays in Bril: initialization, assignment, and indexing.  Arrays are initialized with the &lt;code&gt;new&lt;&#x2F;code&gt; instruction, which takes a type and stores a default initialization of that type to the given destination.  &lt;code&gt;new&lt;&#x2F;code&gt; can be applied to any type to initialize the default value of that type (which is implementation-specific).  Array elements can be set with the &lt;code&gt;set&lt;&#x2F;code&gt; instruction, an effectful instruction which takes the array, an index, and a value in that order.  Finally, array elements can be read using the &lt;code&gt;get&lt;&#x2F;code&gt; instruction, which writes a value of the given array element.&lt;&#x2F;p&gt;
&lt;p&gt;Array operations are concisely summarized in the following code snippet:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;val: int = const 5;
ind: int = const 1;

arr: int[3] = new int[3];   &#x2F;&#x2F; arr = [0, 0, 0]
set arr ind val;            &#x2F;&#x2F; arr = [0, 5, 0]
res: int = get arr ind;   &#x2F;&#x2F; res = 5
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Arrays can be written to freely in any Bril block.  As a result, array values are inherently stateful and should be treated by users carefully.  Writing values of the wrong type, however, is not permitted.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h3&gt;
&lt;p&gt;Floating points and fixed-length arrays have been implemented as described, and can be reasoned about by the &lt;code&gt;bril2json&lt;&#x2F;code&gt; parser, the &lt;code&gt;brili&lt;&#x2F;code&gt; interpreter, and the &lt;code&gt;bril2txt&lt;&#x2F;code&gt; translator.  A subset of TypeScript can be written to Bril-recognized JSON by the &lt;code&gt;ts2bril&lt;&#x2F;code&gt; compiler.  While these conversions work generally as described above, there are some features which bear special notice.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;brili&lt;&#x2F;code&gt; interpreter supports integers, doubles, and floating-points all as different objects.  Integers are represented as JavaScript &lt;a href=&quot;https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;JavaScript&#x2F;Reference&#x2F;Global_Objects&#x2F;BigInt&quot;&gt;BigInt&lt;&#x2F;a&gt; objects.  This choice of representation means that &lt;code&gt;brili&lt;&#x2F;code&gt; does not correctly reason about integer overflow -- while this should be implemented in the future, I have not observed a need for it in the current Bril projects.&lt;&#x2F;p&gt;
&lt;p&gt;When interpreting Bril code in &lt;code&gt;brili&lt;&#x2F;code&gt;, doubles can be represented simply as &lt;code&gt;number&lt;&#x2F;code&gt;, while floats are derived with the &lt;a href=&quot;https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;JavaScript&#x2F;Reference&#x2F;Global_Objects&#x2F;Math&#x2F;fround&quot;&gt;Math.fround&lt;&#x2F;a&gt; function.  Note that &lt;code&gt;fround&lt;&#x2F;code&gt; is only applied &lt;em&gt;after&lt;&#x2F;em&gt; each floating-point operation; this is permitted due to the &lt;a href=&quot;https:&#x2F;&#x2F;docs.oracle.com&#x2F;cd&#x2F;E19957-01&#x2F;806-3568&#x2F;ncg_goldberg.html&quot;&gt;IEEE-754 restrictions&lt;&#x2F;a&gt; on floating point operation precision loss (or lack of).  Extending Bril to support other mathematical operations (such as &lt;code&gt;pow&lt;&#x2F;code&gt; or &lt;code&gt;sqrt&lt;&#x2F;code&gt;) may require that this implementation choice be updated.&lt;&#x2F;p&gt;
&lt;p&gt;To update our translation from TypeScript to Bril, the addition of floating-points require that we make a few changes.  First, &lt;code&gt;ts2bril&lt;&#x2F;code&gt; now represents every number as a &lt;code&gt;double&lt;&#x2F;code&gt; by default, and thus operations in TypeScript are translated to Bril as floating-point operations.  The TypeScript &lt;code&gt;bigint&lt;&#x2F;code&gt; type can be used to specially generate integers in Bril, and operations between bigints are translated correctly.  Floating points cannot be easily represented by TypeScript code; however, I will look into using the &lt;code&gt;fround&lt;&#x2F;code&gt; function as a mechanism for generating Bril &lt;code&gt;float&lt;&#x2F;code&gt; values from TypeScript code.&lt;&#x2F;p&gt;
&lt;p&gt;Fixed-length arrays are reasoned about naively by the &lt;code&gt;brili&lt;&#x2F;code&gt; interpreter; in particular, it makes no attempt to check array length before indexing.  &lt;code&gt;brili&lt;&#x2F;code&gt; arrays are recursively filled with 0s, False, or new arrays by default when initialized.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;ts2bril&lt;&#x2F;code&gt; can compile TypeScript arrays of consistent types to a Bril-like language.  This language differs from Bril only in that fixed-length arrays will have a size of &lt;code&gt;-1&lt;&#x2F;code&gt; except at initialization, indicating that the size of the array is not known statically by the JavaScript compiler.  I am planning to explore ways to fix this issue, as TypeScript does not track array length.  Since Bril semantics for arrays is extremely limited, it should be very straightforward to take a second pass after compiling to replace each instance of unknown array length with the correct size.  It is worth noting that translating JavaScript commands which change array length, such as append, must be done by instantiating an entirely new array in Bril and filling in the values from the old array.&lt;&#x2F;p&gt;
&lt;p&gt;I found working with &lt;code&gt;ts2bril&lt;&#x2F;code&gt; to be &lt;em&gt;extremely&lt;&#x2F;em&gt; painful when adding arrays, and spent far too long trying to grapple with the TypeScript type system.  For me, this project highlighted the lack of internal documentation for TypeScript compiler type structure.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h3&gt;
&lt;p&gt;Testing was focused on correctly interpreting several new files added to the Bril &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Checkmate50&#x2F;bril&#x2F;tree&#x2F;arrays&#x2F;test&quot;&gt;test suite&lt;&#x2F;a&gt;.  These tests all work as intended; however, it is worth noting that the TypeScript array interpretation file was simplified somewhat as the limitations of the compiler became clear.  Speed of implementation was not a concern in this project.&lt;&#x2F;p&gt;
&lt;p&gt;Some evaluation details of note include floating-point exactness testing and TypeScript compilation limitations.  Some simple floating-point operations &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Checkmate50&#x2F;bril&#x2F;blob&#x2F;master&#x2F;test&#x2F;interp&#x2F;float.bril&quot;&gt;provided&lt;&#x2F;a&gt; to &lt;code&gt;brili&lt;&#x2F;code&gt; were compared to a similar C implementation (&lt;a href=&quot;https:&#x2F;&#x2F;www.onlinegdb.com&#x2F;online_c_compiler&quot;&gt;compiled online&lt;&#x2F;a&gt;) and found to be identical in precision.&lt;br &#x2F;&gt;
I originally intended to verify TypeScript compilation by interpreting the resulting code; that is, by running the command:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
ts2bril src.ts | bril2txt | bril2json | brili
&lt;&#x2F;pre&gt;
&lt;p&gt;Due to the limitations in the &lt;code&gt;ts2bril&lt;&#x2F;code&gt; compiler described above, however, I was unable to achieve this goal with my TypeScript &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Checkmate50&#x2F;bril&#x2F;blob&#x2F;arrays&#x2F;test&#x2F;ts&#x2F;array.ts&quot;&gt;array test file&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Function Calls and Property-Based Testing in Bril</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/function-calls/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/function-calls/</guid>
                <description>&lt;p&gt;In this post, we will describe our experience extending &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&#x2F;blob&#x2F;master&#x2F;README.md&quot;&gt;Bril&lt;&#x2F;a&gt; (the Big Red Intermediate Language) to include function calls. 
In addition, we share how we tested our implementation with both targeted manual tests and automated property-based testing (à la &lt;a href=&quot;http:&#x2F;&#x2F;hackage.haskell.org&#x2F;package&#x2F;QuickCheck&quot;&gt;QuickCheck&lt;&#x2F;a&gt;) with &lt;a href=&quot;https:&#x2F;&#x2F;hypothesis.works&quot;&gt;Hypothesis&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;why-did-bril-need-function-calls&quot;&gt;Why did Bril need function calls?&lt;&#x2F;h3&gt;
&lt;p&gt;Bril is a simple language that &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;%7Easampson&#x2F;&quot;&gt;Adrian&lt;&#x2F;a&gt; designed to be a playground for building compiler extensions and optimizations. 
While out-of-the-box Bril supports programs with multiple functions, the initial implementation lacked an instruction to actually &lt;em&gt;call&lt;&#x2F;em&gt; one function from another. 
In service of &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;&quot;&gt;this course&lt;&#x2F;a&gt;&#x27;s journey toward successively more fun compiler hacking, we set out to rectify this &amp;quot;oversight&amp;quot;. &lt;&#x2F;p&gt;
&lt;p&gt;The Bril ecosystem is centered around a JSON-based intermediate language that represents functions, labels, and instructions.
In addition, Bril includes two &lt;em&gt;frontends&lt;&#x2F;em&gt; to make for a more ergonomic programming experience—users can compile from either a more concise text-based syntax or a restricted subset of TypeScript.
For our project, we decided to focus our scope on simple function calls (without first-class functions) in favor of updating the full Bril stack.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-we-did&quot;&gt;What we did&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;surface-syntax&quot;&gt;Surface syntax&lt;&#x2F;h3&gt;
&lt;p&gt;Bril now supports function definitions:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;&amp;lt;ReturnType&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;function &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;name&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;gt;(&amp;lt;argument name&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;lt;Type&amp;gt;, ...) { &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;instructions&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;};
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Where:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;&amp;lt;ReturnType&amp;gt;&lt;&#x2F;code&gt;: The return type of a function must be &lt;code&gt;void&lt;&#x2F;code&gt; or one of the currently recognized Bril types: &lt;code&gt;int&lt;&#x2F;code&gt; or &lt;code&gt;bool&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;&amp;lt;function name&amp;gt;&lt;&#x2F;code&gt;: The function&#x27;s name.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;&amp;lt;argument name&amp;gt; : &amp;lt;Type&amp;gt;&lt;&#x2F;code&gt;: There can be zero or more arguments. Each argument name must be paired with a Bril type.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;&amp;lt;instructions&amp;gt;&lt;&#x2F;code&gt;: This is a sequence of Bril instructions.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Bril now supports two kinds of &lt;code&gt;call&lt;&#x2F;code&gt;s, those that produce a value (value operation), and those that do not (effect operation):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;var&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; &amp;lt;variable name&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;lt;Type&amp;gt;  &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;call &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;lt;function name&amp;gt;(&amp;lt;args&amp;gt;);
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;call &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;lt;function name&amp;gt;(&amp;lt;args&amp;gt;);
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For backwards compatibility, functions can still be declared without return types and arguments, as in &lt;code&gt;tests&#x2F;ts&#x2F;br.bril&lt;&#x2F;code&gt;. 
Such functions are assumed to have a return type of void.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;extended-json-representation&quot;&gt;Extended JSON representation&lt;&#x2F;h3&gt;
&lt;p&gt;We extended the JSON representation of Bril functions to account for a function&#x27;s arguments and return type. Every &lt;code&gt;Function&lt;&#x2F;code&gt; object still has a name and a list of instructions. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;{ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;name&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&amp;lt;string&amp;gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;instrs&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: [&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Instruction&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;], &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;args&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: [&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;Argument&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;], &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;type&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;lt;Type&amp;gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A function can take no arguments, in which case the &lt;code&gt;\&amp;quot;args\&amp;quot;&lt;&#x2F;code&gt; field contains the empty list.
The return type, represented by the &lt;code&gt;\&amp;quot;type\&amp;quot;&lt;&#x2F;code&gt; field, is not required. A function that does not return anything (giving it the return type &lt;code&gt;void&lt;&#x2F;code&gt;) does not contain the &lt;code&gt;\&amp;quot;type\&amp;quot;&lt;&#x2F;code&gt; field.&lt;&#x2F;p&gt;
&lt;p&gt;An &lt;code&gt;Argument&lt;&#x2F;code&gt; JSON object contains the argument&#x27;s name and type:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;{&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;name&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&amp;lt;string&amp;gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;type&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;lt;Type&amp;gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The JSON Bril &lt;code&gt;Program&lt;&#x2F;code&gt; object remains unchanged as a list of functions.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;compiling-to-json&quot;&gt;Compiling to JSON&lt;&#x2F;h3&gt;
&lt;p&gt;We extended the frontend for text-based Bril in &lt;code&gt;briltxt.py&lt;&#x2F;code&gt;.
The goal was to convert our new function definitions and call instructions to the JSON representation of Bril, necessitating extending the parser and JSON generators.&lt;&#x2F;p&gt;
&lt;p&gt;We also extended the TypeScript frontend in &lt;code&gt;ts2bril.ts&lt;&#x2F;code&gt;.
The TypeScript parser already handled calls to handle treating &lt;code&gt;console.log&lt;&#x2F;code&gt; statements as Bril&#x27;s &lt;code&gt;print&lt;&#x2F;code&gt; statements. We extended this component to also capture effectful calls that return results. In addition, the initial implementation did not support function clarations, so we added new support to transform the declaration and type information.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;interpreter&quot;&gt;Interpreter&lt;&#x2F;h3&gt;
&lt;p&gt;The interpreter needed to be able to handle functions and calls in their extended JSON representation. 
The main work was when encountering a &lt;code&gt;call&lt;&#x2F;code&gt; instruction: we create a new, empty environment with the arguments bound to the correct values.
The interpreter searches for the function name in the program&#x27;s list of functions since we are not implementing first-class functions.
Because we chose to represent the stack implicitly, function calls are executed with a recursive call to &lt;code&gt;evalInstr&lt;&#x2F;code&gt;, thus relying on the underlying TypeScript stack frame implementation.&lt;&#x2F;p&gt;
&lt;p&gt;Helpful compilers also need to check for errors. The interpreter now checks for a number of possible errors when calling functions. We use simple dynamic type checking to ensure that (1) argument types match the types of the provided values and (2) the function&#x27;s declared return type matches both the type of the returned value and the type of the variable where the returned value is being stored.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;design-decisions&quot;&gt;Design decisions&lt;&#x2F;h3&gt;
&lt;p&gt;There were surprisingly many decisions to be made in the course of designing function calls.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;For the sake of a sufficiently-scoped project, we chose not to implement first-order functions.&lt;&#x2F;li&gt;
&lt;li&gt;We implicitly represent the stack with recursive interpreter calls for simplicity based on the functionality we target.
An explicit stack would allow more interesting control flow in the future.&lt;&#x2F;li&gt;
&lt;li&gt;We chose to allow backwards compatibility with the original Bril &lt;code&gt;main&lt;&#x2F;code&gt; syntax that did not have a return type or arguments.
Similarly, the TypeScript &lt;code&gt;main&lt;&#x2F;code&gt; function is not explicitly demarcated—it is understood to consist of the instructions before any function definitions.&lt;&#x2F;li&gt;
&lt;li&gt;Calls can be effectful or non-effectful.
In the JSON representation of Bril, we chose to represent &lt;code&gt;call&lt;&#x2F;code&gt; as its own &amp;quot;kind&amp;quot; of instruction, allowing us to include the function&#x27;s name as an explicit &lt;code&gt;name&lt;&#x2F;code&gt; field in the JSON object rather than an argument to the instruction.&lt;&#x2F;li&gt;
&lt;li&gt;If multiple functions with the same name are defined or a called function is missing, the interpreter throws an error.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;main&lt;&#x2F;code&gt; functions in the text and JSON Bril representations can take arguments that are fed to &lt;code&gt;brili&lt;&#x2F;code&gt;. &lt;code&gt;main&lt;&#x2F;code&gt; also takes named and typed arguments, rather than C-style &lt;code&gt;argc&lt;&#x2F;code&gt;&#x2F;&lt;code&gt;argv&lt;&#x2F;code&gt;.
However, &lt;code&gt;main&lt;&#x2F;code&gt; doesn&#x27;t return an exit code for simplicity.&lt;&#x2F;li&gt;
&lt;li&gt;Originally, the Bril interpreter simply threw string message exceptions on errors. We made the design decision that the interpreter should not leak interpretation details through uncaught excerptions for anticipated failures. We updated the interpreter to return a specific exception, which is caught and send to standard error along with a custom exit code.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;challenges&quot;&gt;Challenges&lt;&#x2F;h3&gt;
&lt;p&gt;The hardest part of this particular project, as with many compiler endeavors, was wrangling with new frameworks and existing code bases. 
In particular, this project was more involved than we originally expected because it touched the full Bril stack—not just the interpreter, but the text-to-JSON and JSON-to-text compilers, the TypeScript frontend, and the Turnt testing framework. &lt;&#x2F;p&gt;
&lt;p&gt;The TypeScript frontend changes were especially gnarly because the TypeScript AST does not have detailed documentation. It took us quite some time to determine, e.g., if a function call AST node stored its result to a variable. &lt;&#x2F;p&gt;
&lt;p&gt;Finally, the Hypothesis testing framework was completely new for us, so it was somewhat challenging to think of how to generate meaningful test data automatically. In the end, we settled on generating relatively simple syntactically correct programs. It would be interesting to put more time into generating richer, semantically meaningful Bril in the future as well. &lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluating-our-contribution&quot;&gt;Evaluating our contribution&lt;&#x2F;h2&gt;
&lt;p&gt;To convince ourselves that we&#x27;d actually made a useful contribution to Bril, we wanted to rigorously test our changes. 
Our evaluation was two-fold: (1) manual testing at multiple abstraction levels (JSON, text-based Bril, and TypeScript), and (2) automated property-based testing to try and cover classes of errors we may not have anticipated. In order to support these lofty testings goals, we also had to make several tooling changes.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;tooling-changes-for-testing&quot;&gt;Tooling changes for testing&lt;&#x2F;h3&gt;
&lt;p&gt;As we developed our implementation, we built up a  bevy of small Bril programs that we expected to trigger certain classes of errors.
However, the check-expect-style testing framework Bril employs, &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cucapra&#x2F;turnt&quot;&gt;Turnt&lt;&#x2F;a&gt;, did not support tests that were expected to fail.
We &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cucapra&#x2F;turnt&#x2F;issues&#x2F;6&quot;&gt;extended&lt;&#x2F;a&gt; Turnt to check both standard error and program exit codes in order to test invalid Bril programs. &lt;&#x2F;p&gt;
&lt;p&gt;Turnt relies on C-style comments to configure settings on a per-test basis, so we also extended the Bril text-based surface syntax to support comments of the form &lt;code&gt;&#x2F;&#x2F; &amp;lt;comment&amp;gt;&lt;&#x2F;code&gt;. &lt;&#x2F;p&gt;
&lt;p&gt;Finally, in order for automated testing to be useful, we needed to distinguish between expected errors on invalid Bril programs and implementation flaws. 
We thus added a named exception to Bril&#x27;s interpreter with a custom exit code, removing all string-based &lt;code&gt;throw&lt;&#x2F;code&gt; calls.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;bugs-we-found-with-manual-testing&quot;&gt;Bugs we found with manual testing&lt;&#x2F;h3&gt;
&lt;p&gt;Manual testing uncovered several significant bugs. &lt;&#x2F;p&gt;
&lt;p&gt;When we were fairly confident we had finished our implementation (hah), we wrote a quick recursive factorial implemention in the TypeScript frontend:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;function &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;fac&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;number&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;number &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;var &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;result &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#795da3;&quot;&gt;fac&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;- &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;result; 
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Surprisingly, this test failed—we had forgotten that in TypeScript, function calls could be nested subexpressions! Our implementation expected functions that did not return void to be stored directly into variables. We did not have to worry about this in text-based Bril because operations can only take variables as their arguments.&lt;&#x2F;p&gt;
&lt;p&gt;Testing &lt;code&gt;void&lt;&#x2F;code&gt; functions revealed that the TypeScript compiler was expecting only annotated function types of &lt;code&gt;number&lt;&#x2F;code&gt; and &lt;code&gt;boolean&lt;&#x2F;code&gt;.
Though the legacy syntax for defining a &lt;code&gt;void&lt;&#x2F;code&gt; function—without any type annotation—compiled fine, the test showed that we had to add a check for an explicit &lt;code&gt;void&lt;&#x2F;code&gt; return type.&lt;&#x2F;p&gt;
&lt;p&gt;We also found a bug arising from the nondeterminism of Lark, the Python parser; constant operations were occasionally parsed as value operations. This was fixed with a simple upgrade to the most recent version of Lark.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, we found a bug in the original TypeScript compiler (&lt;code&gt;ts2bril.ts&lt;&#x2F;code&gt;) while manually testing the argument type error messages of our function implementation.
The compiler hits an unexpected error when encountering a boolean variable declaration (with or without the type annotation):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;var &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;x &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;boolean &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;true&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We opened an &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&#x2F;issues&#x2F;25&quot;&gt;issue&lt;&#x2F;a&gt; in the main Bril repository which was subsequently fixed. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;automated-property-based-testing-with-hypothesis&quot;&gt;Automated property-based testing with &lt;a href=&quot;https:&#x2F;&#x2F;hypothesis.works&quot;&gt;Hypothesis&lt;&#x2F;a&gt;&lt;&#x2F;h3&gt;
&lt;p&gt;We were excited to try our hand at stress-testing our Bril implementation with automated testing. 
The key idea behind property-based testing is to specify some details of expected program behavior, then use a framework to test those properties on many automated examples (in particular, more than a human would reasonably want to write). This framework allows us to test many aspects of Bril, not solely the new function calls.&lt;&#x2F;p&gt;
&lt;p&gt;For Bril, we decided to use a Python-based property testing framework, &lt;a href=&quot;https:&#x2F;&#x2F;hypothesis.works&quot;&gt;Hypothesis&lt;&#x2F;a&gt;. 
The primary challenge in using such a tool is to specify &lt;em&gt;how&lt;&#x2F;em&gt; example data can be generated such that the tests are useful. 
In testing Bril, this meant specifying how to generate syntactically correct Bril programs.&lt;&#x2F;p&gt;
&lt;p&gt;Our first test checks the property that conversion from text-based Bril JSON to is invertible. 
That is, we want the following high level assertion to hold:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;bril2json(bril2txt(program)) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;program
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For this test, we don&#x27;t particularly care if the programs we generate are &lt;em&gt;meaningful&lt;&#x2F;em&gt;, as long as they are of the correct syntactic form. 
We can also generate the simpler, JSON syntactic form. 
In Hypothesis, this is accomplished via &lt;em&gt;strategies&lt;&#x2F;em&gt; that tell the framework how to compose test data. 
We start with the small forms, and build up to a whole program.
For example, we can generate simple names with the following, which says that names are 1-3 lowercase Latin characters:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;names &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;text(alphabet&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;characters(min_codepoint&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;97&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, max_codepoint&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;122&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;), 
             min_size&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,
             max_size&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Instructions are built up compositionally, using a &lt;code&gt;draw&lt;&#x2F;code&gt; primitive that automatically explores the specified space of the constituent parts. For example, constant instructions are generated with:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;types &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;sampled_from([&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;int&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;bool&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;])

@composite
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;def &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#323232;&quot;&gt;bril_constant_instr&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(draw):
    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;type &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;draw(types)
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;if &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;type &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;int&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;):
        value &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;draw(sampled_from(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;range&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;100&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)))
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;elif &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;type &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;bool&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;):
        value &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;draw(sampled_from([&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;True&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;False&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]))
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;return &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{
        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;op&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;const&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;,
        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;value&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: value,
        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;dest&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: draw(names),
        &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;type&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;type&lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here, we use a sampling primitive to choose either &lt;code&gt;int&lt;&#x2F;code&gt; or &lt;code&gt;bool&lt;&#x2F;code&gt;, then generate a numeric or boolean value as appropriate.&lt;&#x2F;p&gt;
&lt;p&gt;Along with similar composite strategies for other instruction forms (including calls) and functions, we build up many (somewhat silly) programs. Even this naive strategy found a potential bug:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;{&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;dest&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;aaa&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;op&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;const&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;type&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;bool&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;value&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: True} &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;!=
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;dest&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;aaa&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;op&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;const&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;type&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;bool&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;value&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;#39;true&amp;#39;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Originally, we generated the JSON strings &lt;code&gt;true&lt;&#x2F;code&gt; and &lt;code&gt;false&lt;&#x2F;code&gt; (instead of boolean literals &lt;code&gt;True&lt;&#x2F;code&gt; and &lt;code&gt;False&lt;&#x2F;code&gt;). The &lt;code&gt;bril2txt&lt;&#x2F;code&gt; implementation parsed this correctly, which we decided to leave as-implemented, but this assured us that Hypothesis could actually find programs that were not reversible as we expected.&lt;&#x2F;p&gt;
&lt;p&gt;We also tested that running Hypothesis-generated programs through the &lt;code&gt;brili&lt;&#x2F;code&gt; Bril interpreter only produced clean-exit expected error cases, instead of exposing failures in the underlying TypeScript implementation. Once we changed &lt;code&gt;Brili&lt;&#x2F;code&gt; to throw a specific exception, this meant testing the high-level property:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;exit_code &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;brili(program) 
exit_code &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0 &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;or &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;exit_code &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;== &amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;known_exit_code&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Because we did not encode much semantic meaning into the generation strategies, almost of the all of the thousands of generated programs failed in the interpreter (some did execute, and print values, successfully!). Reading the generated programs also led us to realize that we were not specifically handling the case where a Bril program calls a function with multiple definitions. &lt;&#x2F;p&gt;
&lt;p&gt;Overall, property-based testing was easier than expected to set up, and helped us explore the space of Bril programs.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;next-steps&quot;&gt;Next steps&lt;&#x2F;h2&gt;
&lt;p&gt;There are several interesting directions that Bril&#x27;s function handling could take from here. We could represent the program stack and context explicitly, rather than relying on the underlying interpreter&#x27;s stack, and implement first-order and anonymous functions. We could also integrate with other projects&#x27; type checking and eliminate most of the interpreter&#x27;s dynamic checks. Finally, function calls will allow us and other Bril implementors to run more exciting programs (and optimizations) in the future.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Vector Instruction Support in the Bril Interpreter</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/interpreter-vector-support/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/interpreter-vector-support/</guid>
                <description>&lt;h2 id=&quot;problem-statement&quot;&gt;Problem Statement&lt;&#x2F;h2&gt;
&lt;p&gt;The Bril interpreter does not take advantage of the vector instructions available in modern CPUs. This presents two problems: the first is that a Bril backend cannot generate vector instructions because they are not present in the intermediate language. The second issue is that interpreting vector instructions is slow because it requires a loop over each vector element. We propose to support vector instructions in the interpreter and accelerate the interpretation of them using intrinsic vector instructions.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;vector-instructions&quot;&gt;Vector Instructions&lt;&#x2F;h2&gt;
&lt;p&gt;Single-Instruction Multiple-Data (SIMD) allows computer hardware to obtain significant speedups over the conventional execution paradigm, Multiple-Instruction Multiple-Data (MIMD). SIMD executes fewer instructions than MIMD to accomplish the same amount of work. In SIMD, multiple arithmetic operations are grouped together under one instruction. Generally, this arithmetic is executed over multiple execution units at the same instant (spatial SIMD), although some implementations allow the same instruction to use the same functional unit over multiple cycles (temporal SIMD). Various hardware architectures allow the programmer to support SIMD by exposing vector instructions in their ISA. Typically, there is a vector register file which holds multiple elements per register and can participate in vector arithmetic.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;vector-support-in-the-bril-interpreter&quot;&gt;Vector Support in the Bril Interpreter&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;memory-support&quot;&gt;Memory Support&lt;&#x2F;h3&gt;
&lt;p&gt;Vector instructions are only useful when operating on large chunks of data. Conversely, registers are not designed to hold this amount of data. A prerequisite for vector instructions is support for data memory in the interpreter. The interpreter memory is implemented as a flat heap. The program is allowed to access any location in the fixed size memory and there is no memory management.&lt;&#x2F;p&gt;
&lt;p&gt;Memory access requires both load and store operations. To emulate arrays, we can loop over multiple addresses and perform a load and&#x2F;or store at each location. Load and store instructions in Bril are implemented similarly to their assembly counterparts. A load takes in a memory address (an &lt;code&gt;int&lt;&#x2F;code&gt; register in Bril) and writes to a destination register. A store is an effect operation and does not write to a register. It uses two source registers: a register containing a value and a register containing an address. The Bril syntax is given below.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; load value from address 2
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;addr&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
data&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; lw addr;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; store value to address 2
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;sw data addr;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;interpreted-vector-instructions&quot;&gt;Interpreted Vector Instructions&lt;&#x2F;h3&gt;
&lt;p&gt;Vector instructions were added to the Bril interpreter. We specifically implement fixed-sized vector instructions of length four (akin to native Intel &lt;code&gt;__m128&lt;&#x2F;code&gt; SSE instructions). TypeScript and JavaScript (TypeScript always compiles to JavaScript) do not have support for vector intrinsic in the current standard. Thus, we implement the Bril vector instructions as a loop over four values. Additionally, we add vector registers in Bril which must be used in the vector instructions. We target a vector-vector add (vvadd) program, so we include interpreter support for &lt;code&gt;vadd&lt;&#x2F;code&gt;, &lt;code&gt;vload&lt;&#x2F;code&gt;, and &lt;code&gt;vstore&lt;&#x2F;code&gt; instructions. The &lt;code&gt;vload&lt;&#x2F;code&gt; and &lt;code&gt;vstore&lt;&#x2F;code&gt; instructions communicate data between vector registers and the interpreter stack. The &lt;code&gt;vadd&lt;&#x2F;code&gt; instructions adds two vector registers and writes to a destination vector register. An example vvadd program is shown below.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; locations of memory (arrays of 4 elements)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;a&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
b&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
c&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;8&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; load into src vector registers
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;va&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; vload a;
vb&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; vload b;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; do the vector add
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vc&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; vadd va vb;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; store from vector register to memory
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;vstore vc c;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;intrinsic-vector-support&quot;&gt;Intrinsic Vector Support&lt;&#x2F;h3&gt;
&lt;p&gt;It&#x27;s awkward that the interpreter supports vector instructions but doesn&#x27;t actually use a native vector assembly instruction to perform the computation. We expect that the performance of the interpreted vector instructions will be poor. To explore this hypothesis, we create a version of an interpreted vector add instruction using three methods: 1) TypeScript, 2) serial C++, and 3) vectorized C++. We run each test for 10,000 iterations and average the execution time over five runs. We assumed 10,000 iterations was enough time for the Node (Google V8) JIT to warmup. The average time per loop iteration for each implementation is given below.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Method&lt;&#x2F;th&gt;&lt;th&gt;Time per iteration (ns)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;TypeScript&lt;&#x2F;td&gt;&lt;td&gt;317&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Serial C++&lt;&#x2F;td&gt;&lt;td&gt;16&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Vector C++&lt;&#x2F;td&gt;&lt;td&gt;9&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;There is a fair benefit to using native vector instruction from C (35x) speedup over the TypeScript version. A large portion of this benefit comes from just using C in the first place. The serial C version achieves 20x better performance while the vector version further improves the performance by a more modest 1.8x.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;c-binding-for-typescript&quot;&gt;C++ binding for TypeScript&lt;&#x2F;h3&gt;
&lt;p&gt;In order to utilize vector intrinsic, we need to call the C++ implementations in the Bril interpreter. We use &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;charto&#x2F;nbind&quot;&gt;nbind&lt;&#x2F;a&gt; to allow TypeScript to execute binaries generated from C++. These sorts of calls will add potentially significant overhead to the execution. We quantify this overhead to see if it is practical in the interpreter. Note that each time we run a single vector instruction we must make a call to the binding. We run a vector add program with various iterations. Each iteration does two vector loads, a vector add, and a vector store along with instructions to facilitate the iteration. We run five configurations (128, 1024, 2048, 4196, 8192) and average the execution time over five runs. We compare the execution time with and without calls to the binding (literally just comment the line out). Note that the calls include passing arguments to the C++ binary, which incurs some additional overhead. On average there is a 10% overhead in the program due to the binding call. This overhead is expected to be offset from the substantial speedups offered by the C++ implementation.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;correctness&quot;&gt;Correctness&lt;&#x2F;h3&gt;
&lt;p&gt;We write multiple test programs to verify that the memory and vector instructions functioned as expected. &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cucapra&#x2F;turnt&quot;&gt;Turnt&lt;&#x2F;a&gt; is used to test the expected output of the program from &lt;code&gt;print&lt;&#x2F;code&gt; instructions. The vector programs could not be verified with Turnt, however, because it needs to be executed in the interpreter directory to find the location of the C++ binaries. These programs were verified by manually inspecting the output. We test a simple store and then load, multiple stores and then multiple loads, and vvadd with both the TypeScript implementation and the C++ vector implementation. All functioned as specified in this document.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;performance&quot;&gt;Performance&lt;&#x2F;h3&gt;
&lt;p&gt;We evaluate the effectiveness of using intrinsic vector functions in the Bril interpreter. We run a multi-iteration vvadd program where each iteration does a single vector add of four elements. The program is shown below. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; initialize number of iterations
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;size&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;8192&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
vecSize&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; initialize data locations
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;a&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
b&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add a size;
c&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add b size;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; loop
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
vvadd_loop&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; get base addresses to add
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;ai&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add a i;
bi&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add b i;
ci&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add c i;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; do the vvadd
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;va&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; vload ai;
vb&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; vload bi;
vc&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; vector &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; vadd va vb;
vstore vc ci;

&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; iterations (increment by vector length)
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add i vecSize;
done&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: bool =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ge i size;
br done vvadd_done vvadd_loop;

vvadd_done&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Notice that the program does not initialize values in memory or print the results. We do not want to include the initialization and cleanup time in the measured runtime. In the future, we could implement a Bril timer on&#x2F;off instruction that could solve this problem. We time the execution of the interpreter using JavaScript&#x27;s &lt;code&gt;console.time&lt;&#x2F;code&gt; and &lt;code&gt;console.timeEnd&lt;&#x2F;code&gt; functions. We take care not to time the file I&#x2F;O part of the interpreter as this would dominate the runtime for the program sizes that we run.&lt;&#x2F;p&gt;
&lt;p&gt;We run vvadd with various iteration amounts (128, 1024, 2048, 4098, and 8196). We average the runtime of five executions of the same program. Our baseline is the TypeScript implementation of vector instructions. The following figure shows the speedup of various implementations relevant to the baseline. Our performance metric is the execution time of the whole program.&lt;&#x2F;p&gt;
&lt;img src=&quot;vector-graph.png&quot; alt=&quot;Interpreter Performance&quot; style=&quot;max-width: 100%&quot;&gt;
&lt;p&gt;The C++ implementations outperform the TypeScript implantation at a smaller number of iterations. However, the performance equalizes for higher iterations. This is potentially due in part to the JavaScript JIT warming up on later iterations and matching the C++ generated code. However, C++ is still expected to get much better performance up to this point as measured previously. The JIT hypothesis does not explain the full trend.&lt;&#x2F;p&gt;
&lt;p&gt;The C++ implementations have very similar execution times even though a 2x performance gap was expected. This suggests that there is a bottleneck in the interpreter apart from the C++ implementations.&lt;&#x2F;p&gt;
&lt;p&gt;Both of these results are likely due to the overhead in TypeScript. The 4th series in the graph shows the speedup if the C++ calls are removed altogether (commented out). This results in a slight speedup, but not a substantial one. We can conclude that the TypeScript runtime dominates the execution time and optimizing the C++ implementation or binding calls will have little effect on the overall performance.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;We were able to correctly implement vector instructions in the Bril interpreter. However, we were not able to obtain execution speedup of these instructions due to slowness of TypeScript. If one wanted to get speedups in the interpreter it would need to be written fully in C++ (or another fast language) rather than making fine-grained calls to C++.&lt;&#x2F;p&gt;
&lt;p&gt;In this work, vectorization was manually implemented in Bril. Future work could create a pass to automatically unroll loops and insert vector instructions.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>LLVM JIT Compiler for Bril</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/llvm-backend-for-bril/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/llvm-backend-for-bril/</guid>
                <description>&lt;p&gt;Bril is a concise intermediate representation language, which is powerful enough to describe most common arithmetic operations (e.g., add, mul, div, and other control flow instructions). In this project, we aim to extend the reachability of Bril IR to different backend devices by compiling Bril programs to LLVM IR. We execute the generated LLVM IR via LLVM execution engine to verify its functional correctness. Finally, we compare the runtime between LLVM JIT compilation and Bril interpreter.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;methodology&quot;&gt;Methodology&lt;&#x2F;h3&gt;
&lt;p&gt;To compile a Bril program into LLVM IR, we first take the program in JSON format and have it analyzed by our compiler. The overall workflow is similar to what we do for data flow analysis in class. One thing to notice here is that, during the class, we have not mentioned static single assignment (SSA), wihch is an IR property requiring each variable to be assigned exactly once. Multiple assignments to same variable create new versions for that variable. SSA is essential when we have multiple assignments to a single variable. Namely, we need to create phi nodes in cases where we have branches. However, Bril is not an SSA-form IR where multiple assignments overwrite the variable without creating new identifiers. To compile the Bril IR into SSA-form LLVM IR, we make each assignment a unique memory store. Similarly, each variable read becomes a memory load.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Create a basic block mapping: Given the Bril IR in JSON representation, we create empty LLVM basic blocks according to block labels. Meanwhile, we maintain a mapping between label strings and LLVM basic block pointers. We also create a flag to mark whether a basic block is used or not.&lt;&#x2F;li&gt;
&lt;li&gt;Insert instructions into blocks: We traverse the empty basic blocks and insert instructions into them. Each basic block should end with a valid terminator (i.e., jmp, br, or ret). The insertion process will terminate after encountering the first terminator. All following instructions under the same label will be ignored since this code is dead and will not be executed in any condition. &lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;label:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  br cond b1 b2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  # inst n ignored&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  n: int = mul a b;  &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
b1:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  m: int = const 5;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  print v;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
b2:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  jmp end;    &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;Dump LLVM code and run through JIT compilation: We allow the users to dump the generated LLVM IR for easy inspection. After that, we compile the code with LLVM execution engine and verify the outputs by comparing the results produced by Bril interpreter.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;implementation-details&quot;&gt;Implementation details&lt;&#x2F;h3&gt;
&lt;p&gt;In this section, we briefly describe the implementation details for each step in LLVM code generation process. To get a global idea of how different components are linked and executed in the program, we need a data structure recording basic blocks and the actual values each instruction takes. Here is a list of the data structure we create. Basically, we categorize them into two classes: one is for block level and the other for instruction level.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;using&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; BasicBlockFlag_T &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; pair&amp;lt;llvm::BasicBlock, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;bool&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;gt;;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;using&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; BasicBlockMap_T &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; map&amp;lt;string, BasicBlockFlag_T&amp;gt;;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;using&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; VarToVal_T &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; map&amp;lt;string, llvm::Value&amp;gt;;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The &lt;code&gt;BasicBlockFlag_T&lt;&#x2F;code&gt; is used to track whether a block has been visited or not. We will remove the unused basic blocks when traversing the whole program. Following is a simple example of removing redundant basic blocks.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;cond bool: lt a b;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
br cond b1 b3&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
b1:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  jmp end;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
b2: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  jmp end;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
b3:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  jmp end;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The &lt;code&gt;b2&lt;&#x2F;code&gt; branch will not be reached in any condition. During the instruction insertion process, the block labeled &lt;code&gt;b2&lt;&#x2F;code&gt; will not be marked and thus we know that it is redundant. This basic optimization helps reduce the executable size.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;BasicBlockMap_T&lt;&#x2F;code&gt; structure constructs a mapping from block name to actual LLVM blocks. When the instruction visitor traverses the control data flow graph, we create a new LLVM basic block every time we find a new label (except for the special case where the entry basic block has no label). The unordered mapping from label name to LLVM basic blocks will be created and saved for later use.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;VarToVal_T&lt;&#x2F;code&gt; structure tracks the pointer of each destination variable.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; create a alloca llvm value using IRBuilder&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;llvm::Value&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; val &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; builder-&amp;gt;CreateAlloca(t_int_, llvm::ConstantInt::getSigned(t_int_, &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;));&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; save the allocated value into the VarToVal map&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;varToVal)[destination] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; val;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Special note for the print instruction in Bril: we create an LLVM function call with integer return data type, and pass in &lt;code&gt;%d&lt;&#x2F;code&gt; and the actual LLVM value to be printed as arguments. Then we build a &lt;code&gt;CreateCall&lt;&#x2F;code&gt; node with LLVM IRBuilder so that the print function can be realized in LLVM program.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;experiment-results&quot;&gt;Experiment Results&lt;&#x2F;h3&gt;
&lt;p&gt;Our program is in &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;seanlatias&#x2F;bril&#x2F;tree&#x2F;master&#x2F;codegen-llvm&quot;&gt;one of Bril&#x27;s forks&lt;&#x2F;a&gt;, under the &lt;code&gt;codegen-llvm&lt;&#x2F;code&gt; folder. To compile our JIT compiler, &lt;code&gt;make&lt;&#x2F;code&gt; is the only command needed. Our program takes in two variables. One is the input Bril program in JSON format and the other is the output LLVM file (usually ends with &lt;code&gt;.ll&lt;&#x2F;code&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;To verify the correctness of the generated LLVM IR, we develop several test cases, which cover most commonly used arithmetic and control flow instructions, as well as some corner cases where the program has some redundant instructions that could be removed. The test example is shown as followed (it can also be found under the &lt;code&gt;codegen-llvm&lt;&#x2F;code&gt; folder):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  a: int = const 42;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  b: int = const 22;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  v: int = add a b;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  m: int = mul v b;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  cond: bool = lt a m;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  br cond b1 b2;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  # inst n removed automatically&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  n: int = mul a b; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
b1:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  m: int = const 5;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  print v;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  # br to b2 inserted here&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
b2:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  jmp end;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
end:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  # print func in llvm&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  print a;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
}&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We wrote the program in text representation for Bril and get the canonical JSON form Bril program with &lt;code&gt;bril2json&lt;&#x2F;code&gt;. By running the following commands, the JSON file will be generated and analyzed. Our compiler then generates the LLVM code and print out the LLVM program into the destination file.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;cat test.bril | bril2json &amp;gt; test.json&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
.&#x2F;bril-llvm test.json test.llvm&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;By observing the generated LLVM code, we can see that at the very end of branch &lt;code&gt;b1&lt;&#x2F;code&gt;, a new instruction is added to avoid the issue where the basic block is missing a terminator. Moreover, the print function of Bril is transformed into an LLVM function call with corresponding variables passed in as arguments.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;  b1:                                               ; preds = %0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    %13 = alloca i64, i64 1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    store i64 5, i64* %13&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    %14 = load i64, i64* %5&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    %15 = call i64 @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @0, i32 0, i32 0), i64 %14)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    br label %b2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  b2:                                               ; preds = %b1, %0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    br label %end&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  end:                                              ; preds = %b2&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    %16 = load i64, i64* %1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    %17 = call i64 @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @1, i32 0, i32 0), i64 %16)&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    ret i32 0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
  }&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;After verifying the correctness of the code generator, we also compare the performance of LLVM simulation and the Bril interpreter. The performance is measured with profiling tool in Linux and C++. We run the same program for 10 times and take the average runtime. For the test program with a regular for loop iteratively computing one multiply operation for 1 billion times, the LLVM interpreter runs about 10 times faster than the Bril interpreter. The average runtime is 0.47 seconds and 0.05 seconds for Bril and LLVM interpreter respectively. The LLVM execution engine achieves approximately 1000x speedup over the Bril interpreter without optimizing the loop inside. We can expect higher speedup if the loop is optimized away.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Bril()</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/making-function-calls-work/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/making-function-calls-work/</guid>
                <description>&lt;h3 id=&quot;goal&quot;&gt;Goal&lt;&#x2F;h3&gt;
&lt;p&gt;Our goal for this project was to make function calls work. In addition, we introduce function parameters, return types, optional type annotations for parameters, nested function definitions, and a simple module system. We also offer the option to pass command-line arguments to the main function.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;design&quot;&gt;Design&lt;&#x2F;h3&gt;
&lt;h4 id=&quot;adding-support-for-function-calls&quot;&gt;Adding support for function calls&lt;&#x2F;h4&gt;
&lt;p&gt;The provided interpreter already supports the evaluation of functions. All we have to do is add a &lt;em&gt;call&lt;&#x2F;em&gt; instruction to the language grammar and link that to the interpreter. To limit function scope, we create a new environment (a map from variable names to values) every time a function is called.&lt;&#x2F;p&gt;
&lt;p&gt;Below is a Bril program that demonstrates this functionality.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
# This program prints out 100 and exits. Nothing too exciting here.
main {
  call func;
}

func {
  v0: int = const 100;
  print v0;
}
&lt;&#x2F;pre&gt;
&lt;p&gt;Below is the JSON representation of the &lt;code&gt;call&lt;&#x2F;code&gt; instruction. It follows the same format as other instructions.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
{
  &amp;quot;args&amp;quot;: [
    &amp;quot;func&amp;quot;
  ],
  &amp;quot;op&amp;quot;: &amp;quot;call&amp;quot;
}
&lt;&#x2F;pre&gt;
&lt;p&gt;Since the interpreter scans function definitions before executing anything, functions can be defined in any order. The above program can be rewritten as:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
func {
  v0: int = const 100;
  print v0;
}

main {
  call func;
}
&lt;&#x2F;pre&gt;&lt;h4 id=&quot;adding-function-parameters&quot;&gt;Adding function parameters&lt;&#x2F;h4&gt;
&lt;p&gt;To add support for function parameters, we first update the grammars for function definitions and calls to take whitespace-delimited lists of variable names. The interpreter&#x27;s &lt;code&gt;call&lt;&#x2F;code&gt; operation handler first extracts the values for all arguments from the current environment map. It then pre-populates the callee&#x27;s environment with the function parameters mapped to these values. This new environment is then used when evaluating the called function. Since Bril does not do any static type-checking, we do not require types to be included with function parameters.&lt;&#x2F;p&gt;
&lt;p&gt;Below is a Bril program that demonstrates the use of function parameters.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
# This program also prints out 100 and exits.
main {
  v0: int = const 50;
  call print_double v0;
}

print_double x {
  v0: int = add x x;
  print v0;
}
&lt;&#x2F;pre&gt;
&lt;p&gt;Below is the JSON representation of the &lt;code&gt;call&lt;&#x2F;code&gt; instruction.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
{
  &amp;quot;args&amp;quot;: [
    &amp;quot;print_double&amp;quot;,
    &amp;quot;v0&amp;quot;
  ],
  &amp;quot;op&amp;quot;: &amp;quot;call&amp;quot;
}
&lt;&#x2F;pre&gt;
&lt;p&gt;The first argument in &lt;code&gt;&amp;quot;args&amp;quot;&lt;&#x2F;code&gt; the function name, followed by the arguments.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;return-types&quot;&gt;Return types&lt;&#x2F;h4&gt;
&lt;p&gt;Without the ability to return data, our functions are not very useful. To add support for return types, we first update the grammar for function definitions to optionally take &lt;code&gt;: type&lt;&#x2F;code&gt; at the end of the header, and update the rule for &lt;code&gt;ret&lt;&#x2F;code&gt; to optionally take a variable name to return. We also overload the &lt;code&gt;call&lt;&#x2F;code&gt; operation to be both an effect operation and a value operation, since functions that do not return anything will be effect operations and those that do will be value operations.&lt;&#x2F;p&gt;
&lt;p&gt;To pass a return value back to the caller, we use the following approach. When we handle a &lt;code&gt;ret&lt;&#x2F;code&gt; operation, if there is a return value, we add a special variable &lt;code&gt;_ ret&lt;&#x2F;code&gt; to the environment and map it to the return value. Note that this will never collide with any existing variables as there is space in the name. Then, if the caller is expecting a return value (which it will in instructions of the form &lt;code&gt;v: type = call func_name&lt;&#x2F;code&gt;), it can check for the existence of &lt;code&gt;_ ret&lt;&#x2F;code&gt; in the callee&#x27;s environment. We do some basic type-checking here, comparing return types and variable types, to make the Bril programmer&#x27;s life easier.&lt;&#x2F;p&gt;
&lt;p&gt;Below is an example program that demonstrates this functionality.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
# Once again, we print 100.
main {
  v0: int = call get_hundred;
  print v0;
}

get_hundred: int {  # The &amp;quot;: int&amp;quot; is required.
  v0: int = const 100;
  ret v0;
}
&lt;&#x2F;pre&gt;
&lt;p&gt;Since we have made several changes to the JSON representation of the program, below is the JSON representation of the entire program.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
{
  &amp;quot;functions&amp;quot;: [
    {
      &amp;quot;instrs&amp;quot;: [
        {
          &amp;quot;args&amp;quot;: [
            &amp;quot;get_hundred&amp;quot;
          ],
          &amp;quot;dest&amp;quot;: &amp;quot;v0&amp;quot;,
          &amp;quot;op&amp;quot;: &amp;quot;call&amp;quot;,
          &amp;quot;type&amp;quot;: &amp;quot;int&amp;quot;
        },
        {
          &amp;quot;args&amp;quot;: [
            &amp;quot;v0&amp;quot;
          ],
          &amp;quot;op&amp;quot;: &amp;quot;print&amp;quot;
        }
      ],
      &amp;quot;name&amp;quot;: &amp;quot;main&amp;quot;
    },
    {
      &amp;quot;instrs&amp;quot;: [
        {
          &amp;quot;dest&amp;quot;: &amp;quot;v0&amp;quot;,
          &amp;quot;op&amp;quot;: &amp;quot;const&amp;quot;,
          &amp;quot;type&amp;quot;: &amp;quot;int&amp;quot;,
          &amp;quot;value&amp;quot;: 100
        },
        {
          &amp;quot;args&amp;quot;: [
            &amp;quot;v0&amp;quot;
          ],
          &amp;quot;op&amp;quot;: &amp;quot;ret&amp;quot;
        }
      ],
      &amp;quot;name&amp;quot;: &amp;quot;get_hundred&amp;quot;,
      &amp;quot;type&amp;quot;: &amp;quot;int&amp;quot;
    }
  ]
}
&lt;&#x2F;pre&gt;
&lt;p&gt;Note that functions now can have a &lt;code&gt;type&lt;&#x2F;code&gt; key. For example, &lt;code&gt;get_hundred&lt;&#x2F;code&gt; has &lt;code&gt;type&lt;&#x2F;code&gt; set to &lt;code&gt;&amp;quot;int&amp;quot;&lt;&#x2F;code&gt;. Also, &lt;code&gt;ret&lt;&#x2F;code&gt; instructions can now have arguments, specified by the &lt;code&gt;args&lt;&#x2F;code&gt; key.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;&#x2F;strong&gt;: To maintain backward compatibility, we omit these keys if they are empty.&lt;&#x2F;p&gt;
&lt;p&gt;Since we now have the ability to pass arguments to functions and get return values, we can now write some interesting Bril programs. Below is a Bril program that prints the 10th Fibonacci number.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
# Print fib(10)
main {
    v0: int = const 10;
    fib10: int = call fib v0;
    print fib10;
}

# Return true if n &amp;lt;= 1, false otherwise.
lte_one n: bool {
    one: int = const 1;
    lto: bool = le n one;
    ret lto;
}

# Return fib(10).
fib n: int {
    base: bool = call lte_one n;
    br base return continue;
return:
    ret n;
continue:
    one: int = const 1;
    prev: int = sub n one;
    prev2: int = sub prev one;
    fib1: int = call fib prev;
    fib2: int = call fib prev2;
    ans: int = add fib1 fib2;
    ret ans;
}
&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;Note&lt;&#x2F;strong&gt;: To call a function on the right-hand side of an assignment, the function must declare a return type. Similarly, functions should not &lt;code&gt;ret&lt;&#x2F;code&gt; values if they do not specify a return type. We choose to enforce this to improve readability. The following program will fail with the error message &lt;code&gt;function func does not return&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
# Fails!
main {
  v0: int = call func;
  print v0;
}

func {
  v0: int = const 100;
  ret v0;
}
&lt;&#x2F;pre&gt;
&lt;p&gt;Changing the signature of &lt;code&gt;func&lt;&#x2F;code&gt; to &lt;code&gt;func: int&lt;&#x2F;code&gt; will fix the program.&lt;&#x2F;p&gt;
&lt;p&gt;The opposite is allowed: functions that have a declared return type &lt;em&gt;may&lt;&#x2F;em&gt; be called as standalone instructions. The following program succeeds and prints 100.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
# Print 100
main {
  call func;
}

func: int {
  v0: int = const 100;
  print v0;
  ret v0;
}
&lt;&#x2F;pre&gt;&lt;h4 id=&quot;optional-type-annotations-for-function-parameters&quot;&gt;Optional type annotations for function parameters&lt;&#x2F;h4&gt;
&lt;p&gt;Function definitions are now more complex, and with this added syntactic complexity comes a loss of readability. To fix this, we introduce the notion of type annotations for function parameters. Since we do not do support static type checking, these type annotations serve no operational purpose. As a result, all we have to do is update the language grammar. Below is an updated version of the above program with type annotations.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
# Print fib(10)
main {
    v0: int = const 10;
    fib10: int = call fib v0;
    print fib10;
}

# Return true if n &amp;lt;= 1, false otherwise.
lte_one (n: int): bool {
    one: int = const 1;
    lto: bool = le n one;
    ret lto;
}

# Return fib(10). Requires: n &amp;gt;= 0.
fib (n: int): int {
    base: bool = call lte_one n;
    br base return continue;
return:
    ret n;
continue:
    one: int = const 1;
    prev: int = sub n one;
    prev2: int = sub prev one;
    fib1: int = call fib prev;
    fib2: int = call fib prev2;
    ans: int = add fib1 fib2;
    ret ans;
}
&lt;&#x2F;pre&gt;&lt;h4 id=&quot;nested-function-definitions&quot;&gt;Nested function definitions&lt;&#x2F;h4&gt;
&lt;p&gt;In the above example, &lt;code&gt;lte_one&lt;&#x2F;code&gt; is just a helper function for &lt;code&gt;fib&lt;&#x2F;code&gt; and is not used anywhere else. To avoid cluttering the global function definitions, it would be nice to only define functions where they are useful.&lt;&#x2F;p&gt;
&lt;p&gt;To do this, we introduced support for nested function definitions (i.e., function definitions within function definitions). First we add a new rule to the grammar. We add a new &lt;code&gt;instr&lt;&#x2F;code&gt; rule of the format &lt;code&gt;&amp;quot;def&amp;quot; func&lt;&#x2F;code&gt;, where &lt;code&gt;func&lt;&#x2F;code&gt; is the rule for normal function definitions. The &lt;code&gt;&amp;quot;def&amp;quot;&lt;&#x2F;code&gt; is there to avoid issues with labels (when the parser encounters &lt;code&gt;x:&lt;&#x2F;code&gt; it will not know if &lt;code&gt;x&lt;&#x2F;code&gt; is a label or a function with a return type). In the interpreter, we introduce the notion of a local function map. Previously, we had a global function map for the entire program. The local function map restricts access to a nested function to the immediate parent function.&lt;&#x2F;p&gt;
&lt;p&gt;Note that these nested function definitions are &lt;strong&gt;not&lt;&#x2F;strong&gt; closures. Nested functions cannot access variables of their parents. This is an interesting potential future extension of the language, but the current purpose of nested function definitions is to improve code organization.&lt;&#x2F;p&gt;
&lt;p&gt;We also change the &lt;code&gt;call&lt;&#x2F;code&gt; operation handler to first search the local map before searching the global map. Note that this means functions can be shadowed. Below is the above program updated with nested function definitions.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
# Print fib(10)
main {
    v0: int = const 10;
    fib10: int = call fib v0;
    print fib10;
}

# Return fib(10). Requires: n &amp;gt;= 0.
fib (n: int): int {
    def lte_one (n: int): bool {
        one: int = const 1;
        lto: bool = le n one;
        ret lto;
    }
    base: bool = call lte_one n;
    br base return continue;
return:
    ret n;
continue:
    one: int = const 1;
    prev: int = sub n one;
    prev2: int = sub prev one;
    fib1: int = call fib prev;
    fib2: int = call fib prev2;
    ans: int = add fib1 fib2;
    ret ans;
}
&lt;&#x2F;pre&gt;&lt;h4 id=&quot;command-line-arguments&quot;&gt;Command-line arguments&lt;&#x2F;h4&gt;
&lt;p&gt;Adding support for command-line arguments is quite straightforward. We extend the interpreter to use the &lt;code&gt;process.argv&lt;&#x2F;code&gt; variable to access arguments to &lt;code&gt;brili&lt;&#x2F;code&gt;. For each argument, we simply ensure that the argument is either an integer or a string containing &amp;quot;true&amp;quot; or &amp;quot;false&amp;quot;, in which case we convert the string to the corresponding Boolean. We then pass these arguments to &lt;code&gt;main&lt;&#x2F;code&gt; the same way we would pass arguments to any other function.&lt;&#x2F;p&gt;
&lt;p&gt;We can update the above Fibonacci implementation to take in a command-line argument &lt;code&gt;n&lt;&#x2F;code&gt;, and print out &lt;code&gt;fib(n)&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
# Print fib(n)
main n {
    fibn: int = call fib n;
    print fibn;
}

# Return fib(10). Requires: n &amp;gt;= 0.
fib (n: int): int {
    def lte_one (n: int): bool {
        one: int = const 1;
        lto: bool = le n one;
        ret lto;
    }
    base: bool = call lte_one n;
    br base return continue;
return:
    ret n;
continue:
    one: int = const 1;
    prev: int = sub n one;
    prev2: int = sub prev one;
    fib1: int = call fib prev;
    fib2: int = call fib prev2;
    ans: int = add fib1 fib2;
    ret ans;
}
&lt;&#x2F;pre&gt;
&lt;p&gt;If the above program were in a file called &lt;code&gt;fib.bril&lt;&#x2F;code&gt;, we could run it by running:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
bril2json &amp;lt; fib.bril | brili [n]
&lt;&#x2F;pre&gt;&lt;h4 id=&quot;module-system&quot;&gt;Module system&lt;&#x2F;h4&gt;
&lt;p&gt;Lastly, we introduce a basic module system to allow for basic abstraction of functionality. The syntax for importing modules is &lt;code&gt;import MOD_NAME;&lt;&#x2F;code&gt;. All imports must be placed at the top of the Bril file, and &lt;code&gt;MOD_NAME.bril&lt;&#x2F;code&gt; should exist in the current working directory. After parsing the Bril file, in addition to returning the list of functions, we include a list of imported module names (we don&#x27;t do anything else with the module names; this will become clear shortly). We added a new command in addition to &lt;code&gt;bril2json&lt;&#x2F;code&gt;and &lt;code&gt;ts2bril&lt;&#x2F;code&gt;, named &lt;code&gt;loadbril&lt;&#x2F;code&gt;. This command takes the new IR representation which includes the list of imported modules and recursively parses and loads each module. Circular imports &lt;em&gt;are&lt;&#x2F;em&gt; supported, as this is a very basic module system that simply adds all imported functions to the global namespace. (We considered adding namespaces for modules but decided that this would best be left for a separate project.) We also detect duplicate function definitions as the module system could make detecting these manually much more tedious.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h3 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h3&gt;
&lt;p&gt;By adding the above features to Bril, the language becomes more complex, increasing room for error. We have added many rigorous tests to ensure the correctness of our implementation.&lt;&#x2F;p&gt;
&lt;p&gt;We considered benchmarking the performance of our language features but decided that this did not make much sense, as we did not have a baseline metric. We thought about comparing our language&#x27;s performance to that of another interpreted language such as Python but decided that there were too many variables to consider and that our results would most likely be misleading.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;basic-function-calls&quot;&gt;Basic function calls&lt;&#x2F;h4&gt;
&lt;p&gt;With basic function calls, we want to make sure that execution jumps to the desired functions. We also want to make sure programs follow standard scoping rules (variables are scoped within their respective functions). Take the following test program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
  v0: int = const 4;
  call double v0;
}

double a {
  v0: int = const 2;
  v1: int = mul a v0;
  print v1;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In the above program, &lt;code&gt;v0&lt;&#x2F;code&gt; is defined in &lt;code&gt;main&lt;&#x2F;code&gt; and &lt;code&gt;double&lt;&#x2F;code&gt;. When &lt;code&gt;double&lt;&#x2F;code&gt; is called, &lt;code&gt;v0&lt;&#x2F;code&gt; is passed in as &lt;code&gt;a&lt;&#x2F;code&gt;, and &lt;code&gt;double&lt;&#x2F;code&gt; has no knowledge of any variables in &lt;code&gt;main&lt;&#x2F;code&gt;. (Fortunately, the initial Bril interpreter handles basic function scoping.)&lt;&#x2F;p&gt;
&lt;h4 id=&quot;function-parameters&quot;&gt;Function parameters&lt;&#x2F;h4&gt;
&lt;p&gt;We evaluate the correctness of function parameters in two stages.&lt;&#x2F;p&gt;
&lt;p&gt;First, we need to ensure that the new function definition grammar is parsed correctly. Each statement accepts a variable number of arguments, each of which can be assigned a type. We test this by defining functions with 0, 1, and 10+ arguments and make sure the parser output is as expected.&lt;&#x2F;p&gt;
&lt;p&gt;Second, we need to ensure that we are able to pass arguments. To test this, we test several functions that take a varying number of arguments and ensure that calling them with different arguments would execute correctly.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;return-statements&quot;&gt;Return statements&lt;&#x2F;h4&gt;
&lt;p&gt;Return statements add another layer of complexity– now we want to pass data out of functions. The testing scheme here is similar to the one for function parameters.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;nested-functions&quot;&gt;Nested functions&lt;&#x2F;h4&gt;
&lt;p&gt;For nested functions, the primary testing benchmark is to see if nested functions behave similar to normal functions within scope of the parent. To do this, we test the following cases:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;one function nested in another function,&lt;&#x2F;li&gt;
&lt;li&gt;two functions nested in one function, &lt;&#x2F;li&gt;
&lt;li&gt;one nested function inside a nested function of another function, and&lt;&#x2F;li&gt;
&lt;li&gt;nested function name shadowing.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;We designed nested functions in Bril to obey lexical scope, so our test cases enumerate all possible cases. An example of two Bril programs, one with nested function definitions, and one without, is the above recursive Fibonacci implementation.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;modules&quot;&gt;Modules&lt;&#x2F;h4&gt;
&lt;p&gt;The most important aspects of our testing plan for the module system are circular imports and duplicate function definitions across modules. Since our module system is quite simple and circular imports are allowed, we simply test programs that import in a circular fashion. Some of these cases include:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;a&lt;&#x2F;code&gt; and &lt;code&gt;b&lt;&#x2F;code&gt; import each other,&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;a&lt;&#x2F;code&gt;, &lt;code&gt;b&lt;&#x2F;code&gt;, and &lt;code&gt;c&lt;&#x2F;code&gt; all import each other, and&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;a&lt;&#x2F;code&gt; imports &lt;code&gt;b&lt;&#x2F;code&gt;, &lt;code&gt;b&lt;&#x2F;code&gt; imports &lt;code&gt;c&lt;&#x2F;code&gt;, &lt;code&gt;c&lt;&#x2F;code&gt; imports &lt;code&gt;d&lt;&#x2F;code&gt;, and &lt;code&gt;d&lt;&#x2F;code&gt; imports &lt;code&gt;a&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;To test duplicate function definitions (which our module system should detect and complain about), we test duplicate function definitions across imported modules and make sure that the module loader raises an exception.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;command-line-arguments-1&quot;&gt;Command-line arguments&lt;&#x2F;h4&gt;
&lt;p&gt;To test command-line arguments, we run programs taking various numbers of arguments of both integer and Boolean types.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h3 id=&quot;notable-challenges&quot;&gt;Notable challenges&lt;&#x2F;h3&gt;
&lt;p&gt;One notable challenge we encountered when implementing our language features was designing the loading system for modules. We ended up on a rather simple design (see above), but tested various other implementations. We first tried loading modules as soon as the &lt;code&gt;import&lt;&#x2F;code&gt; statement was encountered. This made it difficult to detect circular imports. We also tried updating the &lt;code&gt;loadbril&lt;&#x2F;code&gt; command to take the list of module files, but found this to be detrimental to usability, as running programs that imported many modules would require enumerating the filenames of all the modules. We also considered adding namespaces for modules as discussed in the previous section, but could not settle on a clean design.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Manually Managed Memory in Bril</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/manually-managed-memory/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/manually-managed-memory/</guid>
                <description>&lt;h3 id=&quot;pointers-and-heap-allocated-memory&quot;&gt;Pointers and Heap Allocated Memory&lt;&#x2F;h3&gt;
&lt;p&gt;Our goal was to add &lt;em&gt;pointer types&lt;&#x2F;em&gt; to Bril. &lt;em&gt;Pointers&lt;&#x2F;em&gt; represent references to
manually managed read&#x2F;write memory cells which can persist outside of function
scope. Furthermore we support C-style arrays such that pointer arithmetic
instructions can be used to index into allocated memory regions. Lastly, we
did not implement a typechecker, however we
wished to ensure that value typechecking was still supportable for our new
instructions. Our pointer types are
meant only for value checking (i.e., every pointer type totally specifies the
type of its contents); they do not include bounds or alias information to
prevent memory safety bugs.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;design-overview&quot;&gt;Design Overview&lt;&#x2F;h3&gt;
&lt;p&gt;We added manually managed memory and typed pointers to Bril while keeping the
layout of data hidden from Bril programs. The API for working with these
heap-allocated pointers was inspired by &lt;a href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;LangRef.html#memory-access-and-addressing-operations&quot;&gt;LLVM&#x27;s manually allocated stack pointer
API&lt;&#x2F;a&gt;.
Pointer types include the data type to which they refer: no &lt;code&gt;void*&lt;&#x2F;code&gt; magic
allowed. Furthermore, the representation of data in memory is abstract. In
a type system that just consists of integers, booleans, and pointers this is not
too strong a statement, but something like LLVM&#x27;s &lt;code&gt;getelementptr&lt;&#x2F;code&gt; would allow
structs to be added while still hiding type sizes from Bril programs.
This depends on the fact that, unlike in C, you cannot do bytewise arithmetic on
Bril pointers to determine the size of things in memory or extract the value of
a pointer as an integer address.&lt;&#x2F;p&gt;
&lt;p&gt;This design leaves the Bril interpreter&#x2F;compiler complete freedom to choose how
data, including pointers, are represented and allocated; this separates the high
level computaional usage of pointers and the low level (and likely
platform-dependent) implementation details while still providing usable manually
managed memory to programs.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;pointer-syntax-representation&quot;&gt;Pointer Syntax &amp;amp; Representation&lt;&#x2F;h4&gt;
&lt;p&gt;We expanded the Bril JSON type syntax with &lt;code&gt;{ &amp;quot;ptr&amp;quot; : TYPE }&lt;&#x2F;code&gt;, which denotes a pointer
to a value in memory of type &lt;code&gt;TYPE&lt;&#x2F;code&gt;. The corresponding Bril textual representation
of a pointer type is &lt;code&gt;ptr&amp;lt;TYPE&amp;gt;&lt;&#x2F;code&gt;.
For the rest of this article, we&#x27;ll use the more concise text format.
There is no additional syntax for pointer values, since pointer representation
is abstract: the only way to produce something of type &lt;code&gt;ptr&amp;lt;T&amp;gt;&lt;&#x2F;code&gt; is by using the
language&#x27;s memory allocator. There is no address-of operator (like C&#x27;s &lt;code&gt;&amp;amp;&lt;&#x2F;code&gt;)
either.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;allocating-memory&quot;&gt;Allocating Memory&lt;&#x2F;h4&gt;
&lt;p&gt;We added a typed &lt;code&gt;alloc&lt;&#x2F;code&gt; primitive to Bril.
Bril&#x27;s &lt;code&gt;alloc&lt;&#x2F;code&gt; works like C&#x27;s &lt;code&gt;malloc&lt;&#x2F;code&gt;, but the argument passed to &lt;code&gt;alloc&lt;&#x2F;code&gt;
represents the number of elements to allocate, rather than the number of bytes
to allocate.&lt;&#x2F;p&gt;
&lt;p&gt;In Bril, allocating a pointer to 10 ints looks like this:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt; ten&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
 myptr&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; alloc ten;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Doing the same in C would require invoking the &lt;code&gt;sizeof&lt;&#x2F;code&gt; operator to determine
how much space an &lt;code&gt;int&lt;&#x2F;code&gt; takes up in memory, but that lets the program know
something about the representation of data in memory. Bril&#x27;s element-size
allocator avoids this.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;modifying-memory&quot;&gt;Modifying Memory&lt;&#x2F;h4&gt;
&lt;p&gt;Like in assembly languages, Bril pointers can be used to access memory through
&lt;code&gt;load&lt;&#x2F;code&gt; and &lt;code&gt;store&lt;&#x2F;code&gt; instructions. These operations take a pointer as their first
argument and work like pointer dereferencing in C. &lt;em&gt;Loads&lt;&#x2F;em&gt; correspond to
read operations and &lt;em&gt;stores&lt;&#x2F;em&gt; correspond to writes. As an example, both of the
following programs will print the value &lt;code&gt;4&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;C Implementation:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; myptr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;malloc&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;sizeof&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;* &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;myptr &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
printf(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;%d\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;myptr);
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Bril Implementation:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt; ten&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
 four&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
 myptr&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; alloc ten;
 store myptr four;
 v&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; load myptr;
 print v;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h4 id=&quot;pointer-arithmetic&quot;&gt;Pointer Arithmetic&lt;&#x2F;h4&gt;
&lt;p&gt;So far we can only use loads and stores to access the first cell in our
allocated memory region, since that&#x27;s where the pointers returned by &lt;code&gt;alloc&lt;&#x2F;code&gt;
point to.&lt;&#x2F;p&gt;
&lt;p&gt;To index into the memory region, Bril programs can use our typed &lt;code&gt;ptradd&lt;&#x2F;code&gt;
instruction, which allows a program to add an integer to a pointer to produce
a new pointer.&lt;&#x2F;p&gt;
&lt;p&gt;The code snippets below access the second element of some already allocated
memory region.&lt;&#x2F;p&gt;
&lt;p&gt;C Implementation:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; myptr;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; ... allocate memory ...
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;printf(&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;%d\n&lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;, myptr[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;]); &lt;&#x2F;span&gt;&lt;span style=&quot;font-style:italic;color:#969896;&quot;&gt;&#x2F;&#x2F; myptr[1] === *(myptr + sizeof(int)*1)
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Bril Implementation:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;one&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
myptr_1&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptradd myptr one;
v&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; load my_ptr1;
print v;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h4 id=&quot;deallocating-memory&quot;&gt;Deallocating Memory&lt;&#x2F;h4&gt;
&lt;p&gt;In general, &lt;code&gt;free&lt;&#x2F;code&gt; in Bril works exactly the same as it does in C. You can use
any reference to the same allocation to free it; however, double frees or
free-ing a pointer which doesn&#x27;t refer to the beginning of an allocation are
illegal. That means the following programs both result in bad behavior at
runtime:&lt;&#x2F;p&gt;
&lt;p&gt;Error Program 1:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;ten&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
myptr&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; alloc ten;
free myptr;
free myptr;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Error Program 2:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;ten&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
myptr&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; alloc ten;
myptr_10&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptradd myptr ten;
free myptr_10;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Furthermore, (also like C) Bril does not prevent memory leaks by default. In
other words, programs may &lt;code&gt;alloc&lt;&#x2F;code&gt; memory that they never &lt;code&gt;free&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;For a larger example of how pointers can be used in Bril, the following C code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; vals &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#62a35c;&quot;&gt;malloc&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;sizeof&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;)&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;*&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;);
vals[&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;for &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;(&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;= &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;&amp;lt; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;; i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;++&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;) {
  vals[i] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; vals[i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;] &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;+ &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Would be roughly equivalent to the following Bril code:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt; ten&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;10&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
 zero&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;0&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
 one&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
 neg_one&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const -&lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
 four&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;4&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
 vals&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; alloc ten;
 store vals zero;
 i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int = const &lt;&#x2F;span&gt;&lt;span style=&quot;color:#0086b3;&quot;&gt;1&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;;
 i_minus_one&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add i neg_one;
loop&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
 cond&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; lt i ten;
 br cond done body;
body&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
 vals_i&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptradd vals i;
 vals_i_minus_one&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptr&amp;lt;&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;int&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;&amp;gt; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptradd vals i_minus_one;
 tmp&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; load vals_i_minus_one;
 tmp&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: int =&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add tmp four;
 store vals_i tmp;
 i &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add i one;
 i_minus_one &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;=&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; add i_minus_one one;
 jmp loop;
done&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
 free vals;
 ret;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h3&gt;
&lt;p&gt;We implemented our design by extending the Bril text parser to support
pointer types and the Bril interpreter to support pointers,
heap memory, and runtime error checking.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;pointer-representation&quot;&gt;Pointer Representation&lt;&#x2F;h4&gt;
&lt;p&gt;A Bril pointer is represented by a pointer with two fields: a &lt;em&gt;Key&lt;&#x2F;em&gt; that points
into the &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;manually-managed-memory&#x2F;#heap-memory&quot;&gt;&lt;em&gt;Heap&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;, and a &lt;em&gt;tag&lt;&#x2F;em&gt; that tells the runtime which kind
of data the memory cell should be used for. Type tags can be &lt;code&gt;&amp;quot;int&amp;quot;&lt;&#x2F;code&gt;, &lt;code&gt;&amp;quot;bool&amp;quot;&lt;&#x2F;code&gt;
or &lt;code&gt;&amp;quot;ptr&amp;quot;&lt;&#x2F;code&gt;. Type tags are checked whenever a Bril program does a &lt;code&gt;store&lt;&#x2F;code&gt; to make
sure that the cells of the allocated memory were allocated to store something of
that type. All pointers, regardless of what they point to, have the same type
tag. This is because all pointers have the same in-memory representation, so
ensuring a cell for pointers is only ever used for pointers can be done without
worrying about what type of data the pointer points to.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;heap-memory&quot;&gt;Heap Memory&lt;&#x2F;h4&gt;
&lt;p&gt;Memory itself is represented by the &lt;code&gt;Heap&lt;&#x2F;code&gt; data structure. It implements all the
operations exposed to Bril programs. Concretely, the &lt;code&gt;Heap&lt;&#x2F;code&gt; implementation is
a JavaScript &lt;code&gt;Map&lt;&#x2F;code&gt; that maps &lt;code&gt;number&lt;&#x2F;code&gt; keys to arrays of objects. Its &lt;code&gt;alloc&lt;&#x2F;code&gt;
method extends the map to map a fresh key &lt;code&gt;number&lt;&#x2F;code&gt; to a new array of the
requested size and returns an opaque &lt;em&gt;Key&lt;&#x2F;em&gt; wrapping the &lt;code&gt;number&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;A &lt;em&gt;Key&lt;&#x2F;em&gt; is a pair of a &lt;code&gt;base&lt;&#x2F;code&gt; and an &lt;code&gt;offset&lt;&#x2F;code&gt;. The &lt;code&gt;base&lt;&#x2F;code&gt; is the key used to
look up the array in the heap &lt;code&gt;Map&lt;&#x2F;code&gt; and the &lt;code&gt;offset&lt;&#x2F;code&gt; is an index into the array.
Freshly allocated pointers have &lt;code&gt;offset == 0&lt;&#x2F;code&gt;, and pointer arithmetic can only
change the &lt;code&gt;offset&lt;&#x2F;code&gt;. In this way, we keep track of which allocation any given
pointer belongs to, regardless of &amp;quot;where&amp;quot; it may point in memory. Notably this
implementation of a Heap does not model &amp;quot;a single contiguous memory space&amp;quot;; each
allocation represents a continguous space and allocations are otherwise
unrelated.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;free&lt;&#x2F;code&gt; method deletes entries from the internal map, so we are relying on
the base JavaScript Map implementation and the JavaScript runtime garbage
collection to actually free physical memory dynamically. We don&#x27;t implement any
interesting memory allocation strategies based on physical memory layout and
simply let the runtime do the work. While smart, type-aware memory allocators
are an interesting area of performance optimization, we felt that rabbit hole
was outside the scope of this small project.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;notable-challenges&quot;&gt;Notable Challenges&lt;&#x2F;h3&gt;
&lt;p&gt;In an interpreter setting, where we can rely on another language runtime to do
the real physical memory allocation and garbage collection for us, this is not
a terribly complex addition to Bril. However, there were a few small details
that were tricky to get exactly right.&lt;&#x2F;p&gt;
&lt;p&gt;Firstly, we had to handle how to parse and represent new types that are
parameterized on other types. We wanted to syntactically enforce, with the
parser, that pointer types had to be fully specified.&lt;&#x2F;p&gt;
&lt;p&gt;The original type parsing specification looked like this:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; CNAME
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And created an AST node where the type was specified as a &lt;code&gt;string&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Our new version looks like this:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;ptr&amp;lt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; ptrtype &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;&amp;gt;&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;|&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; basetype
ptrtype&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; type
basetype&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt; CNAME
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And we create an AST node where the type is specified as a (potentially nested)
object with one field named &amp;quot;ptr&amp;quot;. For example, the AST representation of a node
with type &lt;code&gt;ptr&amp;lt;bool&amp;gt;&lt;&#x2F;code&gt; looks like&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;{ ptr&lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#a71d5d;&quot;&gt;: &lt;&#x2F;span&gt;&lt;span style=&quot;color:#183691;&quot;&gt;&amp;quot;bool&amp;quot; &lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We could have decided to maintian the pointer type abstract representation as
a string, but that would require re-parsing that type string repeatedly
throughout the interpreter. While we avoided that problem, we now had to deal
with the annoyance that some types were represented as strings and others were
JSON objects. This lead to a fair bit of refactoring and slightly tricky runtime
typechecking, but we decided it was worth it compared to having to do
sophisticated string pattern matching at runtime.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h3&gt;
&lt;p&gt;We evaluated our implementation through qualitative testing rather than
quantitative measurement. While one might argue we should measure memory
allocation performance, we would really just be measuring how well the
JavaScript runtime allocates memory and garbage collects.&lt;&#x2F;p&gt;
&lt;p&gt;We wanted to evaluate the correctness of our code and
see if we threw reasonable errors under all erroneous conditions. We created
a number of test cases that stress pointer arithmetic (similar to the large
example presented earlier in this post). Key features to test were:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Allocating memory of various sizes&lt;&#x2F;li&gt;
&lt;li&gt;Reading &amp;amp; writing memory&lt;&#x2F;li&gt;
&lt;li&gt;Re-writing memory&lt;&#x2F;li&gt;
&lt;li&gt;Ensuring that pointers to pointers function correctly&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Additionally, we needed to check a number of &amp;quot;bad&amp;quot; cases, which we expected the
interpreter to catch and report as errors with reasonable error messages:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Passing non-pointers to &lt;code&gt;load&lt;&#x2F;code&gt;, &lt;code&gt;store&lt;&#x2F;code&gt; or &lt;code&gt;free&lt;&#x2F;code&gt; operations&lt;&#x2F;li&gt;
&lt;li&gt;Allocating pointers with non-positive size&lt;&#x2F;li&gt;
&lt;li&gt;Failing to free memory by the end of the program&lt;&#x2F;li&gt;
&lt;li&gt;Freeing the same allocation multiple times&lt;&#x2F;li&gt;
&lt;li&gt;Trying to free a pointer into the middle of an allocation region&lt;&#x2F;li&gt;
&lt;li&gt;Accessing memory &amp;quot;out of bounds&amp;quot; of a given access&lt;&#x2F;li&gt;
&lt;li&gt;Writing the wrong type of data into a pointer (e.g. store &lt;code&gt;int&lt;&#x2F;code&gt; into
a &lt;code&gt;ptr&amp;lt;bool&amp;gt;&lt;&#x2F;code&gt;)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;&lt;em&gt;N.B. The interpreter may exhibit arbitrarily bad
behavior upon ill-typed inputs and the existing implementation
does allow some bad behaviors. Our memory operations do a
fair amount of dymanic type checking to avoid writing the wrong
kind of data into a memory cell but &lt;code&gt;load&lt;&#x2F;code&gt; operations can
still read data of the wrong type for a given destination variable.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;All of our tests pass. For fun, we also included tests to stress memory allocation,
for example by allocating and free-ing in a tight loop and by allocating very
large amounts of memory. Those pass too, but some of them take a few seconds to complete. :)&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Record Types!</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/recordtypes/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/recordtypes/</guid>
                <description>&lt;p&gt;The goal was to design and implement record types (aka structs). We decided on immutable record types using a nominal type system. We initially planned to implement record type declaration (that are named), record type instantiation, and record type accessing, but later decided to additionally implement &lt;em&gt;with statements&lt;&#x2F;em&gt; to improve usability. To achieve this, we made additions to the bril interpreter &lt;code&gt;brili.ts&lt;&#x2F;code&gt; as well as adding new types to the language definition in &lt;code&gt;bril.ts&lt;&#x2F;code&gt;. The following code will be provided in human-readable bril and semantic additions will be followed by their JSON representations as well. We upgrade the &lt;code&gt;bril2json&lt;&#x2F;code&gt; tool to allow translations from human-readable Bril into JSON. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;immutability&quot;&gt;Immutability&lt;&#x2F;h3&gt;
&lt;p&gt;We decided to make record types immutable as it is considered best practice to make value types immutable. One key reason is that mutation of a value type only changes that specific copy. This is morally equivalent to creating a new record as the other copies remain unchanged. Therefore, when thinking about values, it is logical to think of this new record as a different value, and thus, not a mutation at all.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;declaring-a-record-type&quot;&gt;Declaring a Record Type&lt;&#x2F;h3&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
type &amp;lt;record type name&amp;gt;  = 
    {&amp;lt;field1 name&amp;gt; : &amp;lt;field1 type&amp;gt; ; &amp;lt;field2 name&amp;gt; : &amp;lt;field2 type&amp;gt; ; … };
&lt;&#x2F;pre&gt;
&lt;p&gt;Where
&lt;code&gt;type&lt;&#x2F;code&gt; is a new keyword,
&lt;code&gt;&amp;lt;record type name&amp;gt;&lt;&#x2F;code&gt; is an identifier,
&lt;code&gt;&amp;lt;field# name&amp;gt;&lt;&#x2F;code&gt; is an identifier, and
&lt;code&gt;&amp;lt;field# type&amp;gt;&lt;&#x2F;code&gt; is a type name, which may be either a primitive type or a previously declared record type.&lt;&#x2F;p&gt;
&lt;p&gt;In JSON:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;recordname&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;record type name&amp;gt;&amp;quot;,
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;op&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;recorddef&amp;quot;,
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;fields&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;&amp;lt;field1 name&amp;gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;field1 type&amp;gt;&amp;quot;,
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;&amp;lt;field2 name&amp;gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;field2 type&amp;gt;&amp;quot;,
        &lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We decided on this format to mirror OCaml &lt;a href=&quot;https:&#x2F;&#x2F;v1.realworldocaml.org&#x2F;v1&#x2F;en&#x2F;html&#x2F;records.html&quot;&gt;record type declarations&lt;&#x2F;a&gt;. However, unlike OCaml, we disallow recursive record types (i.e. a record type that may contain itself) as that would require complicated recursive types as well as a notion of nullable references, which are outside of the scope of this project.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;instantiation&quot;&gt;Instantiation&lt;&#x2F;h3&gt;
&lt;p&gt;To instantiate a new record with a previously declared record type, we use the following format:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&amp;lt;variable name&amp;gt;  : &amp;lt;record type&amp;gt; = 
    record {&amp;lt;field1 name&amp;gt; : &amp;lt;field1 value&amp;gt;; &amp;lt;field2 name&amp;gt; : &amp;lt;field2 value&amp;gt;};
&lt;&#x2F;pre&gt;
&lt;p&gt;Where:
&lt;code&gt;&amp;lt;variable name&amp;gt;&lt;&#x2F;code&gt; is an identifier,
&lt;code&gt;&amp;lt;record type&amp;gt;&lt;&#x2F;code&gt; is a previous declared record type,
&lt;code&gt;&amp;lt;record&amp;gt;&lt;&#x2F;code&gt; is new keyword,
&lt;code&gt;&amp;lt;field# name&amp;gt;&lt;&#x2F;code&gt; is the field name used in the record type definition, and
&lt;code&gt;&amp;lt;field# value&amp;gt;&lt;&#x2F;code&gt; is an identifier for an existing variable matching the field type&lt;&#x2F;p&gt;
&lt;p&gt;In JSON:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;dest&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;variable name&amp;gt;&amp;quot;,
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;fields&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;&amp;lt;field1 name&amp;gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;field1 value&amp;gt;&amp;quot;,
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;&amp;lt;field2 name&amp;gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;field2 valuie&amp;gt;&amp;quot;,
        &lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    },
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;op&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;recordinst&amp;quot;,
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;type&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;record type&amp;gt;&amp;quot;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We decided to introduce the &lt;code&gt;record&lt;&#x2F;code&gt; opcode. Note that the ordering of field name and value pairs do not matter, as long as they match with the definiton’s field names and types. We also only allow field values to be existing variables to match the semantics of current operations. The structure of this statement was designed to match record type declarations as much as possible. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;nominal-vs-structural-typing&quot;&gt;Nominal vs Structural Typing&lt;&#x2F;h3&gt;
&lt;p&gt;One of the main design decisions was whether we wanted to use nominal or structural typing to typecheck records. When defining a nested record, it is required to first define the nested record and assign it to a variable. Then, you can define the outer record with the field initialized to the variable holding the nested record. When type-checking these nested records, we need to look up the type of the outer record from our type environment, and step through the fields, one by one, comparing the signature with the type returned from the lookup of the initializing variable. We use the following example to illustrate nominal typing:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type Dog = {age: int; isAsleep: bool};
type Person = {class: int; dog: Dog};
v0: int = const 3;
v1: bool = const false;
Milo: Dog = record {age: v0; isAsleep: v1};
v2: int = const 4120;
AndrewMyers: Person = record {class: v2; dog: Milo};
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Consider the case for checking the variable initializing a nested record, i.e., &lt;code&gt;Milo&lt;&#x2F;code&gt;. With nominal typing, if we expected a type, Dog, and we looked up the variable to have type ‘Dog’, we know this must have been previously typechecked when it was defined and added to our environment. Therefore it still must typecheck, due to immutability. This sort of &lt;em&gt;shallow type-checking&lt;&#x2F;em&gt; falls out of nominal subtyping. &lt;&#x2F;p&gt;
&lt;p&gt;Along with the ability to quickly verify any type, nominal subtyping allows us to reject a nested record&#x27;s type if it does not match the signature’s record type name without recursive checks. In a structural typing world, the name of the type bound to the initializing variable is not enough to reject a type. If the declared type did not match, we would need to recursively check all of the fields of the value bound to our initializing variable, and compare these with the type signature recursively. This is much slower.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;access&quot;&gt;Access&lt;&#x2F;h3&gt;
&lt;p&gt;We use the dot operator to access a field of a record:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&amp;lt;variable name&amp;gt; : &amp;lt;type&amp;gt; = &amp;lt;record&amp;gt; . &amp;lt;field&amp;gt;
&lt;&#x2F;pre&gt;
&lt;p&gt;Where:
&lt;code&gt;&amp;lt;variable name&amp;gt;&lt;&#x2F;code&gt; is a valid identifier that takes on the value of the indicated field, and
&lt;code&gt;&amp;lt;record&amp;gt;&lt;&#x2F;code&gt; is the name of an instance of a record with a field name &lt;code&gt;&amp;lt;field&amp;gt;&lt;&#x2F;code&gt; that has type &lt;code&gt;&amp;lt;type&amp;gt;&lt;&#x2F;code&gt;
Note that there is a space before and after the dot. This is strictly necessary as dot is a valid character for a variable name and we decided that it would be horrible for backwards compatibility if we changed variable naming rules. In hindsight, a &lt;code&gt;get&lt;&#x2F;code&gt; instruction with a &lt;code&gt;record&lt;&#x2F;code&gt; and &lt;code&gt;field&lt;&#x2F;code&gt; argument would have been a better choice as that is more in line with existing syntax.&lt;&#x2F;p&gt;
&lt;p&gt;In JSON:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;src&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;record&amp;gt;&amp;quot;,
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;dest&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;variable name&amp;gt;&amp;quot;,
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;args&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: [
        &amp;quot;&amp;lt;record&amp;gt;&amp;quot;,
        &amp;quot;&amp;lt;field&amp;gt;&amp;quot;
    ],
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;op&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;access&amp;quot;,
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;type&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;type&amp;gt;&amp;quot;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We chose this format because using the dot operator to access fields is very common in modern programming languages.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;with-syntax&quot;&gt;&lt;em&gt;With&lt;&#x2F;em&gt; Syntax&lt;&#x2F;h3&gt;
&lt;p&gt;While immutable data structures allow you to more easily reason about how values flow through your program, these currently immutable records are cumbersome to change. 
For example, to change one field, you must recreate the entire record. While we do want a new record to be created, copying variables between two records is overly tedious. 
As such, we decided to implement a with syntax, similar to OCaml in which the user just needs to specify a record name as the base, and then the fields desired to be changed, along with the new value. This maintains immutability, without so much tedium.&lt;&#x2F;p&gt;
&lt;p&gt;We use the following syntax:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&amp;lt;new record&amp;gt;: &amp;lt;record type&amp;gt; = 
    &amp;lt;old record&amp;gt; with {&amp;lt;field1 name&amp;gt;: &amp;lt;field1 value&amp;gt;; &amp;lt;field2 name&amp;gt;: &amp;lt;field2 value&amp;gt; … };
&lt;&#x2F;pre&gt;
&lt;p&gt;Where:
&lt;code&gt;&amp;lt;new record&amp;gt;&lt;&#x2F;code&gt; is the name of the new record,
&lt;code&gt;&amp;lt;record type&amp;gt;&lt;&#x2F;code&gt; is the same type as &lt;code&gt;&amp;lt;old record&amp;gt;&lt;&#x2F;code&gt;,
&lt;code&gt;with&lt;&#x2F;code&gt; is a new keyword,
&lt;code&gt;&amp;lt;field# name&amp;gt;&lt;&#x2F;code&gt; must match with a field in &lt;code&gt;&amp;lt;type&amp;gt;&lt;&#x2F;code&gt;, and
&lt;code&gt;&amp;lt;field# value&amp;gt;&lt;&#x2F;code&gt; must be a variable with a type that matches its field name.&lt;&#x2F;p&gt;
&lt;p&gt;Within the braces, the user may specify 0 to n field name and value pairs, where n is the total number of fields in the record type.&lt;&#x2F;p&gt;
&lt;p&gt;In JSON:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;{
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;dest&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;new record&amp;gt;&amp;quot;,
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;src&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;old record&amp;gt;&amp;quot;,
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;type&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;record type&amp;gt;&amp;quot;,
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;op&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;recordwith&amp;quot;,
    &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;fields&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: {
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;&amp;lt;field1 name&amp;gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;field1 value&amp;gt;&amp;quot;,
        &lt;&#x2F;span&gt;&lt;span style=&quot;font-weight:bold;color:#183691;&quot;&gt;&amp;quot;&amp;lt;field2 name&amp;gt;&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;: &amp;quot;&amp;lt;field2 value&amp;gt;&amp;quot;,
        &lt;&#x2F;span&gt;&lt;span style=&quot;background-color:#f5f5f5;font-weight:bold;color:#b52a1d;&quot;&gt;...&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;
    }
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This syntax was designed to have a similar format as record instantiation. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h3&gt;
&lt;p&gt;The two main goals of record types are to allow Bril to logically group and use related data points as well as to serve as a valuable language feature. These would be useful when compiling higher-level languages that utilize a similar data structure like records in OCaml or structs in C into Bril. &lt;&#x2F;p&gt;
&lt;p&gt;To satisfy the first goal, we implemented immutable nominal record types. This implementation includes record declaration, instantiation, and access. Decisions about which operations to include in our design were influenced by the operations available on record data types in higher-level languages. &lt;&#x2F;p&gt;
&lt;p&gt;To evaluate the functionality of our implementation, we created a suite of tests that covered each of these operations, as well as combinations of these operations.&lt;&#x2F;p&gt;
&lt;p&gt;The primary operation not supported by our record type specification that is supported by some higher-level language record types is mutation. While our record types do not support mutation as discussed, this does not significantly hinder Bril&#x27;s ability to compile higher-level languages that support this feature. &lt;&#x2F;p&gt;
&lt;p&gt;Consider the following C code that declares a struct and provides a method to update.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;struct Person {
   int age;
   bool isAsleep;
};

int main(void) {
    Person Henry = { 20, false };
    Henry.age += 1;
    return 0;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Compiling this program to Bril may look something like this:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type Person = {age: int; isAsleep: bool};
v0: int = const 20;
v1: bool = const false;
Henry: Person = record {age: v0; isAsleep: v1};
v2: int = const 21;
Henry: Person = Henry with {age: v2};
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;As shown in this example, mutable record types can be transformed into immutable records in Bril without significant effort. Therefore, lack of this operation does not compromise this goal.&lt;&#x2F;p&gt;
&lt;p&gt;For the second goal, to evaluate record types as a language feature in Bril, we consider how this functionality translates to Bril.
We found that creating new records was a tedious process if the record was large, so we implemented &lt;em&gt;with&lt;&#x2F;em&gt; statements in addition to the features mentioned above for situations where one wanted to duplicate a record with a few changes. It should be noted that it is bad form to use a with statement with no fields because that would be identical to referencing the old record with &lt;code&gt;… = id oldRecordName&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;We were successful in this aspect as creating a new record from an existing one with this syntax is more concise and easier to reason about than copying over every field. This is an advantage in an IR as we can logically think about a single &lt;em&gt;with statement&lt;&#x2F;em&gt; and a sequence of functionally equivalent statements that copy variables in the same way.&lt;&#x2F;p&gt;
&lt;p&gt;Consider the blocks of code below. These programs duplicate a record with one field changed.
Here we show what this would look like without &lt;em&gt;with statements&lt;&#x2F;em&gt;. &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type Person = {age: int; isAsleep: bool};
v0: int = const 21;
v1: bool = const false;
Henry: Person = record {age: v0; isAsleep: v1};
v2: bool = const true;
v3: int = Henry . age;
AwakeHenry: Person = record {age: v3; isAsleep: v2};
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here we use the &lt;em&gt;with&lt;&#x2F;em&gt; syntax.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;type Person = {age: int; isAsleep: bool};
v0: int = const 21;
v1: bool = const false;
Henry: Person = record {age: v0; isAsleep: v1};
v2: bool = const true;
AwakeHenry: Person = Henry with {isAsleep: v2};
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It is worth noting that the size of code required to duplicate a record without &lt;em&gt;with statements&lt;&#x2F;em&gt; scales linearly with the size of the record. In contrast, the size of code required to duplicate a record with &lt;em&gt;with statements&lt;&#x2F;em&gt; only increases with the size of the changes. Therefore, record types are successful as a language feature as they integrate well with current syntax and do not impose unnecessary code bloat. &lt;&#x2F;p&gt;
&lt;p&gt;Overall, record types implement the basic operations necessary to use them effectively and they increase Bril&#x27;s ability to compile higher-level languages. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;notable-challenges&quot;&gt;Notable Challenges&lt;&#x2F;h3&gt;
&lt;p&gt;One of the challenging aspects of this project was to think about how our design decisions interact to help us reach the aformentioned goals.&lt;&#x2F;p&gt;
&lt;p&gt;One our goals was to make Bril able to compile a larger set of programs in higher-level languages that use record data types. Questions like &amp;quot;Does immutability hinder Bril&#x27;s ability to compile mutable record types?&amp;quot; or &amp;quot;What are the strengths of nominal vs. structural typing when designing an IR vs. a higher-level language?&amp;quot; were challenging as they involve considering how different aspects of our design interact.&lt;&#x2F;p&gt;
&lt;p&gt;Our second goal was to add a valuable language feature. One aspect of this added value is extensibility. When designing the declaration instruction, we wanted an instruction that could generalize to type aliases, in the event that we decide to add these in the future. One of the challenges here is that when making our design decisions, not only did we consider the current state of the language, but we also tried to predict what features may be useful in the future.&lt;&#x2F;p&gt;
&lt;p&gt;A different challenge was debugging the TypeScript interpreter as it was very difficult to trace errors as the source TypeScript gets compiled into a separate JavaScript file. Debugging the parser in briltxt was not too bad as the new statement formats were pretty straightforward and we did not have to modify existing semantics. &lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Run Bril on Raspberry Pi Natively</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/run-bril-on-raspberry-pi-natively/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/run-bril-on-raspberry-pi-natively/</guid>
                <description>&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sampsyo&#x2F;bril&quot;&gt;Bril&lt;&#x2F;a&gt; comes with a reference interpreter &lt;code&gt;brili&lt;&#x2F;code&gt;, 
which allows any platform that supports Node.js to run Bril code.
The downside of using an interpreter is that
the performance is usually not as good as a native binary.
Therefore, it would be more efficient (and cool) if Bril could be run natively on a ARMv8 processor.
In this project, I am going to build a code translator that
translate Bril to AArch64 assembly,
so that it can run on a 64-bit Raspberry Pi (Raspberry Pi 2B v1.2 or later versions)
or any other 64-bit ARM devices with the AArch64 architecture.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design-and-implementation&quot;&gt;Design and Implementation&lt;&#x2F;h2&gt;
&lt;p&gt;This section will discuss how the translator is designed and implemented.
Although there are almost one-to-one mappings between Bril instructions and AArch64 assembly,
there are still some details that needs to be carefully designed to support
current and future functions of Bril.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;types&quot;&gt;Types&lt;&#x2F;h3&gt;
&lt;p&gt;Currently there are two values types in Bril:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;int&lt;&#x2F;code&gt;: 64-bit two&#x27;s complement signed integer type. 
It is equivalent to the &lt;code&gt;int64_t&lt;&#x2F;code&gt; type in C.
It will occupy 8 bytes of memory and fit in a single 64-bit register of 64-bit ARM processors.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;bool&lt;&#x2F;code&gt;: Boolean value that could be either &lt;code&gt;true&lt;&#x2F;code&gt; or &lt;code&gt;false&lt;&#x2F;code&gt;.
It is equivalent to the &lt;code&gt;bool&lt;&#x2F;code&gt; type in C.
It will occupy one whole byte in memory. &lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;variables-and-stack-allocation&quot;&gt;Variables and Stack Allocation&lt;&#x2F;h3&gt;
&lt;p&gt;Currently, Bril only has local variables with the scope of entire function.
Similar to the local variables in C, all local variables are stored on the 
stack frame of the current function.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;|--frame pointer---|
| local variables  |
|------------------|
| callee-save regs |
|------------------|
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Unlike C, where variables needs to be explicitly declared, 
Bril instruction with a &lt;code&gt;dest&lt;&#x2F;code&gt; opcode will implicitly declare a variable.
In order to build the symbol table and allocate stack space for all variables,
it needs to scan all instructions in the function and add all &lt;code&gt;dest&lt;&#x2F;code&gt; variables to 
the symbol table.&lt;&#x2F;p&gt;
&lt;p&gt;ARMv8 requires that the stack pointer is 16-byte aligned.
One easy solution for stack allocation is that each variable occupies one
16-byte stack frame.
However, this is very inefficient since current supported types have size of
8 bytes or less.&lt;&#x2F;p&gt;
&lt;p&gt;The better solution is to keep track of fragmented space on stack. 
When adding a new variable to the stack, 
it will check if there is any fragmented stack space is big enough for the 
variable.
It will only increase the stack size by 16 bytes if there is no enough space.&lt;&#x2F;p&gt;
&lt;p&gt;For example, the following Bril program:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;main {
    a:int = const 1;
    b:bool = const true;
    c:int = const 1;
    d:bool = const false;
}
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;will have stack allocation like this:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;variable&lt;&#x2F;th&gt;&lt;th&gt;offset&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;a&lt;&#x2F;td&gt;&lt;td&gt;0x0000&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;b&lt;&#x2F;td&gt;&lt;td&gt;0x0008&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;c&lt;&#x2F;td&gt;&lt;td&gt;0x0010&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;d&lt;&#x2F;td&gt;&lt;td&gt;0x0009&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;with free stack space &lt;code&gt;0x000A~0x000F&lt;&#x2F;code&gt; and &lt;code&gt;0x0018~0x001F&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;program-and-functions&quot;&gt;Program and Functions&lt;&#x2F;h3&gt;
&lt;p&gt;Each Bril file contains one program, which is the top-level object.
Therefore, each Bril file can be translated into one AArch64 assembly file.
Similarly, &lt;code&gt;main&lt;&#x2F;code&gt; function is the entry point for the program.&lt;&#x2F;p&gt;
&lt;p&gt;A function in AArch64 assembly consists a label as the function name and a sequence of instructions.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;func-name:
    instr1
    instr2
    ...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;At the beginning of a function, it will first push all callee-save registers
onto the stack, including the frame pointer (&lt;code&gt;x29&lt;&#x2F;code&gt;) and the link register (&lt;code&gt;x30&lt;&#x2F;code&gt;).
Then it will build the symbol table for the current function 
and move the stack pointer and frame pointer accordingly to leave enough space
for all variables.&lt;&#x2F;p&gt;
&lt;p&gt;At the end of a function, there is a label indicating the return point.
Before the process returns to the address in link register by &lt;code&gt;ret&lt;&#x2F;code&gt;, 
it needs to pop out all local variables by moving the stack pointer back
and restore all saved register values.&lt;&#x2F;p&gt;
&lt;p&gt;With this design, function calls could be added easily with minor changes
for passing parameters.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;arithmetic-and-logic-operations&quot;&gt;Arithmetic and Logic Operations&lt;&#x2F;h3&gt;
&lt;p&gt;Arithmetic, logic, and comparison operations are easy to translate.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Bril&lt;&#x2F;th&gt;&lt;th&gt;AArch64&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;add&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;add&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;sub&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;sub&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;mul&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;mul&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;div&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;sdiv&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;and&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;and&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;or&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;orr&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;not&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;not&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;lt&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;lt&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;le&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;le&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;gt&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;gt&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;ge&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;ge&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;eq&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;eq&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;The difference between Bril and AArch64 is the addressing model.
AArch64 does the operation directly on the register data.
Bril does the operation on variables on the stack, 
which should be accessed by memory operations.&lt;&#x2F;p&gt;
&lt;p&gt;For example, the Bril instruction &lt;code&gt;c:int = add a b&lt;&#x2F;code&gt; will be compiled to the
following sequence of AArch64 instructions:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;load data &lt;code&gt;a&lt;&#x2F;code&gt; to &lt;code&gt;x8&lt;&#x2F;code&gt; by &lt;code&gt;ldr&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;load data &lt;code&gt;b&lt;&#x2F;code&gt; to &lt;code&gt;x9&lt;&#x2F;code&gt; by &lt;code&gt;ldr&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;add x8, x8, x9&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;store &lt;code&gt;x9&lt;&#x2F;code&gt; back to the space for &lt;code&gt;c&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Currently, it does not have register allocation for local variables,
so it needs to load and store data between registers and memory for each instruction. It is slow but fine for now. Optimizations could be done in future works.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;other-instructions&quot;&gt;Other Instructions&lt;&#x2F;h2&gt;
&lt;p&gt;Other Bril instructions are relatively straight-forward to translate:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;a:int = const 1;&lt;&#x2F;code&gt;: store value &lt;code&gt;1&lt;&#x2F;code&gt; to the stack location 
of variable &lt;code&gt;a&lt;&#x2F;code&gt; by &lt;code&gt;str&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;a:int = id b;&lt;&#x2F;code&gt;: load the value of &lt;code&gt;b&lt;&#x2F;code&gt; by &lt;code&gt;ldr&lt;&#x2F;code&gt; 
and store to the stack location of &lt;code&gt;a&lt;&#x2F;code&gt; by &lt;code&gt;str&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;br cond label1 label2&lt;&#x2F;code&gt;: 
&lt;ol&gt;
&lt;li&gt;load the value of boolean variable &lt;code&gt;cond&lt;&#x2F;code&gt; to &lt;code&gt;x8&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;cbnz x8, label1&lt;&#x2F;code&gt; if the value is not zero (true), branch to &lt;code&gt;label1&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;b   label2&lt;&#x2F;code&gt; otherwise branch to &lt;code&gt;label2&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;jmp label&lt;&#x2F;code&gt;: &lt;code&gt;b label&lt;&#x2F;code&gt; branch to &lt;code&gt;label&lt;&#x2F;code&gt;&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;ret&lt;&#x2F;code&gt;: branch to the return point (stored link register) of current function&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;print&lt;&#x2F;code&gt;: calling the &lt;code&gt;printf&lt;&#x2F;code&gt; function in C. 
There is a small generated function called &lt;code&gt;printbool&lt;&#x2F;code&gt; 
to print &lt;code&gt;true&lt;&#x2F;code&gt; or &lt;code&gt;false&lt;&#x2F;code&gt; strings.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;experimental-setup&quot;&gt;Experimental Setup&lt;&#x2F;h3&gt;
&lt;p&gt;The experiment is done on a Raspberry Pi 3B+, 
which has a quad-core 1.2GHz Broadcom BCM2837 64-bit processor
and 1GB LPDDR2 SDRAM.
Generated AArch64 assembly programs are aseembled and linked by gcc version 8.3.0.&lt;&#x2F;p&gt;
&lt;p&gt;In order to evaluate the performance of Bril interpreter and 
native binary programs,
both approaches are tested on an n-by-n matrix multiplication workload,
which has &lt;code&gt;O(n^2)&lt;&#x2F;code&gt; space usage, &lt;code&gt;O(n^3)&lt;&#x2F;code&gt; arithmetic operations, 
and &lt;code&gt;O(n^3)&lt;&#x2F;code&gt; memory accesses.&lt;&#x2F;p&gt;
&lt;p&gt;The benchmark Bril program is generated by a Python script
and translated into JSON format for Bril interpreter, 
so that the translation to JSON will not be counted in performance evaluation.
The &lt;code&gt;perf&lt;&#x2F;code&gt; tool is used to report total clock cycles.&lt;&#x2F;p&gt;
&lt;p&gt;Each experiment runs for 10 times to report average and 
variance of experiment data.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;result&quot;&gt;Result&lt;&#x2F;h3&gt;
&lt;p&gt;The experiment runs matrix multiplications from 5 by 5 to 40 by 40, with step
size 5.&lt;&#x2F;p&gt;
&lt;img src=&quot;matmul.png&quot; style=&quot;width: 100%;&quot;&gt;
&lt;p&gt;The first plot shows the comparison of clock cycles between the Bril interpreter
and native binary programs. 
The second plot zooms in to show the trend of clock cycles of binary programs.
The native binary program runs much faster than Brili interpreter
on the matrix multiplication of the same size.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;analysis&quot;&gt;Analysis&lt;&#x2F;h3&gt;
&lt;p&gt;So &amp;quot;where have all my cycles gone?&amp;quot;&lt;&#x2F;p&gt;
&lt;p&gt;By removing the &lt;code&gt;evalProg&lt;&#x2F;code&gt; function from the Bril interpreter,
it will only load and parse the JSON file.
The experiment is re-run to obtain the average cycles spent on loading.&lt;&#x2F;p&gt;
&lt;img src=&quot;cyclediff.png&quot; style=&quot;width: 100%;&quot;&gt;
&lt;p&gt;By comparing the blue line and orange line in the plot,
it shows that on average most of the cycles are spent on parsing JSON input.
The average cycles spent on evaluating programs, 
by taking the difference of blue and orange line,
are shown as the green line.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s take a look at the file sizes. &lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;matrix size&lt;&#x2F;th&gt;&lt;th&gt;5&lt;&#x2F;th&gt;&lt;th&gt;10&lt;&#x2F;th&gt;&lt;th&gt;15&lt;&#x2F;th&gt;&lt;th&gt;20&lt;&#x2F;th&gt;&lt;th&gt;25&lt;&#x2F;th&gt;&lt;th&gt;30&lt;&#x2F;th&gt;&lt;th&gt;35&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;binary size&lt;&#x2F;td&gt;&lt;td&gt;18K&lt;&#x2F;td&gt;&lt;td&gt;46K&lt;&#x2F;td&gt;&lt;td&gt;126K&lt;&#x2F;td&gt;&lt;td&gt;282K&lt;&#x2F;td&gt;&lt;td&gt;530&lt;&#x2F;td&gt;&lt;td&gt;902K&lt;&#x2F;td&gt;&lt;td&gt;1.4M&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;JSON size&lt;&#x2F;td&gt;&lt;td&gt;54K&lt;&#x2F;td&gt;&lt;td&gt;374K&lt;&#x2F;td&gt;&lt;td&gt;1.2M&lt;&#x2F;td&gt;&lt;td&gt;2.8M&lt;&#x2F;td&gt;&lt;td&gt;5.3M&lt;&#x2F;td&gt;&lt;td&gt;9.1M&lt;&#x2F;td&gt;&lt;td&gt;15M&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;One of the reasons is that the Bril code file gets too large to do simple tasks
because current Bril lacks of basic language features.
For example, Bril does not have arrays yet,
so it is not possible to do matrix multiplication with loops.
Therefore, each multiplication instruction is explicitly generated,
which means the source code increases cubically.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;Native binary Bril programs runs much faster than the Bril interpreter on ARMv8 
processor.
Currently, loading and paring the JSON-format Bril code dominates the execution 
time of Bril interpreter. &lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Static Type Checking for Bril</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/type-checker/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/type-checker/</guid>
                <description>&lt;h2 id=&quot;goal&quot;&gt;Goal&lt;&#x2F;h2&gt;
&lt;p&gt;The goal of the project was to add a static type checker to find type errors, multiple definitions of variables and undefined ones as well. We also ensure that the branch labels are valid and unique.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;design-and-implementation&quot;&gt;Design and Implementation&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;design&quot;&gt;Design&lt;&#x2F;h3&gt;
&lt;p&gt;Bril currently supports 2 types, &lt;code&gt;int&lt;&#x2F;code&gt; and &lt;code&gt;bool&lt;&#x2F;code&gt;, which makes type checking relatively easy. Our type checker is defined such that an arithmetic operation like this:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
a: int = add b c
&lt;&#x2F;pre&gt;
&lt;p&gt;would raise an error if either &lt;code&gt;a&lt;&#x2F;code&gt; or &lt;code&gt;b&lt;&#x2F;code&gt; has type other than &lt;code&gt;int&lt;&#x2F;code&gt;. Similarly, boolean operations only accept all bool arguments and comparison operations have integer arguments but a boolean destination. These type definitions have been nicely defined &lt;a href=&quot;https:&#x2F;&#x2F;capra.cs.cornell.edu&#x2F;bril&#x2F;langref.html&quot;&gt;here&lt;&#x2F;a&gt;. During type checking, we also ensure that the variables in the instruction have been defined before in the program order. Conversely, we make sure that there isn&#x27;t a redefinition of the variable using different types. For control flow operations, we ensure that labels are present in the code and are uniquely defined. This allows us to treat label strings as a separate type, invisible to the user. &lt;&#x2F;p&gt;
&lt;p&gt;The error raised by the type checker outputs the line at which we find the first operation breaking the type check rule and mentions the type of error found.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;implementation&quot;&gt;Implementation&lt;&#x2F;h3&gt;
&lt;p&gt;We have a first pass in the algorithm which collects a list of labels and ensures that each label is unique. If there are multiple labels with the same name, we throw an error at this point. This also helps us create a set of label names which is used to check for valid strings (of labels) during the second pass for control flow operations.  In the second pass, we go over each instruction and check for various type errors. Some fundamental checks in this process are:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Invalid instructions: This ensures that all the arguments and destination (if applicable) of an instruction are available, which is given by the number of arguments.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Argument and destination type: For various operations we check if arguments and destination variables have the correct type. Something like &lt;code&gt;d: bool = add a b&lt;&#x2F;code&gt; where &lt;code&gt;a&lt;&#x2F;code&gt; and &lt;code&gt;b&lt;&#x2F;code&gt; are integers would raise an error for the destination variable.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Redefined variables: We check if the destination variable has already been assigned to a different type in that context.  Hence the following set of instructions:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
a: int = const 2;
a: bool = const true;
&lt;&#x2F;pre&gt;
&lt;p&gt;are not allowed but redefinition on the same type is definitely possible:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
a: int = const 2; 
a: int = const 5;
&lt;&#x2F;pre&gt;
&lt;p&gt;We do this by keeping a set of variables of each type (&lt;code&gt;int&lt;&#x2F;code&gt; and &lt;code&gt;bool&lt;&#x2F;code&gt;) defined in the function. This helps while checking existing definitions and possible redefinition errors.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Undefined variables: We check if the arguments to the instruction have defined variables using the set of variables mentioned before. A simple example would be an instruction like &lt;code&gt;a: int = const 5; c: int = add a b&lt;&#x2F;code&gt; where &lt;code&gt;b&lt;&#x2F;code&gt; was not defined before the instruction.&lt;&#x2F;p&gt;
&lt;p&gt;We haven&#x27;t taken jumps&#x2F;branches into consideration which implies if there was a jump instruction before where the variable was defined, we would still throw an error. The example below, throws an error with our static type checker:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;jmp label;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;compute:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;a: int = const 6;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;c: int = add a b;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;label:
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;b: int = const 5;
&lt;&#x2F;span&gt;&lt;span style=&quot;color:#323232;&quot;&gt;jmp compute;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h2 id=&quot;hardest-parts-during-the-implementation&quot;&gt;Hardest parts during the implementation&lt;&#x2F;h2&gt;
&lt;p&gt;The type checker implementation, though straightforward, had a few challenges.&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;The typing rules need to be designed carefully for each operation, including arithmetic, boolean, and control instructions. Take the following branch instruction as an example: &lt;code&gt;br b left right&lt;&#x2F;code&gt;. The type checker needs to go through all the existing boolean variables to ensure that &lt;code&gt;b&lt;&#x2F;code&gt; is predefined and &lt;code&gt;left&lt;&#x2F;code&gt; and &lt;code&gt;right&lt;&#x2F;code&gt; are valid labels in the code snippet. Also, though currently only two basic types are supported in Bril, there could be extensions for more types like list, stack etc. Thus, it is important to maintain modularity of arithmetic&#x2F;boolean and control flow checking so that future updates could be made easily to support them.&lt;&#x2F;li&gt;
&lt;li&gt;To ensure that the type checker works properly with labels, a first pass of the labels is designed. Through the first pass, all existing labels are saved and no duplicate ones are allowed to avoid conflicts. Though this almost doubles the type checking overhead, it further ensures the correctness of the program by static analysis.&lt;&#x2F;li&gt;
&lt;li&gt;Keeping track of the line number and returning proper error message when encountered with a type checking error. We implement a small logic on top of &lt;code&gt;briltxt&lt;&#x2F;code&gt; parser to keep a list of line numbers corresponding to lines without Bril instructions. These lines could be a newline (&lt;code&gt;\n&lt;&#x2F;code&gt;) or a comment (&lt;code&gt;#&lt;&#x2F;code&gt;) or simply the ending braces of a function (&lt;code&gt;}&lt;&#x2F;code&gt;).  This list helps us keep track of the original line numbers of each instruction in the code when we scan through the &lt;code&gt;json&lt;&#x2F;code&gt; parsed output during our type checking pass.&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h2 id=&quot;evaluation-and-results&quot;&gt;Evaluation and results&lt;&#x2F;h2&gt;
&lt;p&gt;We wrote a set of benchmarking programs to test our implementation for various types of rules defined in the design section above.
The general test cases are classified into two sub-directories: &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;tissue3&#x2F;bril&#x2F;tree&#x2F;master&#x2F;test&#x2F;type-check&#x2F;should-fail&quot;&gt;should-fail&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;tissue3&#x2F;bril&#x2F;tree&#x2F;master&#x2F;test&#x2F;type-check&#x2F;should-pass&quot;&gt;should-pass&lt;&#x2F;a&gt;. It has expected output named as &lt;code&gt;*.out&lt;&#x2F;code&gt; corresponding to the input file &lt;code&gt;*.bril&lt;&#x2F;code&gt;. The user can simply run all test cases in a directory by running &lt;code&gt;turnt directory&#x2F;*.bril&lt;&#x2F;code&gt;. For example, to run should-fail benchmark, one can just run &lt;code&gt;turnt test&#x2F;type-check&#x2F;should-pass&#x2F;*.bril&lt;&#x2F;code&gt; at the main directory of Bril.
Because should-pass cases are trivial, where we enumerate existing operations, in the following table we only list tests cases that our type checker will report an error message. The second column of the table aims to help one understand why an error would be reported by our type checker.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Instruction&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Type Checking Rule&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Testing Code Snippet&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Conflict Definition&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;Variables cannot be redefined with a different type.&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;v: int = 5;&lt;br&#x2F;&gt; v :bool = true;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Arithmetic&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;Adding an integer and a boolean type.&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;a: int = 4;&lt;br&#x2F;&gt;b: bool = true;&lt;br&#x2F;&gt;c: int = add a b;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Boolean&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;Cannot assign the output of boolean to an integer.&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;b: bool = true;&lt;br&#x2F;&gt; a:int = not b;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Const&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;Cannot assign an integer const to a bool variable.&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;a: int =  const true;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Cond Branch&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;Only takes bool variable as input (and 2 labels).&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;a: int = 1;&lt;br&#x2F;&gt; br a here there;&lt;br&#x2F;&gt; here: print a;&lt;br&#x2F;&gt;there: jmp here;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Finally, even though this is not directly related to type checking, we also implemented other checking passes as long as an input stream is parsable by bril2json. For example, &lt;code&gt;v0: boolean =const true;&lt;&#x2F;code&gt; is a legal statement for bril2json, but boolean is not an existing type in Bril, so our type checker will report the error. The table below lists all these cases:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Instruction&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Checking Rule&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;Testing Code Snippet&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Type existence&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;The destination type is undefined.&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;v0: boolean = const true;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Argument Number&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;The expected argument is 2 but only 1 is given.&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;v0: int = const 5;&lt;br&#x2F;&gt; v1: int = add v0;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Argument Existence&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;The argument is never defined.&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;v1: int = add v0;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Label Existence&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;Label argument in control operation not present in code.&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;jmp its_a_trap; (Consider this as is a full program)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Repeated Label&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;A label should be unique and not be repeated.&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;jmp label;&lt;br&#x2F;&gt; label: a:int=1;&lt;br&#x2F;&gt; label: a:int =2&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;By and large, we have implemented the checker satisfying all of our defined behaviors. But we don&#x27;t know if that&#x27;s exhaustive for all possible errors (not necessarily type errors). We would be very happy if someone comes up with more cases and reaches out to us by mail or GitHub issues.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Bril Syntax Highlighting for Vim</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/vim-syntax-highlighting/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/vim-syntax-highlighting/</guid>
                <description>&lt;p&gt;This project aimed to provide syntax highlighting for Bril in the Vim text editor, with the goal of learning about the implementation process underlying this ubiquitous category of tools. Until now I&#x27;ve taken for granted this tooling across various editing environemnts and programming languages; so I felt that the ability to support my language design efforts might prove useful and interesting.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;blog&#x2F;vim-syntax-highlighting&#x2F;bril-syntax.png&quot; alt=&quot;&quot; &#x2F;&gt; &lt;&#x2F;p&gt;
&lt;p&gt;As the project evolved, I quickly discovered the task of &lt;em&gt;syntax highlighting&lt;&#x2F;em&gt; to be more open-ended than expected. Ideally, we want the appearance of program text to reflect syntactic structure, but:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Maintaining a constantly changing syntax tree for an entire program can be slow, and&lt;&#x2F;li&gt;
&lt;li&gt;The nature of editing is that program text does not always represent a well-formed syntax tree.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;This fundamental limitation is acknowledged at the start of Vim&#x27;s documentation for &lt;code&gt;syntax&lt;&#x2F;code&gt;, the collection of syntax highlighting commands:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Syntax highlighting enables Vim to show parts of the text in another font or color. Those parts can be specific keywords or text matching a pattern.  Vim doesn&#x27;t parse the whole file (to keep it fast), so the highlighting has its limitations.  Lexical highlighting might be a better name, but since everybody calls it syntax highlighting we&#x27;ll stick with that.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Even though &lt;em&gt;syntax&lt;&#x2F;em&gt; highlighting implies the output of a parsing operation, the reality is closer to &lt;em&gt;lexing&lt;&#x2F;em&gt;. However, Vim&#x27;s powerful regular expressions, in conjunction with the features available through some &lt;code&gt;syntax&lt;&#x2F;code&gt; commands, facilitate highlighting that appears more complex than simply highlighting tokens.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;anatomy-of-a-vim-syntax-highlighter&quot;&gt;Anatomy of a Vim Syntax Highlighter&lt;&#x2F;h3&gt;
&lt;p&gt;The basic structure of a Vim syntax highlighter is designed to separate the concerns of textual appearance from textual extraction of syntactic units. 
A colorscheme will map a language&#x27;s &lt;em&gt;syntax groups&lt;&#x2F;em&gt; to a set of generic &lt;em&gt;highlight groups&lt;&#x2F;em&gt; provided by Vim. 
Each highlight group is named after a generic syntax unit, such as &lt;code&gt;Comment&lt;&#x2F;code&gt; or &lt;code&gt;Identifier&lt;&#x2F;code&gt;, and defines its appearance. 
Thus, the development efforts of language and colorscheme designers are independent, as seen in the implementation of &lt;code&gt;bril-syntax&lt;&#x2F;code&gt;: &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;highlight default link brilComment Comment
highlight default link brilLabel Label
highlight default link brilVariable Identifier
highlight default link brilMain Function
highlight default link brilType Type
highlight default link brilValueOp Operator
highlight default link brilEffectOp Keyword
highlight default link brilNumber Number
highlight default link brilBool Boolean
highlight default link brilCondVariable Boolean
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;My task was simply to define each of the above &lt;code&gt;bril-&lt;&#x2F;code&gt; syntax groups.&lt;&#x2F;p&gt;
&lt;p&gt;It&#x27;s interesting to observe that although this scheme works well for most programming languages, it does not work for any arbitrary formal language, since the baseline set of highlight groups makes an assumption about the &#x27;top&#x27; level of syntactic categories. From the &lt;code&gt;syntax&lt;&#x2F;code&gt; docs:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;... a syntax group and a highlight group are similar. For a highlight group you will have given highlight attributes. These attributes will be used for the syntax group with the same name.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;To resolve this, one might create new highlight groups, as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;highlight MyHighlightGroup gui=bold ctermbg=NONE cterm=bold ...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;However, doing so conflates the roles of language designer and colorscheme designer. To maintain this separation of concerns for more exotic languages, a new set of highlight groups would need to be exposed for colorscheme designers.&lt;&#x2F;p&gt;
&lt;p&gt;Since Bril&#x27;s syntax fits nicely into the base set of highlight groups, this was not a problem. 
I&#x27;ll admit, though, that I really wanted to italicize at least one syntax group, and although it&#x27;s possible to minimally override a colorscheme in this way, any implementation is error prone, possibly version-dependent, and hard to understand and maintain (as is much of &lt;strong&gt;Vimscript&lt;&#x2F;strong&gt;).
Thus, I focused my efforts on designing the needed collection of &lt;code&gt;bril-&lt;&#x2F;code&gt; syntax groups.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;syntax-groups-and-regions&quot;&gt;Syntax Groups and Regions&lt;&#x2F;h4&gt;
&lt;p&gt;The most basic syntax group is a keyword, defined using the &lt;code&gt;syntax keyword&lt;&#x2F;code&gt; command.
It takes two arguments: a name for the syntax group and a set of language keywords.
For example, value ops are defined as follows (ignore the &lt;code&gt;contained&lt;&#x2F;code&gt; option, for now):&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;syntax keyword brilValueOp contained
  \ id
  \ const
  \ add
  ...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;At the next level of generality, we can define syntax groups using regular expressions via the &lt;code&gt;syntax match&lt;&#x2F;code&gt; command. For example, Bril comments are defined as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;syntax match brilComment &amp;quot;\#.*$&amp;quot;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Vim provides one final level of generality with the notion of &lt;em&gt;syntax regions&lt;&#x2F;em&gt;.
A syntax region is a region of text, delimited by regular expressions on both sides:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;syntax region regionName start=startRegexp end=endRegexp contains=synGroup1,synGroup2,...
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;When the &lt;em&gt;start&lt;&#x2F;em&gt; expression is detected, only those syntax groups &lt;em&gt;contained&lt;&#x2F;em&gt; in a region are checked. 
The &lt;code&gt;contained&lt;&#x2F;code&gt; option removes top-level visibility from a syntax group, so that they are only parsed when their parent syntax region is parsed. 
Together, these two mechanisms allow for hierarchical parsing of &#x27;syntax-tree-like&#x27; syntax groups. 
Besides its effectiveness as a design pattern for organizing syntax groups, regions made it easier to identify the branch condition as a &lt;code&gt;brilCondVariable&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;syntax region brilBranchInstr start=&amp;#39;br&amp;#39; end=&amp;#39;;&amp;#39; 
  \ oneline contained contains=brilCondVariable,brilVariable,brilEffectOp
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h4 id=&quot;incremental-parsing-via-sync-points&quot;&gt;Incremental Parsing via &lt;code&gt;sync&lt;&#x2F;code&gt; points&lt;&#x2F;h4&gt;
&lt;p&gt;When scrolling through a file or making an edit, Vim needs to figure out the most fitting syntax groups in the corresponding line. Since syntax groups may lie within a syntax region, Vim needs to find the most accurate &lt;em&gt;syntax state&lt;&#x2F;em&gt; for the new line:
&amp;gt;Vim wants to be able to start redrawing in any position in the document.  To make this possible it needs to know the syntax state at the position where redrawing starts.&lt;&#x2F;p&gt;
&lt;p&gt;A robust, but slow syntax highlighter can recalculate syntax state across the entire file as needed. 
To improve performance, Vim introduces mechanisms for defining a &lt;em&gt;sync point&lt;&#x2F;em&gt; around which syntax state is remembered.
Vim allows for defining a sync point relative to screen and cursor details, including the cursor line number, the lines that have currently been drawn on screen, and user defined look-behind parameters. 
Additionally, Vim provides a mechanism for locally &#x27;guessing&#x27; the current syntax region using regular expressions as hints. Although, this requires some parsing several lines, it might be preferable than parsing the entire file.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;bril-syntax&lt;&#x2F;code&gt; uses the &lt;code&gt;syntax sync fromstart&lt;&#x2F;code&gt;, which, as the name implies, sets the sync point at the start of the file. Thus, the entire file is parsed with each new line.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;The correctness of &lt;code&gt;bril-syntax&lt;&#x2F;code&gt; can be defined by its ability to correctly highlight well-formed Bril code. 
To this end, I highlighted a small Bril file (&lt;code&gt;test.bril&lt;&#x2F;code&gt;) containing all base Bril constructs and observed consistent coloring of syntax groups. 
Armed with a &lt;a href=&quot;https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;9464844&#x2F;how-to-get-group-name-of-highlighting-under-cursor-in-vim&quot;&gt;helpful Vim script&lt;&#x2F;a&gt;, I was able to manually confirm that &lt;code&gt;bril-syntax&lt;&#x2F;code&gt; correctly identifies the nested syntax group structures at every cursor position in &lt;code&gt;test.bril&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;A quantitative evaluation of a syntax highlighter measures its performance in terms of resource utilization under worst-case stress loads, such as when recomputing large, nested syntax regions. The largest syntax region defined in &lt;code&gt;bril-syntax&lt;&#x2F;code&gt; spans an entire function definition and contains syntax regions representing the different kinds of instructions. 
I tested performance of &lt;code&gt;bril-syntax&lt;&#x2F;code&gt; on an Intel Core i7-6700 by repeatedly destroying and reconstructing the function syntax region on a Bril file containing ~200k instructions, all while observing &lt;code&gt;htop&lt;&#x2F;code&gt; in a &lt;code&gt;tmux&lt;&#x2F;code&gt; panel. This load had an unnoticeable effect on CPU and memory usage.&lt;&#x2F;p&gt;
&lt;p&gt;A qualitative evaluation observes how smoothly and effectively &lt;code&gt;bril-syntax&lt;&#x2F;code&gt; integrates into a Bril workflow. From the results of the quantitative evaluation, it is unsurprising to find that highlighting when scrolling, adding labels and instructions, and destroying and reconstructing syntax regions is immediate, without slowing down file navigation. 
If Bril was heavily used, a user study could assess how the particular choice and arrangement of syntax groups affects usability. For example, in the current &lt;code&gt;bril-syntax&lt;&#x2F;code&gt; implementation, the syntax groups contained in instruction regions, such as op names, are only highlighted once a semicolon is typed; it&#x27;s unclear whether this peculiarity is significant. In either case, changing it would not be difficult.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;limitations-of-vimscript&quot;&gt;Limitations of Vimscript&lt;&#x2F;h4&gt;
&lt;p&gt;Despite its powerful features for syntax highlighting, Vimscript itself is rife with language issues that severely hinder its reliability and maintainability.&lt;&#x2F;p&gt;
&lt;p&gt;References to old variable bindings remain active between executions of a script, a fact I often only realized after restarting Vim.&lt;&#x2F;p&gt;
&lt;p&gt;Regular expressions are powerful, but possibly too powerful; for instance, the &lt;code&gt;\ze&lt;&#x2F;code&gt; regex atomdrops all matches regexes on both sides, keeping only the match on the left one. Such a regex is easy to express, but is computationally inefficient, potentially encouraging poor implementations.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, &lt;a href=&quot;http:&#x2F;&#x2F;learnvimscriptthehardway.stevelosh.com&#x2F;chapters&#x2F;28.html&quot;&gt;metaprogramming&lt;&#x2F;a&gt; is idiomatic in Vimscript. 
This is useful, since it allows us to factor out and reuse regexes from syntax group definitions into meta-strings. 
However, since Vimscript makes a distinction between &lt;em&gt;strings&lt;&#x2F;em&gt; and &lt;em&gt;literal strings&lt;&#x2F;em&gt;, reasoning about &amp;quot;meta-strings&amp;quot; and &amp;quot;meta-literal-strings&amp;quot; can lead to hard to detect bugs.&lt;&#x2F;p&gt;
&lt;p&gt;Semantically, a Vimscript string consists of a list of characters and characters &#x27;escaped&#x27; by a backslash. (hence &amp;quot;&amp;quot; is an invalid string; it is a malformed character escape).
On the other hand, a literal string simply consists of the literal characters in the string.
If we attempt to factor out a regexp as a string, any escaped characters will be stored as &#x27;escaped&#x27;, rather than as characters.
Splicing this faulty regexp into a program string will generate a faulty program string containing escaped characters.&lt;&#x2F;p&gt;
&lt;p&gt;We can solve this by &lt;em&gt;escaping each &#x27;escaping&#x27; backslash with a backslash, except for those backslashes that escape backlashes.&lt;&#x2F;em&gt; (It&#x27;s no wonder the leading Vimscript tutorial recommends &lt;a href=&quot;http:&#x2F;&#x2F;learnvimscriptthehardway.stevelosh.com&#x2F;chapters&#x2F;08.html&quot;&gt;beer&lt;&#x2F;a&gt; to accompany some of the exercises.)
To avoid backslash hell, we can instead store regexps as literal strings, which can never escape characters.
Later on, these can be spliced with program fragments that are also represented as string literals, rather than strings, to avoid escaping any hard-coded regexps.&lt;&#x2F;p&gt;
&lt;p&gt;Once I dealt with the frustrations and mysterious bugs, I found Vimscript to be pretty fun, in part due to its flexibility,
and I enjoyed pondering the feedback problem of language design for language tooling.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;design-considerations-and-future-work&quot;&gt;Design Considerations and Future Work&lt;&#x2F;h3&gt;
&lt;p&gt;As I&#x27;ve described it, a syntax highlighter&#x27;s principal job is to give a meaningful visual presentation to an evolving syntax tree.
In reality, developers have come to expect much more from a syntax highlighter, including:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Syntactically-aware code editing and navigation, such as automatic indentation, folding, variable renaming, etc.&lt;&#x2F;li&gt;
&lt;li&gt;Type checking and type information lookup&lt;&#x2F;li&gt;
&lt;li&gt;Code linting&lt;&#x2F;li&gt;
&lt;li&gt;etc.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;All of these tasks benefit from Vimscript&#x27;s prowess at text manipulation and syntax extraction, and represent
additional opportunities to explore different concerns in the space of language design for language tooling.&lt;&#x2F;p&gt;
&lt;p&gt;It&#x27;s unfortunate that these efforts benefit only Vim users; an editor-agnostic approach would be ideal.
&lt;a href=&quot;https:&#x2F;&#x2F;microsoft.github.io&#x2F;language-server-protocol&#x2F;&quot;&gt;LSP&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;tree-sitter.github.io&#x2F;tree-sitter&#x2F;&quot;&gt;Tree-sitter&lt;&#x2F;a&gt; provide editor-agnostic languages to support these tasks, though a cursory reading of the state-of-the-art reveals that no current tool can accomplish all syntax-sensitive editing features adequately.&lt;&#x2F;p&gt;
&lt;p&gt;For now, I hope you enjoy &lt;code&gt;bril-syntax&lt;&#x2F;code&gt;!&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Vril! Vector Bril</title>
                <pubDate>Wed, 25 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/vril-vector-bril/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/vril-vector-bril/</guid>
                <description>&lt;h2 id=&quot;why-vectors-are-cool-to-have&quot;&gt;Why Vectors are cool to have!&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;high-level-idea&quot;&gt;High-level idea&lt;&#x2F;h3&gt;
&lt;p&gt;In principle vectors are just one-dimensional arrays. An array is a collection of elements that can be accessed by index. However, there are lot of applications where you would perform the same operation on each and every element of an array. This is what is called data-level parallelism (DLP). Some architectures are able to exploit this DLP by encapsulating several operations to different elements of one array into one vector operation. By doing that, they release the pressure in the front-end of the processor because they fetch and decode only once and then they can execute many identical operations concurrently. One of the challenges is for the workloads to exhibit sufficient data-level parallelism to achieve full utilization.&lt;&#x2F;p&gt;
&lt;p&gt;There are three popular options to program for vector architectres, a.k.a. &lt;em&gt;vectorization&lt;&#x2F;em&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;manual vectorization: Explicitly using vector operations in assembly or intrinsics.&lt;&#x2F;li&gt;
&lt;li&gt;user annotations or &lt;a href=&quot;https:&#x2F;&#x2F;info.ornl.gov&#x2F;sites&#x2F;publications&#x2F;files&#x2F;Pub69214.pdf&quot;&gt;pragmas&lt;&#x2F;a&gt;: Help the compiler find the &lt;em&gt;vectorizable&lt;&#x2F;em&gt; regions of code and inform the compiler about lack of dependences and other situations that would typically restrict the vectorization.&lt;&#x2F;li&gt;
&lt;li&gt;auto-vectorization: Rely directly on the compiler vectorizer. In the case of &lt;a href=&quot;https:&#x2F;&#x2F;www.gnu.org&#x2F;software&#x2F;gcc&#x2F;projects&#x2F;tree-ssa&#x2F;vectorization.html&quot;&gt;gcc&lt;&#x2F;a&gt;, it is typically enabled by the flag &lt;code&gt;-ftree-vectorize&lt;&#x2F;code&gt; and by default at &lt;code&gt;-O3&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Manual vectorization allows explicit control, but usually becomes architecture specific and assumes vector-like programming from the programmer.&lt;&#x2F;p&gt;
&lt;p&gt;Auto-vectorization aims to improve the programmer&#x27;s productivity by performing a compiler pass to automatically generate vector instructions in a given program. However, in practice help from the programmer is needed to achieve competitive vectorized codes, either fully manual or with the user annotations.&lt;&#x2F;p&gt;
&lt;p&gt;An additional benefit of auto-vectorization and most user annotations is decoupling the program specification and underlying execution which is architecture-specific. The vectorizable source will be portable to different architectures and the compiler will dictate if it can be vectorized or how well it can exploit the program&#x27;s DLP in order to generate the architecture-specific vector instructions.&lt;&#x2F;p&gt;
&lt;p&gt;This project is aiming at having these general vector specifications at the compiler-level which:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;permits intrinsics or vector assembly instructions to be translated into Bril IR&lt;&#x2F;li&gt;
&lt;li&gt;permits architecture specific backend to easily generate an executable&lt;&#x2F;li&gt;
&lt;li&gt;naturally expose opportunities for vector optimizations at Bril&lt;&#x2F;li&gt;
&lt;li&gt;offer to the compiler (automatic) vectorizer the possibility of generating Bril IR vector operations that can later be mapped into ISA vector instructions&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;link-to-project&quot;&gt;Link to project&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sa2257&#x2F;vril&quot;&gt;Vril&lt;&#x2F;a&gt; is our public repository.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;arrays&quot;&gt;Arrays&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;arrays-in-vril&quot;&gt;Arrays in Vril&lt;&#x2F;h3&gt;
&lt;p&gt;Vril includes a new type &lt;code&gt;array&lt;&#x2F;code&gt; to the language. The definition of an array in Vril is a sequence of elements of any type, which includes arrays (no type checking carried out) and the length being any Bril literal. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;initializing-arrays-in-vril&quot;&gt;Initializing arrays in Vril&lt;&#x2F;h3&gt;
&lt;p&gt;Vril extends Bril to support arrays by adding an &lt;code&gt;init&lt;&#x2F;code&gt; operator to initialize an array of length &lt;code&gt;l&lt;&#x2F;code&gt; and set its elements to 0 as follows:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;array_name : array = init l
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Vril adds array operations to Bril of the form,&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;variable: type = aop args
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;supported-array-operations&quot;&gt;Supported array operations&lt;&#x2F;h3&gt;
&lt;p&gt;Vril can operate on arrays in two ways: 1) We can perform &lt;em&gt;scalar&lt;&#x2F;em&gt; accesses to arrays by moving one element of the array to a variable and performing any arithmetic or logic operation that are already supported in the original Bril, and 2) we can perform &lt;em&gt;vector&lt;&#x2F;em&gt; accesses to the arrays by using new vector operations that take arrays as arguments directly.&lt;&#x2F;p&gt;
&lt;p&gt;In order to move data from an element of an array &lt;code&gt;arr[idx]&lt;&#x2F;code&gt; to a variable &lt;code&gt;var&lt;&#x2F;code&gt;, we can use &lt;code&gt;a2v&lt;&#x2F;code&gt; op:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;var: int = a2v array_name idx
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Similarly, we can move a value stored in a variable &lt;code&gt;var&lt;&#x2F;code&gt; into one of the array&#x27;s element &lt;code&gt;arr[idx]&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;array_name: int = v2a var index
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Constants are inserted to arrays through variables, just like the original Bril.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;vector-ops&quot;&gt;Vector ops&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;vector-ops-in-vril&quot;&gt;Vector ops in Vril&lt;&#x2F;h3&gt;
&lt;p&gt;There are two types of vector ops in Vril: configuration ops and arithmetic&#x2F;logic ops. Vector arithmetic or logic operations take arrays as arguments and produce an array or a scalar value as a result. Configuration ops are used to adjust the amount of data parallelism needed by the program to the resources that the micro-architecture offers. For example, in a for-loop that has no inter-iteration dependences we could potentially run all iterations in parallel if we had enough hardware resources. Otherwise, we would run it in groups of iterations in parallel that the hardware can handle.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;adding-vector-ops-to-vril&quot;&gt;Adding vector ops to Vril&lt;&#x2F;h3&gt;
&lt;p&gt;For now there is only one configuration op: &lt;code&gt;setvl&lt;&#x2F;code&gt;, which is based on the &lt;a href=&quot;https:&#x2F;&#x2F;github.com&#x2F;riscv&#x2F;riscv-v-spec&quot;&gt;RISC-V V vector extension&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;vl: int = setvl val
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;which requests &lt;code&gt;val&lt;&#x2F;code&gt; elements to the hardware and returns the &lt;code&gt;vl&lt;&#x2F;code&gt;, which is the actual number of elements that the hardware can support. The pseudo-code of this instruction is: &lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;vl := (val &amp;lt;= max_vl)? val:max_vl;
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;where &lt;code&gt;max_vl&lt;&#x2F;code&gt; is the maximum amount of lanes or array elements that the micro-architecture can process in parallel.&lt;&#x2F;p&gt;
&lt;p&gt;We have also implemented only one arithmetic vector op &lt;code&gt;vadd&lt;&#x2F;code&gt; as a proof of concept:&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
&lt;span style=&quot;color:#323232;&quot;&gt;arr3: array = vadd arr1 arr2 idx
&lt;&#x2F;span&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Which takes two arrays as arguments and performs element-wise vector addition to &lt;code&gt;vl&lt;&#x2F;code&gt; elements starting at element &lt;code&gt;idx&lt;&#x2F;code&gt;, and stores it into a third array. For now, if &lt;code&gt;arr1&lt;&#x2F;code&gt;, &lt;code&gt;arr2&lt;&#x2F;code&gt;, and &lt;code&gt;arr3&lt;&#x2F;code&gt; have different lengths, as long as elements in indices &lt;code&gt;i:i+vl-1&lt;&#x2F;code&gt; are valid for the three arrays (two sources and destination), the operation is valid. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;other-types-on-arrays&quot;&gt;Other types on arrays&lt;&#x2F;h3&gt;
&lt;p&gt;We have not implemented any array type checking on arrays. While we expect the programmer to only use integer values on array elements; there may not be any error thrown if they stored other types. Vector arithmetic or logic operations have undefined behavior for these cases. We left type checking for future work. &lt;&#x2F;p&gt;
&lt;h3 id=&quot;simulating-vector-ops-in-vrili-vril-interpreter&quot;&gt;Simulating vector ops in Vrili (Vril interpreter)&lt;&#x2F;h3&gt;
&lt;p&gt;We have extended Brili to interpret vector operations.
It should be noted that this Vrili implementation is not simulating a vector processor, but allows us to evaluate functionality of Vril and extract performance information.
Vrili executes for loops across the different elements of the array operands in order to produce the correct result. This functionally emulates what a vector microarchitecture would do concurrently with a single fetch-decode effort.
However, Vrili still assumes the same underlying machine Brili did. This is done by generating separate variables for each array element, i.e. an array &lt;code&gt;arr&lt;&#x2F;code&gt; of size 2 will generate variables &lt;code&gt;arr[0]&lt;&#x2F;code&gt; and &lt;code&gt;arr[1]&lt;&#x2F;code&gt;. Subsequent array element accesses are simply variable accesses. Similar to Brili, Vrili does not impose any access checking, therefore the programmer should take precautions only to access array elements defined by array initialization.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;Vril makes two major additions to Bril, adding Arrays and extending Bril operations with vector operations. This section describes how we evaluate each of these.&lt;&#x2F;p&gt;
&lt;p&gt;We take three approaches to evaluate arrays and vector operation &lt;code&gt;vadd&lt;&#x2F;code&gt;,&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;evaluating integration to Bril and qualitative justification&lt;&#x2F;li&gt;
&lt;li&gt;evaluating functionality&lt;&#x2F;li&gt;
&lt;li&gt;evaluating performance&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;qualitative-evaluation-by-inspection&quot;&gt;Qualitative evaluation by inspection.&lt;&#x2F;h3&gt;
&lt;p&gt;We ran tests to generate JSON representation of Bril from text format and vice-versa to evaluate whether the array specificationi and vector operations are expressible in both available formats. This also allowed us to manually inspect Bril IR. These tests reside in &lt;code&gt;vtest&#x2F;parse&lt;&#x2F;code&gt; and &lt;code&gt;vtest&#x2F;print&lt;&#x2F;code&gt; directories.&lt;&#x2F;p&gt;
&lt;p&gt;Arrays provide a convenient method to use loops in Bril. To add two groups of 10 numbers, conventional Brili required 43 lines of code, 20 of which would define the 20 numbers and 10 to add them. Vril required 50 lines of code, 20 of which were still used to define two arrays element by element and 6 lines to add them in a loop body. It is observed that as the size of the arrays grow, conventional Bril would need correspondingly more lines of code where as Vril would still use 6 lines of code in the loop body.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;functional-evaluation-of-arrays-and-vadd&quot;&gt;Functional evaluation of arrays and &lt;code&gt;vadd&lt;&#x2F;code&gt;.&lt;&#x2F;h3&gt;
&lt;p&gt;We first wrote a simple test to initialize an array, store a value to an array and read a value from an array. 
A second test described a loop which would add two array elements and save it in a third array.
Both these tests generated the expected result from the interpreter. These tests reside in &lt;code&gt;vtest&#x2F;interp&lt;&#x2F;code&gt; directory. &lt;&#x2F;p&gt;
&lt;p&gt;We wrote a program with &lt;code&gt;vadd&lt;&#x2F;code&gt; and ran this test in &lt;code&gt;vrili&lt;&#x2F;code&gt; with different values for &lt;code&gt;maxvl&lt;&#x2F;code&gt; which represents the size of vector length of the backend. The test generated expected results. This is included in &lt;code&gt;vtest&#x2F;interp&#x2F;array_vector.bril&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;performance-of-vadd-with-arrays&quot;&gt;Performance of &lt;code&gt;vadd&lt;&#x2F;code&gt; with arrays.&lt;&#x2F;h4&gt;
&lt;p&gt;The initial motivation for Vril is to reduce hop counts (hop counts are the number of basic blocks traversed) by having vector operators in Bril. This aspect is tested by comparing the dynamic instruction count, hop count, and lines of code for a loop executed in a scalar manner and a vector operation. We assume an architectural vector length greater than 10 for this experiment. &lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Program&lt;&#x2F;th&gt;&lt;th&gt;Lines of code&lt;&#x2F;th&gt;&lt;th&gt;Dynamic instructions&lt;&#x2F;th&gt;&lt;th&gt;Hop count&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Conventional Bril&lt;&#x2F;td&gt;&lt;td&gt;43&lt;&#x2F;td&gt;&lt;td&gt;41&lt;&#x2F;td&gt;&lt;td&gt;1&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Scalar Array&lt;&#x2F;td&gt;&lt;td&gt;50&lt;&#x2F;td&gt;&lt;td&gt;109&lt;&#x2F;td&gt;&lt;td&gt;12&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Vector Array&lt;&#x2F;td&gt;&lt;td&gt;48&lt;&#x2F;td&gt;&lt;td&gt;44&lt;&#x2F;td&gt;&lt;td&gt;3&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Given that the length of array is N=10 and vector length selected by Vrili at execution is k, and compensating for the start and the end block as shown in the figure, scalar operation iterates N times and vector operation iterates N&#x2F;k times.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th align=&quot;center&quot;&gt;CFG for scalar array operation&lt;&#x2F;th&gt;&lt;th align=&quot;center&quot;&gt;CFG for vector array operation&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td align=&quot;center&quot;&gt;&lt;img src=&quot;array_scalar_nv.png&quot; style=&quot;width: 100%&quot;&gt;&lt;&#x2F;td&gt;&lt;td align=&quot;center&quot;&gt;&lt;img src=&quot;array_vector_nv.png&quot; style=&quot;width: 100%&quot;&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;As underscored in the code inspection, array initialization overhead is sizable which requires many lines of code. However, scalar arrays allow concise representation of the computation albeit with an overhead in dynamic instructions. Using vector operations in arrays reduces these dynamic instructions by reducing the number of loop iterations (reflected as less hop counts), i.e. doing more computations with one instruction.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;We have extended Bril to support array types. We have added two new operations to move data in and out of the arrays so that we can emulate data movements between an array and what it would be a scalar register. We have also extended Bril to support vector operations of two types: configuration and arithmetic. Configuration operations allow programs to modify a vector state and arithmetic operations perform operations on array arguments.&lt;&#x2F;p&gt;
&lt;p&gt;The goal of this exercise was to understand how much an IR needs to change in order to express data-parallel operations. For that, we extended Bril to be able to express vector operations and to compare their potential against a traditional scalar set of operations. For that we wrote a benchmark for vector-vector add (vvadd), which adds the elements of two arrays and stores their results into a third array in two versions: a scalar code and a vector code. The end goal is to verify that the CFG generated is very similar for both codes: it should contain the same number of basic blocks (BB). However, the vector code hops on the BB involved in the &lt;code&gt;for-loop&lt;&#x2F;code&gt; statement &lt;em&gt;vector length&lt;&#x2F;em&gt; times less than the scalar code. &lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Measuring Computer Systems is Almost Certainly Harder Than You Think</title>
                <pubDate>Wed, 04 Sep 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/measurement/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/measurement/</guid>
                <description>&lt;p&gt;Mytkowicz et al.’s &lt;a href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;citation.cfm?id=1508275&quot;&gt;“Producing Wrong Data Without Doing Anything Obviously Wrong!”&lt;&#x2F;a&gt; in ASPLOS 2009 is one of those papers that contains its own best summary.
Right there in Section 2, it says:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Computer systems are sensitive: an insignificant and seemingly irrelevant change can dramatically affect the performance of the system.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;The implication is profound:
even if you think reliably measuring computer systems is pretty hard,
it’s probably harder than you think.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-case-study-in-measurement&quot;&gt;A Case Study in Measurement&lt;&#x2F;h2&gt;
&lt;p&gt;In its most headline-worthy outcome,
this paper identifies two particularly irrelevant-seeming changes that can completely ruin a realistic experiment.
As a case study,
the authors set up a reasonable question that you can imagine wanting to answer empirically:
does &lt;a href=&quot;https:&#x2F;&#x2F;gcc.gnu.org&quot;&gt;gcc&lt;&#x2F;a&gt;’s &lt;code&gt;-O3&lt;&#x2F;code&gt; optimization level &lt;em&gt;really&lt;&#x2F;em&gt; offer any improvement over &lt;code&gt;-O2&lt;&#x2F;code&gt;?&lt;&#x2F;p&gt;
&lt;p&gt;It should be easy to answer this question: you just need to compile some programs at both optimization levels and measure how long they take to run.
But that obvious a sweeping assumption: that your handful measurements on those programs are representative of the &lt;em&gt;general concept&lt;&#x2F;em&gt; of compiling with different optimization levels.&lt;&#x2F;p&gt;
&lt;p&gt;Of course it’s not, you might say—you can’t possibly compile &lt;em&gt;all&lt;&#x2F;em&gt; the programs in the world, for example, so your choice of programs might certainly influence the results you see.
The best you can do is pick some standard, diverse, representative benchmarks and make the case that they’re reasonably representative.
Similarly, while a truly robust finding would require measuring &lt;em&gt;all the computers in the world&lt;&#x2F;em&gt; that someone might ever run optimized binaries on, that’s clearly infeasible—so you can do your job as a scrupulous scientist by measuring a few popular machines and hoping for the best.&lt;&#x2F;p&gt;
&lt;p&gt;The problem this paper identifies is that even a carefully designed experiment can go horribly wrong.
While benchmark and platform choice are clearly important factors, the paper finds factors that &lt;em&gt;do not seem important&lt;&#x2F;em&gt; can be just as critical.
When setting up this &lt;code&gt;-O2&lt;&#x2F;code&gt; vs. &lt;code&gt;-O3&lt;&#x2F;code&gt; experiment, for example, would you have guessed that the order of arguments in the linking command matters—that is, these two commands might compile binaries that run at different speeds?&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
$ gcc foo.o bar.o -o bin
$ gcc bar.o foo.o -o bin
&lt;&#x2F;pre&gt;
&lt;p&gt;I certainly would never have guessed that this detail could possibly matter.
The paper shows that it does, and at least for this case, it’s every bit as important as the more obvious experimental design factors such as the choice of benchmark or the target CPU.&lt;&#x2F;p&gt;
&lt;p&gt;Specifically, the link order you choose can change your conclusion about whether &lt;code&gt;-O3&lt;&#x2F;code&gt; matters.
In the paper’s experiments, certain “lucky” linking orders make it seem like &lt;code&gt;-O3&lt;&#x2F;code&gt; performs 8% better than &lt;code&gt;-O2&lt;&#x2F;code&gt;, while other “unlucky” linking order can perform 7% &lt;em&gt;worse&lt;&#x2F;em&gt;.
So if you ran this hypothetical experiment without trying multiple linking orders—and who would, honestly?—you could never know that your seemingly-solid results depend rigidly on a factor that has nothing to do with optimization levels at all.&lt;&#x2F;p&gt;
&lt;p&gt;I find the second confounding factor even more shocking.
Assuming the &lt;code&gt;bin&lt;&#x2F;code&gt; program does not ever read the &lt;code&gt;FOO&lt;&#x2F;code&gt; environment variable, would you expect these commands to run at different speeds?&lt;&#x2F;p&gt;
&lt;pre style=&quot;background-color:#ffffff;&quot;&gt;
$ FOO=BAR .&#x2F;bin
$ FOO=BAAAAAAAAAAAAAAAAAR .&#x2F;bin
&lt;&#x2F;pre&gt;
&lt;p&gt;This paper finds that they can.
Much like linking order, changing the size of Unix environment variables can cause large swings in execution time.
And these changes are &lt;em&gt;also&lt;&#x2F;em&gt; large enough to change the conclusion of the hypothetical &lt;code&gt;-O2&lt;&#x2F;code&gt; vs. &lt;code&gt;-O3&lt;&#x2F;code&gt; experiment.
Because usernames are part of the environment, this finding means that people with longer names might be more or less likely to observe benefits from compiler optimizations.&lt;&#x2F;p&gt;
&lt;p&gt;The paper does a thorough investigation into &lt;em&gt;why&lt;&#x2F;em&gt; these strange performance effects arise, and they’re not all that surprising once you think of them:
environment variables can shift the starting stack address, for example, which can move data across cache lines, and CPUs care a lot about cache lines.
But these explanations are evidence in service a larger point:
computer systems are complex enough that their behavior can certainly surprise you.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-model-evaluation&quot;&gt;A Model Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;Aside from its actual research conclusions, the “Producing Wrong Data!” paper serves a second useful purpose:
as a model of a solid empirical evaluation.
No evaluation is perfect, but this one does a lot of things right.
If you hold the &lt;a href=&quot;https:&#x2F;&#x2F;www.sigplan.org&#x2F;Resources&#x2F;EmpiricalEvaluation&#x2F;&quot;&gt;SIGPLAN empirical evaluation guidelines&lt;&#x2F;a&gt; in one hand and this paper in the other, you can find lots of common ground.
Here are some standard pieces of evaluation advice that this paper exemplifies:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Use a complete, standard benchmark suite—and justify your choice.&lt;&#x2F;em&gt;
This paper uses the most standard of standard suites, &lt;a href=&quot;https:&#x2F;&#x2F;www.spec.org&quot;&gt;SPEC&lt;&#x2F;a&gt;.
It’s careful to explain that it only uses the C benchmarks because Java compilers don’t have &lt;code&gt;-O2&lt;&#x2F;code&gt; and &lt;code&gt;-O3&lt;&#x2F;code&gt; optimization levels.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Collect lots of data, even if takes a while.&lt;&#x2F;em&gt;
This evaluation required 5,940 executions each of 12 benchmarks.
For the slowest benchmark, collecting data took 12 days.
A few CPU-weeks or even CPU-months can definitely be worth it if they give you solid results.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Thoroughly explore the design space.&lt;&#x2F;em&gt;
The evaluation is not satisfied with just measuring one compiler on one system—the authors augment their main gcc results with Intel’s &lt;a href=&quot;https:&#x2F;&#x2F;software.intel.com&#x2F;en-us&#x2F;c-compilers&quot;&gt;icc&lt;&#x2F;a&gt;, a second CPU, and even an architectural simulator.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Plot error bars.&lt;&#x2F;em&gt;
It’s good to measure the average of multiple runs in any experiment, and it’s even better to report the standard error of the mean.
All of the scatter and bar charts in this paper have error bars.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Report entire distributions.&lt;&#x2F;em&gt;
Sometimes, error bars are not enough—it’s even more useful to give a complete depiction of the data distribution.
There are many ways to do this, but two good visualizations are violin plots and histograms.
This paper uses both.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Include details about your experimental setup.&lt;&#x2F;em&gt;
In any paper with an empirical evaluation, you have to include a complete description of the system you measured.
Table 2 in this paper gives a complete breakdown of the platforms in these experiments.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Always explain your axes.&lt;&#x2F;em&gt;
This paper contains several sentences that sound like “A point &lt;em&gt;(x,y)&lt;&#x2F;em&gt; says that when the UNIX environment size is &lt;em&gt;x&lt;&#x2F;em&gt; bytes, the execution time is &lt;em&gt;y&lt;&#x2F;em&gt; cycles on a Core 2 workstation.”
This description leaves absolutely no ambiguity about what the plot is telling us.
It’s easy to leave these descriptions off and make your data needlessly hard to interpret.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;em&gt;Dig deep to explain mysterious phenomena.&lt;&#x2F;em&gt;
If you find something weird in your experimental results, it can be tempting to blame amorphous “measurement error” and leave it alone.
Resist this temptation—look more closely and run more experiments to nail down exactly what’s happening.
This paper goes to heroic lengths to understand the microarchitectural reasons behind the outliers it found.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;lessons-we-re-still-learning&quot;&gt;Lessons We’re Still Learning&lt;&#x2F;h2&gt;
&lt;p&gt;This paper is from 2009.
I last read it a long time ago, and upon re-reading it, I was saddened but not surprised to discover how many of its lessons still feel under-appreciated today.&lt;&#x2F;p&gt;
&lt;p&gt;Underlying the primary message about sensitivity is a secondary call to action:
computer systems researchers need to get serious about statistics.
Long ago, it’s possible to imagine that computers were simple enough that experts could reasonably predict how they would behave.
Those days are gone.
While we should control as many factors as we can, we need to treat computer systems as unknowable, potentially hostile environments that actively try to spoil our experiments.
And one critical tool for rationalizing a complex, uncontrollable environment is good statistical analysis.
“Real” sciences like biology and chemistry seem to be more comfortable with this concept, but computer science lacks a strong tradition of rigorous statistics.&lt;&#x2F;p&gt;
&lt;p&gt;The best example of statistical thinking in this paper is embodied in the violin plots that show distributions of running times.
The key insight here is that it is &lt;em&gt;possible&lt;&#x2F;em&gt; to think of the different system treatments in these experiments as probability distributions.
You can imagine selecting a linking order uniformly at random from all the possible orders, for example, and sampling the execution time as a random variable.
It’s obvious in retrospect, but it wasn’t &lt;em&gt;necessary&lt;&#x2F;em&gt; for the authors to treat this data as random—after all, nobody &lt;em&gt;really&lt;&#x2F;em&gt; uses a random number generator to choose their linking order.
But viewing the data through a statistical lens helps give a complete picture of the confounding factor’s influence.&lt;&#x2F;p&gt;
&lt;p&gt;As we read more papers throughout &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;&quot;&gt;6120&lt;&#x2F;a&gt;, let’s keep the statistical attitude from Mytkowicz et al. in mind.
Let’s keep a healthy skepticism about spurious findings and maintain a high standard for statistical sophistication.&lt;&#x2F;p&gt;
</description>
            </item>
        
            <item>
                <title>Welcome to CS 6120!</title>
                <pubDate>Fri, 30 Aug 2019 00:00:00 +0000</pubDate>
                <link>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/welcome/</link>
                <guid>https%3A//www.cs.cornell.edu/courses/cs6120/2019fa/blog/welcome/</guid>
                <description>&lt;p&gt;I&#x27;m incredibly excited to teach CS 6120, a new PhD-level course about compilers at Cornell!
We&#x27;re using a broad definition of &amp;quot;compilers&amp;quot; that covers all aspects of language implementation.&lt;&#x2F;p&gt;
&lt;p&gt;We&#x27;ll use this course blog for &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;discussion&#x2F;&quot;&gt;paper discussion&lt;&#x2F;a&gt; and &lt;a href=&quot;https:&#x2F;&#x2F;www.cs.cornell.edu&#x2F;courses&#x2F;cs6120&#x2F;2019fa&#x2F;project&#x2F;&quot;&gt;project reports&lt;&#x2F;a&gt; throughout the semester.&lt;&#x2F;p&gt;
</description>
            </item>
        
    </channel>
</rss>
