Lesson 8: Loop Optimization
- discussion thread
- video
- tasks due October 17
Gist
- Recall from Lesson 5: identifying natural loops in the CFG.
- Loop optimization is important because programs, by definition, spend most of their time executing loops! (A program without any loops can’t run for very long.)
Loop-Invariant Code Motion (LICM)
Also from Lesson 5, recall the loop-invariant code motion (LICM) optimization, which moves code from inside a loop to before the loop, if the computation always does the same thing on every iteration of the loop.
A loop’s preheader is its header’s unique predecessor. LICM moves code to the preheader. But while natural loops need to have a unique header, the header does not necessarily have a unique predecessor. So it’s often convenient to invent an empty preheader block that jumps directly to the header, and then move all the in-edges to the header to point there instead.
LICM needs two ingredients: identifying loop-invariant instructions in the loop body, and deciding when it’s safe to move one from the body to the preheader.
To identify loop-invariant instructions:
iterate to convergence:
for every instruction in the loop:
mark it as LI iff, for all arguments x, either:
all reaching defintions of x are outside of the loop, or
there is exactly one definition, and it is already marked as
loop invariant
(This determination requires that you already calculated reaching definitions! Presumably using data flow.)
It’s safe to move a loop-invariant instruction to the preheader iff:
- The definition dominates all of its uses, and
- No other definitions of the same variable exist in the loop, and
- The instruction dominates all loop exits.
The last criterion is somewhat tricky: it ensures that the computation would have been computed eventually anyway, so it’s safe to just do it earlier. But it’s not true of loops that may execute zero times, which, when you think about it, rules out most for
loops! It’s possible to relax this condition if:
- The assigned-to variable is dead after the loop, and
- The instruction can’t have side effects, including exceptions—generally ruling out division because it might divide by zero. (A thing that you generally need to be careful of in such speculative optimizations that do computations that might not actually be necessary.)
Induction Variable Elimination
Induction variable elimination is an optimization that generally targets array indexing (and similar computations). Imagine this loop that accesses an array:
for (let i = 0; i < 100; ++i) {
f(a[i]);
}
which, in C terms, is equivalent to:
for (let i = 0; i < 100; ++i) {
f(a + i * stride);
}
where a
is a base pointer and stride
is the size of the array’s elements.
Induction variable elimination rewrites the code to just “bump the pointer” instead of multiplying every time:
let a_100 = a + 100 * stride;
for (let a_i = a; a_i < a_100; a_i += stride) {
f(a_i);
}
This code only has to do an addition in the loop—it has eliminated a multiplication.
The basic idea is to find affine expressions like a + i * stride
that depend on a loop induction variable (i
here) and some loop-invariant values (a
and stride
here).
Here’s the recipe:
- Given a natural loop, identify the basic loop induction variable(s), like
i
in this example.- These are variables that get have a single in-loop definition that looks like
i += e
for some loop-invariant valuee
. - Of course, there is no such thing as
+=
in SSA form so identifying this kind of assignment requires traversing ϕ-nodes.
- These are variables that get have a single in-loop definition that looks like
- Find derived induction variables.
- You’re looking for an affine expression like
j = c * i + d
wherei
is a basic induction variable andc
andd
are loop-invariant values. - Record the stride
c
, offsetd
, and the basic induction variablei
for every induction variable you find this way.
- You’re looking for an affine expression like
- Replace the definition of each derived induction variable.
- Assuming the induction variable is
j = c * i + d
wherei
is a basic induction variable that gets updated likei += e
… - In the preheader, initialize
j
withd
. (That’s the “equation” forj
withi
set to zero.) - In the loop (after the update to
i
), replace the definition ofj
with an update likej += c * e
(where that latter multiplication is a loop-invariant value). (And again, this will look a bit different in SSA. And in non-SSA form, you will need to be careful to compute the new value ofj
right after the update toi
but without clobbering other, pre-update uses ofj
.)
- Assuming the induction variable is
After all this, you’ll want to do some copy propagation and dead code elimination to, for example, eliminate basic induction variables that are no longer necessary.
This optimization is also sometimes called strength reduction for induction variables (but I think that’s confusing because strength reduction is a much more general concept that doesn’t by itself have anything to do with loops). For more, check out some slides from CMU on induction variable elimination and two blog posts that cover this topic from CS 6120 in 2019: one by Yi Jing, Zhijing Li, and Neil Adit and another by Shaojie Xiang, Yi-Hsiang Lai, and Yuan Zhou.
Loop Unswitching
Loop unswitching “lifts” conditions out of loops, creating two loops. So this code:
for (let i = 0; i < 100; ++i) {
if (c) { // Loop-invariant value.
f();
} else {
g();
}
}
Becomes:
if (c) {
for (let i = 0; i < 100; ++i) {
f();
}
} else {
for (let i = 0; i < 100; ++i) {
g();
}
}
The analysis, of course, requires identifying loop-invariant code. Then the heavily lifting in the transformation is “just” rearranging the CFG and duplicating the loop’s subgraph. For more, see a blog post by Sameer Lal from CS 6120 in 2019.
Loop Fusion & Fission
Loop fusion or loop jamming takes two sequential loops and combines them into one. Its main purpose again has to do with arrays, particularly when feeding the output from one loop into the input of the next loop, like this:
for (let i = 0; i < 100; ++i) {
b[i] = f(a[i]);
}
for (let i = 0; i < 100; ++i) {
c[i] = g(b[i]);
}
Fusing these loops yields one that calls both f
and g
:
for (let i = 0; i < 100; ++i) {
b[i] = f(a[i]);
c[i] = g(b[i]);
}
Then, the real win comes if the intermediate array b
is “dead” outside of this context, in which case we can eliminate it:
for (let i = 0; i < 100; ++i) {
c[i] = g(f(a[i]));
}
In this case, fusion eliminated a lot of memory accesses (and perhaps even reduced the program’s heap footprint, if we no longer even have to allocate b
).
This optimization is extremely important in machine learning compilers and other data-intensive domains—for example, it is far more efficient to fuse a fully-connected layer with a ReLU nonlinearity than it is to compute them serially, with a temporary tensor in between.
The opposite transformation, loop fission or distribution, can be useful to optimize for locality—i.e., to make multiple, smaller loop bodies that each have a smaller memory footprint.
The key analysis you need in both case is independence, i.e., being certain that f
and g
do not interfere with each other so it’s legal to change the order they execute in.
Global Optimizations Galore!
This is not an exhaustive list of loop optimizations—there are so many more! For example:
- Loop unrolling (duplicating the loop body to reduce the numer of branches executed at the expense of code size).
- Loop interchange (replacing row-major iteration with column-major iteration, for example).
- Loop splitting or peeling (carving off the first few iterations of the loop and running them separately, leaving you with a simpler main loop body).
- Loop transformations for data locality optimization (see also Kathryn McKinley’s slides on this topic).
See overviews of these and still more in Steve Chong’s slides on loop optimizations.
More broadly, this is where we will stop going down the standard function-level optimization rabbit hole. Before the class moves on to farther-flung topics in language implementation, for those who are curious about this sort of thing, I wanted to leave you with a few links to read about even more fancy global optimizations:
- Global value numbering, the turbocharged global version of local value numbering, which you will remember from Lesson 3. See also this CS 6120 blog post by Alexa VanHattum and Gregory Yauney.
- Partial redundancy elimination.
- Lazy code motion. See also these slides from CMU.
Tasks
Your task is to implement and evaluate a loop optimization.
- First, pick Bril or LLVM as your starting point.
- You can work in the SSA form of Bril if you want.
- If you pick Bril, you’ll need to find the natural loops yourself. If you pick LLVM, you can use a
LoopPass
to skip that step, but of course other parts of the implementation will be trickier.
- Then, pick your optimization. You can either:
- Choose one of the loop-based optimizations mentioned above in this lesson. (Including one of the ones I just mentioned at the end without describing in detail.) (If you’re having trouble deciding, I recommend LICM.)
- Propose your own. (DM me on Zulip before you start working on it to clear your idea.)
- Implement your optimization.
- Rigorously evaluate its performance impact. See the SIGPLAN empirical evaluation guidelines for the definition of “rigorous.”
- In Bril, you can use the Bril benchmarks.
- If you choose LLVM, select an existing (small!) benchmark suite such as Embench. Please violate the SIGPLAN guideline about using complete benchmark suites, an cherry-pick a convenient subset, to make this evaluation tractable.