The CS 6120 Course Blog

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

by Ankush Rayabhari

This week's paper, "Exploiting Superword Level Parallelism with Multimedia Instruction Sets," tries to explore a new way of exploiting single-instruction, multiple data or SIMD operations on a processor. It was written by Samuel Larsen and Saman Amarasinghe and appeared in PLDI 2000.


As applications process more and more data, processors now include so called SIMD registers and instructions, to enable more parallelism. These registers are extra wide: a 512-bit wide register can hold 16 32-bit words. Instructions on these registers perform the same operation on each of the packed data types.

For example, on Intel processors, the instruction vaddps adds each of the corresponding packed data elements together.

Instruction: vaddps zmm, zmm, zmm
  FOR j := 0 to 15
    i := j*32
    dst[i+31:i] := a[i+31:i] + b[i+31:i]

Just having these instructions doesn't mean applications can make use of them. Unless developers choose to handwrite assembly code or use libraries that do so, it's up to the compiler to identify and exploit opportunities to use these features. This paper proposes a new set of algorithms for that.

Superword Level Parallelism

Superword level parallelism is the type of parallelism that the authors target. This can be thought of as the SIMD subset of instruction level parallelism. It is a sequence of short SIMD operations where the source and result data are packed into a SIMD register. Additional operations to pack the initial data into the source registers and some more operations to unpack the result might be needed.

The authors devise a simple algorithm to convert a basic block that has opportunities for SLP into one that exploits SIMD registers and instructions. Broadly, the algorithm looks for sets of independent (no data dependencies), isomorphic (same operation) statements to merge together.


The algorithm proceeds in 4 steps:

  1. First, identify pairs of independent memory loads from adjacent locations. These are natural candidates for SIMD loads. This is generally done by some enabling analysis as an initial step. We add these adjacent loads to a pack, effectively the list of independent, isomorphic instructions that we want to combine into a single SIMD instruction. We store all these packs in a set.


  2. We now need to extend this set to contain non-memory operations. Since we want to take advantage of loaded data, we look for operations that use the packed data that we loaded before. We do this by following the uses and defs on the list of instructions that we packed. We can combine two statements into the current set if:

    • The statements do the same operation on independent data.
    • It is beneficial to execute these at once rather than as multiple instructions. If the data is already packed, then there's no additional cost. However, if the data is not packed we have to take into account the cost of moving them into the SIMD register.
    • The positions of the values in the SIMD register is consistent with the previous packed statements. This is what the paper calls alignment.
    • To simplify the search space for the next part, we only allow a statement to occupy the beginning or end instruction in a pack list once. This way, we don't have to choose between multiple options later down the line. For multiple choices, we use the cost model to determine the most profitable one.


  3. Now that we have pairs of instructions that can be executed by a single SIMD instruction without any issues, we can combine multiple of these pairs into an even wider SIMD instruction. Taking advantage of the last bullet above, we only combine two lists if we have the exact same beginning and end instruction. For example, we can merge the pack containing (1) (4) and (4) (7) to a pack containing (1) (4) (7).


  4. The last part is to actually generate the concrete SIMD instructions. Since statement groups could depend on data defined by a previous one, we need to generate the instructions in a topological ordering of the data dependency graph. The original instruction order works so the scheduling phase generates instructions in the order of the original basic block. To deal with cycles, we simply revert the pack and use regular instructions instead.


This routine will take in a single basic block and then convert it to a basic block with SIMD instructions.

By simply unrolling loops, we can apply this to loops to get a vector parallel version.

Performance Comparison

The authors chose to use a benchmark suite that consists of multimedia and scientific programs. These process heavy amounts of data, e.g., matrix multiplication or color conversion. Their experimental platform was a Motorola processor with an AltiVec co-processor, i.e., all vector instructions work on the coprocessor and all sequential operations on the main unit. Any movement between the vector registers and sequential registers must pass through memory.

They don't really compare with any other baseline implementations. They mention previous versions of this in their related work section but it would have been nice to see how many opportunities SLP can exploit over the loop transformation version. However, they do show which opportunities come from vectorizable (loop based) opportunities versus just SLP, which still shows large improvement, just not as large.

One claim that they put forth (albeit a reasonable one) is that their approach is more robust than the previous loop transformation techniques. It would be have been good to see this experimentally verified in their chosen benchmarks.


The authors do a good job of highlighting several points that limit the effectiveness of vectorization.

Looking to future work, LLVM implemented the SLP vectorization algorithm. Additionally, a few of the same authors have worked on improving the relaxed assumptions, the reduced search space and greedy scheduling, by using techniques such as integer linear programming solvers.