High-performance Divide-and-conquer Codes
An intriguing feature of divide-and-conquer algorithms is
that in principle they should run well on memory hierarchies with
multiple levels of memory without the need for blocking or
tiling. Each successive "divide" step in the divide-and-conquer
process generates subproblems that touch successively smaller portions
of data. For any level of the memory hierarchy, there is a level of
division below which all the data touched by the subproblem will fit
into that level of the memory hierarchy. Therefore, a
divide-and-conquer algorithm can be viewed as an algorithm that is
blocked for all levels of the memory hierarchy.
Let us look at a few divide-and-conquer algorithms to understand
this. Matrix multiplication is usually expressed iteratively as
follows:
for i = 1, N
for j = 1, N
for k = 1, N
C(i,j) = C(i,j) + A(i,k)*B(k,j)
It is possible to write the same code in a divide-and-conquer style
by partitioning all the arrays into 4 blocks and expressing the
computation to be performed as 8 smaller matrix multiplications:
|C11 C12| = |A11 A12|* |B11 B12|
|C21 C22| |A21 A22| |B21 B22|
so
C11 = A11*B11 + A12*B21
C12 = A11*B12 + A12*B22
C21 = A21*B11 + A22*B21
C22 = A21*B12 + A22*B22
Each of the 8 smaller matrix multiplications can themselves be
performed in a divide-and-conquer style and so on till you get 1x1
matrices which can be multiplied directly since that corresponds to
scalar multiplication. For obvious reasons, this algorithm is also
called a block-recursive algorithm for matrix multiplication.
Similarly, triangular solve can be written in a block-recursive
manner in terms of two smaller triangular solves and a matrix-vector
product as follows:
|L11 0 | |x1|= |b1|
|L21 L22| |x2| |b2|
so
L11*x1 = b1
L22*x2 = b2 - L21*x1
Similar block-recursive codes can be written for other matrix
computations such as Cholesky factorization and LU factorization with
pivoting which are usually expressed iteratively.
Of course, divide-and-conquer algorithms are known for non-numeric
problems as well. For example, sorting on a machine with a memory
hierarchy can be performed efficiently using a divide-and-conquer
algorithm like merge-sort.
Reality check
In practice, it turns out that purely block-recursive codes can be
quite inefficient because of the overhead of recursion at the lower
levels of the recursion tree. Recursing down to 1x1 blocks makes it
difficult to generate efficient code to fill processor pipelines etc.,
so it is better to stop the recursion at some level and switch to a
purely iterative code from that point on. For example, multiplying
small matrices should be done using the iterative matrix
multiplication, while sorting small arrays that fit into cache should
be done using an iterative algorithm like selection-sort or
insertion-sort.
Restructuring to block-recursive form
It is preferable not to have to maintain two versions (a recursive
version and an iterative version) of each code. Our group has
developed restructuring compiler technology that generates
block-recursive versions of matrix codes automatically from iterative
versions. The programmer writes only iterative code, and calls the
compiler to restructure it into block-recursive code. The code
generated by the compiler switches from block-recursive to iterative
at some level of the recursion.
You can read about this technology in one of the papers we have cited
below.
In this project, we will study three issues that arise in getting
such compiled code to run very fast on a machine like the SGI MIPSPro.
- At what point should the code switch from block-recursive to
iterative style? We do not yet have a good understanding of this
issue.
- Notice that the iterative code is executed for small
blocks. Back-end code generation in compilers like the SGI MIPSPro is
optimized for loops in which the trip count is large, so they generate
poor code for small blocks. One solution is to unroll loops completely
since the iterative code is presumably executed only for small block
sizes. You may want to read the FFTW paper to see what they did to
address this problem.
- We have found experimentally that although the L1 and L2 cache
miss rates for the block-recursive codes we are generating are smaller
than the corresponding miss rates for blocked iterative codes, the TLB
miss-rate for the block-recursive codes are much higher than the that
of the iterative codes. This appears to be because the default
row/column-major layout of arrays is not well-suited to the patterns
of access in block-recursive codes. We see two possible solutions:
- Prefetch TLB entries.
- Copy data from row/column-major order to a layout better suited
for block-recursive codes. There will be a lot of TLB misses during
the copying process, but each data element is touched just once during
copying, but is touched O(n) times during key computations like matrix
multiplication and Cholesky factorization, so copying in and out makes
sense for large arrays.
Investigate these options and tell us which one we should use.
How to proceed
Take a look at the following papers:
- Your contact person for this project is Nawaaz Ahmed who is doing
his dissertation on automatic generation of block-recursive codes. If
you are interested in this project, talk to him, and he will get you
going.