High-performance Divide-and-conquer Codes

An intriguing feature of divide-and-conquer algorithms is that in principle they should run well on memory hierarchies with multiple levels of memory without the need for blocking or tiling. Each successive "divide" step in the divide-and-conquer process generates subproblems that touch successively smaller portions of data. For any level of the memory hierarchy, there is a level of division below which all the data touched by the subproblem will fit into that level of the memory hierarchy. Therefore, a divide-and-conquer algorithm can be viewed as an algorithm that is blocked for all levels of the memory hierarchy.

Let us look at a few divide-and-conquer algorithms to understand this. Matrix multiplication is usually expressed iteratively as follows:

  for i = 1, N
   for j = 1, N
    for k = 1, N 
     C(i,j) = C(i,j) + A(i,k)*B(k,j)

It is possible to write the same code in a divide-and-conquer style by partitioning all the arrays into 4 blocks and expressing the computation to be performed as 8 smaller matrix multiplications:

 |C11 C12| =  |A11 A12|* |B11 B12|
 |C21 C22|    |A21 A22|  |B21 B22|

  so 

  C11 = A11*B11 + A12*B21 
  C12 = A11*B12 + A12*B22 
  C21 = A21*B11 + A22*B21 
  C22 = A21*B12 + A22*B22

Each of the 8 smaller matrix multiplications can themselves be performed in a divide-and-conquer style and so on till you get 1x1 matrices which can be multiplied directly since that corresponds to scalar multiplication. For obvious reasons, this algorithm is also called a block-recursive algorithm for matrix multiplication.

Similarly, triangular solve can be written in a block-recursive manner in terms of two smaller triangular solves and a matrix-vector product as follows:

 |L11 0  | |x1|= |b1|
 |L21 L22| |x2|  |b2|

 so 

  L11*x1 = b1
  L22*x2 = b2 - L21*x1

Similar block-recursive codes can be written for other matrix computations such as Cholesky factorization and LU factorization with pivoting which are usually expressed iteratively.

Of course, divide-and-conquer algorithms are known for non-numeric problems as well. For example, sorting on a machine with a memory hierarchy can be performed efficiently using a divide-and-conquer algorithm like merge-sort.

Reality check

In practice, it turns out that purely block-recursive codes can be quite inefficient because of the overhead of recursion at the lower levels of the recursion tree. Recursing down to 1x1 blocks makes it difficult to generate efficient code to fill processor pipelines etc., so it is better to stop the recursion at some level and switch to a purely iterative code from that point on. For example, multiplying small matrices should be done using the iterative matrix multiplication, while sorting small arrays that fit into cache should be done using an iterative algorithm like selection-sort or insertion-sort.

Restructuring to block-recursive form

It is preferable not to have to maintain two versions (a recursive version and an iterative version) of each code. Our group has developed restructuring compiler technology that generates block-recursive versions of matrix codes automatically from iterative versions. The programmer writes only iterative code, and calls the compiler to restructure it into block-recursive code. The code generated by the compiler switches from block-recursive to iterative at some level of the recursion. You can read about this technology in one of the papers we have cited below.

In this project, we will study three issues that arise in getting such compiled code to run very fast on a machine like the SGI MIPSPro.

At what point should the code switch from block-recursive to iterative style? We do not yet have a good understanding of this issue.
Notice that the iterative code is executed for small blocks. Back-end code generation in compilers like the SGI MIPSPro is optimized for loops in which the trip count is large, so they generate poor code for small blocks. One solution is to unroll loops completely since the iterative code is presumably executed only for small block sizes. You may want to read the FFTW paper to see what they did to address this problem.
We have found experimentally that although the L1 and L2 cache miss rates for the block-recursive codes we are generating are smaller than the corresponding miss rates for blocked iterative codes, the TLB miss-rate for the block-recursive codes are much higher than the that of the iterative codes. This appears to be because the default row/column-major layout of arrays is not well-suited to the patterns of access in block-recursive codes. We see two possible solutions:
- Prefetch TLB entries.
- Copy data from row/column-major order to a layout better suited for block-recursive codes. There will be a lot of TLB misses during the copying process, but each data element is touched just once during copying, but is touched O(n) times during key computations like matrix multiplication and Cholesky factorization, so copying in and out makes sense for large arrays.
Investigate these options and tell us which one we should use.
How to proceed
- Your contact person for this project is Nawaaz Ahmed who is doing his dissertation on automatic generation of block-recursive codes. If you are interested in this project, talk to him, and he will get you going.

High-performance Divide-and-conquer Codes

Reality check

Restructuring to block-recursive form

How to proceed