Empirical Optimization of Sparse Matrix Kernels

The objective of this project is to study the use of empirical optimization to generate very efficient sparse matrix codes. We will focus on matrix-vector multiplication for this project.

At first blush, it would appear that MVM does not exhibit must reuse. In particular, each element of the matrix, A, is visited exactly one. However, there are two factors that mitigate this. First, there is some reuse within the vectors, x and y. Second, the "vectors", x and y, are often matrices with many fewer columns (say, 5 to 10) than rows. In this case, there is an opportunity for some reuse within the matrix, A.

This project has two parts.

Part One

Toledo wrote a paper [4] in which he describes various program and data transformations that can be used to more than double the performance of MVM on the IBM RS6000. This transformations include,

The first part of the project requires that you implement Toledo's transformations and measure their effectiveness on an architecture other than the RS6000 (e.g., x86, Sparc, MIPS).

How to proceed with Part One

Part Two

As mentioned in the introduction, very often x and y are actually matrices with many fewer columns than rows (e.g. 1 million rows, 10 columns). If the non-zeros are blocked using Toledo's transformations, the large matrix-matrix multiplication (MMM) operation, y=A*x, can be decomposed into many small MMM operations each MMM is on the order of 10 by 10.

All of the transformations made in Part One are still important, but equally important is how the small MMM operations are written. If these are not done efficiently, then the overall efficiency of the entire MMM operation will suffer.

In order to get the most performance from these small MMM codes, you must produce a number of versions specialized for the small matrix sizes that are written in such a way as to get the most performance from the processor. 

How many specialized routines must be produced? If too few are produced, then opportunities for running the specially tuned code are missed. If too many are produced, then the overhead of choosing the appropriate piece of code will be more than the performance gained by executing the specially tuned code. The exact trade-off depends on the structure of the matrix and the machine on which the codes are run.

Which transformations are needed to produce the most efficient code for each of the matrix sizes? This depends upon the underlying compiler and the machine on which the codes are run.

Historically, the decisions have been made by a human running a set of experiments on a particular machine and then writing code based upon the results. Recently, ATLAS [1] and FFTW [3] have shown that empirical optimization techniques can be used to automatically tune codes for particular machines.

The second part of this project requires you to implement an empirical optimization framework so that you can automatically generate MMM code specialized for the small block sizes that are likely to occur in the sparse MMM operation described above.

How to proceed with Part Two

References

  1. Atlas home page. http://www.netlib.org/atlas/.
  2. Crack Propagation on Teraflop Computers. http://www.tc.cornell.edu/Research/CrackProp/.
  3. FFTW home page. http://www.fftw.org/.
  4. Sivan Toledo. Improving memory-system performance of sparse matrix-vector multiplication, November 25 1996. http://theory.lcs.mit.edu/~sivan/smv.ps.gz.
Your contact for this project is Paul Stodghill.