Empirical Optimization of Sparse Matrix Kernels

The objective of this project is to study the use of empirical optimization to generate very efficient sparse matrix codes. We will focus on matrix-vector multiplication for this project.

At first blush, it would appear that MVM does not exhibit must reuse. In particular, each element of the matrix, A, is visited exactly one. However, there are two factors that mitigate this. First, there is some reuse within the vectors, x and y. Second, the "vectors", x and y, are often matrices with many fewer columns (say, 5 to 10) than rows. In this case, there is an opportunity for some reuse within the matrix, A.

This project has two parts.

Part One

Toledo wrote a paper [4] in which he describes various program and data transformations that can be used to more than double the performance of MVM on the IBM RS6000. This transformations include,

Permuting the rows and columns of the matrix in order to improve the locality of accesses to x and y,
Blocking the data so that the code may be blocked,
Performing certain low-level optimizations to the MVM code in order to improve scalar performance, and
Prefetch data in order to reduce the latency penalty.

The first part of the project requires that you implement Toledo's transformations and measure their effectiveness on an architecture other than the RS6000 (e.g., x86, Sparc, MIPS).

How to proceed with Part One

Read Toledo's paper. Decide what architecture(s) you will target. Decide what transformations you will implement.
Implement MVM once, and then apply each of the transformations.
Design an experimental setup to evaluate your results.
Run experiments on the target architectures.
Analyze the results and decode which set of transformations are most effective for each architecture.

Part Two

As mentioned in the introduction, very often x and y are actually matrices with many fewer columns than rows (e.g. 1 million rows, 10 columns). If the non-zeros are blocked using Toledo's transformations, the large matrix-matrix multiplication (MMM) operation, y=A*x, can be decomposed into many small MMM operations each MMM is on the order of 10 by 10.

All of the transformations made in Part One are still important, but equally important is how the small MMM operations are written. If these are not done efficiently, then the overall efficiency of the entire MMM operation will suffer.

In order to get the most performance from these small MMM codes, you must produce a number of versions specialized for the small matrix sizes that are written in such a way as to get the most performance from the processor.

How many specialized routines must be produced? If too few are produced, then opportunities for running the specially tuned code are missed. If too many are produced, then the overhead of choosing the appropriate piece of code will be more than the performance gained by executing the specially tuned code. The exact trade-off depends on the structure of the matrix and the machine on which the codes are run.

Which transformations are needed to produce the most efficient code for each of the matrix sizes? This depends upon the underlying compiler and the machine on which the codes are run.

Historically, the decisions have been made by a human running a set of experiments on a particular machine and then writing code based upon the results. Recently, ATLAS [1] and FFTW [3] have shown that empirical optimization techniques can be used to automatically tune codes for particular machines.

The second part of this project requires you to implement an empirical optimization framework so that you can automatically generate MMM code specialized for the small block sizes that are likely to occur in the sparse MMM operation described above.

How to proceed with Part Two

Read the papers from the Atlas and FFTW home pages on empirical optimization.
For each transformation developed in Part One, and any others that might be appropriate, write a set of test cases for evaluating their effectiveness.
Design and implement a framework for automatically running your test cases and using the results to select and apply transformations to your MMM code.
Design an experimental setup to evaluate the effectiveness of your system on several architectures.

References

Atlas home page. http://www.netlib.org/atlas/.
Crack Propagation on Teraflop Computers. http://www.tc.cornell.edu/Research/CrackProp/.
FFTW home page. http://www.fftw.org/.
Sivan Toledo. Improving memory-system performance of sparse matrix-vector multiplication, November 25 1996. http://theory.lcs.mit.edu/~sivan/smv.ps.gz.

Your contact for this project is Paul Stodghill.