**Due: Wednesday, February 17 by 5
pm.**

For this assignment, you will optimize a routine to multiply two double-precision square matrices. As discussed in class, the naive implementation is short, sweet, and horrifyingly slow. A naive blocked code is only marginally better. You will need to use what you have learned about tuning to get your code to run as fast as possible on a single core on one node of the crocus cluster (Intel Xeon E5504).

We provide:

- A trivial unoptimized implementation and simple blocked implementation
- A timing harness and tester
- A version of the interface that calls the ATLAS BLAS

Your function must have the following signature:

void square_dgemm(unsigned M, const double* A, const double* B, double* C);

The three arrays should be interpreted as matrices in column-major order with leading dimension M. The operation implemented will actually be a multiply-add:

C := C + A*B

Look at the code in basic_dgemm.c if you find this confusing.

The necessary files are in matmul.tar.gz. Included are:

- Makefile
- a sample Makefile, with some basic rules,
- matmul.c
- the driver program,
- basic_dgemm.c
- a very simple
`square_dgemm`implementation, - blocked_dgemm.c
- a slightly more complex
`square_dgemm`implementation - blas_dgemm.c
- another wrapper that lets the C driver program call the
`dgemm`routine in BLAS implementations, - timing.gnuplot
- a sample gnuplot script to display the results,

We will be testing on the 2.0 GHz Xeon machines on the crocus cluster. Each node has two quad-core chips, but you will only be using a single core for this assignent. See the wiki for more information on the cluster.

Your group should submit your `dgemm.c`, your `Makefile`
(so we can see compiler optimizations) and a write-up. Your write-up
should contain:

- the names of the group members
- a description of optimizations used or attempted
- the results of those optimizations
- your explanations for any odd behavior (e.g. performance dips)

To document the effect of your optimizations, include a graph
comparing your code with `basic_dgemm.c`. Your explanations should
rely heavily on your knowledge of the memory hierarchy (benchmark graphs
help).

- I have mentioned the wiki, right?
- Much of this assignment was shamelessly borrowed from CS 267, Fall 2006, which in turn borrowed from previous years into the mists of antiquity. Many of the resources listed there will be useful.
- You can find out more about the processor by running Todd Allen's
`cpuid`utility and by running`cat /proc/cpuinfo`. - The optimizations used in PHiPAC
and ATLAS may be
interesting. Note: You cannot use PHiPAC or ATLAS to generate your
matrix multiplication kernels. You
*can*write your own code generator, however. You might want to skim the tech report (.pdf) on PHiPAC. The section on C coding guidelines will be particuarly relevant. - There is a classic paper (PostScript) on the idea of cache blocking in the context of matrix multiply by Monica Lam, et al.
- Several folks have tried to automatically get cache locality by
basing their matrix storage organization around space-filling curves.
See the paper
by Chatterjee and the classic paper by Gustavson,
Recursion leads to automatic variable blocking for dense linear algebra.