Due: Wednesday, February 17 by 5 pm.


For this assignment, you will optimize a routine to multiply two double-precision square matrices. As discussed in class, the naive implementation is short, sweet, and horrifyingly slow. A naive blocked code is only marginally better. You will need to use what you have learned about tuning to get your code to run as fast as possible on a single core on one node of the crocus cluster (Intel Xeon E5504).

We provide:

  1. A trivial unoptimized implementation and simple blocked implementation
  2. A timing harness and tester
  3. A version of the interface that calls the ATLAS BLAS


Your function must have the following signature:

  void square_dgemm(unsigned M, const double* A, const double* B,
                    double* C);

The three arrays should be interpreted as matrices in column-major order with leading dimension M. The operation implemented will actually be a multiply-add:

  C := C + A*B

Look at the code in basic_dgemm.c if you find this confusing.

The necessary files are in matmul.tar.gz. Included are:

a sample Makefile, with some basic rules,
the driver program,
a very simple square_dgemm implementation,
a slightly more complex square_dgemm implementation
another wrapper that lets the C driver program call the dgemm routine in BLAS implementations,
a sample gnuplot script to display the results,

We will be testing on the 2.0 GHz Xeon machines on the crocus cluster. Each node has two quad-core chips, but you will only be using a single core for this assignent. See the wiki for more information on the cluster.


Your group should submit your dgemm.c, your Makefile (so we can see compiler optimizations) and a write-up. Your write-up should contain:

To document the effect of your optimizations, include a graph comparing your code with basic_dgemm.c. Your explanations should rely heavily on your knowledge of the memory hierarchy (benchmark graphs help).