CS612 Problem Set #1

Due date: Tuesday, February 13th, 2000

This assignment requires you to perform a small set of experiments that will help you understand caches, and program restructuring to improve the cache performance of array programs. These experiments should be performed on a workstation of your choice (don't forget to tell us which one).

1.

Memory Hierarchies

Here is a simple micro-benchmark program that reveals a lot of detail about the memory hierarchy of the machine (it is taken from Hennessy and Patterson's book). Study and run the program. Plot your results for different block sizes in a single graph with stride as the x-axis (with a log scale) and access time as the y-axis. Turn in the graph.

For your convenience we have provided a Matlab script which you may use to create your graphs. All you need to do is copy into the script your data and let Matlab do the rest.

Use your results to answer the following questions. Assume all answers are powers of two (except miss times). Briefly explain your answers.

What is size of the L1 cache?
What is the miss penalty of the L1 cache?
What is the line size of L1 cache?
What is the associativity of L1 cache?
Is there an L2 cache? If so, answer the previous four questions for the L2 cache.
What is the page size?
What is the TLB miss penalty?
How many entrees are there in the TLB?

Hint: Here is the graph for the SP/2 along with a sample analysis.

2.

Blocked Codes

As mentioned in class, matrix multiplication is one routine that exhibits significant data reuse. However, the standard form, shown below, will suffer on large matrices from lack of temporal locality. An important technique to enhance locality employed by both programmers and compilers is to restructure the code to access data in blocks. A blocked version of matrix multiplication (with block size $K \times K$ is shown below. Run both versions below (in C) for N = 500, declaring A, B and C as arrays of doubles.

For the blocked version, experiment with different values of K, and plot the execution time as a function of K. What is the optimal value of K? Can you explain this value using the memory hierarchy parameters you obtained in the previous problem?

Standard version:

do i = 1 .. N do j = 1 .. N do k = 1 .. N C[i,j] = C[i,j] + A[i,k] * B[k,j]

Blocked version:

do t1 = 1 .. $\lceil(N/K)\rceil$ do t2 = 1 .. $\lceil(N/K)\rceil$ do t3 = 1 .. $\lceil(N/K)\rceil$ do i = (t1-1)*K +1 .. min(t1*K,N) do j = (t2-1)*K +1 .. min(t2*K,N) do k = (t3-1)*K +1 .. min(t3*K,N) C[i,j] = C[i,j] + A[i,k] * B[k,j]

3.

Spatial Locality

Consider the two versions of matrix addition shown below. Run both versions below (in C) for N = 500, declaring A, B, and C as arrays of doubles, and measure execution times. Which version is faster and why?

Note that either version can be transformed to the other by simply interchanging the loops. In this case, interchanging the loops does not alter the result of the program. Write down a simple loop nest for which interchange would change the computed result.

do i = 1 .. N do j = 1 .. N C[i,j] = A[i,j] + B[i,j]

do j = 1 .. N do i = 1 .. N C[i,j] = A[i,j] + B[i,j]

4. Miss Rates for MVM

Implement the three versions of Matrix-Vector Multiply discussed in class (in C) for N=500, declaring A as a matrix and B and C as arrays of doubles, and use the performance monitoring counters for NT (cpumon), Linux (cacheprof or pmc) or FreeBSD (perfmon) to determine the miss rates for the cache for each code.

Take readings so that you can construct a graph comparable to the one presented in class. How do your results compare to those presented in class? Is the miss rate constant?

Make certain you document your results with what test(s) you used and why, and which cache(s) your data pertains to.

5.: MFlops
Design a micro-benchmark to measure how many MFlops your computer is capable of. Run it and show us the results of your experiments.

Last Updated: Wednesday, February 07, 2001 10:29:38 AM