Sample Solutions of CS612 PS#1

The Experimental Setup

300 MHz Intel® Pentium® II Processor
Redhat 7.0, Linux 2.2.16
init level = 1 (single user mode), experiment codes run at priority -15
gcc "2.96"
-O9 -fomit-frame-pointer -mcpu=i686 -malign-double -funroll-all-loops

1. Memory Hierarchies

What is size of the L1 cache?
16k -- The first jump occurs for array sizes 32k and larger.
What is the miss penalty of the L1 cache?
a little more than 100ns.
What is the line size of L1 cache?
32 bytes -- for strides less than 32 bytes, there is some benefit from spatial locality.
What is the associativity of L1 cache?
4-way -- When the stride is such that only 4 elements are touched, all array sizes fit in the L1 cache.
Is there an L2 cache? If so, answer the previous four questions for the L2 cache.
Yes, there is an L2 cache. It's a little unclear from the graph, but it is 512k and it's miss penalty appears to be about 100ns above the L1 miss penalty (about 250ns total). It appears to be an 4-way cache. (The spike when 8 elements are accessed for N=512 discussed below).
What is the page size?
4k -- notice the jump in latency at 4k.
What is the TLB miss penalty?
as much as 325ns.
How many entrees are there in the TLB?
256k / 4k = 64 entries.

Why, for most sizes >=512k, does the access time spike just before falling to the L1 level (e.g., N=512k and Stride=64k, N=1M and Stride = 128k, etc)? This processor is kinda funny, in that the L2 cache is 512k with 32 byte lines while the TLB is 256k with 4k byte "lines". Both are 4-way set associative. I believe what is happening with these "spikes" is that the data accessed causes conflict misses in the TLB. For instance, for N=512 and Stride=64k, all of the references will fall in the L2 cache without conflict misses, but the TLB can only hold half of the references because of conflicts.

Why does the access time rise to over 400ns for sizes >= 32M? I'm not certain, but I believe that it's because, in the process of handling the TLB miss, a second TLB or L2 miss occurs when the page translation data structures are accessed.

2. Blocked Codes

What is the optimal value of K? Here are the 10 best block sizes,

Block Size	Time
32	4.42026
37	4.46711
39	4.48451
46	4.59128
23	4.60884
34	4.61456
14	4.61669
48	4.65089
30	4.67816
21	4.7065

Can you explain this value using the memory hierarchy parameters you obtained in the previous problem? The predicted optimal block size is sqrt(16k [L1 cache size]/ (8 [bytes per double] * 3 [three matices])) = 26. In the experiments, K=26 had a time of 5.39342, which is "close" to the best observed value. The difference can probably be explained by the few extra objects referenced by the code and the fact that the experiments were not the only process running on the machine.

3. Spatial Locality

Which version is faster and why?

i->j loop order -- 0.2580 seconds
j->i loop order -- 0.6661 seconds

The "i->j" version is faster because it traverses the array in the same order in which it is laid out in memory, and thus has spatial locality in the L1 cache. The "j->i" version does not.

Write down a simple loop nest for which interchange would change the computed result.

for (int i=0; i<N; i++)
    for (int j=0; j<N; j++)
        A[i][j] = A[i+1][j-1] + 1;

Ir - Iw = (i+1,j-1) - (i,j) = (1,-1). Loop interchange would make this (-1,1) which is illegal.

4. Miss Rates for MVM

How do your results compare to those presented in class? Is the miss rate constant?

They are consistent with the predicted values. According to the models presented in class, for this cache (c=2k, b=4) the transition from large to small caches occurs at c/2 (1k) for the I->J version, c/(b+1) (about 200) for the J->I version, and should stay constant for the blocked version.

5. MFlops

double x=0.0000001, y=1.0;
for (int i=0; i<n; i++) y += x * y;

Compiler flags are used to unroll this loop. This is important in order to reduce the overhead of the loop instructions. The computed MFlops are 63.036.

The code for these solutions is available upon request. Just send me email.