The graph roughly measures the average execution time of a memory access for
various access patterns on different sized arrays.

Observations:

- The machine has a 128K L1 cache.  For arrays from 4k to 128k, regardless
of stride, each access results in a cache hit.  At 256k, data no longer fits
in cache.

- The L1 miss penalty is slightly over 400 ns.

- The L1 cache line size is 256 bytes.  It is only when the stride of access
increases to 256 that each access results in a L1 cache miss.  For small
strides, spatial locality reduces the average cost of an access.  In these
cases, the first access to a cache line results in a miss, but immediately
following accesses result in a hit.

- The L1 cache is 4-way set associative.  Note that, for very large strides,
each accessed element maps to the same cache line.  (The stride is a power
of 2, and the mapping is determined by the lower order address bits.)   At
the third largest stride, only eight elements in the array are accessed.
Yet, for arrays larger that 128k, these eight elements do not fit together
into the L1 cache.  Clearly, this must be due to cache conflict.  When only
4 elements are accessed, all elements fit in cache.

- There is no L2 cache.  Another level in the memory hierarchy is apparent
only for arrays larger than 8M.   This is a TLB that maps 8M of memory.
(Both the 8M 'cache' size and 'line'/page size below suggest that this
cannot be an L2 cache.)

- The page size is 16k.  For smaller strides, spatial locality reduces the
number of TLB misses.

- The TLB miss penalty is approximately 250 ns.  In general, a TLB miss
requires an additional memory access.

- There are 512 entrees in the TLB.  The TLB maps 8M of memory.  This
corresponds to  512 pages of 16k each.