Thursday, February 27, 2003
B17 Upson Hall
A Parallel Vector Memory Subsystem for the Impulse Adaptable Memory Controller
Microprocessor speed is increasing much faster than memory system speed, and thus achieving good application performance often translates into achieving good cache performance. Unfortunately, many important commercial and scientific workloads lack the locality of reference that makes caching effective. Studies have shown that memory bus and DRAM latencies cause an 8X slowdown from peak performance to actual performance on commercial database workloads, that the efficiency of current caching techniques is generally less than 20\% of an optimal cache's, and that cache sizes are up to 2000 times larger than an optimal cache would be. The evidence is clear: no matter how hard we push it, traditional caching cannot bridge the growing processor-memory performance gap. This talk presents research that attacks the memory problem at a different level -- the memory controller. We describe the Impulse Adaptable Memory System built at the University of Utah. This general-purpose, uniprocessor system improves performance within the cache hierarchy and the memory back end for both regular and irregular computations. It does this in three ways: by optimizing the use of DRAM resources in the memory controller backend, by prefetching data within the memory controller and delivering it to the processor only when requested, and by remapping previously unused physical addresses within the memory controller. Extending the virtual memory hierarchy in this way allows optimizations that improve the efficiency of the system bus and the performance of the CPU's cache hierarchy and translation lookaside buffer (TLB). Impulse represents a combined hardware/software solution in that the compiler, OS, or application writer supplies the access pattern information that the memory controller exploits.
We provide an overview of Impulse functionality, and then present details of the Parallel Vector Access (PVA) mechanism that optimizes the use of DRAM resources by gathering data in parallel within the memory controller. The PVA performs cache-line fills as efficiently as a normal, serial controller, and performs strided vector accesses from three to 33 times faster. The scalable design is two-five times faster than other gathering mechanisms with similar goals, at the cost of only a slight increase in hardware complexity.