Cache-conscious Database Systems

Until recently, database systems kept most of their data on disk, and the goal of database optimization was to minimize the number of disk accesses. This is beginning to change because of the availability of commodity parallel processors with enormous amounts of main memory. The prediction is that within the next decade, it will be common to have a terabyte of main memory serving as a buffer pool for a hundred-terabyte database. In such databases, most of the working set of a database query might be in main memory, and the performance of the database query engine might depend critically on the cache behavior of the database. Some evidence for this belief comes from two recent papers that we have linked below. The first paper is by Ailamaki et al from the Wisconsin database group, and it is a study of where time is spent when a database query is executed. They report studies on four memory-resident commercial DBMS systems. They conclude that database implementors must (a) optimize data placement for the L2 cache, and (b) optimize instruction placement to reduce L1 instruction cache. The goal of this project is to study the impact of caches on database design. Here are some questions to think about.

We have spent a lot of time in class discussing optimization of array programs, the most important of which are loop transformations. The analog of loop transformations in the database world is query optimization. Although an SQL query for example can be converted trivially into a sequence of basic operations on relations, the resulting code can be very inefficient. The objective of query optimization is to translate such queries into more efficient sequences of basic operations. Traditionally, a major goal of query optimization was to avoid building large intermediate relations. How does this change in cache-conscious database systems? What performance model should be used? How should queries be optimized?
For array programs, we mentioned that it might be advantageous to perform data transformations such as laying out an array in column-major order rather than say a default row-major order. Similar optimizations are possible in the context of databases. How should a relation be stored in memory to optimize memory hierarchy behavior? In traditional databases, relations are stored tuple-by-tuple, which is analogous to a row-major storage order if one thinks of a relation as a table. However, relations can be stored by field if this is advantageous for locality during query processing; this would correspond to a column-major storage order.
Unlike dense arrays, relations are not random access data structures, so it is important to implement good index structures that make it possible to search relations efficiently. How would you redesign index structures into relations to optimize them for caches? The paper by Rao and Ross given below discusses a novel indexing structure called cache-sensitive search trees which are data structures for searching sorted arrays, carefully designed so that nodes of the trees fit into cache lines.

There are any number of research problems here. Johannes Gehrke has offered to help with this project. You can use the Predator database system for carrying out an implementation of ideas you come up with. Here are some references to get you started:

Ailamaki et al: DBMSs on a modern processor: where does time go?. VLDB 99.
Jun Rao, Kenneth Ross: Cache conscious indexing for decision-support in main memory. VLDB 99: 78-89.
Ambuj Shatdal, Chander Kant, Jeffrey F. Naughton: Cache Conscious Algorithms for Relational Query Processing. VLDB 1994: 510-521
H. V. Jagadish, Daniel F. Lieuwen, Rajeev Rastogi, Abraham Silberschatz, S. Sudarshan: Dalí: A High Performance Main Memory Storage Manager. VLDB 1994: 48-59
Kimberly Keeton, David A. Patterson, Yong Qiang He, Roger C. Raphael, Walter E. Baker: Performance Characterization of a Quad Pentium Pro SMP using OLTP Workloads. ISCA 1998: 15-26
Jerry Baulier, Philip Bohannon, S. Gogate, C. Gupta, S. Haldar, S. Joshi, A. Khivesera, Henry F. Korth, P. McIlroy, J. Miller, P. P. S. Narayan, M. Nemeth, Rajeev Rastogi, S. Seshadri, Abraham Silberschatz, S. Sudarshan, M. Wilder, C. Wei: DataBlitz Storage Manager: Main Memory Database Performance for Critical Applications. SIGMOD Conference 1999: 519-520
Tobin J. Lehman, Michael J. Carey: A Study of Index Structures for Main Memory Database Management Systems. VLDB 1986: 294-303
Tobin J. Lehman, Michael J. Carey: Query Processing in Main Memory Database Management Systems. SIGMOD Conference 1986: 239-250
Anthony LaMarca, Richard E. Ladner: The Influence of Caches on the Performance of Sorting. SODA 1997: 370-379