Cache-conscious Database Systems
Until recently, database systems kept most of their data on disk, and
the goal of database optimization was to minimize the number of disk
accesses. This is beginning to change because of the availability of
commodity parallel processors with enormous amounts of main memory.
The prediction is that within the next decade, it will be common to
have a terabyte of main memory serving as a buffer pool for a
hundred-terabyte database. In such databases, most of the working set
of a database query might be in main memory, and the performance of
the database query engine might depend critically on the cache
behavior of the database. Some evidence for this belief comes from two
recent papers that we have linked below.
The first paper is by Ailamaki et al from the Wisconsin database
group, and it is a study of where time is spent when a database query
is executed. They report studies on four memory-resident commercial
DBMS systems. They conclude that database implementors must (a)
optimize data placement for the L2 cache, and (b) optimize instruction
placement to reduce L1 instruction cache.
The goal of this project is to study the impact of caches on database
design. Here are some questions to think about.
- We have spent a lot of time in class discussing optimization of
array programs, the most important of which are loop transformations.
The analog of loop transformations in the database world is query
optimization. Although an SQL query for example can be converted
trivially into a sequence of basic operations on relations, the
resulting code can be very inefficient. The objective of query
optimization is to translate such queries into more efficient
sequences of basic operations. Traditionally, a major goal of query
optimization was to avoid building large intermediate relations. How
does this change in cache-conscious database systems? What performance
model should be used? How should queries be optimized?
- For array programs, we mentioned that it might be advantageous to
perform data transformations such as laying out an array in
column-major order rather than say a default row-major order. Similar
optimizations are possible in the context of databases. How should a
relation be stored in memory to optimize memory hierarchy behavior?
In traditional databases, relations are stored tuple-by-tuple, which
is analogous to a row-major storage order if one thinks of a relation
as a table. However, relations can be stored by field if this is
advantageous for locality during query processing; this would
correspond to a column-major storage order.
- Unlike dense arrays, relations are not random access data
structures, so it is important to implement good index structures that
make it possible to search relations efficiently. How would you
redesign index structures into relations to optimize them for caches?
The paper by Rao and Ross given below discusses a novel indexing
structure called cache-sensitive search trees which are data
structures for searching sorted arrays, carefully designed so that
nodes of the trees fit into cache lines.
There are any number of research problems here. Johannes Gehrke has
offered to help with this project. You can use the Predator database
system for carrying out an implementation of ideas you come up with.
Here are some references to get you started:
- Ailamaki
et al: DBMSs on a modern processor: where does time go?. VLDB
99.
-
Jun Rao, Kenneth Ross: Cache conscious indexing for
decision-support in main memory. VLDB 99: 78-89.
- Ambuj Shatdal, Chander Kant, Jeffrey F. Naughton: Cache Conscious Algorithms
for Relational Query Processing. VLDB 1994: 510-521
- H. V. Jagadish, Daniel F. Lieuwen, Rajeev Rastogi, Abraham Silberschatz, S.
Sudarshan: Dalí: A High Performance Main Memory Storage Manager. VLDB 1994:
48-59
- Kimberly Keeton, David A. Patterson, Yong Qiang He, Roger C. Raphael, Walter
E. Baker: Performance Characterization of a Quad Pentium Pro SMP using OLTP
Workloads. ISCA 1998: 15-26
- Jerry Baulier, Philip Bohannon, S. Gogate, C. Gupta, S. Haldar, S. Joshi, A.
Khivesera, Henry F. Korth, P. McIlroy, J. Miller, P. P. S. Narayan, M.
Nemeth, Rajeev Rastogi, S. Seshadri, Abraham Silberschatz, S. Sudarshan, M.
Wilder, C. Wei: DataBlitz Storage Manager: Main Memory Database Performance
for Critical Applications. SIGMOD Conference 1999: 519-520
- Tobin J. Lehman, Michael J. Carey: A Study of Index Structures for Main
Memory Database Management Systems. VLDB 1986: 294-303
- Tobin J. Lehman, Michael J. Carey: Query Processing in Main Memory Database
Management Systems. SIGMOD Conference 1986: 239-250
- Anthony LaMarca, Richard E. Ladner: The Influence of Caches on the
Performance of Sorting. SODA 1997: 370-379