Database Management Systems

Where is the Life we have lost in living?
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information? -- T.S. Elliot

Logistics

CS632, Cornell University, Spring 2006.
Instructor: Johannes Gehrke
TTh 2:55-4:10, Upson 109

Overview

Large datasets such as historical snapshots of the World Wide Web stored by the Internet Archive, or the data hosted by amazon.com ("earth's biggest selection") bring about interesting data management challenges that lie at the intersection of systems and algorithmic questions for managing and mining large datasets.

The course has two parts. The first part covers a basic background on system abstractions and algorithmic tools for managing large datasets. The second part covers current research topics in this area such as the management of structured and unstructured data, search, and data mining.

The course prerequisites include basic backgrounds in algorithms and probability (such as CS482) and in systems (such as CS414). The main work for the course consists of a paper summary for each class, the presentation of one paper in the course, and a course project.

In the project, you are expected to try to do original research. The project encompasses the following steps:

Project proposal with references. The proposal should contain your goals for the project and the results of an initial literature search. The project proposal is due March 10.
Full literature review for the project, a formal problem description, and a high-level discussion of your approach . This is due March 31.
An intermediate status update the week of April 17. An email to Johannes is ok.
The final project report. The project report should be formatted like a regular paper for a conference submission (use the ACM style). The final project is due May 12.

Course Outline (Draft)

Note: Write your summary about the paper marked with a (*).

Part I: The Classics

Programming abstractions and their implementations (1/31 and 2/2)

The relational model

E. F. Codd, "A Relational Model of Data for Large Shared Data Banks", CACM 13(6), 1970.
C. J. Date, Ronald Fagin: Simple Conditions for Guaranteeing Higher Normal Forms in Relational Databases. ACM Trans. Database Syst. 17(3): 465-476(1992)
P. G. Selinger, et al., "Access Path Selection in a Relational Database Management System", SIGMOD Conference, 1979.
L. D. Shapiro, "Join Processing in Database Systems with Large Main Memories", TODS 11(3), 1986

XML

Don Chamberlin. XQuery: An XML Query Language. IBM Systems Journal, Volume 41, Number 4, 2002.

MapReduce

Jeffrey Dean, Sanjay Ghemawat:. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004, 137-150.

Auxiliary data structures (2/7)

A. Guttman, "R-Trees: A Dynamic Index Structure for Spatial Searching", SIGMOD Conference, 1984. (*)
J. Nievergelt, H. Hinterberger, K. C. Sevcik, "The Grid File: An Adaptable, Symmetric Multikey File Structure", TODS 9(1), 1984.
Justin Zobel, Alistair Moffat, Kotagiri Ramamohanarao: Inverted Files Versus Signature Files for Text Indexing. ACM Trans. Database Syst. 23(4): 453-490 (1998)

Data reduction (2/9)

Yannis E. Ioannidis: The History of Histograms (abridged). VLDB 2003: 19-30
Yannis E. Ioannidis, Viswanath Poosala: Balancing Histogram Optimality and Practicality for Query Result Size Estimation. SIGMOD Conference 1995: 233-244 (*)
V. Poosala, Y.E. Ioannidis, P. J. Haas, E. Shekita, "Improved Histograms for Selectivity Estimation of Range Predicates", SIGMOD Conference, 1996. [Anton]

Dealing with performance mismatches and different access characteristics (2/14, 2/16, and 2/21)

2/14

M. Stonebraker, "Operating System Support for Database Management", CACM 24(7), pp. 412-418, 1981.
H. T. Chou, D. J. DeWitt, "An Evaluation of Buffer Management Strategies for Relational Database Systems", VLDB Conference, 1985. [Liviu]
P. Bohannon, et al., "The Architecture of the Dali Storage Manager", Journal of Multimedia Tools and Applications 4(2), 1997 (*)
J. Rao, K. A. Ross, "Making B+-Trees Cache Conscious in Main Memory", SIGMOD Conference, 2000 [Nazrul]

Answer the following question about the Bohannon et al. paper: Given that Dali is a main-memory database system, some design decisions are strongly influenced by this constraint. Describe the one design decision that in your opinion is the most non-obvious, but still has a lot of impact on the overall system performance.

2/16

J. Ian Munro, Mike Paterson: Selection and Sorting with Limited Storage. Theor. Comput. Sci. 12: 315-323(1980)
Gurmeet Singh Manku, Sridhar Rajagopalan, Bruce G. Lindsay: Approximate Medians and other Quantiles in One Pass and with Limited Memory. SIGMOD Conference 1998: 426-435 [Jonathan] (*)

2/21

Jeffrey Scott Vitter: Random Sampling with a Reservoir. ACM Trans. Math. Softw. 11(1): 37-57(1985) (*)
Surajit Chaudhuri, Rajeev Motwani, Vivek R. Narasayya: On Random Sampling over Joins. SIGMOD Conference 1999: 263-274 [Bistra]

Data reduction (Continued) (2/23)

Noga Alon, Yossi Matias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58(1): 137-147 (1999). Read the full version of the paper. [Gödel prize citation] [Ara] (*)
Noga Alon, Phillip B. Gibbons, Yossi Matias, Mario Szegedy: Tracking Join and Self-Join Sizes in Limited Storage. PODS 1999: 10-20. Read the full version of the paper.

Concurrency (2/28 and 3/2)

2/28

J. Gray, et al., "Granularity of Locks and Degrees of Consistency in a Shared Database", IFIP Working Conference on Modeling of Data Base Management Systems, 1977
P. L. Lehman, S. B. Yao, "Efficient Locking for Concurrent Operations on B-Trees", TODS 6(4), 1981.
H. T. Kung, J. T. Robinson, "On Optimistic Methods for Concurrency Control", TODS 6(2), 1981. (*) [Ganesh]

3/2

R. Agrawal, M. J. Carey, M. Livny, "Concurrency Control Performance Modeling: Alternatives and Implications", TODS 12(4), 1987. [Seidel]
H. Garcia-Molina, K. Salem, "SAGAS", SIGMOD Conference, 1987 [Yookyung]
H. Wachter, A. Reuter, "The ConTract Model", Database Transaction Models for Advanced Applications (*)

Handling Failure (3/7 and 3/9)

3/7, 3/9

C. Mohan, et al., "ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging", TODS 17(1), 1992 (no summary necessary)

Distributed Transaction Management and Replication (3/14)

C. Mohan, B. G. Lindsay, R. Obermarck, "Transaction Management in the R* Distributed Database Management System", TODS 11(4), 1986.
J. Gray, P. Helland, P. E. O'Neil, D. Sasha, "The Dangers of Replication and a Solution", SIGMOD Conference, 1996. In Hellerstein/Stonebraker textbook. (Johannes)

Please do not write any summaries for this class. Instead, please answer the following questions about the Mohan et al paper:

The paper states that, in order to maintain atomicity, each subordinate S must make sure (by forcing log records) never to ask the coordinator about a decision that S has already ACKed. For each protocol (2P, PA, PC), give a scenario where failure to follow this principle breaks atomicity. (Assume that the coordinator and all other subordinates follow the protocol strictly.)
In Presumed Commit (PC), the root’s commit record is forced even though all other nodes do not force their commit records. Why?

Benchmarks (3/16)

Anon, et al, "A Measure of Transaction Processing Power", Datamation, 31(7).
M. J. Carey, D. J. DeWitt, J. F. Naughton: "The 007 Benchmark", SIGMOD Conference, 1993.
The BUCKY Object-Relational Database Benchmark. In SIGMOD97. This paper describes a benchmark for object-relational database systems; follow the link to get postscript describing the results of running the benchmark against one object-relational database system, and also to get an SQL3-ish implementation of the benchmark and a data generation program.

Please do not write any summaries for this class. Instead, please outline the design of a performance benchmark that you could use to benchmark Intranet (not Internet) search engines (for an example product in this space, look here: http://www.google.com/enterprise/).

Part II: The Nouveaux

Decision Support (3/27 and 3/29)

J. Gray, et al., "Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals", Data Mining and Knowledge Discovery 1(1), 1997.
P. O'Neill, D. Quass, "Improved Query Performance with Variant Indexes" SIGMOD Conference, 1997
Y. Zhao, P. Deshpande, J. Naughton, "An Array-Based Algorithm for Simultaneous Multidimensional Aggregates", SIGMOD Conference, 1997.
J. M. Hellerstein, P. J. Haas, H. Wang, "Online Aggregation", SIGMOD Conference, 1997.
Peter J. Haas, Joseph M. Hellerstein: Ripple Joins for Online Aggregation. SIGMOD Conference 1999: 287-298

Data Mining

Association Rules (4/4)

R. Agrawal, R. Srikant, "Fast Algorithms for Mining Association Rules", VLDB Conference, 1994.
R. Agrawal and R. Srikant, "Mining Sequential Patterns", Proc. of the 11th Int'l Conference on Data Engineering, Taiwan, March 1995. PDF ps.gz.
Cristian Bucila, J. E. Gehrke, Daniel Kifer, and Walker White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints. Data Mining and Knowledge Discovery, Vol. 7, Issue 4, July 2003, pages 241-272.
Short version of the paper: Cristian Bucila, J. E. Gehrke, Daniel Kifer, and Walker White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002). Edmonton, Alberta, Canada, July 2002.)

Privacy-preserving association rule mining (4/6)

Alexandre Evfimievski, J. E. Gehrke, and Ramakrishnan Srikant. Limiting Privacy Breaches in Privacy Preserving Data Mining. In Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003). San Diego, CA, June 2003.

Frequent Structure Discovery (4/11 and 4/13)

Mohammed J. Zaki, Efficiently Mining Frequent Trees in a Forest, SIGKDD 2002.
X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns. SIGKDD 2003.

Questions (due 4/11, 6am via email to Johannes):

We read the DualMiner paper last week, where the idea was to incorporate constraints into itemset mining. Can you think of at least two interesting constraints that you would want to incorporate into either frequent tree mining or frequent graph mining? Please motivate your constraints by application examples.
Given your constraints, please outline a high-level idea how you could modify either the TreeMiner Algorithm or CloseGraph (or another algorithm) to incorporate your constraints.
Describe one other research direction that you can think of after having read these two papers.

Managing semi-structured and unstructured data

Native vs. Relational XML Stores (4/18)

J. McHugh, S. Abiteboul, R. Goldman, D. Quass, J. Widom, "Lore: A Database Management System for Semistructured Data", SIGMOD Record 26(3), 1997.
J. Shanmugasundaram, et al., "Relational Databases for Querying XML Documents: Limitations and Opportunities", VLDB Conference, 1999.

Internet Query Engines (4/20)

S. Brin, L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine", WWW Conference/Computer Networks 30(1-7), 1998.
J. Naughton, et al., "The Niagara Internet Query System", IEEE Data Engineering Bulletin 24(2), 2001.

Continuous Query Systems (4/25)

Yanlei Diao, Mehmet Altinel, Michael J. Franklin, Hao Zhang, and Peter Fischer. Path Sharing and Predicate Evaluation for High-Performance XML Filtering. TODS , December 2003. (pdf) (Ara)

Questions about the paper (due 4/25, 6am via email to Johannes)

The paper mentioned that XML filtering was the inverse problem of traditional query processing. Motivated by that assumption, they took the idea of indexing over the data to come up with indexing over queries. Can you think of another idea that can be borrowed from traditional query processing of databases, which can be applied to XML filtering?
In Section 5 of the paper, there is a discussion of how to handle value-based predicate evaluation. Two methods are described - "Inline" and "Selection Posponed (SP)". Discuss one main problem associated with each method. Can you think of a unified way of addressing both of those issues?

Optional reading:

NiagaraCQ: A Scalable Continuous Query System for Internet Databases. Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang. SIGMOD 2000 , p379-390. [ps version]

Text Search Over XML (4/27)

On the Integration of Structure Indexes and Inverted Lists : Raghav Kaushik, Rajasekar Krishnamurthy, Jeffrey F. Naughton, Raghu Ramakrishnan: SIGMOD 2004 : 779-790. (Ganesh)
Sihem Amer-Yahia, Laks V. S. Lakshmanan, Shashank Pandit: FleXPath: Flexible Structure and Full-Text Querying for XML. SIGMOD Conference 2004: 83-94 (Jon)

Questions about the papers (due 4/27, 6am via email to Johannes)

Kaushik et al. paper: The paper shows a simple way to link two data structures to address the hybrid nature of simple XML path queries with keywords. Can you think of any other examples of a hybrid kind of query that can be answered by combining two data structures?
Kaushik et al. paper: The power of a structural index is it roughly captures a sort of schema for the current instance of data even with all the irregularities that may be present. A simple version is shown in Figure 2 of the paper. How could you drive the construction of a good structural index if you had the XML schema, which explicitly states which elements/attributes are required and which ones are optional (which helps you decide what queries can be covered and therefore when inverted lists can be used)? Is there a better way to solve the problem if you have the schema?
Amer-Yahia et al. paper: Section 3.4 discusses other classes of relaxations such as replacing a type with a supertype, relaxing value-based predicates like $i.price < 98 to $i.price < 100, or relaxing the contains predicate via standard IR methods. Including the structural relaxations described in this paper, which of these relaxations do you think are really the most useful to the end user? Is it desirable to incorporate several classes of relaxations at once to get a single ranking?
Amer-Yahia et al. paper: In Example 1 (after Figure 6), the calculated structural score for query Q1 is shown to be 3. However, the listed predicate penalty for query Q5 consists of 4 terms, each of which is in [0,1]. Could the resulting structural score of Q5 then be negative? If so, does this have any negative implications?

Optional reading:

L. Guo, F. Shao, C. Botev, J. Shanmugasundaram, "XRANK: Ranked Keyword Search over XML Documents", SIGMOD Conference, June 2003.

Potpourri

Parallel Database Systems (5/2)

D. DeWitt, et al. "The Gamma Database Machine Project", TKDE 2(1), 1990. (Nazrul)
Goetz Graefe: Encapsulation of Parallelism in the Volcano Query Processing System. SIGMOD Conference 1990: 102-111. In Hellerstein/Stonebraker textbook.
Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, David B. Lomet: AlphaSort: A Cache-Sensitive Parallel External Sort. VLDB Journal 4(4): 603-627 (1995) (Bistra)

Questions about the papers (due 5/2, 6am, via email to Johannes):

The Gamma paper:

Q1: The query language of Gamma provides the user with three alternative declustering strategies: round robin, hashed and range partitioned (Section 3.1). Can you think of any other strategies, and what would they be useful for?
Q2: Gamma was developed in the eighties when hardware capability was limited. If you are asked to build such a system today with current state of the art hardware what changes would you make in the design of the Gamma software?

The Alphasort paper:

Q3: The authors use QuickSort as the one pass sort algorithm to produce runs that are later merged into the full sorted order. They investigate different versions of QuickSort which are desirable for different types of data. Are there other sorting methods that would work well/better for certain kinds of data that they could have used instead of QuickSort?
Q4: The authors describe several optimizations of different nature that make AlphaSort perform well such as striping, minimizing cache misses by using QuickSort, adapting data structures to increase cache locality, etc. Which of those do you think were most interesting/innovative and which are most relevant to external sorting today?

Data Stream Systems (5/4)

Praveen Seshadri Miron Livny Raghu Ramakrishnan. The Design and Implementation of a Sequence Database System.. Proc. VLDB, 1996, 99-110. (Anton)
João Pereira, Françoise Fabret, H.-Arno Jacobesen, François Llirbat, Radu Preotiuc-Prieto, Kenneth Ross, and Dennis Shasha Le Subscribe: Publish and Subscribe on the Web at Extreme Speed. SIGMOD 2001. (Seidel)