Database Management Systems
Where is the Life we have lost in living?
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information? -- T.S. Elliot
Logistics
- CS632, Cornell University, Spring 2006.
- Instructor: Johannes Gehrke
- TTh 2:55-4:10, Upson 109
Overview
Large datasets such as historical snapshots of the World Wide Web stored by
the Internet Archive, or the data hosted by amazon.com ("earth's biggest
selection") bring about interesting data management challenges that lie at the
intersection of systems and algorithmic questions for managing and mining large
datasets.
The course has two parts. The first part covers a basic background on system
abstractions and algorithmic tools for managing large datasets. The second part
covers current research topics in this area such as the management of structured
and unstructured data, search, and data mining.
The course prerequisites include basic backgrounds in algorithms and
probability (such as CS482) and in systems (such as CS414). The main work for
the course consists of a paper summary for each
class, the presentation of one paper in the course, and a course project.
In the project, you are expected to try to do
original research. The project encompasses the following steps:
- Project proposal with references. The proposal should contain your goals
for the project and the results of an initial literature search. The project
proposal is due March 10.
- Full literature review for the project, a formal problem description, and
a high-level discussion of your approach . This is due March 31.
- An intermediate status update the week of April 17. An email to Johannes
is ok.
- The final project report. The project report should be formatted like a
regular paper for a conference submission (use the ACM style). The final
project is due May 12.
Course Outline (Draft)
Note: Write your summary about the paper marked with a (*).
Part I: The Classics
Programming abstractions and their implementations (1/31 and 2/2)
The relational model
- E. F. Codd, "A
Relational Model of Data for Large Shared Data Banks", CACM 13(6),
1970.
-
C. J. Date,
Ronald Fagin:
Simple Conditions for Guaranteeing Higher Normal Forms in
Relational Databases.
ACM Trans. Database Syst. 17(3): 465-476(1992)
- P. G. Selinger, et al., "Access
Path Selection in a Relational Database Management System", SIGMOD
Conference, 1979.
- L. D. Shapiro, "Join
Processing in Database Systems with Large Main Memories", TODS 11(3),
1986
XML
MapReduce
Auxiliary data structures (2/7)
- A. Guttman, "R-Trees:
A Dynamic Index Structure for Spatial Searching", SIGMOD Conference,
1984. (*)
- J. Nievergelt, H. Hinterberger,
K. C. Sevcik, "The
Grid File: An Adaptable, Symmetric Multikey File
Structure", TODS 9(1), 1984.
- Justin Zobel,
Alistair Moffat,
Kotagiri Ramamohanarao: Inverted Files Versus Signature Files for Text
Indexing.
ACM Trans. Database Syst. 23(4): 453-490 (1998)
Data reduction (2/9)
Dealing with performance mismatches and
different access characteristics (2/14, 2/16, and 2/21)
2/14
- M. Stonebraker, "Operating
System Support for Database Management", CACM 24(7), pp. 412-418,
1981.
- H. T. Chou, D. J. DeWitt, "An
Evaluation of Buffer Management Strategies for Relational Database Systems",
VLDB Conference, 1985. [Liviu]
- P.
Bohannon, et al., "The
Architecture of the Dali Storage Manager", Journal of Multimedia Tools and
Applications 4(2), 1997 (*)
- J. Rao, K. A. Ross, "Making
B+-Trees Cache Conscious in Main Memory", SIGMOD Conference, 2000
[Nazrul]
Answer the following question about the Bohannon et al. paper: Given that
Dali is a main-memory database system, some design decisions are strongly
influenced by this constraint. Describe the one design decision that in your
opinion is the most non-obvious, but still has a lot of impact on the overall
system performance.
2/16
2/21
Data reduction (Continued) (2/23)
Concurrency (2/28 and 3/2)
2/28
- J. Gray, et al., "Granularity of Locks and Degrees of Consistency in
a Shared Database", IFIP Working Conference on Modeling of Data Base
Management Systems, 1977
- P. L. Lehman, S. B. Yao, "Efficient Locking for Concurrent Operations
on B-Trees", TODS 6(4), 1981.
- H. T. Kung, J. T. Robinson, "On Optimistic
Methods for Concurrency Control", TODS 6(2), 1981. (*)
[Ganesh]
3/2
- R. Agrawal, M. J. Carey, M. Livny, "Concurrency
Control Performance Modeling: Alternatives and Implications", TODS
12(4), 1987. [Seidel]
- H. Garcia-Molina, K. Salem, "SAGAS",
SIGMOD Conference, 1987 [Yookyung]
- H. Wachter, A. Reuter, "The ConTract Model", Database Transaction Models for
Advanced Applications (*)
Handling Failure (3/7 and 3/9)
3/7, 3/9
- C. Mohan, et al., "ARIES: A Transaction Recovery Method Supporting
Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead
Logging", TODS 17(1), 1992 (no summary necessary)
Distributed Transaction Management and Replication
(3/14)
- C. Mohan, B. G. Lindsay, R. Obermarck,
"Transaction Management in the R* Distributed Database Management System",
TODS 11(4), 1986.
- J. Gray, P. Helland, P. E. O'Neil, D.
Sasha, "The Dangers of Replication and a
Solution", SIGMOD Conference, 1996. In Hellerstein/Stonebraker
textbook. (Johannes)
Please do not write any summaries for this class. Instead, please answer the
following questions about the Mohan et al paper:
- The paper states that, in order to maintain atomicity,
each subordinate S must make sure (by forcing log records) never to ask the
coordinator about a decision that S has already ACKed. For each protocol
(2P, PA, PC), give a scenario where failure to follow this principle breaks
atomicity. (Assume that the coordinator and all other subordinates follow
the protocol strictly.)
- In Presumed Commit (PC), the root’s commit record is
forced even though all other nodes do not force their commit records. Why?
Benchmarks (3/16)
Please do not write any summaries for this class. Instead, please outline the
design of a performance benchmark that you could use to benchmark Intranet (not
Internet) search engines (for an example product in this space, look here:
http://www.google.com/enterprise/).
Part II: The Nouveaux
Decision Support (3/27 and 3/29)
- J. Gray, et al., "Data
Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab,
and Sub Totals", Data Mining and Knowledge Discovery 1(1), 1997.
- P. O'Neill, D. Quass, "Improved
Query Performance with Variant Indexes" SIGMOD Conference, 1997
- Y. Zhao, P. Deshpande, J. Naughton, "An
Array-Based Algorithm for Simultaneous Multidimensional Aggregates",
SIGMOD Conference, 1997.
- J. M. Hellerstein, P. J. Haas, H. Wang, "Online
Aggregation", SIGMOD Conference, 1997.
-
Peter J. Haas, Joseph M. Hellerstein: Ripple
Joins for Online Aggregation.
SIGMOD Conference 1999: 287-298
Data Mining
Association Rules (4/4)
- R. Agrawal, R. Srikant, "Fast
Algorithms for Mining Association Rules", VLDB Conference, 1994.
- R. Agrawal and R. Srikant,
"Mining Sequential Patterns", Proc. of the 11th Int'l Conference on
Data Engineering, Taiwan, March 1995.
PDF
ps.gz.
- Cristian Bucila, J. E. Gehrke, Daniel Kifer, and Walker
White.
DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints.
Data Mining and Knowledge Discovery, Vol. 7, Issue 4, July 2003, pages
241-272.
Short version of the paper: Cristian Bucila, J. E. Gehrke, Daniel Kifer, and
Walker White.
DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints. In
Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD 2002). Edmonton, Alberta, Canada, July
2002.)
Privacy-preserving association rule mining (4/6)
Frequent Structure Discovery (4/11 and 4/13)
Questions (due 4/11, 6am via email to Johannes):
- We read the DualMiner paper last week, where the idea was to incorporate
constraints into itemset mining. Can you think of at least two interesting
constraints that you would want to incorporate into either frequent tree
mining or frequent graph mining? Please motivate your constraints by
application examples.
- Given your constraints, please outline a high-level idea how you could
modify either the TreeMiner Algorithm or CloseGraph (or another algorithm)
to incorporate your constraints.
- Describe one other research direction that you can think of after having
read these two papers.
Managing semi-structured and unstructured data
Native vs. Relational XML Stores (4/18)
- J. McHugh, S. Abiteboul, R. Goldman, D. Quass, J. Widom, "Lore: A
Database Management System for Semistructured Data", SIGMOD Record 26(3),
1997.
- J. Shanmugasundaram, et al., "Relational Databases for Querying XML
Documents: Limitations and Opportunities", VLDB Conference, 1999.
Internet Query Engines (4/20)
- S. Brin, L. Page, "The Anatomy of a Large-Scale Hypertextual Web
Search Engine", WWW Conference/Computer Networks 30(1-7), 1998.
- J. Naughton, et al., "The Niagara Internet Query System", IEEE Data
Engineering Bulletin 24(2), 2001.
Continuous Query Systems (4/25)
- Yanlei Diao, Mehmet Altinel, Michael J. Franklin, Hao Zhang, and Peter
Fischer. Path Sharing and Predicate Evaluation for High-Performance XML
Filtering. TODS , December
2003. (pdf)
(Ara)
Questions about the paper (due 4/25, 6am via email to Johannes)
- The paper mentioned that XML filtering was the inverse problem of
traditional query processing. Motivated by that assumption, they took the
idea of indexing over the data to come up with indexing over queries. Can
you think of another idea that can be borrowed from traditional query
processing of databases, which can be applied to XML filtering?
- In Section 5 of the paper, there is a discussion of how to handle
value-based predicate evaluation. Two methods are described - "Inline" and
"Selection Posponed (SP)". Discuss one main problem associated with each
method. Can you think of a unified way of addressing both of those issues?
Optional reading:
Text Search Over XML (4/27)
Questions about the papers (due 4/27, 6am via email to Johannes)
- Kaushik et al. paper: The paper shows a simple way to link two data
structures to address the hybrid nature of simple XML path queries with
keywords. Can you think of any other examples of a hybrid kind of query that
can be answered by combining two data structures?
- Kaushik et al. paper: The power of a structural index is it roughly
captures a sort of schema for the current instance of data even with all the
irregularities that may be present. A simple version is shown in Figure 2 of
the paper. How could you drive the construction of a good structural index if
you had the XML schema, which explicitly states which elements/attributes are
required and which ones are optional (which helps you decide what queries can
be covered and therefore when inverted lists can be used)? Is there a better
way to solve the problem if you have the schema?
-
Amer-Yahia et al. paper: Section 3.4 discusses other classes of
relaxations such as replacing a type with a supertype, relaxing
value-based predicates like $i.price < 98 to $i.price < 100, or relaxing
the contains predicate via standard IR methods. Including the structural
relaxations described in this paper, which of these relaxations do you
think are really the most useful to the end user? Is it desirable to
incorporate several classes of relaxations at once to get a single
ranking?
- Amer-Yahia et al. paper: In Example 1 (after Figure 6), the calculated
structural score for query Q1 is shown to be 3. However, the listed
predicate penalty for query Q5 consists of 4 terms, each of which is in
[0,1]. Could the resulting structural score of Q5 then be negative? If so,
does this have any negative implications?
Optional reading:
Potpourri
Parallel Database Systems (5/2)
Questions about the papers (due 5/2, 6am, via email to Johannes):
The Gamma paper:
- Q1: The query language of Gamma provides the user with three alternative
declustering strategies: round robin, hashed and range partitioned (Section
3.1). Can you think of any other strategies, and what would they be useful
for?
- Q2: Gamma was developed in the eighties when hardware capability was
limited. If you are asked to build such a system today with current state of
the art hardware what changes would you make in the design of the Gamma
software?
The Alphasort paper:
- Q3: The authors use QuickSort as the one pass sort algorithm to produce
runs that are later merged into the full sorted order. They investigate
different versions of QuickSort which are desirable for different types of
data. Are there other sorting methods that would work well/better for certain
kinds of data that they could have used instead of QuickSort?
- Q4: The authors describe several optimizations of different nature that
make AlphaSort perform well such as striping, minimizing cache misses by using
QuickSort, adapting data structures to increase cache locality, etc. Which of
those do you think were most interesting/innovative and which are most
relevant to external sorting today?
Data Stream Systems (5/4)
- Praveen Seshadri Miron Livny Raghu Ramakrishnan. The Design and
Implementation of a Sequence Database System.. Proc. VLDB, 1996, 99-110.
(Anton)
- João Pereira, Françoise Fabret, H.-Arno Jacobesen, François Llirbat,
Radu Preotiuc-Prieto, Kenneth Ross, and Dennis Shasha
Le Subscribe: Publish and
Subscribe on the Web at Extreme Speed. SIGMOD 2001.
(Seidel)