CS6320 – Advanced Database Systems

Announcements

None yet.

Logistics:

Instructor: Johannes Gehrke, 4105B Upson; office hours: Fridays, 1:15-2:15pm or by appointment.
Class: Tuesdays, 2:55-4:10pm, Thursdays 2:50-4:05pm; Hollister 110

Course Description

Management and processing of large datasets is an area with many practical problems, elegant systems abstractions, and interesting algorithms. In this course, we will explore the beauty of some foundational and recent work in this area which intersects systems, algorithms, and programming languages.

Workload

Every week one paper summary (10%; 1% each and the 10 best count)
- Why is paper summary writing important? A good summary should be such that you should able to read the summary after a few months and still know what the paper was about. Summary writing is important for a researcher. We all tend to forget the many papers that we have read over time, and a good summary helps us to recollect the main ideas in the paper without investing a lot of time in re-reading the paper.
- A good summary should have these main points:
  - Problem Statement: What is the problem being studied in the paper and what are the major assumptions?
  - Motivation: Why is the problem being studied and why is this paper being written? Is it a novel problem? Or is it a novel problem setting? Or does the paper improve upon existing work?
  - Techniques: What techniques are proposed and what is the basic idea behind the technique? This should not just list the keywords but should should give an idea of the technique too.
  - What are the benefits of the technique -- is it easier to implement or does it give better guarantees or does it run faster or does it provide more functionality -- over existing techniques?
  - Contributions: If it is not obvious why, give an intuition why the technique is better that (say outperforms) existing techniques.
Every week answers to a few questions about the papers (30%, 3% each and the 10 best count)
One class presentation about a topic with associated research papers (20%)
Write a (hopefully publishable) research paper in the area of database systems (40%). You can do a project by yourself, or with some other student from the class.

Topic selection. Please talk with me about the topic of your project so that the project is within the scope of the class. There are several ideas for project topics already posted within CMS; all of these have the potential of leading to a strong publication. You should have selected a project topic the latest by September 16.
Project proposal with references. The proposal should contain your goals for the project and the results of an initial literature search. The project proposal is due October 6.
Full literature review for the project, a formal problem description, and a high-level discussion of your approach. This part of the project is due October 27.
An intermediate status update the week of November 14. An email to Johannes is ok.
The final project report. The project report should be formatted like a regular paper for a conference submission (use the ACM style). The final project is due December 12.

Course Schedule

August 26: Introduction to the course

Background Readings

As we are starting to discuss more database internals over the next weeks, please read the following paper as background reading (do not be scared by its length; it is easy to read):

Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton. Architecture of a Database System. Foundations and TrendsR in Databases. Vol. 1, No. 2 (2007), 141–259.

September 8: Concurrency control

September 13: Recovery I (Presenter: Amit Sharma)

September 15: Main-Memory Database Systems (Presenter: Ji-Yong Shin)

Tobin J. Lehman, Michael J. Carey: A Study of Index Structures for Main Memory Database Management Systems.VLDB 1986: 294-303 (*)
Jun Rao, Kenneth A. Ross: Cache Conscious Indexing for Decision-Support in Main Memory. VLDB 1999: 78-89
Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, Pat Helland: The End of an Architectural Era (It's Time for a Complete Rewrite). VLDB 2007: 1150-1160

September 20: Recovery II (Presenter: Gabriel Bender)

Marcos Antonio Vaz Salles, Tuan Cao, Benjamin Sowell, Alan J. Demers, Johannes Gehrke, Christoph Koch, Walker M. White: An Evaluation of Checkpoint Recovery for Massively Multiplayer Online Games. PVLDB 2(1): 1258-1269 (2009)
Tuan Cao, Marcos Antonio Vaz Salles, Benjamin Sowell, Yao Yue, Alan J. Demers, Johannes Gehrke, Walker M. White: Fast checkpoint recovery algorithms for frequently consistent applications. SIGMOD Conference 2011: 265-276
Alfons Kemper, Thomas Neumann: HyPer: A hybrid OLTP and OLAP main memory database system based on virtual memory snapshots. ICDE 2011: 195-206
Henrik Mühe, Alfons Kemper, Thomas Neumann: How to efficiently snapshot transactional data: hardware or software controlled? DaMoN 2011: 17-26

September 22: Index Structures I (Presenter: Joyce Chen)

A. Guttman, "R-Trees: A Dynamic Index Structure for Spatial Searching", SIGMOD Conference, 1984. (*)
Timos K. Sellis, Nick Roussopoulos, Christos Faloutsos: The R+-Tree: A Dynamic Index for Multi-Dimensional Objects. VLDB 1987: 507-518
Norbert Beckmann , Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger: The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. SIGMOD Conference 1990: 322-331
J. Nievergelt, H. Hinterberger, K. C. Sevcik: The Grid File: An Adaptable, Symmetric Multikey File Structure, TODS 9(1), 1984.

September 27: Index Structures II (Presenter: Karthik Raman)

Roger Weber, Hans-Jörg Schek, Stephen Blott: A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. VLDB 1998: 194-205
Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft: When Is ''Nearest Neighbor'' Meaningful? ICDT 1999: 217-235
Flip Korn, S. Muthukrishnan: Influence Sets Based on Reverse Nearest Neighbor Queries. SIGMOD Conference 2000: 201-212

September 29: Decision Support (Presenter: Chenhao Tan)

Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, Hamid Pirahesh: Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals. Data Min. Knowl. Discov. 1(1): 29-53 (1997) (*)
Sameet Agarwal,Rakesh Agrawal, Prasad Deshpande,Ashish Gupta, Jeffrey F. Naughton, Raghu Ramakrishnan, Sunita Sarawagi: On the Computation of Multidimensional Aggregates.VLDB 1996: 506-521
Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman: Implementing Data Cubes Efficiently. SIGMOD Conf. 1996: 205-26

October 4: Decision Support II (Presenter: Ruben Sipos)

Sunita Sarawagi, Rakesh Agrawal, Nimrod Megiddo: Discovery-Driven Exploration of OLAP Data Cubes. EDBT 1998: 168-182 (*)
Sunita Sarawagi: Explaining Differences in Multidimensional Aggregates. VLDB 1999: 42-53
Sunita Sarawagi: User-Adaptive Exploration of Multidimensional Data. VLDB 2000: 307-316

October 6: Online aggregation (Presenter: Adith Swaminathan)

Joseph M. Hellerstein, Peter J. Haas, Helen Wang: Online Aggregation. SIGMOD Conference 1997: 171-182
Peter J. Haas, Joseph M. Hellerstein: Ripple Joins for Online Aggregation. SIGMOD Conference 1999: 287-298

October 11: Fall break, no class.

October 13: Approximate query answering I (Presenter: Edward Lui)

Jeffrey Scott Vitter: Random Sampling with a Reservoir.ACM Trans. Math. Softw. 11(1): 37-57(1985) (*)
Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, : On Random Sampling over Joins. SIGMOD Conference 1999: 263-274

October 18: Approximate query answering II (Presenter: Ashwinkumar B V)

Noga Alon, Yossi Matias, Mario Szegedy: The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58(1): 137-147 (1999). Read the full version of the paper. [Gödel prize citation]
Noga Alon, Phillip B. Gibbons, Yossi Matias, Mario Szegedy: Tracking Join and Self-Join Sizes in Limited Storage. PODS 1999: 10-20. Read the full version of the paper.

October 20: Approximate Query Anwering III (Presenter: Konstantinos Mamouras)

Andrei Z. Broder, Michael Mitzenmacher: Survey: Network Applications of Bloom Filters: A Survey. Internet Mathematics 1(4): (2003)

October 25: Data stream algorithms I (Presenter: Anshumali Shrivastava)

Flajolet and Martin: Probabilistic Counting. FOCS 1983. Or read the following version: Philippe Flajolet and G. Nigel Martin: Probabilistic Counting Algorithms for Database Applications. JCSS 31, 182-209 (1985).
Graham Cormode, S. Muthukrishnan: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1): 58-75 (2005)

October 27: Data stream algorithms II (Presenter: Lior Seeman)

J. Ian Munro, Mike Paterson: Selection and Sorting with Limited Storage.Theor. Comput. Sci. 12: 315-323 (1980). (*)
Gurmeet Singh Manku, Sridhar Rajagopalan, Bruce G. Lindsay: Approximate Medians and other Quantiles in One Pass and with Limited Memory. SIGMOD Conference 1998: 426-435.
Graham Cormode, Marios Hadjieleftheriou: Methods for finding frequent items in data streams. VLDB J. 19(1): 3-20 (2010)

November 1: Distributed Transaction Management and Replication (Presenter: Elisavet Kozyri)

C. Mohan, B. G. Lindsay, R. Obermarck, "Transaction Management in the R* Distributed Database Management System", TODS 11(4), 1986.
J. Gray, P. Helland, P. E. O'Neil, D. Sasha, "The Dangers of Replication and a Solution", SIGMOD Conference, 1996.

November 3: Parallel Database Systems (Presenter: Yin Lou)

D. J. DeWitt, J. Gray, "Parallel Database Systems: The Future of High Performance Database Systems", CACM 35(6), 1992. (*)
D. DeWitt, et al. "The Gamma Database Machine Project", TKDE 2(1), 1990.
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
David Dewitt. MapReduce: A major step backwards. The Database Column, January 17, 2008.
David Dewitt, Michael Stonebraker. MapReduce II. The Database Column, January 25, 2008.

November 8: Data Provenance (Presenter: Raghavendra Rajkumar)

Todd J. Green, Gregory Karvounarakis, Val Tannen: Provenance semirings. PODS 2007: 31-40 (*)
Yael Amsterdamer, Daniel Deutch, Val Tannen: Provenance for aggregate queries. PODS 2011: 153-164

November 10: Probabilistic database systems (Presenter: Samantha Leung)

The first two chapters of this book: http://dx.doi.org/10.2200/S00362ED1V01Y201105DTM016

November 15: Probabilistic database systems II (Presenter: Wenlei Xie)

Christoph Koch: MayBMS: A System for Managing Large Uncertain and Probabilistic Databases. Chapter 6 of Charu Aggarwal, ed., Managing and Mining Uncertain Data, Springer-Verlag, 2008/9.

November 17: Fagin’s Algorithm (Presenter: Ronan Le Bras)

Ronald Fagin: Combining Fuzzy Information from Multiple Systems. J. Comput. Syst. Sci. 58(1): 83-99 (1999)
Ronald Fagin, Amnon Lotem, Moni Naor: Optimal Aggregation Algorithms for Middleware. PODS 2001

Combining fuzzy information: an overview. SIGMOD Record 31,2, June 2002, pp. 109-118.

November 22: Column Stores and Database Cracking (Presenter: Stavros Nikolaou)

http://db.csail.mit.edu/projects/cstore/
Stratos Idreos, Martin L. Kersten, Stefan Manegold: Database Cracking. CIDR 2007: 68-78
Stratos Idreos, Stefan Manegold, Harumi A. Kuno, Goetz Graefe: Merging What's Cracked, Cracking What's Merged: Adaptive Indexing in Main-Memory Column-Stores. PVLDB 4(9): 585-597 (2011)

November 24: Thanksgiving break, no class.

November 29: Data Mining I (Presenter: Bailu Ding)

Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, A. Inkeri Verkamo: Fast Discovery of Association Rules. Advances in Knowledge Discovery and Data Mining 1996: 307-328
Rakesh Agrawal, Ramakrishnan Srikant: Mining Sequential Patterns. ICDE 1995: 3-14
Cristian Bucila, J. E. Gehrke, Daniel Kifer, and Walker White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints. Data Mining and Knowledge Discovery, Vol. 7, Issue 4, July 2003, pages 241-272.

December 1: Data Mining II (Presenter: Oren Sigal)

Mohammed J. Zaki. Efficiently Mining Frequent Trees in a Forest. SIGKDD 2002.
X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns. SIGKDD 2003.