CS6320 – Advanced Database Systems
Johannes Gehrke, 4105B Upson; office hours: Fridays, 1:15-2:15pm or by
- Class: Tuesdays, 2:55-4:10pm, Thursdays 2:50-4:05pm; Hollister 110
Management and processing of large datasets is an area with many practical problems,
elegant systems abstractions, and interesting algorithms. In this course, we
will explore the beauty of some foundational and recent work in this area which
intersects systems, algorithms, and programming languages.
- Every week one paper
summary (10%; 1% each and the 10 best count)
- Why is paper summary writing important? A good summary should be such that you should able to read the summary after a few months and still know what the paper was about.
Summary writing is important for a researcher. We all tend to forget the many papers that we have read over time, and a good summary
helps us to recollect the main ideas in the paper without investing a lot of time in re-reading the paper.
- A good summary should have these main points:
- Problem Statement: What is the problem being studied in the paper and what are the major assumptions?
- Motivation: Why is the problem being studied and why is this paper being written? Is it a novel problem? Or is it a novel problem setting? Or does the paper improve upon existing work?
- Techniques: What techniques are proposed and what is the basic idea behind the technique? This should not just list the keywords but should should give an idea of the technique too.
- What are the benefits of the technique -- is it easier to implement or does it give better guarantees or does it run faster or does it provide more functionality -- over existing techniques?
- Contributions: If it is not obvious why, give an intuition why the technique is better that (say outperforms) existing techniques.
- Every week answers
to a few questions about the papers (30%, 3% each and the 10 best count)
- One class
presentation about a topic with associated research papers (20%)
- Write a
(hopefully publishable) research paper in the area of database systems (40%). You can do a project by yourself,
or with some other student from the class.
- Topic selection. Please talk with me about the topic of your project so that the project is within the scope of
the class. There are several ideas for project topics already posted within CMS; all of these have the potential of leading
to a strong publication. You should have selected a project topic the latest by September 16.
- Project proposal
with references. The proposal should contain your goals for the project and the
results of an initial literature search. The project proposal is due October 6.
- Full literature
review for the project, a formal problem description, and a high-level
discussion of your approach. This part of the project is due October 27.
- An intermediate
status update the week of November 14. An email to Johannes is ok.
- The final project report. The project
report should be formatted like a regular paper for a conference submission
(use the ACM style). The final project is due December 12.
26: Introduction to the course
30: The relational model, relational algebra, and normalization
1: Query optimization and query processing
September 6: Selectivity estimation
As we are starting to discuss more database internals over the next weeks,
please read the following paper as background reading (do not be scared by its length; it is easy to read):
- Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton.
Architecture of a Database System. Foundations and TrendsR in
Databases. Vol. 1, No. 2 (2007), 141–259.
September 8: Concurrency control
September 13: Recovery I (Presenter: Amit Sharma)
September 15: Main-Memory Database Systems (Presenter: Ji-Yong Shin)
September 20: Recovery II (Presenter: Gabriel Bender)
Antonio Vaz Salles, Tuan
J. Demers, Johannes Gehrke, Christoph
M. White: An Evaluation of Checkpoint Recovery for Massively Multiplayer
Online Games. PVLDB
2(1): 1258-1269 (2009)
Antonio Vaz Salles, Benjamin
J. Demers, Johannes Gehrke, Walker
M. White: Fast checkpoint recovery algorithms for frequently consistent applications.
Conference 2011: 265-276
- Alfons Kemper, Thomas Neumann: HyPer: A hybrid OLTP and OLAP main
memory database system based on virtual memory snapshots. ICDE 2011: 195-206
Mühe, Alfons Kemper, Thomas
Neumann: How to efficiently snapshot transactional data: hardware or
software controlled? DaMoN
September 22: Index Structures I (Presenter: Joyce Chen)
- A. Guttman,
A Dynamic Index Structure for Spatial Searching", SIGMOD
Conference, 1984. (*)
- Timos K.
Faloutsos: The R+-Tree: A Dynamic Index for Multi-Dimensional Objects.
- Norbert Beckmann
Seeger: The R*-Tree: An Efficient and Robust
Access Method for Points and Rectangles. SIGMOD
Conference 1990: 322-331
- J. Nievergelt,
H. Hinterberger, K. C. Sevcik: The
Grid File: An Adaptable, Symmetric Multikey File Structure,
TODS 9(1), 1984.
September 27: Index Structures II (Presenter: Karthik Raman)
29: Decision Support (Presenter: Chenhao Tan)
- Jim Gray, Surajit
Pirahesh: Data Cube: A Relational Aggregation Operator Generalizing
Group-by, Cross-Tab, and Sub Totals. Data
Min. Knowl. Discov. 1(1): 29-53 (1997) (*)
Agarwal,Rakesh Agrawal, Prasad
Deshpande,Ashish Gupta, Jeffrey
F. Naughton, Raghu
Sarawagi: On the Computation of Multidimensional Aggregates.VLDB
Harinarayan, Anand Rajaraman,
D. Ullman: Implementing Data Cubes
Conf. 1996: 205-26
October 4: Decision Support II (Presenter: Ruben Sipos)
- Sunita Sarawagi, Rakesh Agrawal, Nimrod
Megiddo: Discovery-Driven Exploration of OLAP
Data Cubes. EDBT
1998: 168-182 (*)
Sarawagi: Explaining Differences in Multidimensional Aggregates. VLDB
Sarawagi: User-Adaptive Exploration of Multidimensional Data. VLDB
October 6: Online aggregation (Presenter: Adith Swaminathan)
October 11: Fall break, no class.
October 13: Approximate query answering I (Presenter: Edward Lui)
October 18: Approximate query
answering II (Presenter: Ashwinkumar B V)
October 20: Approximate Query Anwering III (Presenter: Konstantinos Mamouras)
October 25: Data stream algorithms I (Presenter: Anshumali Shrivastava)
October 27: Data stream algorithms II (Presenter: Lior Seeman)
1: Distributed Transaction Management and Replication (Presenter: Elisavet Kozyri)
- C. Mohan, B. G. Lindsay, R. Obermarck,
Management in the R* Distributed Database Management System", TODS
- J. Gray, P. Helland, P. E. O'Neil, D.
Dangers of Replication and a Solution", SIGMOD Conference, 1996.
3: Parallel Database Systems (Presenter: Yin Lou)
D. J. DeWitt, J. Gray, "Parallel
Database Systems: The Future of High Performance Database Systems",
CACM 35(6), 1992. (*)
- D. DeWitt, et al. "The Gamma
Database Machine Project", TKDE 2(1), 1990.
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data
Processing on Large Clusters. In Proceedings of OSDI'04: Sixth Symposium on
Operating System Design and Implementation, San Francisco, CA, December, 2004.
- David Dewitt. MapReduce:
A major step backwards. The Database Column, January 17, 2008.
- David Dewitt, Michael Stonebraker. MapReduce
II. The Database Column, January 25, 2008.
8: Data Provenance (Presenter: Raghavendra Rajkumar)
10: Probabilistic database systems (Presenter: Samantha Leung)
15: Probabilistic database systems II (Presenter: Wenlei Xie)
November 17: Fagin’s Algorithm (Presenter: Ronan Le Bras)
November 22: Column Stores and Database Cracking (Presenter: Stavros Nikolaou)
24: Thanksgiving break, no class.
29: Data Mining I (Presenter: Bailu Ding)
- Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, A. Inkeri Verkamo: Fast Discovery of Association Rules. Advances in Knowledge Discovery and Data Mining 1996: 307-328
- Rakesh Agrawal, Ramakrishnan Srikant: Mining Sequential Patterns. ICDE 1995: 3-14
- Cristian Bucila, J. E. Gehrke, Daniel Kifer, and Walker White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints.
Data Mining and Knowledge Discovery, Vol. 7, Issue 4, July 2003, pages 241-272.
December 1: Data Mining II (Presenter: Oren Sigal)