CS6320 – Advanced Database Systems
Announcements
Logistics:
- Instructor:
Johannes Gehrke, 4105B Upson; office hours: Fridays, 1:15-2:15pm or by
appointment.
- Class: Tuesdays, 2:55-4:10pm, Thursdays 2:50-4:05pm; Hollister 110
Course
Description
Management and processing of large datasets is an area with many practical problems,
elegant systems abstractions, and interesting algorithms. In this course, we
will explore the beauty of some foundational and recent work in this area which
intersects systems, algorithms, and programming languages.
Workload
- Every week one paper
summary (10%; 1% each and the 10 best count)
- Why is paper summary writing important? A good summary should be such that you should able to read the summary after a few months and still know what the paper was about.
Summary writing is important for a researcher. We all tend to forget the many papers that we have read over time, and a good summary
helps us to recollect the main ideas in the paper without investing a lot of time in re-reading the paper.
- A good summary should have these main points:
- Problem Statement: What is the problem being studied in the paper and what are the major assumptions?
- Motivation: Why is the problem being studied and why is this paper being written? Is it a novel problem? Or is it a novel problem setting? Or does the paper improve upon existing work?
- Techniques: What techniques are proposed and what is the basic idea behind the technique? This should not just list the keywords but should should give an idea of the technique too.
- What are the benefits of the technique -- is it easier to implement or does it give better guarantees or does it run faster or does it provide more functionality -- over existing techniques?
- Contributions: If it is not obvious why, give an intuition why the technique is better that (say outperforms) existing techniques.
- Every week answers
to a few questions about the papers (30%, 3% each and the 10 best count)
- One class
presentation about a topic with associated research papers (20%)
- Write a
(hopefully publishable) research paper in the area of database systems (40%). You can do a project by yourself,
or with some other student from the class.
- Topic selection. Please talk with me about the topic of your project so that the project is within the scope of
the class. There are several ideas for project topics already posted within CMS; all of these have the potential of leading
to a strong publication. You should have selected a project topic the latest by September 16.
- Project proposal
with references. The proposal should contain your goals for the project and the
results of an initial literature search. The project proposal is due October 6.
- Full literature
review for the project, a formal problem description, and a high-level
discussion of your approach. This part of the project is due October 27.
- An intermediate
status update the week of November 14. An email to Johannes is ok.
- The final project report. The project
report should be formatted like a regular paper for a conference submission
(use the ACM style). The final project is due December 12.
Course Schedule
August
26: Introduction to the course
August
30: The relational model, relational algebra, and normalization
September
1: Query optimization and query processing
September 6: Selectivity estimation
Background Readings
As we are starting to discuss more database internals over the next weeks,
please read the following paper as background reading (do not be scared by its length; it is easy to read):
- Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton.
Architecture of a Database System. Foundations and TrendsR in
Databases. Vol. 1, No. 2 (2007), 141–259.
September 8: Concurrency control
September 13: Recovery I (Presenter: Amit Sharma)
September 15: Main-Memory Database Systems (Presenter: Ji-Yong Shin)
September 20: Recovery II (Presenter: Gabriel Bender)
- Marcos
Antonio Vaz Salles, Tuan
Cao, Benjamin
Sowell, Alan
J. Demers, Johannes Gehrke, Christoph
Koch, Walker
M. White: An Evaluation of Checkpoint Recovery for Massively Multiplayer
Online Games. PVLDB
2(1): 1258-1269 (2009)
- Tuan
Cao, Marcos
Antonio Vaz Salles, Benjamin
Sowell, Yao
Yue, Alan
J. Demers, Johannes Gehrke, Walker
M. White: Fast checkpoint recovery algorithms for frequently consistent applications.
SIGMOD
Conference 2011: 265-276
- Alfons Kemper, Thomas Neumann: HyPer: A hybrid OLTP and OLAP main
memory database system based on virtual memory snapshots. ICDE 2011: 195-206
- Henrik
Mühe, Alfons Kemper, Thomas
Neumann: How to efficiently snapshot transactional data: hardware or
software controlled? DaMoN
2011: 17-26
September 22: Index Structures I (Presenter: Joyce Chen)
- A. Guttman,
"R-Trees:
A Dynamic Index Structure for Spatial Searching", SIGMOD
Conference, 1984. (*)
- Timos K.
Sellis,
Nick
Roussopoulos, Christos
Faloutsos: The R+-Tree: A Dynamic Index for Multi-Dimensional Objects.
VLDB
1987: 507-518
- Norbert Beckmann
, Hans-Peter
Kriegel, Ralf
Schneider, Bernhard
Seeger: The R*-Tree: An Efficient and Robust
Access Method for Points and Rectangles. SIGMOD
Conference 1990: 322-331
- J. Nievergelt,
H. Hinterberger, K. C. Sevcik: The
Grid File: An Adaptable, Symmetric Multikey File Structure,
TODS 9(1), 1984.
September 27: Index Structures II (Presenter: Karthik Raman)
September
29: Decision Support (Presenter: Chenhao Tan)
- Jim Gray, Surajit
Chaudhuri, Adam
Bosworth, Andrew
Layman, Don
Reichart, Murali
Venkatrao, Frank
Pellow, Hamid
Pirahesh: Data Cube: A Relational Aggregation Operator Generalizing
Group-by, Cross-Tab, and Sub Totals. Data
Min. Knowl. Discov. 1(1): 29-53 (1997) (*)
- Sameet
Agarwal,Rakesh Agrawal, Prasad
Deshpande,Ashish Gupta, Jeffrey
F. Naughton, Raghu
Ramakrishnan, Sunita
Sarawagi: On the Computation of Multidimensional Aggregates.VLDB
1996: 506-521
- Venky
Harinarayan, Anand Rajaraman,
Jeffrey
D. Ullman: Implementing Data Cubes
Efficiently. SIGMOD
Conf. 1996: 205-26
October 4: Decision Support II (Presenter: Ruben Sipos)
- Sunita Sarawagi, Rakesh Agrawal, Nimrod
Megiddo: Discovery-Driven Exploration of OLAP
Data Cubes. EDBT
1998: 168-182 (*)
- Sunita
Sarawagi: Explaining Differences in Multidimensional Aggregates. VLDB
1999: 42-53
- Sunita
Sarawagi: User-Adaptive Exploration of Multidimensional Data. VLDB
2000: 307-316
October 6: Online aggregation (Presenter: Adith Swaminathan)
October 11: Fall break, no class.
October 13: Approximate query answering I (Presenter: Edward Lui)
October 18: Approximate query
answering II (Presenter: Ashwinkumar B V)
October 20: Approximate Query Anwering III (Presenter: Konstantinos Mamouras)
October 25: Data stream algorithms I (Presenter: Anshumali Shrivastava)
October 27: Data stream algorithms II (Presenter: Lior Seeman)
November
1: Distributed Transaction Management and Replication (Presenter: Elisavet Kozyri)
- C. Mohan, B. G. Lindsay, R. Obermarck,
"Transaction
Management in the R* Distributed Database Management System", TODS
11(4), 1986.
- J. Gray, P. Helland, P. E. O'Neil, D.
Sasha, "The
Dangers of Replication and a Solution", SIGMOD Conference, 1996.
November
3: Parallel Database Systems (Presenter: Yin Lou)
-
D. J. DeWitt, J. Gray, "Parallel
Database Systems: The Future of High Performance Database Systems",
CACM 35(6), 1992. (*)
- D. DeWitt, et al. "The Gamma
Database Machine Project", TKDE 2(1), 1990.
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data
Processing on Large Clusters. In Proceedings of OSDI'04: Sixth Symposium on
Operating System Design and Implementation, San Francisco, CA, December, 2004.
- David Dewitt. MapReduce:
A major step backwards. The Database Column, January 17, 2008.
- David Dewitt, Michael Stonebraker. MapReduce
II. The Database Column, January 25, 2008.
November
8: Data Provenance (Presenter: Raghavendra Rajkumar)
November
10: Probabilistic database systems (Presenter: Samantha Leung)
November
15: Probabilistic database systems II (Presenter: Wenlei Xie)
November 17: Fagin’s Algorithm (Presenter: Ronan Le Bras)
November 22: Column Stores and Database Cracking (Presenter: Stavros Nikolaou)
November
24: Thanksgiving break, no class.
November
29: Data Mining I (Presenter: Bailu Ding)
- Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, A. Inkeri Verkamo: Fast Discovery of Association Rules. Advances in Knowledge Discovery and Data Mining 1996: 307-328
- Rakesh Agrawal, Ramakrishnan Srikant: Mining Sequential Patterns. ICDE 1995: 3-14
- Cristian Bucila, J. E. Gehrke, Daniel Kifer, and Walker White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints.
Data Mining and Knowledge Discovery, Vol. 7, Issue 4, July 2003, pages 241-272.
December 1: Data Mining II (Presenter: Oren Sigal)