Advanced Database Systems

CS632 Spring 2001

News:

4/13: The final presentations are on 5/11/2001 in the systems lab, starting at 2pm. Prepare a 15 minute presentation about your project. Your final project report is due at that time as well. The final exam (48 hours) will be handed out after the presentations. 
2/15: Remember that we moved the start time of the class to 2:50pm due to the colloquia directly after class.
2/5: The reading list has been nearly completed. 
2/5: One project has been taken (see the list of projects). If you have any questions about the projects, feel welcome to stop by this afternoon to talk with me.
1/30: The readings can only be accessed from within the CUCS firewall due to copyright issues.
1/28: A project timeline has been added at the bottom of this page.
1/28: The list of project suggestions is available online. If you decide on a project, send me email, and I will send you a list of related work. You can do a project by yourself, or you can work in a group of students. I will stay after class on Tuesday, so that we can talk about projects if you have any questions.
1/23: You can sign up for your presentation date by sending email to johannes@cs.cornell.edu 

Logistics:

Class: TR 2:50-4:05, Thurston 203

Instructor

Johannes Gehrke
4108 Upson Hall; johannes@cs.cornell.edu
Office hours: Wednesdays 2:30-3:30pm, or by appointment.

Teaching Assistant

Alexandre Evfimievski
5136 Upson Hall;  aevf@cs.cornell.edu
Office hours: Mondays 2:30-3:30pm, or by appointment

Overview

CS632 covers on-going trends in database systems, focusing on database fundamentals, but also illustrating new research directions and current problems. The course pre-requisite is CS 432 or equivalent introductory background material in database systems. We will study a collection of papers, some of these are "classics" from the early days of relational databases, others are more recent papers that have influenced the field. At the end of the course, you will have an in-depth understanding of the state of the art in database systems.

Tentative Course Outline

Date Topic and Paper Presenter Notes
1/23 Introduction to the course; review of some CS432 material

Slides on relational algebra: PDF, PDF (6 slides per page)

Slides on normalization: PDF, PDF (6 slides per page)

Johannes
1/25 Buffer management

Hong-Tai Chou, David J. DeWitt: An Evaluation of Buffer Management Strategies for Relational Database Systems. VLDB 1985: 127-141

Join Processing

Leonard D. Shapiro: Join Processing in Database Systems with Large Main Memories. TODS 11(3): 239-264(1986)

Johannes
1/30 Query optimization

Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, Thomas G. Price: Access Path Selection in a Relational Database Management System. SIGMOD Conference 1979: 23-34

Yannis E. Ioannidis, Younkyung Cha Kang: Randomized Algorithms for Optimizing Large Join Queries. SIGMOD Conference 1990: 312-321

Johannes
2/1 Selectivity estimation

"Improved Histograms for Selectivity Estimation of Range Predicates", by V. Poosala, Y.E. Ioannidis, P. J. Haas, and E. Shekita, Proc. of the 1996 ACM SIGMOD Conference, Montreal, Canada, June 1996.

"Balancing Histogram Optimality and Practicality for Query Result Size Estimation", by Y.E. Ioannidis and V. Poosala, Proc. of the 1995 ACM SIGMOD Conference, San Jose, CA, May 1995, pages 233-244.

Johannes
2/6 Concurrency control

Rakesh Agrawal, Michael J. Carey, Miron Livny: Models for Studying Concurrency Control Performance: Alternatives and Implications. SIGMOD Conference 1985: 108-121

H. T. Kung, John T. Robinson: On Optimistic Methods for Concurrency Control. TODS 6(2): 213-226(1981)

Hal Berenson, Philip A. Bernstein, Jim Gray, Jim Melton, Elizabeth J. O'Neil, Patrick E. O'Neil: A Critique of ANSI SQL Isolation Levels. SIGMOD Conference 1995: 1-10

Nate Nystrom

Slides from Nate

2/8 Recovery

Theo H�rder, Andreas Reuter: Principles of Transaction-Oriented Database Recovery. Computing Surveys 15(4): 287-317(1983)

C. Mohan, Donald J. Haderle, Bruce G. Lindsay, Hamid Pirahesh, Peter Schwarz: ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. TODS 17(1): 94-162(1992)

Cristian Bucila

Slides from Cristian

2/13 Distributed database systems

Lothar F. Mackert, Guy M. Lohman: R* Optimizer Validation and Performance Evaluation for Local Queries. SIGMOD Conference 1986: 84-95 

Lothar F. Mackert, Guy M. Lohman: R* Optimizer Validation and Performance Evaluation for Distributed Queries. VLDB 1986: 149-159
Michael Morse

Slides from Michael

2/15 Distributed query processing 

C. Mohan, Bruce G. Lindsay, Ron Obermarck: Transaction Management in the R* Distributed Database Management System. TODS 11(4): 378-396 (1986)

The following two papers are on the Garlic Project Home Page.

Laura M. Haas, Donald Kossmann, Edward L. Wimmers, Jun Yang: Optimizing Queries Across Diverse Data Sources. VLDB 1997: 276-285

Mary Tork Roth, Peter M. Schwarz: Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources. VLDB 1997: 266-275
Ashish Motivala
2/20 Replication (and its dangers)

Jim Gray, Pat Helland, Patrick E. O'Neil, Dennis Shasha: The Dangers of Replication and a Solution. SIGMOD Conf. 1996: 173-182

Guest lecture by Professor Al Demers.
2/22 Thor

Providing Persistent Objects in Distributed Systems. Proceedings of the 13th European Conference on Object-Oriented Programming (ECOOP '99), Lisbon, Portugal, June 1999.  [PDF] Barbara Liskov, Miguel Castro, Liuba Shrira, Atul Adya

HAC: Hybrid Adaptive Caching for Distributed Storage Systems. Proceedings of the ACM Symposium on Operating System Principles (SOSP '97), Saint Malo, France, October 1997.   [PDF] Miguel Castro, Atul Adya, Barbara Liskov, Andrew C. Myers

Guest lecture by Andrew Myers

Slides

2/27 Main-memory database systems: Dali

The Architecture of the Dali Storage Manager.  A recent summary of Dali's architecture, features, and research contributions. This is the best overview of the system as it is today. Published in the Journal of Multimedia Tools and Applications, 4/2, 1997.

Tobin J. Lehman, Michael J. Carey: A Study of Index Structures for Main Memory Database Management Systems. VLDB 1986: 294-303

Adina Costea

Slides from Adina

3/1 Main-memory database systems

``Cache Conscious Indexing for Decision-Support in Main Memory''  J. Rao and  K. A. Ross, Proceedings of the 1999 VLDB Conference, September 1999. postscript

``Making B+-Trees Cache Conscious in Main Memory''  J. Rao and  K. A. Ross, Proceedings of the 2000 SIGMOD Conference, May 2000. postscript

Peter A. Boncz, Stefan Manegold, Martin L. Kersten: Database Architecture Optimized for the New Bottleneck: Memory Access. VLDB 1999: 54-65

Guest lecture by Zhiyuan Chen

Slides from Zhiyuan

3/6 Decision support: Bitmap indices

Patrick E. O'Neil, Dallan Quass: Improved Query Performance with Variant Indexes. SIGMOD Conference 1997: 38-49

C.Y. Chan and Y.E. Ioannidis, ``Bitmap Index Design and Evaluation'' , Proceedings of ACM SIGMOD Intl' Conference, Seattle, Washington, June 1998, 355-366.
Rohit Ananthakrishna

Rohit's disk crashed and the slides are lost.

3/8 Decision support: Computing the cube

Sameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish Gupta, Jeffrey F. Naughton, Raghu Ramakrishnan, Sunita Sarawagi: On the Computation of Multidimensional Aggregates. VLDB 1996: 506-521

Yihong Zhao, Prasad Deshpande, Jeffrey F. Naughton: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates. SIGMOD Conference 1997: 159-170
Abhinandan Das

Slides from Abhinandan

3/13 Decision support: OLAP

Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman: Implementing Data Cubes Efficiently. SIGMOD Conf. 1996: 205-216 

Sunita Sarawagi, Rakesh Agrawal, Nimrod Megiddo: Discovery-Driven Exploration of OLAP Data Cubes. EDBT 1998: 168-182
Ali Anvari

Slides from Ali

3/15 Database Tuning

Optional Reading:

Data Engineering Bulletin June 1999
Self-Tuning Databases and Application Tuning (Surajit Chaudhuri, editor)
ftp://ftp.research.microsoft.com/pub/debull/99JUN-CD.pdf 

Guest lecture by Philippe Bonnet Philippe Bonnet is an expert on database tuning.
3/27 Online aggregation 

Joseph M. Hellerstein, Peter J. Haas, Helen Wang: Online Aggregation. SIGMOD Conference 1997: 171-182

Peter J. Haas, Joseph M. Hellerstein: Ripple Joins for Online Aggregation. SIGMOD Conference 1999: 287-298

Vijayshankar Raman, Bhaskaran Raman, Joseph M. Hellerstein: Online Dynamic Reordering for Interactive Data Processing. VLDB 1999: 709-720
Yanling Wang

Slides from Yanling

3/29

Parallel database systems and high-performance sorting

David J. DeWitt, Jim Gray: Parallel Database Systems: The Future of High Performance Database Systems. CACM 35(6): 85-98 (1992)

David J. DeWitt, Shahram Ghandeharizadeh, Donovan A. Schneider, Allan Bricker, Hui-I Hsiao, Rick Rasmussen: The Gamma Database Machine Project. TKDE 2(1): 44-62 (1990)

Chris Nyberg, Tom Barclay, Zarka Cvetanovic, Jim Gray, David B. Lomet: AlphaSort: A Cache-Sensitive Parallel External Sort. VLDB Journal 4(4): 603-627 (1995)

The papers are on Tobias' web site.

Guest lecture by Tobias Mayr
4/3 Data mining

Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499

Rakesh Agrawal, Ramakrishnan Srikant: Mining Sequential Patterns. ICDE 1995: 3-14
Kevin Seng

Slides from Kevin

4/5 Data mining

Manish Mehta, Rakesh Agrawal, Jorma Rissanen: SLIQ: A Fast Scalable Classifier for Data Mining. EDBT 1996: 18-32

John C. Shafer, Rakesh Agrawal, Manish Mehta: SPRINT: A Scalable Parallel Classifier for Data Mining. VLDB 1996: 544-555

Pedro Domingos, Geoff Hulten: Mining high-speed data streams. 71-80 Electronic Edition (ACM DL)
Tomi Yiu

Slides from Tomi

4/10 High-dimensional indexing

Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Conference 1984: 47-57

Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger: The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. SIGMOD Conference 1990: 322-331

Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft: When Is ''Nearest Neighbor'' Meaningful? ICDT 1999: 217-235

Dan Kifer

Slides from Dan

Project overview due in class

 

4/12 Nearest neighbor queries

Nick Roussopoulos, Stephen Kelley, Frederic Vincent: Nearest Neighbor Queries. SIGMOD Conference 1995: 71-79. http://www.cs.umd.edu/~nick/papers/nnpaper.html 

G. R. Hjaltason and H. Samet, Distance browsing in spatial databases, ACM Transactions on Database Systems 24, 2 (June 1999), 265-318. http://www.cs.umd.edu/~grh/incnear2.ps.gz 

Sung-hsun Su

Slides from Sung-hsun

4/17 Object-relational database systems

Michael Stonebraker: Inclusion of New Types in Relational Data Base Systems. ICDE 1986: 262-269

Michael Stonebraker, Greg Kemnitz: The Postgres Next Generation Database Management System. CACM 34(10): 78-92 (1991)

PREDATOR System Design Document: A detailed description of internal design decisions and implementation details. Available as a word document , in postscript , or as HTML.
Yong Yao

Slides from Yong

Johannes out of town
4/19 Client-server caching and Object Stores

Seth J. White, David J. DeWitt: QuickStore: A High Performance Mapped Object Store. VLDB Journal 4(4): 629-673 (1995)

Michael J. Franklin, Michael J. Carey: Client-Server Caching Revisited. IWDOM 1992: 57-78

Ben Atkin

Slides from Ben

Johannes out of town
4/24 Database systems and XML

J. Shanmugasundaram, E. Shekita, R. Barr, M. Carey, B. Lindsay, H. Pirahesh, B. Reinwald, "Efficiently Publishing Relational Data as XML Documents", VLDB Conference, September 2000. Click here for the slides.

J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, J. Naughton, "Relational Databases for Querying XML Documents: Limitations and Opportunities," VLDB Conference, September 1999. Click here for the slides.

David Wu

Slides from David

4/26 Deductive database systems

Chapter 27, Ramakrishnan and Gehrke: Database Management Systems (Second Edition)

Qinghong Zhang and Jinghao Zhang
5/1 no class; work on the projects
5/3 The Future of DBMS Research

TBA

Johannes

 

Coursework

Homework for each class: You have to read the papers that will be discussed in the class carefully.

Exams: Two exams: one take-home prelim and one take-home final exam.

Class presentation: You will give one of the lectures in this course by preparing a slide presentation of the material in the papers for this class.

Course project: As part of the course, you will do a database/data mining research project in groups of two students. Usually several of the projects will lead to publishable results, and I encourage you to have a publication as a goal in the course. I will provide a list of projects, and you can choose from this list, but you can also choose your own project.

The project will have the following milestones:
1. Survey of the relevant literature. A survey of the relevant literature related to the project.
2. A detailed description of your project. The overview includes a description of your algorithm and ideas, the design of the piece of software you are proposing to build, a description of how you propose to evaluate your ideas, and the design of the experiments that you will run.
3. Implementation and performance analysis.
4. Final project report and a short presentation to the class about your course project.

Grading:

Your paper presentation in class: 15%
The class project: 50%
Two exams: 30% (15% each)
Class participation: 5%

Important dates

2/6 Project decisions are made.
3/6 Literature survey and draft of the project overview due in class
3/13 Handout of the take-home midterm
3/15 Midterm due in class
4/10 Project overview with detailed description of the algorithms and design of the experimental evaluation due in class
5/11 Final project presentations and final project reports due
5/11 Handout of the final exam after the presentations (48 hours)
5/13 Handin of the final exam