CS6320
Logistics
- Instructor: Immanuel Trummer,
411B Gates Hall; office hours: Wednesdays, 3-4pm or by appointment.
- Class: Tuesdays 1:25-2:40pm and Thursdays 1:25-2:40pm; Bard Hall 140.
Course Description
In this course, we review recent trends and foundational work in the
area of databases and large-scale data analysis. Starting from the
foundations of relational databases, we review recent research in areas
such as column stores, main-memory databases, query compilation, and
approximate query engines that aims at making data processing more
efficient. We cover parallel and distributed databases, NoSQL and
NewSQL systems, stream processing engines, graph databases, and systems
for data mining and large-scale machine learning. Finally, we review
approaches to make databases more user-friendly, including natural
language interfaces and automated data visualization.
An important
component of this course is the course project which requires you to
research a database-related problem of your choice.
Workload and Grading
- Several class
presentations about a topic with associated research papers (25%)
- Participation in class discussions (25%)
- Write a (hopefully publishable) research paper in the area of
database systems (50%). You can do a project by yourself, or with
another student from the class.
- Topic selection. Please talk to me about the topic of your
project to make sure that the project is within the scope of the class.
Several high-level ideas for project topics will be presented in the
first course session. You should have selected a project topic by February 7.
- Project proposal with references. The proposal should contain
your goals for the project and the results of a thorough literature
search. The project proposal is due February 14.
- An intermediate status update the week of March 15 An
email to Immanuel is sufficient.
- The final project report. The project report should be
formatted like a regular paper for a conference submission (use the ACM
style). The final project is due May 2.
Course
Schedule (Draft)
Introduction to the course
Slides.
Basics, Architecture of a Database Management System
Section 1: Foundations
Joins
Indexing
Query Optimization
Selectivity Estimation & Robust Optimization
Concurrency Control
Logging and Recovery
Buffer Management
Section 2: Efficient Query Processing
Column Stores
Main Memory Databases
Query Compilation
Online/Approximate Processing
Processing on Novel Hardware
(Massively) Parallel Processing
Optional: D. J. DeWitt, J. Gray: Parallel database systems: the future of high-performance database systems. CACM 1992.
Data Warehousing vs. MAD Analytics
Section 3: Efficient Transaction Processing
CAP Theorem vs. NoSQL Databases
NewSQL
Coordination Avoidance
Optional: S. Roy et al.: The homeostatis protocol: avoiding transaction coordination through program analysis. SIGMOD 2015.
Section 4: Beyond Relational Data Processing
Optional: Video of M. Stonebraker on "One size fits all: an idea whose time has come and gone".
Graph Databases
Stream Processing
Machine Learning
Knowledge Mining
Section 5: User Interfaces
Novel Query Interfaces
Optional: Video of VLDB 2015 Panel on "Design for Interaction"
Data Visualization
Privacy?
Crowd Database Systems