CS6320 – Big Data
Logistics
- Instructor: Johannes Gehrke, 4105B Upson; office hours: Fridays, 1:15-2:15pm or by
appointment.
- Class: Tuesdays 2:55-4:10pm and Thursdays 2:45-4:00pm; Hollister
306. Occasional makeup classes on Wednesday evenings.
Course Description
Big Data is a research area motivated by lots of practical problems,
resulting in elegant systems abstractions and interesting algorithms. In this
course, we will explore the beauty of some foundational and recent work in this
area which intersects systems, algorithms, and programming languages.
Workload and Grading
- Every week a brief homework about the material that we covered during that week (15%)
- Every week one paper summary (15%). You should able to
read the summary after a few months and still know what the paper was
about. Summary writing is important for researchers: We all tend to
forget the many papers that we read over time, and a good summary
helps us to recollect the main ideas in the paper without investing a lot
of time in re-reading the paper. A good summary should have the following parts:
- Problem Statement:
What is the problem being studied in the paper and what are the major
assumptions??
- Motivation: Why is the
problem being studied and why is this paper being written? Is it a novel
problem? Or is it a novel problem setting? Or does the paper improve
upon existing work?
- Techniques: What
techniques are proposed and what is the basic idea behind the technique?
This should not just list the keywords but should should give an idea of
the technique too.
- What are the benefits
of the technique -- is it easier to implement or does it give better
guarantees or does it run faster or does it provide more functionality
-- over existing techniques?
- Contributions: If it
is not obvious why, give an intuition why the technique is better that
(say outperforms) existing techniques.
- The paper about which you should write a summary is marked with a (*) in the syllabus.
- One class presentation about a topic with associated
research papers (20%)
- Write a (hopefully publishable) research paper in the
area of database systems (49%). You can do a project by yourself, or with
another student from the class.
- Topic selection. Please
talk with me about the topic of your project so that the project is
within the scope of the class. There are several ideas for project topics
already posted within CMS; all of these have the potential of leading to
a strong publication. You should have selected a project topic the latest
by September 11.
- Project proposal with
references. The proposal should contain your goals for the project and the
results of an initial literature search. The project proposal is due
October 2.
- Full literature review
for the project, a formal problem description, and a high-level
discussion of your approach. This part of the project is due October 25.
- An intermediate status
update the week of November 13. An email to Johannes is ok.
- The final project
report. The project report should be formatted like a regular paper for a
conference submission (use the ACM style). The final project is due
December 11.
- Course evaluation (1%)
(Draft) Course
Schedule
August 23:
Introduction to the course
August 28: The
relational model, relational algebra, and normalization (Presenter: Johannes)
August 30: Query
processing and query optimization (Presenter: Johannes)
As we are starting to discuss more database internals over
the next weeks, please read the following paper as background reading (do not
be scared by its length; it is easy to read):
- Joseph M. Hellerstein,
Michael Stonebraker, and James Hamilton. Architecture of a
Database System.
Foundations and Trends in Databases. Vol. 1, No. 2 (2007), 141–259.
September 4:
Buffer Management and Selectivity estimation (Presenter: Johannes)
September 6:
Concurrency control (Presenter: Johannes)
September 11:
Recovery (Presenter: Alan Demers)
- Theo Härder, Andreas Reuter: Principles of Transaction-Oriented Database Recovery. ACM Comput. Surv. 15(4): 287-317(1983)
- Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., Schwarz, P.
ARIES: A Transaction Recovery Method Supporting
Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.
ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992.
September 13: Index
Structures I (Presenter: Alan Demers)
- J. Nievergelt, H.
Hinterberger, K. C. Sevcik: The Grid File: An
Adaptable, Symmetric Multikey File Structure, TODS 9(1), 1984. (*)
- A. Guttman: R-Trees: A Dynamic Index Structure for Spatial
Searching.
SIGMOD Conference, 1984.
- Timos K. Sellis, Nick Roussopoulos, Christos Faloutsos:
The R+-Tree: A Dynamic Index for Multi-Dimensional Objects. VLDB 1987: 507-518
- Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger:
The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles.
SIGMOD Conference 1990: 322-331
September 18: Index
Structures II (Presenter: Hema Koppula)
September 20:
Main-Memory Database Systems (Presenter: Jiexun Xu)
September 25:
Decision Support (Presenter: Albert Liu)
- Jim Gray, Surajit Chaudhuri, Adam Bosworth,
Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, Hamid Pirahesh: Data Cube: A Relational Aggregation Operator Generalizing
Group-by, Cross-Tab, and Sub Totals. Data Min. Knowl. Discov. 1(1): 29-53 (1997) (*)
- Sameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish Gupta, Jeffrey F. Naughton, Raghu Ramakrishnan, Sunita Sarawagi:
On the Computation of Multidimensional Aggregates. VLDB 1996: 506-521
- Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman:
Implementing Data Cubes Efficiently. SIGMOD Conf. 1996: 205-26
September 27: Decision
Support II (Presenter: Kevin James Matzen)
October 2:
Index Structures III (Presenter: Eoin O'Mahony)
October 4: Online
aggregation (Presenter: Shuo Chen)
October 9:
Fall break, no class.
October 11:
Approximate query answering I (Presenter: Jon Park)
-
Jeffrey Scott Vitter: Random Sampling with a Reservoir.
ACM Trans. Math. Softw. 11(1): 37-57(1985) (* Note: You only need to write about Algorithms X and Y.)
- Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya: On Random Sampling over Joins. SIGMOD Conference 1999: 263-274
- Optional reading:
Note: No summary for October 16 and 18; work on your course projects.
October 16:
Approximate query answering II (Presenter: Daniel Cabrini Hauagge)
October 18:
Distributed Transaction Management and Replication (Presenter: Yohan Ko)
-
C. Mohan, B. G. Lindsay, R. Obermarck, "Transaction Management in the R* Distributed Database Management System", TODS 11(4), 1986.
- J. Gray, P. Helland, P. E. O'Neil, D. Sasha, "The Dangers of Replication and a Solution", SIGMOD Conference, 1996.
October 23:
Parallel Database Systems (Presenter: Lucja Kot)
- D. J. DeWitt, J. Gray: Parallel Database Systems:
The Future of High Performance Database Systems. CACM 35(6), 1992. (*)
- David J. DeWitt, Shahram Ghandeharizadeh, Donovan A. Schneider, Allan Bricker, Hui-I Hsiao, Rick Rasmussen:
The Gamma Database Machine Project.
IEEE Trans. Knowl. Data Eng. 2(1): 44-62 (1990)
- Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplified Data Processing on Large Clusters.
In Proceedings of OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
- Michael Stonebraker, Daniel J. Abadi, David J. DeWitt, Samuel Madden,
Erik Paulson, Andrew Pavlo, Alexander Rasin: MapReduce and parallel DBMSs:
friends or foes? Commun. ACM 53(1): 64-71 (2010)
- Optional Readings:
October 25: Column Stores (Presenter: Lucja Kot)
- Michael Stonebraker, Daniel J. Abadi, Adam Batkin,
Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau,
Amerson Lin, Samuel Madden, Elizabeth J. O'Neil, Patrick E. O'Neil,
Alex Rasin, Nga Tran, Stanley B. Zdonik:
C-Store: A Column-oriented DBMS. VLDB 2005: 553-564
-
Sándor Héman, Marcin Zukowski, Niels J. Nes, Lefteris Sidirourgos,
Peter A. Boncz: Positional update handling in column stores.
SIGMOD Conference 2010: 543-554
October 30:
Data stream algorithms I (Presenter: Yexiang Xue)
November 1:
Data stream algorithms II (Presenter: Bishan Yang)
-
Gurmeet Singh Manku, Sridhar Rajagopalan, Bruce G. Lindsay: Approximate Medians and other Quantiles in One Pass and with Limited Memory. SIGMOD Conference 1998: 426-435.
- Graham Cormode, Marios Hadjieleftheriou: Methods for finding frequent items in data streams. VLDB J. 19(1): 3-20 (2010)
November 6:
Fagin’s Algorithm (Presenter: Yang Yuan)
- Ronald
Fagin: Combining Fuzzy Information from Multiple Systems.
J. Comput.
Syst. Sci. 58(1): 83-99 (1999)
-
Ronald Fagin, Amnon Lotem, Moni Naor: Optimal Aggregation Algorithms for Middleware. PODS 2001.
- Combining fuzzy information: an overview. SIGMOD Record 31,2, June 2002, pp. 109-118.
November 8:
Frequent itemsets, association rules, and sequential patterns (Presenter: Johannes)
-
Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, A. Inkeri Verkamo: Fast Discovery of Association Rules. Advances in Knowledge Discovery and Data Mining 1996: 307-328
- Rakesh Agrawal, Ramakrishnan Srikant: Mining Sequential Patterns. ICDE 1995: 3-14
November 13: No class.
November 15: Data Stream Systems (Guest Lecture by Walker White)
-
Donald Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey,
Sangdon Lee, Greg Seidman, Michael Stonebraker, Nesime Tatbul, Stanley B. Zdonik:
Monitoring Streams - A New Class of Data Management Applications.
VLDB 2002: 215-226. (*)
-
Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Keith Ito,
Rajeev Motwani, Itaru Nishizawa, Utkarsh Srivastava, Dilys Thomas,
Rohit Varma, Jennifer Widom:
STREAM: The Stanford Stream Data Manager. IEEE Data Eng. Bull. 26(1): 19-26 (2003)
- Alan J. Demers, Johannes Gehrke, Mingsheng Hong, Mirek Riedewald, Walker M. White:
Towards Expressive Publish/Subscribe Systems. EDBT 2006: 627-644
- Samuel Madden, Mehul A. Shah, Joseph M. Hellerstein, Vijayshankar Raman:
Continuously adaptive continuous queries over streams. SIGMOD Conference 2002: 49-60
November 20:
Probabilistic Database Systems (Presenter: Vasilis Syrganis)
November 22:
Thanksgiving, no class.
November 27: Paxos (Guest Lecture by Robbert Van Renesse)
-
Leslie Lamport.
Paxos Made Simple. ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58.
November 29: Distribution and Fault Tolerance (Presenter: Johannes)
- Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, Hari Balakrishnan:
Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM 2001: 149-160 (*)
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman,
Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, Werner Vogels:
Dynamo: amazon's highly available key-value store. SOSP 2007: 205-220
-
Werner Vogels: Eventually consistent. Commun. ACM 52(1): 40-44 (2009)
-
Marcos Kawazoe Aguilera, Arif Merchant, Mehul A. Shah, Alistair C. Veitch, Christos T. Karamanolis:
Sinfonia: a new paradigm for building scalable distributed systems. SOSP 2007: 159-174
-
Marcos Kawazoe Aguilera, Wojciech M. Golab, Mehul A. Shah:
A practical scalable distributed B-tree. PVLDB 1(1): 598-609 (2008)