CS732: Approximate Query Processing Over Data Streams

Organizers

Overview

This semester's database seminar will deal, as the title suggests, with approximate query processing over data streams. We will cover some introductory material as well as advanced topics, with the goal of identifying interesting research problems in the area.

The seminar will meet Wednesdays and (most) Fridays from 2:30 to 3:30 PM, in Upson 215.

Al Demers and Johannes Gehrke will give the initial lectures; after the first three or four weeks participants in the course will give the remaining lectures.

There will be more discussion of this requirement during the first meeting.

Tentative Schedule

Date Topic Lecturer
January 30

Introduction

S. Babu and J. Widom, Continuous Queries over Data Streams. [BW01]

A. Demers
February 1

J. S. Vitter, Random Sampling With a Reservoir. [V85]

A. Demers
February 6

P. Flajolet and G. Martin, Probabilistic Counting Algorithms for Database Applications. [FM85]

A. Demers
February 8

Kyu-Young Whang, Brad Vander Zanden and Howard Taylor, A Linear Time Probabilistic Counting Algorithm for Database Applications. [WVT90]

A. Demers
March 20-22 Spring Break  
March 29

N. Alon, Y. Matias and M. Szegedy, The Space Complexity of Approximating the Frequency Moments. [AMS96]

Martin Pal
April 3 N. Alon, P. B. Gibbons, Y. Matias and M. Szegedy, Tracking Join and Self-Join Sizes in Limited Storage. [AGM99] Martin Pal
April 5

W. C. Hou, Ozsoyoglu, and B. K. Taneja, Statistical Estimators for Relational Algebra Expressions. [HOT88]

S. Ganguly, P. B. Gibbons, Y. Matias and A. Silberschatz Bifocal sampling for skew-resistant join size estimation. [GGM96]

Sasha
April 10

J. Shanmugasundaram, K. Tufte, D. DeWitt, J. Naughton, D. Maier, " Architecting a Network Query Engine for Producing Partial Results ", Lecture Notes in Computer Science, Vol. 1997, Springer-Verlag Publishers, 2001.

J. Chen, D. DeWitt, F. Tian and Y. Wang, NiagaraCQ: A Scalable Continuous Query System for Internet Databases. [CDTW00]

Jai
April 12

P. B. Gibbons and Y. Matias, New Sampling-Based Summary Statistics for Improving Approximate Query Answers. [GM98]

S. Acharya, P. Gibbons, V. Poosala and S. Ramaswamy, Join Synopses for Fast Approximate Query Answering. [AGP99]

Alin
April 19

S. Chaudhuri, R. Motwani and N. Narasayya, On Random Sampling over Joins. [CMN99]

V. Ganti, M. L. Lee and R. Ramakrishnan, ICICLES: Self-Tuning Samples for Approximate Query Answering. [GLR00]

Adina Costea
April 24

J. Chen, D. DeWitt and J. Naughton, Design and Evaluation of Alternative Selection Placement Strategies in Optimizing Continuous Queries. [CDN02]

S. Madden, M. Shah, and Joe Hellerstein. Continuously Adaptive Continuous Queries over Streams . SIGMOD 2002 [PDF]

Abhinandan
Monday
April 29

Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala: Congressional Samples for Approximate Answering of Group-By Queries. SIGMOD Conference 2000: 487-498

P. B. Gibbons, Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. [Gib01]

Cristi
May 1

P. Seshadri, M. Livny and R. Ramakrishnan, Sequence Query Processing. [SLR94]

P. Seshadri, M. Livny and R. Ramakrishnan, SEQ: A Model for sequence Databases. [SLR95]

Rohit
May 3

S. Chaudhuri, G. Das, M. Datar, R. Motwani and V. Narasayya, Overcoming Limitations of Sampling for Aggregation Queries. [CDD+01]

S. Chaudhuri, G. Das and V. Narasayya, A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. [CDN01]

Dan

Bibliography

[ABB+01]
A. Arasu, B. Babcock, S. Babu, J. McAlister and J. Widom. Characterizing Memory Requirements for Queries over Continuous Data Streams, Stanford C.S. Tech Report 2001-49, Nov. 2001. [ps, pdf]
[AGM99]
N. Alon, P. B. Gibbons, Y. Matias and M. Szegedy. Tracking Join and Self-Join Sizes in Limited Storage. ACM PODS, 1999. Phil Gibbons' pulications.
[AGP99]
S. Acharya, P. Gibbons, V. Poosala and S. Ramaswamy. Join Synopses for Fast Approximate Query Answering. ACM SIGMOD, 1999. Full version (pdf).
[AMS96]
N. Alon, Y. Matias and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. ACM STOC, 1996. Yossi Matias' papers. Full version to appear in JCSS STOC'96 Special Issue.
[BSW01]
S. Babu, L. Subramanian and J. Widom. A Data Stream Management System for Network Traffic Management, Proceedings of Workshop on Network-Related Data Management, 2001. [ps, pdf]
[BW01]
S. Babu and J. Widom. Continuous Queries over Data Streams. SIGMOD Record, Sept. 2001. [ps, pdf]
[CDD+01]
S. Chaudhuri, G. Das, M. Datar, R. Motwani and V. Narasayya. Overcoming Limitations of Sampling for Aggregation Queries. Proceedings 17th ICDE, 2001. pdf version 
[CDN01]
S. Chaudhuri, G. Das and V. Narasayya. A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries. ACM SIGMOD, 2001. pdf version 
[CDN02]
J. Chen, D. DeWitt and J. Naughton. Design and Evaluation of Alternative Selection Placement Strategies in Optimizing Continuous Queries. ICDE, 2002. [pdf]
[CDTW00]
J. Chen, D. DeWitt, F. Tian and Y. Wang. NiagaraCQ: A Scalable Continuous Query System for Internet Databases. ACM SIGMOD, 2000. [pdf, ps]
[CMN99]
S. Chaudhuri, R. Motwani and N. Narasayya. On Random Sampling over Joins. ACM SIGMOD, 1999. pdf version
[Fa99]
R. Fagin. Combining fuzzy information from multiple systems. J. Computer and System Sciences 58, 1999, pp. 83-99. (Special issue for selected papers from the 1996 ACM Symposium on Principles of Database Systems). [pdf]
[FLN01]
R. Fagin, A. Lotem and M. Naor. Optimal aggregation algorithms for middleware. PODS, 2001. [ps, pdf]
[FM85]
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 31, 2, 1985.
[GGM96]
S. Ganguly, P. B. Gibbons, Y. Matias and A. Silberschatz. Bifocal sampling for skew-resistant join size estimation. ACM SIGMOD, 1996.
[Gib01]
P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports, VLDB 2001 (pdf).
[GLR00]
V. Ganti, M. L. Lee and R. Ramakrishnan. ICICLES: Self-Tuning Samples for Approximate Query Answering. VLDB, 2000. (pdf)
[GM98]
P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers (pdf).. ACM SIGMOD, 1998.
[HOT88]
W. C. Hou, Ozsoyoglu, and B. K. Taneja. Statistical Estimators for Relational Algebra Expressions. ACM PODS, 1988.
[NDM+00]
J. Naughton, D. DeWitt, D. Maier et al. The Niagara Internet Query System. VLDB, 2000. [pdf]
[SLR94]
P. Seshadri, M. Livny and R. Ramakrishnan. Sequence Query Processing. ACM SIGMOD, 1994. Praveen's papers.
[SLR95]
P. Seshadri, M. Livny and R. Ramakrishnan. SEQ: A Model for sequence Databases. ICDE, 1995. Praveen's papers.
[V85]
J. S. Vitter. Random Sampling with a Reservoir. ACM TOMS, 1985.
[VN01]
E. Viglas and J. Naughton. Rate-based Query Optimization for Streaming Information Sources. (submitted for publication, 2001) [pdf]
[WVT90]
Kyu-Young Whang, Brad Vander Zanden and Howard Taylor. A Linear Time Probabilistic Counting Algorithm for Database Applications. TODS 15, 2, 1990. (pdf)