Project Title

Interactive and Online Data Mining

NSF Award IIS-0084762

Principal Investigator

Johannes E. Gehrke

Department of Computer Science

Cornell University

4105B Upson Hall

Ithaca, NY 14850

Phone: 607-255-1045

Fax: 607-255-4428
johannes@cs.cornell.edu

http://www.cs.cornell.edu/johannes

Keywords

Data mining, data stream processing, online data mining, mining data streams

Project Summary

This project spans two topics: Performance of data mining model construction, and construction of data mining models and computation of queries over high-speed data streams.

First, we are working on increasing the performance of data mining algorithms in order to push performance towards interactive data exploration with complex data mining models. We have developed two of the fastest algorithms for mining frequent patterns and mining sequential patterns published in the literature, and implementations of these algorithms can be downloaded at http://himalaya-tools.sourceforge.net. Second, we are investigating model construction and online query processing over high-speed data streams where data can be read only once and in the order the data arrives. We have developed algorithms for computing correlated aggregates over data streams, a novel method for computing complex aggregates over data stream joins, and the first algorithm for approximately computing data stream joins under resource constraints.

Publications and Products

Doug Burdick, Manuel Calimlim, and J. E. Gehrke. MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases. In Proceedings of the 17th International Conference on Data Engineering (ICDE 2001), Heidelberg, Germany, April 2001.
J. E. Gehrke, Flip Korn, and Divesh Srivastava. On Computing Correlated Aggregates Over Continual Data Streams. In Proceedings of the 2001 ACM Sigmod International Conference on Management of Data (SIGMOD 2001), Santa Barbara, California, May 2001.
Venkatesh Ganti, J. E. Gehrke, and Raghu Ramakrishnan. Mining Data Streams under Block Evolution. Invited paper to SIGKDD Explorations, Volume 3, Issue 2, January 2002.
J. E. Gehrke. Report on the SIGKDD 2001 Conference Panel “New Research Directions in KDD”. SIGKDD Explorations, Volume 3, Issue 2, January 2002.
Alin Dobra, Minos Garofalakis, J. E. Gehrke, and Rajeev Rastogi. Processing Complex Aggregate Queries over Data Streams. In Proceedings of the 2002 ACM Sigmod International Conference on Management of Data, Madison (SIGMOD 2002), Wisconsin, June 2002.
Jay Ayres, J. E. Gehrke, Tomi Yiu, and Jason Flannick. Sequential PAttern Mining Using Bitmaps. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2002). Edmonton, Alberta, Canada, July 2002.
Alan Demers, J. E. Gehrke, and Mirek Riedewald. Research Issues in Distributed Mining and Monitoring. In the Informal Proceedings of the National Science Foundation Workshop on Next Generation Data Mining (NGDM 2002). Baltimore, Maryland, November 2002.
Rohit Ananthakrishna, Abhinandan Das, J. E. Gehrke, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. Efficient Approximation of Correlated Sums on Data Streams. IEEE Transactions on Knowledge and Data Engineering, Vol. 15, No. 3, May/June 2003, pages 569-572.
Abhinandan Das, J. E. Gehrke, and Mirek Riedewald. Approximate Join Processing Over Data Streams. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD 2003). San Diego, CA, June 2003.

We also gave the following tutorial based on research funded under this grant: Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi. Querying and Mining Data Streams: You Only Get One Look. Tutorial at the following conferences:

2002 ACM SIGMOD International Conference on Management of Data (SIGMOD 2002), Madison, Wisconsin, June 2002.
8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2002), Edmonton, Alberta, July 2002.
28th International Conference on Very Large Databases (VLDB 2002), Hong Kong, China, August 2002.
Invited tutorial at the Third International Conference on Web-Age Information Management (WAIM 2002), Beijing, China, August 2002.

Project Impact

We developed two of the fastest implementations for two popular data mining techniques, large itemsets and sequential patterns. Our tools are publicly available at http://himalaya-tools.sourceforge.net. The site averages about 50 page views per days, and our code is downloaded about three times per day.

We also apply our tools and data mining experience to real data. We are collaborating with scientists from the Cornell Lab of Ornithology on the analysis of citizen science data, and we are collaborating with scientists from the Stanford Linear Accelerator Center on the analysis of high-energy physics data.

Goals, Objectives and Targeted Activities

The objective of this research is to develop novel methods for high-performance data mining, and to research algorithms for processing high-speed data streams. From a research standpoint, the central questions are the following:

The significance of performance. In both interactive and online data mining, performance is of crucial importance. There has been a lot of research on hiding the difference in access time between main memory and disk. But there are no studies highlighting the actual system bottlenecks and we do not understand how to make optimal use of all available hardware resources. As part of this project, we have developed novel algorithms based on efficient bitmap manipulations that have improved the speed of data mining model construction over an order of magnitude for two important data mining problems: large itemsets and sequential patterns.

The relationship between offline and online query processing. Traditional Database Management Systems software is built on the concept of persistent data sets, that are stored reliably in stable storage and queried/updated several times throughout their lifetime. For several emerging application domains, however, data arrives and needs to be processed on a continuous 24x7 basis, without the benefit of several passes over a static, persistent data image. What classes of queries are amenable for processing online data streams, and what types of provable guarantees can be give in case we allow approximate query answers?

Our work is based on the following activities:

Development of novel algorithms that make use of highly efficient data structures tuned for the memory hierarchies of modern processors.
Design of query processing algorithms that use only one pass over the data, and thus are suitable for the online processing of high-speed data streams.

Area Background

There has been a plethora of recent work on data mining focusing on improving the performance of existing data mining algorithms. There is also a lot of recent interest in developing systems and algorithms for processing high-speed data streams. Some of the recent textbooks on data mining (see area references below) provide a good introduction on how to improve performance for data mining algorithms. An overview of algorithms for data stream processing can be found in our recent tutorial (see publications and products).

Area References

N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing, pages 20–29, Philadelphia, Pennsylvania, May 1996.
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proceedings of the 21^st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. Madison, Wisconsin, June 2002.
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-Dimensional Regression Analysis of Time-Series Data Streams. In Proceedings of the 28^th International Conference on Very Large Data Bases. Hong Kong, China, August 2002.
Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. “Maintaining Stream Statistics over Sliding Windows”. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, January 2002.
Minos Garofalakis and Phillip B. Gibbons. “Approximate Query Processing: Taming the Terabytes”. Tutorial in the 27th International. Conference on Very Large Data Bases, Roma, Italy, September 2001.
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2001.
D. J. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, 2001.
G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2001). San Francisco, California, August 2001.
J. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37–57, 1985.

Related Projects

Mining Dynamics of Data Streams in Multi-Dimensional Space. Jiawei Han, University of Illinois at Urbana-Champaign. http://www-faculty.cs.uiuc.edu/~hanj/
The Stanford Stream Data Manager. http://www-db.stanford.edu/stream/

Project Websites

http://www.cs.cornell.edu/database/himalaya

This website is the main website of the Cornell Himalaya Data Mining Project with links to all ongoing data mining activities.

Online Software

http://himalaya-tools.sourceforge.net.

We provide the full source code of the tools that we developed including additional documentation that help understanding how the algorithms work.