Interactive and Online Data Mining
NSF Award IIS-0084762
Principal Investigator
Johannes
E. Gehrke
Department
of Computer Science
Cornell University
4105B
Upson Hall
Ithaca, NY 14850
Phone:
607-255-1045
Fax:
607-255-4428
johannes@cs.cornell.edu
http://www.cs.cornell.edu/johannes
Keywords
Data
mining, data stream processing, online data mining, mining data streams
Project Summary
This project spans two topics: Performance of data
mining model construction, and construction of data mining models and
computation of queries over high-speed data streams.
First, we are working on increasing the performance
of data mining algorithms in order to push performance towards interactive data
exploration with complex data mining models. We have developed two of the
fastest algorithms for mining frequent patterns and mining sequential patterns
published in the literature, and implementations of these algorithms can be downloaded
at http://himalaya-tools.sourceforge.net.
Second, we are investigating model construction and online query processing
over high-speed data streams where data can be read only once and in the order
the data arrives. We have developed algorithms for computing correlated
aggregates over data streams, a novel method for computing complex aggregates
over data stream joins, and the first algorithm for approximately computing
data stream joins under resource constraints.
Publications and Products
- Doug Burdick, Manuel Calimlim, and J. E.
Gehrke. MAFIA: A Maximal Frequent Itemset
Algorithm for Transactional Databases. In Proceedings of the 17th
International Conference on Data Engineering (ICDE 2001), Heidelberg, Germany, April 2001.
- J. E. Gehrke, Flip Korn, and Divesh Srivastava. On
Computing Correlated Aggregates Over Continual
Data Streams. In Proceedings of the 2001 ACM Sigmod International
Conference on Management of Data (SIGMOD 2001), Santa Barbara, California, May 2001.
- Venkatesh Ganti, J. E.
Gehrke, and Raghu Ramakrishnan. Mining Data
Streams under Block Evolution. Invited paper to SIGKDD Explorations,
Volume 3, Issue 2, January 2002.
- J. E. Gehrke. Report on the SIGKDD 2001 Conference Panel “New
Research Directions in KDD”. SIGKDD Explorations, Volume 3, Issue
2, January 2002.
- Alin Dobra, Minos Garofalakis,
J. E. Gehrke, and Rajeev Rastogi. Processing
Complex Aggregate Queries over Data Streams. In Proceedings of the 2002
ACM Sigmod International Conference on Management of Data, Madison (SIGMOD 2002), Wisconsin, June 2002.
- Jay Ayres, J. E. Gehrke, Tomi Yiu, and Jason Flannick.
Sequential PAttern Mining Using Bitmaps. In Proceedings
of the Eighth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining (SIGKDD 2002). Edmonton, Alberta, Canada, July 2002.
- Alan Demers, J. E. Gehrke, and Mirek Riedewald. Research Issues in
Distributed Mining and Monitoring. In the Informal Proceedings of the
National Science Foundation Workshop on Next Generation Data Mining (NGDM
2002). Baltimore, Maryland, November 2002.
- Rohit Ananthakrishna,
Abhinandan Das, J. E. Gehrke, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. Efficient
Approximation of Correlated Sums on Data Streams. IEEE Transactions on
Knowledge and Data Engineering, Vol. 15, No. 3, May/June 2003, pages
569-572.
- Abhinandan Das, J. E. Gehrke, and Mirek Riedewald. Approximate Join
Processing Over Data Streams. In Proceedings of the 2003 ACM SIGMOD
International Conference on Management of Data (SIGMOD 2003). San Diego, CA, June 2003.
We also gave the following tutorial based on
research funded under this grant: Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi. Querying and Mining Data Streams: You Only Get One
Look. Tutorial at the following conferences:
- 2002 ACM SIGMOD International Conference on Management of Data
(SIGMOD 2002), Madison, Wisconsin, June 2002.
- 8th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (SIGKDD 2002), Edmonton, Alberta, July 2002.
- 28th International Conference on Very Large Databases (VLDB 2002), Hong Kong, China, August 2002.
- Invited tutorial at the Third International Conference on
Web-Age Information Management (WAIM 2002), Beijing, China, August
2002.
Project Impact
We developed two of the fastest implementations for
two popular data mining techniques, large itemsets
and sequential patterns. Our tools are publicly available at http://himalaya-tools.sourceforge.net.
The site averages about 50 page views per days, and our code is downloaded
about three times per day.
We also apply our tools and data mining experience
to real data. We are collaborating with scientists from the Cornell Lab of
Ornithology on the analysis of citizen science data, and we are collaborating
with scientists from the Stanford Linear Accelerator Center on the analysis of
high-energy physics data.
Goals, Objectives and Targeted
Activities
The objective of this research is to develop novel
methods for high-performance data mining, and to
research algorithms for processing high-speed data streams. From a research
standpoint, the central questions are the following:
- The significance of performance. In both interactive and online
data mining, performance is of crucial importance. There has been a lot of
research on hiding the difference in access time between main memory and
disk. But there are no studies highlighting the actual system bottlenecks
and we do not understand how to make optimal use of all available hardware
resources. As part of this project, we have developed novel algorithms
based on efficient bitmap manipulations that have improved the speed of
data mining model construction over an order of magnitude for two
important data mining problems: large itemsets
and sequential patterns.
- The relationship between offline and online query processing.
Traditional Database Management Systems software is built on the concept
of persistent data sets, that are stored reliably
in stable storage and queried/updated several times throughout their
lifetime. For several emerging application domains, however, data arrives
and needs to be processed on a continuous 24x7 basis, without the benefit
of several passes over a static, persistent data image. What classes of
queries are amenable for processing online data streams, and what types of
provable guarantees can be give in case we allow approximate query
answers?
Our work is based on the following activities:
- Development of novel algorithms that make use of highly efficient
data structures tuned for the memory hierarchies of modern processors.
- Design of query processing algorithms that use only one pass over
the data, and thus are suitable for the online processing of high-speed
data streams.
Area Background
There has been a plethora of
recent work on data mining focusing on improving the performance of existing
data mining algorithms. There is also a lot of recent interest in developing
systems and algorithms for processing high-speed data streams. Some of the
recent textbooks on data mining (see area references below) provide a good
introduction on how to improve performance for data mining algorithms. An
overview of algorithms for data stream processing can be found in our recent
tutorial (see publications and products).
Area References
- N. Alon, Y. Matias,
and M. Szegedy. The Space Complexity of
Approximating the Frequency Moments. In Proceedings of the 28th Annual
ACM Symposium on the Theory of Computing, pages 20–29, Philadelphia,
Pennsylvania, May 1996.
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In
Proceedings of the 21st ACM
SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. Madison,
Wisconsin, June 2002.
- Y. Chen, G. Dong, J. Han, B. W. Wah, and
J. Wang. Multi-Dimensional Regression Analysis of Time-Series Data
Streams. In Proceedings of the 28th International Conference
on Very Large Data Bases. Hong Kong, China, August 2002.
- Mayur Datar,
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. “Maintaining Stream Statistics over Sliding
Windows”. In Proceedings of the 13th Annual ACM-SIAM Symposium on
Discrete Algorithms, San Francisco, California, January 2002.
- Minos Garofalakis
and Phillip B. Gibbons. “Approximate Query Processing: Taming the
Terabytes”. Tutorial in the 27th International. Conference on Very
Large Data Bases, Roma, Italy, September 2001.
- J. Han and M. Kamber. Data Mining:
Concepts and Techniques. Morgan Kaufmann Publishers, 2001.
- D. J. Hand, H. Mannila, and P. Smyth.
Principles of Data Mining. MIT Press, 2001.
- G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings
of the Seventh ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining (SIGKDD 2001). San Francisco, California,
August 2001.
- J. Vitter. Random sampling with a reservoir. ACM Transactions on
Mathematical Software, 11(1):37–57, 1985.
Related Projects
Project Websites
http://www.cs.cornell.edu/database/himalaya
This website is the main website of the Cornell Himalaya
Data Mining Project with links to all ongoing data mining activities.
Online Software
http://himalaya-tools.sourceforge.net.
We provide the full source code of the tools that we
developed including additional documentation that help understanding how the
algorithms work.