The HIMALAYA Data Mining Project 
The HIMALAYA Data Mining Project at Cornell researched innovative techniques for analyzing large datasets. Results from this project are available on this page.
Source Code
We made the source code of our algorithms available as part of the Himalaya data mining tools source code distribution on sourceforge. Code is available for the following algorithms: MAFIA (Mining Maximal Frequent Itemsets), SECRET (Scalable Linear Regression Trees), and (SPAM: Sequential Pattern Mining).
Past Research Topics
Publications
2004
-
Abhinandan Das, Johannes Gehrke , and Mirek Riedewald. Approximation Techniques for Spatial Data . In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD 2004) . Paris, France, June 2004.
2003
-
Abhinandan Das, J. E. Gehrke, and Mirek Riedewald. Approximate Join Processing Over Data Streams . In Proceedings of the the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD 2003) . San Diego, CA, June 2003.
-
Daniel Kifer, J. E. Gehrke, Cristian Bucila, and Walker White. How to Quickly Find a Witness . In Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003) . San Diego, CA, June 2003.
-
Alexandre Evfimievski, J. E. Gehrke, and Ramakrishnan Srikant. Limiting Privacy Breaches in Privacy Preserving Data Mining . In Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003) . San Diego, CA, June 2003.
2002
- Shai Ben-David, J. E. Gehrke, and Reba Schuller. A Theoretical Framework for Learning from a Pool of Disparate Data Sources . In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . Edmonton, Alberta, Canada, July 2002.
- Cristian Bucila , J. E. Gehrke, Daniel Kifer , and Walker White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints . In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . Edmonton, Alberta, Canada, July 2002.
- Jay Ayres, J. E. Gehrke, Tomi Yiu, and Jason Flannick . Sequential Pattern Mining Using Bitmaps. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . Edmonton, Alberta, Canada, July 2002.
- Alexandre Evfimievski, Ramakrishnan Srikant , Rakesh Agrawal, and J. E. Gehrke. Privacy Preserving Mining of Association Rules . In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . Edmonton, Alberta, Canada, July 2002.
- Alin Dobra and Johannes Gehrke. SECRET: A Scalable Linear Regression Tree Algorithm. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . Edmonton, Alberta, Canada, July 2002.
- Alin Dobra, Minos Garofalakis, J. E. Gehrke, and Rajeev Rastogi . Processing Complex Aggregate Queries over Data Streams . In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data , Madison, Wisconsin, June 2002.
2001
- Johannes Gehrke and Wei-Yin Loh. Advances in Decision Tree Construction. Tutorial at the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , San Francisco, CA, August 2001. Part I of the tutorial slides.
- Alin Dobra and J. E. Gehrke. Bias Correction in Classification Tree Construction. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2001) , Williams College, Massachusetts, June 2001.
-
J. E. Gehrke, Flip Korn, and Divesh Srivastava . On Computing Correlated Aggregates Over Continual Data Streams. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data , Santa Barbara, California, May 2001.
- Doug Burdick, Manuel Calimlim, and J. E. Gehrke. MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases . In Proceedings of the 17th International Conference on Data Engineering , Heidelberg, Germany, April 2001.
- Venkatesh Ganti, J. E. Gehrke, and Raghu Ramakrishnan. DEMON: Mining and Monitoring Evolving Data. IEEE Transactions on Knowledge and Data Engineering, Vol. 13, No.1, January/February 2001, pages 50-63. Preliminary version: Venkatesh Ganti, J. E. Gehrke, and Raghu Ramakrishnan. DEMON: Mining and Monitoring Evolving Data . In Proceedings of the 16th International Conference on Data Engineering , San Diego, California, 2000. Best student paper award.
2000
- Johannes Gehrke. Data Mining with Decision Trees. Tutorial at the Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining, Kyoto Japan, April 2000.
- Johannes Gehrke. Decision Trees and Predictive Rules. Invited tutorial at the Sixteenth International Conference on Data Engineering, San Diego, California, February 2000.
1999
- Venkatesh Ganti, J. E. Gehrke, and Raghu Ramakrishnan. Mining very large databases. IEEE Computer, Vol. 32, No. 9, August 1999 , pages 38-45.
- Venkatesh Ganti, J. E. Gehrke, and Raghu Ramakrishnan . CACTUS--Clustering Categorical Data Using Summaries . In Proceedings of the 1999 SIGKDD Conference , San Diego, California, August 1999.
- J. E. Gehrke, Venkatesh Ganti, Raghu Ramakrishnan, and Wei-Yin Loh . BOAT -- Optimistic Decision Tree Construction . In Proceedings of the 1999 SIGMOD Conference , Philadelphia, Pennsylvania, June 1999.
- Venkatesh Ganti, J. E. Gehrke, Raghu Ramakrishnan, and Wei-Yin Loh. A Framework for Measuring Changes in Data Characteristics . In Proceedings of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems , Philadelphia, Pennsylvania, May 1999. (Invited to Journal of Computer Science and Systems (JCSS).)
- Venkatesh Ganti, Raghu Ramakrishnan, J. E. Gehrke, Allison L. Powell, and James French. Clustering Large Datasets in Arbitrary Metric Spaces . In Proceedings of the Fifteenth International Conference on Data Engineering , Sidney, Australia, 1999.
People
Researchers
- Johannes Gehrke (Faculty)
- Dan Kifer (PhD student)
- Mirek Riedewald (Research Associate)
Collaborators:
-
Martin Burtscher , Cornell ECE
-
Rich Caruana , Cornell CS
-
Minos Garofalakis , Lucent Bell Labs
-
Flip Korn , AT&T Labs
-
Rajeev Rastogi , Lucent Bell Labs
-
Divesh Srivastava , AT&T Labs
-
Walker White , University of Dallas
Alumni
- Jay Ayres (Cornell Presidential Research Scholar, now working in the Cougar Project)
- Abhinandan Das (PhD student)
- Jeff Derstadt (Master of Engineering, Fall 2001. First employment: Microsoft)
- Alin Dobra (PhD, Summer 2003. First employment: University of Florida, Gainesville)
- Alexandre Evfimiefski (PhD, Summer 2004. First employment: IBM Almaden Research Lab)
- Jason Flannick (Bachelor of Arts and Sciences, Spring 2002. First employment: IBM)
- Jeff Hoy (Master of Engineering, Spring 2001. First employment: IBM)
- Priya Rajan (Master of Engineering, Fall 2000. First employment: Lucent Technologies)
- Gilberto Rivera (Master of Engineering, Spring 2002)
- Tomi Yiu (Master of Engineering, Fall 2001)
Acknowledgements:
The HIMALAYA Data Mining Project is supported in part by NSF grants IIS-0121175 and IIS-0084762, the KD-D Initiative, by the Cornell Intelligent Information Systems Institute , the Cornell Information Assurance Institute , and by generous gifts from Microsoft and Intel. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.