
"The secret of success is to know something nobody else knows."
-- Aristotle Onassis
Our vision is a distributed data mining infrastructure that permits continuous monitoring of remote data sources with data mining operators that never stop working, incorporate new data immediately as they arrive in real time, update data mining models continuously, and automatically detect significant changes in the mined phenomena.
Towards this vision, we are currently working on the following technical topics:
Decision trees are one of the most widely used data mining models, and decision tree construction has a long history in machine learning, statistics, and pattern recognition. One of the main advantages of decision trees is that the resulting data mining model (the tree) can be easily understood by the data mining analyst. Recent studies have shown that the variable selection process in decision trees is biased, i.e., the predictor variable at a node of the tree might not actually be the predictor variable that is most important at this point. We have developed a computationally efficient, generic method that takes any traditional (biased) split selection method and generates an unbiased split selection method.
MAFIA. The development efficient algorithms for finding frequent itemsets is one of the most important topics in data mining. We have developed one of the currently fastest published algorithm (MAFIA -- Maximal Frequent Itemset Algorithm) for finding very long itemsets in a large transactional database. Our implementation of the search strategy combines a vertical bitmap representation of the database with efficient pruning and bitmap compression schemata. Experiments with real-life datasets show that our method outperforms previous work by up to an order of magnitude.
DualMiner.
Finding frequent sequences of items is a much harder problem than finding frequent itemsets. In recent work, we have extended the MAFIA Algorithm to the problem of sequential pattern mining. Our new algorithm (SPAM) efficiently finds long sequences in transactional databases. In joint work with Martin Burtscher we are investigating the application of these techniques to program trace analysis.
Data mining has traditionally been performed over static datasets, and offline data mining algorithms could afford to read the input data several times. Our research addresses the online processing of data streams. Tradeoffs between the response time, the accuracy, and the resource usage of a query are central to stream processing. Our goal is to build a system for processing high-speed data streams with complex continuous queries executed in a parallel dataflow engine. Our online data stream processing and data mining system will never stop working, incorporate new records immediately as they arrive, and update and construct new synopsis data structures and data mining models continuously.
Jay Ayres
Jeff Derstadt
Alin Dobra
Alexandre Evfimiefski
Johannes Gehrke
Jeff Hoy
Gilberto Rivera
Tomi Yiu
Collaborators:
Martin Burtscher, Cornell ECE
Walker
White, University of Dallas.
Acknowledgements:
The HIMALAYA Project is supported by NSF grants IIS-0121175 and IIS-0084762, by the Cornell Intelligent Information Systems Institute, and by gifts from Microsoft and Intel. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or Microsoft and Intel.