PIs: Mirek Riedewald (Computer Science), Daniel Fink (Lab of Ornithology)
Cornell University

Finding Interesting Patterns through Analysis of Complex Prediction Models

With data mining techniques it is possible to train accurate prediction models for large high-dimensional data. Unfortunately, complex prediction models per se are not intelligible. To make them 'digestible', analysts need simpler patterns that summarize the complex function learned by the model. The number of such function summaries is overwhelming. Each slice of a lower-dimensional subspace of the original data space could contain an interesting function summary.

The goal of this project is to develop techniques for finding the most 'interesting' function summaries automatically and efficiently. This is done in three steps. First, by formalizing the notion of interestingness for a wide variety of pattern types. Second, by developing a declarative language for specifying these interestingness measures. With a declarative language analysts define what they find interesting, but they need not specify how to find it efficiently. Third, by developing an optimizing compiler for a small language fragment. A major research challenge is to strike the right balance between expressiveness of the language and making it amenable to effective query optimization.

The results of this project will pave the way for powerful exploratory analysis tools. They will also enable future research on optimizers and user-friendly interfaces for the declarative language. The approach will be validated using the rich data resources being organized by the ornithological community in the Avian Knowledge Network (AKN). This will have a tremendous impact on the ability to identify the most significant environmental variables that affect biodiversity on the planet. A fragment of the language will be available to the public through Web services on the AKN Web site. This will enable a broad audience, from educators to land managers or researchers to interested citizens to derive novel knowledge from the data resources gathered. For example, land managers could discover the possible impact of their decisions on an ecosystem's health.