Computer Science Colloquium Series, Fall 2007

How does an expert discover something relevant to a task in a large distributed repository of complex and loosely-structured data? For example, how does a pharmaceutical researcher identify adverse effects of a drug in a large collection of automated cell microscopy images? The term "adverse effects" refers to a vague concept. A more precise definition can only be given after examining the data in some depth. In other words, hypothesis-formation and hypothesis-validation proceed hand-in-hand in a tightly-coupled and iterative sequence. We refer to this inherently human-centric activity as "interactive data exploration."

Diamond is an open-source software platform for interactive data exploration that has been jointly developed by Intel Research and Carnegie Mellon. It implements the concept of "early discard." This makes brute-force interactive search practical by eliminating irrelevant data as cheaply as possible. Further, Diamond embodies the concept of "self-tuning." This allows it dynamically adapt to different hardware configurations, workloads, and data content in a manner that is completely transparent to users and applications.

Medical and pharmaceutical researchers at University of Pittsburgh Medical Center, University of Pittsburgh School of Medicine, and Merck are collaborating with Diamond researchers to apply Diamond to their domain-specific tasks. This may open the door to research and diagnostic strategies that were not considered feasible until now.

Thursday, November 15, 2007 4:15 pm B17 Upson Hall	Computer Science Colloquium Fall 2007
M. Satyanarayanan Carnegie Mellon University
Interactive Data Exploration with Diamond
How does an expert discover something relevant to a task in a large distributed repository of complex and loosely-structured data? For example, how does a pharmaceutical researcher identify adverse effects of a drug in a large collection of automated cell microscopy images? The term "adverse effects" refers to a vague concept. A more precise definition can only be given after examining the data in some depth. In other words, hypothesis-formation and hypothesis-validation proceed hand-in-hand in a tightly-coupled and iterative sequence. We refer to this inherently human-centric activity as "interactive data exploration." Diamond is an open-source software platform for interactive data exploration that has been jointly developed by Intel Research and Carnegie Mellon. It implements the concept of "early discard." This makes brute-force interactive search practical by eliminating irrelevant data as cheaply as possible. Further, Diamond embodies the concept of "self-tuning." This allows it dynamically adapt to different hardware configurations, workloads, and data content in a manner that is completely transparent to users and applications. Medical and pharmaceutical researchers at University of Pittsburgh Medical Center, University of Pittsburgh School of Medicine, and Merck are collaborating with Diamond researchers to apply Diamond to their domain-specific tasks. This may open the door to research and diagnostic strategies that were not considered feasible until now.