Artificial Intelligence Seminar

Spring 2003
Friday 12:00-1:15
Upson 5130

Sponsored by the Intelligent Information Systems Institute (IISI),
Computing and Information Science,
Cornell

The AI seminar will meet weekly for lectures by graduate students, faculty, and researchers emphasizing work-in-progress and recent results in AI research. Lunch will be served starting at noon, with the talks running between 12:15 and 1:15. The new format is designed to allow AI chit-chat before the talks begin. Also, we're trying to make some of the presentations less formal so that students and faculty will feel comfortable using the seminar to give presentations about work in progress or practice talks for conferences.

Schedule (because of departmental meetings, we will begin in February):

Date

Speaker/Title/Abstract/Host

February 7 Golan Yona
Using a mixture of probabilistic decision trees for direct prediction of protein function
I will describe a new approach to decision tree learning and its applications to protein classification. Specifically, I will introduce the mixture model of probabilistic decision trees and demonstrate how it can be used to learn the set of potentially complex relationships between protein features and protein function.

Our model addresses some of the fundamental problems with traditional decision tree learning algorithms. Specifically, we address four elements: optimization, evaluation, biased sample sets, and model selection. More precisely, we first propose an effective method of searching the hypothesis space that overcomes the pitfalls of the deterministic learning algorithms. Secondly, we introduce a novel criterion function to evaluate decision tree performance. Thirdly, we describe a method of dealing with distributions in which negatives samples far outnumber positive samples, such as in our protein classification problem. Lastly, we propose an alternative method for deciding on the most probable model that is especially effective for small data sets.

The model was tested on two well established classifications of proteins. The model is very effective in learning highly diverged protein families or families that are not defined based on sequence. The resulting tree structure indicates the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.

Joint work with M.eng student Umar Syed.

Host: Rich

February 14 *** no class *** (FCI Founders meeting)

February 21
*** no class ***

February 28 *** no class ***

March 7 John Langford
The One Bound

It turns out that every(*) bound on the true error rate of a classifier which holds for all distributions can be stated in terms of the communication complexity of the labels given the unlabeled data. I'll discuss "the One Bound" and the relationships with several families of other bounds.

(*) at least, every bound that has been checked

March 14 ***no class*** (room is being used for brown-bag presentation)

March 21 ***no class*** (Spring Break)

March 28 Rich Caruana
Extreme Ensemble Selection

Host: Rich

April 4 Decision theory symposium

April 11 ***no class*** (ACSU student/faculty lunch)

April 18 Shimon Edelman
(combined with the Brownbag seminar)
Unsupervised efficient learning and representation of language structure

We describe a linguistic pattern acquisition algorithm that learns, in an unsupervised fashion, a streamlined representation of corpus data. This is achieved by compactly coding recursively structured constituent patterns, and by placing strings that have an identical backbone and similar context structure into the same equivalence class. The resulting representations constitute an efficient encoding of linguistic knowledge and support systematic generalization to unseen sentences.

Joint work with Zach Solan, Eytan Ruppin and David Horn

Host: Lillian

April 25 Dimitris Agrafiotis, Johnson & Johnson Pharmaceutical Research & Development
New algorithms for the analysis of large data sets and their application in molecular design

The multitude of potential drug targets emerging from genome sequencing demands new approaches to drug discovery. A chemo-genomics strategy, involving the generation of small molecule compounds that can be used both as tools to probe biological mechanisms and as leads for drug property optimization, provides a highly parallel, industrialized solution. Key to the success of this strategy is an integrated suite of data-driven chemi-informatics tools that can enable the rapid and directed optimization of chemical compounds with drug-like properties using just-in-time combinatorial chemical synthesis. An effective embodiment of this process requires new computational and data mining techniques that cover all aspects of library generation, modeling and design, and work effectively on a massive scale. This talk will introduce the essential elements of such a system, and highlight key algorithmic advances that expand, by several orders of magnitude, the number of compounds that can be assessed as potential drugs. Particular emphasis will be placed on a novel self-organizing algorithm for extracting the metric structure and intrinsic dimensionality of large experimental observation spaces. The algorithm, known as stochastic proximity embedding or SPE, attempts to generate low-dimensional Euclidean embeddings that best preserve the geodesic distances between a set of related observations. Unlike previous approaches, our method can reveal the underlying geometry of the data without intensive nearest neighbor or shortest-path computations, and can reproduce the true geodesic distances of the data points in the low-dimensional embedding without requiring that these distances be estimated from the data sample. More importantly, SPE scales linearly with the number of points, and can be
applied to very large data sets that are intractable by conventional embedding procedures. Because it solves the fundamental problem of converting distances to coordinates, the method can be applied to a wide range of problems across all scientific domains. Several applications in the fields of computational chemistry and structural biology will be presented.

Host: Golan Yona

May 2
Cognitive Studies Spring Symposium

See also the AI graduate study brochure.

Please contact any of the faculty below if you'd like to give a talk this semester. We especially encourage graduate students to sign up!

CS772, Spring '03
Claire Cardie
Rich Caruana
Joe Halpern
Thorsten Joachims
Lillian Lee
Bart Selman
Golan Yona

Back to CS course websites

Date	Speaker/Title/Abstract/Host
February 7	Golan Yona Using a mixture of probabilistic decision trees for direct prediction of protein function I will describe a new approach to decision tree learning and its applications to protein classification. Specifically, I will introduce the mixture model of probabilistic decision trees and demonstrate how it can be used to learn the set of potentially complex relationships between protein features and protein function. Our model addresses some of the fundamental problems with traditional decision tree learning algorithms. Specifically, we address four elements: optimization, evaluation, biased sample sets, and model selection. More precisely, we first propose an effective method of searching the hypothesis space that overcomes the pitfalls of the deterministic learning algorithms. Secondly, we introduce a novel criterion function to evaluate decision tree performance. Thirdly, we describe a method of dealing with distributions in which negatives samples far outnumber positive samples, such as in our protein classification problem. Lastly, we propose an alternative method for deciding on the most probable model that is especially effective for small data sets. The model was tested on two well established classifications of proteins. The model is very effective in learning highly diverged protein families or families that are not defined based on sequence. The resulting tree structure indicates the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family. Joint work with M.eng student Umar Syed. Host: Rich
February 14	* no class * (FCI Founders meeting)
February 21	* no class *
February 28	* no class *
March 7	John Langford The One Bound It turns out that every() bound on the true error rate of a classifier which holds for all distributions can be stated in terms of the communication complexity of the labels given the unlabeled data. I'll discuss "the One Bound" and the relationships with several families of other bounds. () at least, every bound that has been checked
March 14	*no class* (room is being used for brown-bag presentation)
March 21	*no class* (Spring Break)
March 28	Rich Caruana Extreme Ensemble Selection Host: Rich
April 4	Decision theory symposium
April 11	*no class* (ACSU student/faculty lunch)
April 18	Shimon Edelman (combined with the Brownbag seminar) Unsupervised efficient learning and representation of language structure We describe a linguistic pattern acquisition algorithm that learns, in an unsupervised fashion, a streamlined representation of corpus data. This is achieved by compactly coding recursively structured constituent patterns, and by placing strings that have an identical backbone and similar context structure into the same equivalence class. The resulting representations constitute an efficient encoding of linguistic knowledge and support systematic generalization to unseen sentences. Joint work with Zach Solan, Eytan Ruppin and David Horn Host: Lillian
April 25	Dimitris Agrafiotis, Johnson & Johnson Pharmaceutical Research & Development New algorithms for the analysis of large data sets and their application in molecular design The multitude of potential drug targets emerging from genome sequencing demands new approaches to drug discovery. A chemo-genomics strategy, involving the generation of small molecule compounds that can be used both as tools to probe biological mechanisms and as leads for drug property optimization, provides a highly parallel, industrialized solution. Key to the success of this strategy is an integrated suite of data-driven chemi-informatics tools that can enable the rapid and directed optimization of chemical compounds with drug-like properties using just-in-time combinatorial chemical synthesis. An effective embodiment of this process requires new computational and data mining techniques that cover all aspects of library generation, modeling and design, and work effectively on a massive scale. This talk will introduce the essential elements of such a system, and highlight key algorithmic advances that expand, by several orders of magnitude, the number of compounds that can be assessed as potential drugs. Particular emphasis will be placed on a novel self-organizing algorithm for extracting the metric structure and intrinsic dimensionality of large experimental observation spaces. The algorithm, known as stochastic proximity embedding or SPE, attempts to generate low-dimensional Euclidean embeddings that best preserve the geodesic distances between a set of related observations. Unlike previous approaches, our method can reveal the underlying geometry of the data without intensive nearest neighbor or shortest-path computations, and can reproduce the true geodesic distances of the data points in the low-dimensional embedding without requiring that these distances be estimated from the data sample. More importantly, SPE scales linearly with the number of points, and can be applied to very large data sets that are intractable by conventional embedding procedures. Because it solves the fundamental problem of converting distances to coordinates, the method can be applied to a wide range of problems across all scientific domains. Several applications in the fields of computational chemistry and structural biology will be presented. Host: Golan Yona
May 2	Cognitive Studies Spring Symposium

Artificial Intelligence Seminar

Spring 2003 Friday 12:00-1:15 Upson 5130

Sponsored by the Intelligent Information Systems Institute (IISI), Computing and Information Science, Cornell

Spring 2003
Friday 12:00-1:15
Upson 5130

Sponsored by the Intelligent Information Systems Institute (IISI),
Computing and Information Science,
Cornell