In the Spring 2002 semester, the joint AI/NLP seminar met on occasional Wednesdays 4-5 pm. We did not meet on weeks in which there was a talk in the Computer Science Department's Special Year in Machine Learning, Statistics, and Data Mining lecture (ML/S/DM) series (see the machine learning home page) or other AI-related lecture in the 2002 CS Distinguished Lecture schedule.
Co-organizers: Lillian Lee and Bart Selman.
Abstract: Using vector-based representations of document collections has enabled the application of powerful dimension-reduction techniques to information retrieval, document clustering, and other text analysis tasks. One of the most prominent of these techniques is Latent Semantic Indexing (LSI). However, despite ample empirical experience with it, there is still little understanding of when LSI can -- and, just as importantly, cannot -- be expected to perform well.
This talk consists of two parts. First, after a self-contained introduction to LSI, we provide a novel formal analysis of it that links its performance in a precise way to the uniformity of the underlying topic-document distribution. Second, we present a new algorithm, Iterative Residual Rescaling (IRR), that corrects for skewed distributions by automatically adjusting to non-uniformity in the topic distribution without knowledge of the underlying topics. A series of experiments over a variety of evaluation metrics validates our analysis and demonstrates the effectiveness of IRR.
Joint work with Rie Kubota Ando.
This talk will first introduce noun phrase coreference resolution, one of the critical problems that currently limit performance for many practical natural language processing tasks. We then present a machine learning-based solution to noun phrase coreference that extends earlier work in the area and produces the best empirical results to date --- for both learning- and knowledge-based approaches to the problem --- on two standard coreference data sets. Performance gains accrue from two very different sources of change: first, we propose and evaluate three extra-linguistic modifications to the machine learning framework; second, we more than triple the number of linguistic knowledge sources made available to the learning algorithm. We conclude with a discussion of why we view these seemingly promising results as ultimately disappointing, and identify a key area of research where progress in noun phrase coreference resolution is likely to be made.
This is joint work with Claire Cardie.
Abstract: Preference elicitation and preference representation play an important role in any effort to automate decision making. Unfortunately, few tools exist for preference elicitation that allow the decision maker to conveniently structure its preferences and the decision engine to efficiently reason about these preferences. One of the only such tools is CP-nets, which use an intuitive, graphical representation for modeling qualitative preferences of a decision maker.
This talk will introduce the CP-net model, and discuss different conceptual and algorithmic questions with respect to it:
Both recent results, some open structural and computational questions, and some potential directions for an interdisciplinary research will be presented. The talk is self-contained, thus no previous background is assumed.
Several buzzwords have recently been mentioned repeatedly in a variety of visitor talks on Machine Learning. The purpose of this seminar is to try to explain the meaning of (some of) these concepts, and to provide some perspective of the field. I shall assume only superficial familiarity with the area and try to avoid technical details as much as is possible without sacrificing clarity and correctness.
E-mail is a deceptively simple medium that hides a fearsome level of complexity -- a consequence of both the rate at which e-mail arrives, and the demands of managing archives of personal correspondence that can easily grow to hundreds of megabytes in size. And at a still larger scale, e-mail has become the raw material for legal proceedings and historical investigation. How can we find structures that might help in making sense of this type of data?
Viewed more generally, e-mail is a basic example of a document stream -- a sequence of documents that arrives continuously over time. A promising approach for analyzing such streams is to make use of the relationship between topics and temporal cues; as time progresses, topics of interest are signaled by `bursts of activity' in the stream. I'll suggest a formalism for modeling such bursts, which leads to a robust and efficient algorithm for identifying them. I'll then discuss two applications of this burst detection algorithm: first, to structuring the collection of 40,000+ e-mail messages I've sent and received since coming to Cornell; and second, to identifying the rise and fall of keywords in research paper repositories such as the e-Print archive. In both cases, the bursts that emerge have a natural meaning in terms of the underlying content.
I'll also discuss connections to work in topic detection and tracking, which has analyzed news articles as another salient type of document stream; to work in queueing theory, which provides some of the basic models used in the algorithms developed here; to work in temporal data mining, time-series analysis, and traffic analysis, which seek patterns in long sequences of time-indexed data; and, in a more unexpected direction, to work in narrative theory that has dealt with the bursty nature of time from a much more qualitative perspective.
Kernel methods are widely used in statistical learning techniques such as support vector machines or principal component analysis due to their high generalization capability and their computational efficiency in high-dimensional feature spaces.
Recently, several kernels have been introduced in computational biology for input vectors representing biological sequences. These 'string kernels' are specific instances of the more general class of 'rational kernels': kernels corresponding to weighted rational relations. We show that all rational kernels can be computed efficiently using a general algorithm of composition of weighted transducers.
Rational kernels can be further generalized to deal properly with the uncertainty of the data considered in applications such as computational biology or speech recognition. We show that these generalized kernels can also be computed using general weighted automata algorithms over the appropriate semirings and point out their usefulness for improving the word accuracy of automatic speech recognition systems.
Margin bounds give future error rate bounds for classifiers which consist of "votes" over many different subclassifiers. I will present a new margin bound based upon the PAC-Bayes bound which is functionally tighter and cleaner than prior work. This is joint work with John Shawe-Taylor.
Previous runnings:
F98, S00, S01
(as Statistical Natural Language Processing: Models and Methods), F01
See also the AI graduate
studies page or the
Cornell NLP
page.