Not all supervised learning problems fit the standard classification and regression function-learning model. Some problems require that we predict things other than values or classes. For example, sometimes the magnitude of the values predicted for cases are not important, but the ordering/ranking induced by these values is important. Sometimes we don't know a priori what classes exist, but we need to learn which items belong to the same class. Sometimes it is important to retrieve the top 10 objects, and no one cares what ordering is predicted for the remaining 100,000 objects. Sometimes the quality of learning will be judged by measures such as Precision and Recall or ROC that are not well optimized by standard value prediction models.
Mismatch from the value prediction learning model can arise not only with the predictions, but also can arise in the form of the training examples. For example, when a user indicates that one document is more relevant than another, or that two documents should be in the same class, but does not assign values to the documents themselves, the notion of what constitutes a training example is turned on it's head. In these situations a training example is a relation on pairs of what traditionally would have been considered independent training cases. This change has deep ramifications.
This workshop aims to explore supervised learning problems that go beyond the usual value prediction model. In particular, it addresses problems where either (a) the goal of learning or (b) the input to the learner, are more complex structures than in classification and regression. Examples of such problems are:
learning a partial/total ordering
learning equality / matching
learning to optimize non-standard criteria such as ROC area or precision/recall
using relative preferences as training examples
ordinal regression
learning to cluster with feedback or constraints
learning graphs and other structures
problems that require these approaches (e.g., text retrieval, medical decision making, protein matching)
The goal of the workshop is twofold. First, to create a forum for discussing recent methods and results for these problems. Second, to inspire research on new learning algorithms and problems.
This will be a one-day workshop. To prevent the workshop from degrading into a mini-conference, we will allot at least 50% of the time to discussion. The workshop will consist of 2-4 short invited talks and a number of additional presentations and discussion sessions. There will be short poster sessions to force people out of their chairs and get them talking to each other. We probably won't do a panel discussion because it is not clear that anyone has enough expertise yet.
Submissions are invited. They may be extended abstracts or full papers. Submissions should be postscript, PDF, or plain text. Please send submissions via email to caruana@cs.cornell.edu, tj@cs.cornell.edu before Nov 10, 2002. We'll put together a tentative program by Nov 15.
If there is enough interest, we may consider organizing a special journal issue on learning rankings. We'll discuss this further at the workshop.
Learning Customers' Preference Lists via Probabilistic Model
Hao-Hsiang Chung and Han-Shen Huang
Our work is to learn the personalized customers' preference lists of items from transaction data of
supermarkets. Our approach is to use Hybrid Poisson Aspect Model (HyPAM), a probabilistic model, to rank
all the items based on the probability distributions that a customer would buy them. The
experimental results show that HyPAM outperforms two other well-known approaches on the real-world transaction
data set.
Learning to Match and Cluster Entity Names for Data Integration
William Cohen
Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating
databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting
similarities in the textual names used for objects in different databases. In this paper we describe techniques for clustering and
matching identifier names that are both scalable and trainable, in the sense that they can be trained to obtain better performance in a
particular domain. The formal model for these techniques closely models earlier work on "learning to order", and, coupled with recent
formal results in clustering, similar formal guarantees can be made. An experimental evaluation on a number of sample datasets shows that
the adaptive method sometimes performs much better than either of two non-adaptive baseline systems, and is nearly always competitive
with the best baseline system.
Learning to Retrieve Information
Garrison W. Cottrell, Brian Bartell, and Rik Belew
Information retrieval differs significantly from function approximation in that the goal is for the system to achieve
the same ranking function of documents relative to queries as the user: the outputs of the system relative to one
another must be in the proper order. We found that a particular rank-order statistic, Guttman's point alienation,
works well as an objective function for such a system. We demonstrated its efficacy by using it to find the optimal
combination of retrieval experts. In recent years, this approach has worked well for a "many-class classification"
problem in a product developed for Notiora, Inc., where the problem involves hundreds to thousands of categories.
TBA
Koby Crammer and Yoram Singer
Large Margin PRank
Edward Harrington
In this paper we present a truly online large margin version of the Perceptron ranking (PRank) algorithm, called LMPRank. The PRank algorithm
finds a rule that correctly ranks a given training sequence of instance and target rank pairs. PRank maintains a weight vector and a set of thresholds to define a ranking rule which maps each instance to their respective rank. The LMPRank algorithm is an extension of this algorithm by
approximating the Bayes point, thus giving a good generalisation performance. The Bayes point is approximated by averaging the weights and
thresholds associated with several PRank algorithms run in parallel. In order to ensure diversity amongst the solutions of the PRank algorithms
we randomly subsample the stream of incoming training examples. Experiments conducted on a synthetic dataset and the EachMovie datasets
showed that LMPRank has a better performance compared to PRank and a pure online regression algorithm, albeit with a higher computational
cost.
A Statistical Analysis of the Precision-Recall Graph
Ralf Herbrich
Ranking of Objects described by Matrix Data
Sepp Hochreiter and Klaus Obermayer
Using ``row objects'' and ``column objects'', we attempt to rank the former, using only the relations between the two. The data is in
matrix form and the column objects are labeled. The row object ranking criterium is their indicative value relative to each label. The row
objects may be considered as features of the column objects, thus we can perform feature ranking. The basic idea of our method is to
interpret the data matrix entries as being produced by an unknown kernel representing a dot product in some (unknown) feature space.
Therefore, both the column objects and the row objects, despite stemming from different sets, are mapped into a common feature space.
By using support vector machine principles, a hyperplane which separates the labeled column objects is constructed. In contrast to
standard support vector machines the hyperplane is only described by the row objects on which an additional sparseness constraint is
imposed. Only a few row objects become support vectors (non-zero weighted) which allows the row objects to be ranked. Based on this
procedure, different ranking strategies are investigated. We apply the ranking procedure to DNA gene expression data, where column and
row objects are tumor samples and genes, respectively. Therefore, a matrix element is the expression level of a particular gene within a
tumor sample. All tumor samples are from patients who were treated with the same form of chemotherapy, and each sample is labeled
according to the treatment outcome. In a second experiment we utilize the WWW link structure, i.e. a ``hyperlink matrix''. Both the column
and row objects are web pages. The ranking technique uses relatively few columns all of which are given a classification, for example,
according to the page's content. The task is to identify the web pages which index certain web page categories and vice versa. In both
experiments, each classifier tested was improved through the use of this ranking procedure and, in addition, important features (relevant
genes or web pages) were discovered.
Learning Ordering Function from Partially Ranked Examples
Hideto Kazawa and Eisaku Maeda
In this paper we investigate the problem of learning an ordering function from `partially' ranked examples, in which the highestranked examples are presented with their ranks. It is empirically and theoretically demonstrated that partially ranked examples are 'biased', and the existing order learning
methods do not learn well from partially ranked examples. To overcome this problem, we present
a cost modification scheme in which the costs of training errors are modified according to the estimated bias; an ordering function is subsequently
learned from the modified training set. We also propose an SVMbased learning algorithm based on
the scheme, and show the experimental results on an artificial and an information retrieval dataset.
The results confirm that the proposed method performs better than the existing learning algorithms.
Conditional Models on the Ranking Poset - A Unifying Theme for Classification and Ranking
Guy Lebanon and John Lafferty
Classification and ranking have long been considered different problems in machine learning and statistics. We present a class of statistical models whose event space lies on the ranking poset, a combinatorial structure that includes classification and ranking, as a unifying framework for both problems. Popular models for classification such as error correcting output codes and discrete versions of boosting and logistic regression as well as ranking models such as the Mallows model are shown to be special cases in this framework. In addition to drawing a surprising connection between the above algorithms, the framework naturally suggests new algorithms for ranking and classification. One such algorithm is shown to beat a state of the art method for combining search engines and to perform similarly to the best ensemble methods in classification while using less data. Some ideas for future research in this direction will be presented.
Tree induction vs. logistic regression
Claudia Perlich and Foster Provost
I will present the results of a large--scale experimental comparison of logistic regression and tree induction for classification and ranking. The results of the study show several remarkable things. (1) Contrary to prior observations, logistic regression does not generally outperform tree induction. (2) More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (i.e., the learning curves cross), so conclusions about induction-algorithm superiority on a given domain must be based on an analysis of the learning curves. (3) Contrary to conventional wisdom, tree induction is effective at producing rankings based on estimates of class-membership probability. Finally, and perhaps most interestingly, (4) the domains on which tree induction and logistic regression are ultimately preferable can be characterized surprisingly well by a simple ranking-based measure of signal-to-noise ratio.
Extended classification with modified
Perceptron
Gunnar Rätsch and Jyrki Kivinen
Abstract
Optimizing Classifier Performance via the Wilcoxon-Mann-Whitney
Statistic
Lian Yan, Robert Dodier, Michael C. Mozer, and Richard Wolniewicz
Cross entropy and mean squared error are typical cost functions used to optimize classifier performance. The goal of the optimization is usually to achieve the best correct classification rate. However, for many two-class real-world problems, the ROC curve is a more meaningful performance measure. We demonstrate that minimizing cross entropy or mean squared error does not necessarily maximize the area under the ROC curve (AUC). We then consider alternative objective functions for training a classifier to maximize the AUC directly. We propose an objective function that is an approximation to the Wilcoxon-Mann-Whitney statistic, which is equivalent to AUC. The proposed objective function is differentiable, so gradient-based methods can be used to train the classifier. After discussing the improved results of the new objective function over several UCI data sets, we apply the new objective function to real-world customer behavior prediction problems for a wireless service provider and a cable service provider, and achieve reliable and significant improvements in the ROC curve.
This is a one-day workshop on Saturday.
Rich Caruana 4157 Upson Hall Computer Science Cornell University Ithaca, NY 14853 caruana@cs.cornell.edu 607-255-1164 607-255-4428 (fax) |
Thorsten Joachims 4153 Upson Hall Computer Science Cornell University Ithaca, NY 14853 tj@cs.cornell.edu 607-255-1372 607-255-4428 (fax) |