1. Introduction State of the art information retrieval systems commonly use machine learning techniques to learn ranking functions et al., et al., 2007).
Existing machine learning approaches typically optimize for ranking performance measures such as mean average precision or normalized discounted cumulative gain.
these approaches do not consider and also assume that relevance can be evaluated independently from other documents.
several recent studies in information retrieval have emphasized the need to optimize for diversity et al., Chen et al., Appearing in Proceedings of the th International Conference on Machine 2008.
Copyright by the author(s)/owner(s).
et al., 2008).
In they stressed the need to model dependencies.
none of these approaches addressed the learning and thus either use limited feature space or require extensive tuning for retrieval settings.
In we present method which can automatically learn good retrieval function using rich feature space.
In this paper we formulate the task of retrieval as the problem of predicting diverse subsets.
we formulate discriminant based on maximizing word and perform training using the structural framework et al., 2005).
For our diversity is measured using subtopic coverage on manually labeled data.
our approach can incorporate other forms of training data such as results.
To the best of our our method is the approach that can directly train for subtopic diversity.
We have also made available publicly implementation of our algorithm1.
For the rest of this we provide brief survey of recent related work.
We then present our model and describe the prediction and training algorithms.
We by presenting experiments on labeled query data from the Interactive Track as well as synthetic dataset.
Our method compares favorably to conventional methods which do not perform learning.
2. Related Work Our prediction method is most closely related to the Essential Pages method et al., since both methods select documents to maximize weighted word coverage.
Documents are iteratively selected to maximize the marginal which is also similar to approaches considered by et al., Chen et al., 2005).
none of these previous approaches addressed the learning problem.
http://projects.yisongyue.com/svmdiv/Predicting Diverse Subsets Using Structural Learning to rank is problem in machine learning.
Existing approaches typically consider the ranking e.g., et al., et al., et al., et al., et al., 2007).
These approaches maximize commonly used measures such as mean average precision and normalized discounted cumulative and generalize well to new queries.
diversity is not considered.
These approaches also evaluate each document independently of other documents.
From an online learning et al. used bandit method to minimize abandonment for single query.
While abandonment is provably their approach cannot generalize to new queries.
The diversity problem can also be treated as learning preferences for which is the approach taken by the modeling language et al., et al., 2007).
In their diversity is measured on per feature basis.
Since subtopics cannot be treated as features is only given in the training their method cannot be directly applied to maximizing subtopic diversity.
Our model does not need to derive diversity directly from individual but does require richer forms of training data (i.e.
, subtopics explicitly labeled).
Another approach uses global class hierarchy over queries which can be leveraged to classify new documents and queries et al., 2007).
While previous studies on hierarchical did not focus on one might consider diversity by mapping subtopics onto the class hierarchy.
it is for such hierarchies to achieve the granularity required to measure diversity for individual queries beginning of Section for description of subtopics used in our experiments).
Using large global hierarchy also introduces other complications such as how to generate comprehensive set of topics and how to assign documents to topics.
It seems more to collect labeled training data containing subtopics (e.g.
, Interactive Track).
3. The Learning Problem For each we assume that we are given set of candidate documents. . . xn}.
In order to measure we assume that each query spans set of topics may be distinct to that query).
We . . . where topic set Ti contains the subtopics covered by document xi x.
Topic sets may overlap.
Our goal is to select subset of documents from which maximizes topic coverage.
If the topic sets were good solution could be computed via straightforward greedy subset which has bound et al., 1997).
Finding the globally optimal subset takes choose which we consider intractable for even reasonably small values of K. the topic sets of candidate set are not nor is the set of all possible topics known.
We merely assume to have set of training examples of the form and must good function for predicting in the absence of T. This in essence is the learning problem.
Let denote the space of possible candidate sets the space of topic sets and the space of predicted subsets y.
Following the standard machine learning we formulate our task as learning hypothesis function to predict when given x.
