PUBLICATIONS

Short-Term Satisfaction and Long-Term Coverage: Understanding How Users Tolerate Algorithmic Exploration

@InProceedings{schnabel2018exploration,
  author = 	 {Schnabel, Tobias and Bennett, Paul N. and Dumais, Susan T. and Joachims, Thorsten},
  title = 	 {Short-Term Satisfaction and Long-Term Coverage: Understanding How Users Tolerate Algorithmic Exploration},
  booktitle =    {Conference on Web Search and Data Mining (WSDM)},
  pages =	 "0-0",
  year = 	 2018
}

Any learning algorithm for recommendation faces a fundamental trade-off between exploiting partial knowledge of a user's interests to maximize satisfaction in the short term and discovering additional user interests to maximize satisfaction in the long term. To enable discovery, a machine learning algorithm typically elicits feedback on items it is uncertain about, which is termed algorithmic exploration in machine learning. This exploration comes with a cost to the user, since the items an algorithm chooses for exploration frequently turn out to not match the user's interests. In this paper, we study how users tolerate such exploration and how presentation strategies can mitigate the exploration cost. To this end, we conduct a behavioral study with over 600 people, where we vary how algorithmic exploration is mixed into the set of recommendations. We find that users respond non-linearly to the amount of exploration, where some exploration mixed into the set of recommendations has little effect on short-term satisfaction and behavior. For long-term satisfaction, the overall goal is to learn via exploration about the items presented. We therefore also analyze the quantity and quality of implicit feedback signals such as clicks and hovers, and how they vary with different amounts of mix-in exploration. Our findings provide insights into how to design presentation strategies for algorithmic exploration in interactive recommender systems, mitigating the short-term costs of algorithmic exploration while aiming to elicit informative feedback data for learning.

Unbiased Learning-to-Rank with Biased Feedback

PDF CODE

@InProceedings{joachims2017ltr,
  author = 	 {Joachims, Thorsten and Swaminathan, Adith and Schnabel, Tobias},
  title = 	 {Unbiased Learning-to-Rank with Biased Feedback},
  booktitle =    {Conference on Web Search and Data Mining (WSDM)},
  pages =	 "781-789",
  year = 	 2017
}

Implicit feedback (e.g., clicks, dwell times, etc.) is an abundant source of data in human-interactive systems. While implicit feedback has many advantages (e.g., it is inexpensive to collect, user centric, and timely), its inherent biases are a key obstacle to its effective use. For example, position bias in search rankings strongly influences how many clicks a result receives, so that directly using click data as a training signal in Learning-to-Rank (LTR) methods yields sub-optimal results. To overcome this bias problem, we present a counterfactual inference framework that provides the theoretical basis for unbiased LTR via Empirical Risk Minimization despite biased data.

Using this framework, we derive a Propensity-Weighted Ranking SVM for discriminative learning from implicit feedback, where click models take the role of the propensity estimator. In contrast to most conventional approaches to de-biasing the data using click models, this allows training of ranking functions even in settings where queries do not repeat. Beyond the theoretical support, we show empirically that the proposed learning method is highly effective in dealing with biases, that it is robust to noise and propensity model misspecification, and that it scales efficiently. We also demonstrate the real-world applicability of our approach on an operational search engine, where it substantially improves retrieval performance.

Effective Evaluation using Logged Bandit Feedback from Multiple Loggers

PDF

@InProceedings{agarwal2017multiple,
  author    =    {Agarwal, Aman and Basu, Soumya and Schnabel, Tobias and Joachims, Thorsten},
  title     =    {Effective Evaluation using Logged Bandit Feedback from Multiple Loggers},
  booktitle =    {Conference on Knowledge Discovery and Data Mining (KDD)},
  year      =    {2017}
}

Accurately evaluating new policies (e.g. ad-placement models, ranking functions, recommendation functions) is one of the key prerequisites for improving interactive systems. While the conventional approach to evaluation relies on online A/B tests, recent work has shown that counterfactual estimators can provide an inexpensive and fast alternative, since they can be applied offline using log data that was collected from a different policy fielded in the past. In this paper, we address the question of how to estimate the performance of a new target policy when we have log data from multiple historic policies. This question is of great relevance in practice, since policies get updated frequently in most online systems. We show that naively combining data from multiple logging policies can be highly suboptimal. In particular, we find that the standard Inverse Propensity Score (IPS) estimator suffers especially when logging and target policies diverge - to a point where throwing away data improves the variance of the estimator. We therefore propose two alternative estimators which we characterize theoretically and compare experimentally. We find that the new estimators can provide substantially improved estimation accuracy.

Recommendations as Treatments: Debiasing Learning and Evaluation

PDF DATA SLIDES

@InProceedings{schnabel2016shortlists,
   author = "Schnabel, Tobias and Swaminathan, Adith and Singh, Ashudeep and Chandak, Navin and Joachims, Thorsten",
   title = "Recommendations as Treatments: Debiasing Learning and Evaluation",
   booktitle = "Proceedings of The International Conference on Machine Learning (ICML)",
   pages = "1670–1679",
   year = "2016"
}

Most data for evaluating and training recommender systems is subject to selection biases, either through self-selection by the users or through the actions of the recommendation system itself. In this paper, we provide a principled approach to handle selection biases by adapting models and estimation techniques from causal inference. The approach leads to unbiased performance estimators despite biased data, and to a matrix factorization method that provides substantially improved prediction performance on real-world data. We theoretically and empirically characterize the robustness of the approach, and find that it is highly practical and scalable.

Using Shortlists to Support Decision Making and Improve Recommender System Performance

PDF SLIDES

@InProceedings{schnabel2016shortlists,
   author = "Schnabel, Tobias and Bennett, Paul N. and Dumais, Susan T. and Joachims, Thorsten",
   title = "Using Shortlists to Support Decision Making and Improve Recommender System Performance",
   booktitle = "Proceedings of the Conference on World Wide Web (WWW)",
   pages = "0-0",
   year = "2016"
}

In this paper, we study shortlists as an interface component for recommender systems with the dual goal of supporting the user's decision process, as well as improving implicit feedback elicitation for increased recommendation quality. A shortlist is a temporary list of candidates that the user is currently considering, e.g., a list of a few movies the user is currently considering for viewing. From a cognitive perspective, shortlists serve as digital short-term memory where users can off-load the items under consideration - thereby decreasing their cognitive load. From a machine learning perspective, adding items to the shortlist generates a new implicit feedback signal as a by-product of exploration and decision making which can improve recommendation quality. Shortlisting therefore provides additional data for training recommendation systems without the increases in cognitive load that requesting explicit feedback would incur.

We perform an user study with a movie recommendation setup to compare interfaces that offer shortlist support with those that do not. From the user studies we conclude: (i) users make better decisions with a shortlist; (ii) users prefer an interface with shortlist support; and (iii) the additional implicit feedback from sessions with a shortlist improves the quality of recommendations by nearly a factor of two.

Evaluation methods for unsupervised word embeddings

PDF DATA SLIDES

@InProceedings{schnabel2015eval,
   author = "Schnabel, Tobias and Labutov, Igor and Mimno, David and Joachims, Thorsten",
   title = "Evaluation methods for unsupervised word embeddings",
   booktitle = "Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)",
   pages     = "298-307",
   year = "2015"
}

We present a comprehensive study of evaluation methods for unsupervised embedding techniques that obtain meaningful representations of words from text. Different evaluations result in different orderings of embedding methods, calling into question the common assumption that there is one single optimal vector representation. We present new evaluation techniques that directly compare embeddings with respect to specific queries. These methods reduce bias, provide greater insight, and allow us to solicit data-driven relevance judgments rapidly and accurately through crowdsourcing.

Online Updating of Word Representations for Part-of-Speech Taggging

PDF DATA

@InProceedings{yin2015online,
   title = "Online Updating of Word Representations for Part-of-Speech Taggging",
   author= "Yin, Wenpeng and Schnabel, Tobias and Schütze, Hinrich",
   booktitle = "Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)",
   pages = "1329-1334",
   year = "2015"
}

We propose online unsupervised domain adaptation (DA), which is performed incrementally as data comes in and is applicable when batch DA is not possible. In a part-of-speech (POS) tagging evaluation, we find that online unsupervised DA performs as well as batch DA.

Unbiased Concurrent Evaluation on a Budget

PDF

Eliciting expert judgments for evaluating the performance of structured prediction systems (e.g., search engines, recommender systems) is labor-intensive and costly. This raises the question of how to get high-quality performance estimates given a relatively small budget of judgments. In this paper, we provide theoretically justified - yet highly practical and efficient - methods for selecting a subset of items to judge, addressing four commonly encountered evaluation scenarios: estimating the performance of a single system, comparing a system to a benchmark system, and estimating the relative and absolute performances of k systems. Our approach is based on designing effective importance sampling estimators that give provably unbiased performance estimates and that can re-use previously collected judgments. To this effect, we exploit the fact that many performance metrics of interest can be expressed as an expectation and derive theoretically optimal importance sampling distributions for estimating this expected value. We empirically demonstrate the effectiveness of our framework, showing that it eliminates the bias inherent in deterministic selection strategies (e.g. the pooling method) and that it can drastically reduce the number of judgments necessary compared to naive sampling approaches.

FLORS: Fast and Simple Domain Adaptation for Part-of-Speech Tagging

PDF

@Article{schnabel2014flors,
   author = "Schnabel, Tobias and Sch{\"u}tze, Hinrich",
   title = "FLORS: Fast and Simple Domain Adaptation for Part-of-Speech Tagging",
   journal = "Transactions of the Association of Computational Linguistics (TACL)",
   year = "2014",
   volume = "2",
   number  = "1",
   pages = "15-26"
}

We present FLORS, a new part-of-speech tagger for domain adaptation. FLORS uses robust representations that work especially well for unknown words and for known words with unseen tags. FLORS is simpler and faster than previous domain adaptation methods, yet it has significantly better accuracy than several baselines.

Stable Coactive Learning via Perturbation

PDF

@InProceedings{Raman2013stable,
   author = "Raman, Karthik and Joachims, Thorsten and Shivaswamy, Pannaga and Schnabel, Tobias",
   title = "Stable Coactive Learning via Perturbation",
   booktitle = "International Conference on Machine Learning (ICML)",
   pages = "837-845",
   year = "2013"
}

Coactive Learning is a model of interaction between a learning system (e.g. search engine) and its human users, wherein the system learns from (typically implicit) user feedback during operational use. User feedback takes the form of preferences, and recent work has introduced online algorithms that learn from this weak feedback. However, we show that these algorithms can be unstable and ineffective in real-world settings where biases and noise in the feedback are significant. In this paper, we propose the first coactive learning algorithm that can learn robustly despite bias and noise. In particular, we explore how presenting users with slightly perturbed objects (e.g., rankings) can stabilize the learning process. We theoretically validate the algorithm by proving bounds on the average regret. We also provide extensive empirical evidence on benchmarks and from a live search engine user study, showing that the new algorithm substantially outperforms existing methods.

Towards Robust Cross-Domain Domain Adaptation for Part-of-Speech Tagging

PDF

@InProceedings{Schnabel2013POS,
   author = "Schnabel, Tobias and Sch{\"u}tze, Hinrich",
   title = "Towards Robust Cross-Domain Domain Adaptation for Part-of-Speech Tagging",
   booktitle = "Proceedings of the Sixth International Joint Conference on Natural Language Processing (IJCNLP)",
   year = "2013",
   pages = "198--206"
}

We investigate the robustness of domain adaptation (DA) representations and methods across target domains using part-of-speech (POS) tagging as a case study. We find that there is no single representation and method that works equally well for all target domains. In particular, there are large differences between target domains that are more similar to the source domain and those that are less similar.