SIGIR 2016 Tutorial on Counterfactual Evaluation and Learning

for Search, Recommendation and Ad Placement

Speakers: Thorsten Joachims <>
Cornell University
Department of Computer Science & Department of Information Science

Adith Swaminathan <>
Cornell University
Department of Computer Science

Date: 17.07.2016


Online metrics measured through A/B tests have become the gold standard for many evaluation questions. But can we get the same results as A/B tests without actually fielding a new system? And can we train systems to optimize online metrics without subjecting users to an online learning algorithm? This tutorial summarizes and unifies the emerging body of methods on counterfactual evaluation and learning. These counterfactual techniques provide a wellfounded way to evaluate and optimize online metrics by exploiting logs of past user interactions. In particular, the tutorial unifies the causal inference, information retrieval, and machine learning view of this problem, providing the basis for future research in this emerging area of great potential impact.

Slides and Video link

The slides for the tutorial are in four parts, and pdf's exported from Powerpoint are provided below.

Video recording of the tutorial is in two parts, and embedded below. To view it in a separate tab, click here: Part I, Part II.

Part 1:

Part 2:

Code and Data for News Recommendation Demo

Download all required files for the demo here:


All evaluation and most learning approaches are written in Python3. The one exception -- learning using ERM and Inverse Propensity Scoring (or Doubly Robust estimators) -- uses Vowpal Wabbit.

Recommended installation process:
  1. Download and install Anaconda with Python3 (e.g. Python 3.5).
  2. Ensure the following python packages are installed : Numpy, Scipy, Scikit-learn, joblib. With Anaconda, just use:
    conda install [package]
  3. Install Vowpal Wabbit. On Windows, use cygwin and follow these instructions.
  4. Unzip the downloaded from above, and navigate to Code_Data/Code/Synth_TestBed/.

To run the evaluation experiments, use

python [-h]

Allowed values for the -e and -f options (the policy to evaluate, and the policy that serves as the scorer for Plackett-Luce randomization) are

Allowed values for the -v option (the metric to measure/optimize) are

To plot the results of running evaluation experiments, use
python --mode [estimate/error] --path [../../Logs/ssynth...pkl]

To run the learning experiments, use

python [-h]

If you wish to also observe Vowpal Wabbit results for learning, supply the "-d" option to the OptExperiment script [WARNING: Dumps massive dataset files.] Additionally run,
vw -d vw_train.dat --cb_adf -f cb.model --passes 20 –cache_file cb.cache
vw -t -d vw_test.dat -i cb.model -p test.predict
python –d vw_test2.dat –p test.predict

Note that Vowpal Wabbit has several tunable parameters and implements an efficient grid-search and cross-validation strategy. Have fun playing with multiple hyper-parameters!


\bibitem{Aslam2009} J.~A. Aslam, E.~Kanoulas, V.~Pavlu, S.~Savev, and E.~Yilmaz.
Document selection methodologies for efficient and effective learning-to-rank.
In {\em SIGIR}, pages 468--475, 2009.

\bibitem{AtheyImbens} S.~Athey and G.~Imbens.
{Recursive Partitioning for Heterogeneous Causal Effects}.
{\em ArXiv e-prints}, 2015.

\bibitem{beygelzimer09offset} A.~Beygelzimer and J.~Langford.
The offset tree for learning with partial labels.
In {\em KDD}, pages 129--138, 2009.

\bibitem{BeygelzimerKDD2010} A.~Beygelzimer and J.~Langford.
Learning through exploration.
In {\em KDD}, 2010.

\bibitem{Bottou2013} L.~Bottou, J.~Peters, J.~Q. Candela, D.~X. Charles, M.~Chickering, E.~Portugaly, D.~Ray, P.~Y. Simard, and E.~Snelson.
Counterfactual reasoning and learning systems: The example of computational advertising.
{\em Journal of Machine Learning Research}, 14(1):3207--3260, 2013.

\bibitem{Carterette2010} B.~Carterette, E.~Kanoulas, V.~Pavlu, and H.~Fang.
Reusable test collections through experimental design.
In {\em SIGIR}, pages 547--554, 2010. \bibitem{Carterette2010Tutorial} B.~Carterette, E.~Kanoulas, and E.~Yilmaz.
Low cost evaluation in information retrieval.
In {\em SIGIR}, pages 903--903, 2010.

\bibitem{CarteretteKY12} B.~Carterette, E.~Kanoulas, and E.~Yilmaz.
Advances on the development of evaluation measures.
In {\em SIGIR}, pages 1200--1201, 2012.

\bibitem{Cesa-Bianchi2006} N.~Cesa-Bianchi and G.~Lugosi.
{\em Prediction, Learning, and Games}.
Cambridge University Press, New York, NY, USA, 2006.

\bibitem{Interleaving2012} O.~Chapelle, T.~Joachims, F.~Radlinski, and Y.~Yue.
Large-scale validation and analysis of interleaved search evaluation.
{\em {ACM} Transactions on Information Systems}, 30(1):6, 2012.

\bibitem{Chuklin_SIGIR2015_Tutorial1} A.~Chuklin, I.~Markov, and M.~de~Rijke.
An introduction to click models for web search.
In {\em SIGIR}, pages 1113--1115, 2015.

\bibitem{Cortes2010} C.~Cortes, Y.~Mansour, and M.~Mohri.
Learning bounds for importance weighting.
In {\em NIPS}, pages 442--450, 2010.

\bibitem{Dudik2014} M.~Dud{\'\i}k, D.~Erhan, J.~Langford, and L.~Li.
Doubly robust policy evaluation and optimization.
{\em Statistical Science}, pages 485--511, 2014.

\bibitem{Dudik2011} M.~Dud{\'{i}}k, J.~Langford, and L.~Li.
Doubly robust policy evaluation and learning.
In {\em ICML}, pages 1097--1104, 2011.

\bibitem{Workshop2015WWW} N.~Gupta, E.~Koh, and L.~Li.
Workshop on online and offline evaluation of web-based services.
In {\em WWW Companion}, 2015.

\bibitem{HL} J.~M. Hernández-Lobato, N.~Houlsby, and Z.~Ghahramani.
Probabilistic matrix factorization with non-random missing data.
In {\em ICML}, pages 1512--1520, 2014.

\bibitem{Hofmann2014Tutorial} K.~Hofmann.
Online experimentation for information retrieval.
In {\em RuSSIR}, pages 21--41, 2014.

\bibitem{Hofmann2013} K.~Hofmann, A.~Schuth, S.~Whiteson, and M.~de~Rijke.
Reusing historical interaction data for faster online learning to rank for {IR}.
In {\em WSDM}, pages 183--192, 2013.

\bibitem{kohavi2009controlled} R.~Kohavi, R.~Longbotham, D.~Sommerfield, and R.~M. Henne.
Controlled experiments on the web: survey and practical guide.
{\em Knowledge Discovery and Data mining}, pages 140--181, 2009.

\bibitem{LangfordNIPS2013} J.~Langford.
Learning to interact.
In {\em NIPS}, 2013.

\bibitem{Langford2008} J.~Langford, A.~Strehl, and J.~Wortman.
Exploration scavenging.
In {\em ICML}, pages 528--535, 2008.

\bibitem{langford2008epoch} J.~Langford and T.~Zhang.
The epoch-greedy algorithm for multi-armed bandits with side information.
In {\em Advances in Neural Information Processing Systems}, 2008.

\bibitem{LihongWSDM2015} L.~Li.
Offline evaluation and optimization for interactive systems.
In {\em WSDM}, 2015.

\bibitem{LiCKG15} L.~Li, S.~Chen, J.~Kleban, and A.~Gupta.
Counterfactual estimation and optimization of click metrics in search engines: {A} case study.
In {\em WWW Companion}, pages 929--934, 2015.

\bibitem{Li2011} L.~Li, W.~Chu, J.~Langford, and X.~Wang.
Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms.
In {\em WSDM}, pages 297--306, 2011.

\bibitem{Li2015} L.~Li, R.~Munos, and C.~Szepesvari.
Toward minimax off-policy value estimation.
In {\em AISTATS}, 2015.

\bibitem{Marlin/Zemel/09} B.~M. Marlin and R.~S. Zemel.
Collaborative prediction and ranking with non-random missing data.
In {\em RecSys}, pages 5--12, 2009.

\bibitem{Mary2014} J.~Mary, P.~Preux, and O.~Nicol.
Improving offline evaluation of contextual bandit algorithms via bootstrapping techniques.
In {\em ICML}, pages 172--180, 2014.

\bibitem{Rosenbaum1983} P.~Rosenbaum and D.~Rubin.
The central role of the propensity score in observational studies for causal effects.
{\em Biometrika}, 70(1):41--55, 1983.

\bibitem{Rubinstein2008} R.~Rubinstein and D.~Kroese.
{\em Simulation and the Monte Carlo Method}.
Wiley, 2 edition, 2008.

\bibitem{mnar} T.~Schnabel, A.~Swaminathan, A.~Singh, N.~Chandak, and T.~Joachims.
Recommendations as treatments: Debiasing learning and evaluation.
{\em Preprint}, 2016.

\bibitem{Strehl2010} A.~L. Strehl, J.~Langford, L.~Li, and S.~Kakade.
Learning from logged implicit exploration data.
In {\em NIPS}, pages 2217--2225, 2010.

\bibitem{Sutton1998} R.~S. Sutton and A.~G. Barto.
{\em {Reinforcement Learning: An Introduction}}.
The MIT Press, 1998.

\bibitem{Swaminathan2015} A.~Swaminathan and T.~Joachims.
Counterfactual risk minimization: Learning from logged bandit feedback.
In {\em ICML}, 2015.

\bibitem{Swaminathan15b} A.~Swaminathan and T.~Joachims.
The self-normalized estimator for counterfactual learning.
In {\em NIPS}, pages 3213--3221, 2015.

\bibitem{slates} A.~Swaminathan, A.~Krishnamurthy, A.~Agarwal, M.~Dud\'{i}k, and J.~Langford.
Off-policy evaluation and optimization for slate recommendation.

\bibitem{Vapnik1998} V.~Vapnik.
{\em Statistical Learning Theory}.
Wiley, Chichester, GB, 1998.

\bibitem{Yang2014Tutorial} H.~Yang, M.~Sloan, and J.~Wang.
Dynamic information retrieval modeling.
In {\em SIGIR}, 2014.

\bibitem{Evangelosetal} E.~Yilmaz, E.~Kanoulas, and J.~A. Aslam.
A simple and efficient sampling method for estimating {AP} and {NDCG}.
In {\em SIGIR}, pages 603--610, 2008.

\bibitem{Zadrozny2003} B.~Zadrozny, J.~Langford, and N.~Abe.
Cost-sensitive learning by cost-proportionate example weighting.
In {\em ICDM}, pages 435--, 2003.


We acknowledge and thank for the support under NSF Awards IIS-1247637, IIS-1217686, and IIS-1513692, as well as a gift from Bloomberg.