Large-scale Validation of Counterfactual Learning Methods

A Test-Bed

Damien Lefortier Adith Swaminathan Xiaotao Gu Thorsten Joachims Maarten de Rijke
University of Amsterdam
Cornell University
Tsinghua University
Cornell University
University of Amsterdam

Date: December 1, 2016


We provide a public dataset that contains accurately logged propensities for the problem of Batch Learning from Bandit Feedback (BLBF). The data comes from traffic logged by Criteo, a leader in the display advertising space. This dataset is hosted on Amazon AWS and is available to the public at

A small sample of this data (~400 impressions, ~1MB) is available here.

The dataset has over 100 million display ad impressions, and is 35GB gzipped / 250GB raw. We hope this dataset will serve as a large-scale standardized test-bed for the evaluation of counterfactual learning methods. If you use the dataset for your research, please cite [1] and drop us a note on your research as well as the team at Criteo.

Data Description

Consider the problem of filling a banner ad with an aggregate of multiple products the user may want to purchase. Each ad has one of many banner types, which differ in the number of products they contain and in their layout. The task is to choose the products to display in the ad knowing the banner type, user context, and candidate ads, in order to maximize the number of clicks. The format of this data is:

example ${exID}: ${hashID} ${wasAdClicked} ${propensity} ${nbSlots} ${nbCandidates} ${displayFeat1}:${v_1} ...
${wasProduct1Clicked} exid:${exID} ${productFeat1_1}:${v1_1} ...
${wasProductMClicked} exid:${exID} ${productFeatM_1}:${vM_1} ...

Each impression is represented by ${M+1} lines where ${M} is the number of candidate ads and the first line is a header containing summary information. The ${nbSlots} slots in a banner are labeled in order from left to right and from top to bottom. The first ${nbSlots} candidates correspond to the displayed products ordered by position. The logging policy stochastically fills the banner by first computing non-negative scores for all candidates, and then sampling without replacement from the multinomial distribution defined by these scores (i.e. a Plackett-Luce ranking model). The ${propensity} records the probability with which the displayed banner was sampled under this logging policy. There are 35 features. Display features include the user context and banner type, which are constant for all candidates in an impression. Each unique quadruplet of feature IDs < 1, 2, 3, 5 > correspond to a unique banner type. Features 1 and 2 are numerical, while all other features are categorical. Some categorical features are multi-valued (order does not matter). Example ID is increasing with time, allowing temporal slices for evaluation. Importantly, non-clicked examples were sub-sampled aggressively to reduce the dataset size and only a random 10% sub-sample of non-clicked impressions are logged.


A small sample of this data (~400 impressions, ~1MB) is available here.


Download all helper scripts here:


All evaluation scripts are written in Python3 and were developed on a Linux machine.
Learning algorithms use Vowpal Wabbit and a Python3 implementation of POEM (included as a stand-alone in

Recommended installation process:
  1. Download and install Anaconda with Python3 (e.g. Python 3.5).
  2. Ensure the following python packages are installed : Numpy, Scipy, Scikit-learn. With Anaconda, just use:
    conda install [package]
  3. Install Vowpal Wabbit. On Windows, use cygwin and follow these instructions. In the instructions below, vw is assumed to be executable from the current working directory. To achieve this in Linux, after compiling Vowpal Wabbit, simply run
    sudo make install
    Otherwise, simply replace every invocation of vw in the scripts below with the full path to the Vowpal Wabbit binary.
  4. Download the Criteo dataset from Untar it. On Linux, just use:
    tar -zxvf CriteoBannerFillingChallenge.tar.gz
    Optionally, you can re-compress using gzip since all the scripts support reading/writing gzippped files through the zlib library:
    gzip CriteoBannerFillingChallenge.txt
  5. Download and unzip, and navigate to BLBF-DisplayAds/Scripts/.



[1] D. Lefortier and A. Swaminathan and X. Gu and T. Joachims and M. de Rijke. Large-scale Validation of Counterfactual Learning Methods: A Test-Bed, NIPS Workshop on "Inference and Learning of Hypothetical and Counterfactual Interventions in Complex Systems", 2016. [arXiv] [paper] [poster].