Propensity SVM^rank

Ranking SVM for Learning from Partial-Information Feedback

Author: Thorsten Joachims <thorsten@joachims.org>
Cornell University
Department of Computer Science

Version: 1.00
Date: 27.06.2016

Overview

Propensity SVM^rank is an instance of SVM^struct for efficiently training Ranking SVMs from partial-information feedback [Joachims et al., 2017a]. Unlike regular Ranking SVMs, Propensity SVM^rank can deal with situations where the relevance labels for some relevant documents are missing. This is the case when learning from click data, where user are unlikely to click on all relevant documents. But even when learning from manual annotations, pooling techniques typically miss some relevant documents. Despite incomplete annotation, Propensity SVM^rank optimizes an unbiased ERM objective that allows counterfactual inference over ranking policies via inverse propensity weighting. The algrithm is described in [Joachims et al., 2017a], and the optimization algorithm used in this implementation is described in [Joachims et al., 2009c]

Source Code

The program is free for scientific use. Please contact me, if you are planning to use the software for commercial purposes. The software must not be further distributed without prior permission of the author. If you use Propensity SVM^rank in your scientific work, please cite as

T. Joachims, A. Swaminathan, T. Schnabel Unbiased Learning-to-Rank with Biased Feedback, Proceedings of the International Conference on Web Search and Data Mining (WSDM), 2017. [PDF]

The implementation was developed on Linux with gcc, but compiles also on Cygwin, Windows (using MinGW) and Mac (after small modifications, see FAQ). The source code is available at the following location:

https://osmot.cs.cornell.edu/svm_proprank/current/svm_proprank.tar.gz

If you just want the binaries, you can download them for the following systems:

Linux (64-bit): https://osmot.cs.cornell.edu/svm_proprank/current/svm_proprank_linux64.tar.gz
Cygwin (64-bit): https://osmot.cs.cornell.edu/svm_proprank/current/svm_proprank_cygwin64.tar.gz
Cygwin (32-bit): https://osmot.cs.cornell.edu/svm_proprank/current/svm_proprank_cygwin.tar.gz
Windows (32-bit): https://osmot.cs.cornell.edu/svm_proprank/current/svm_proprank_windows.zip

Please send me email and let me know that you got it. The archive contains the source code of the most recent version of SVM^rank, which includes the source code of SVM^struct and the SVM^light quadratic optimizer. Unpack the archive using the shell command:

      gunzip –c svm_proprank.tar.gz | tar xvf –

This expands the archive into the current directory, which now contains all relevant files. You can compile SVM^rank using the command:

      make

This will produce the executables svm_proprank_learn and svm_proprank_classify. If the system does not compile properly, check this FAQ.

How to Use

Propensity SVM^rank consists of a learning module (svm_proprank_learn) and a module for making predictions (svm_proprank_classify). Propensity SVM^rank uses (almost) the same input and output file formats as the normal SVM^rank. You call it like

      svm_proprank_learn -c 20.0 train.dat model.dat

which trains a propensity-weighted Ranking SVM on the training set train.dat and outputs the learned rule to model.dat using the regularization parameter C set to 20.0. The C parameter has the same meaning as in the conventional SVM^rank. Most of the other options come from SVM^struct and SVM^light and are described there. Only the "application-specific options" listed below are particular to Propensity SVM^rank.

General Options:
         -?          -> this help
         -v [0..3]   -> verbosity level (default 1)
         -y [0..3]   -> verbosity level for svm_light (default 0)
Learning Options:
         -c float    -> C: trade-off between training error
                        and margin (default 0.01)
         -p [1,2]    -> L-norm to use for slack variables. Use 1 for L1-norm,
                        use 2 for squared slacks. (default 1)
         -o [1,2]    -> Rescaling method to use for loss.
                        1: slack rescaling
                        2: margin rescaling
                        (default 2)
         -l [0..]    -> Loss function to use.
                        0: zero/one loss
                        ?: see below in application specific options
                        (default 1)
Optimization Options (see [2][5]):
         -w [0,..,9] -> choice of structural learning algorithm (default 3):
                        0: n-slack algorithm described in [2]
                        1: n-slack algorithm with shrinking heuristic
                        2: 1-slack algorithm (primal) described in [5]
                        3: 1-slack algorithm (dual) described in [5]
                        4: 1-slack algorithm (dual) with constraint cache [5]
                        9: custom algorithm in svm_struct_learn_custom.c
         -e float    -> epsilon: allow that tolerance for termination
                        criterion (default 0.001000)
         -k [1..]    -> number of new constraints to accumulate before
                        recomputing the QP solution (default 100)
                        (-w 0 and 1 only)
         -f [5..]    -> number of constraints to cache for each example
                        (default 5) (used with -w 4)
         -b [1..100] -> percentage of training set for which to refresh cache
                        when no epsilon violated constraint can be constructed
                        from current cache (default 100%) (used with -w 4)
SVM-light Options for Solving QP Subproblems (see [3]):
         -n [2..q]   -> number of new variables entering the working set
                        in each svm-light iteration (default n = q).
                        Set n < q to prevent zig-zagging.
         -m [5..]    -> size of svm-light cache for kernel evaluations in MB
                        (default 40) (used only for -w 1 with kernels)
         -h [5..]    -> number of svm-light iterations a variable needs to be
                        optimal before considered for shrinking (default 100)
         -# int      -> terminate svm-light QP subproblem optimization, if no
                        progress after this number of iterations.
                        (default 100000)
Kernel Options:
         -t int      -> type of kernel function:
                        0: linear (default)
                        1: polynomial (s a*b+c)^d
                        2: radial basis function exp(-gamma ||a-b||^2)
                        3: sigmoid tanh(s a*b + c)
                        4: user defined kernel from kernel.h
         -d int      -> parameter d in polynomial kernel
         -g float    -> parameter gamma in rbf kernel
         -s float    -> parameter s in sigmoid/poly kernel
         -r float    -> parameter c in sigmoid/poly kernel
         -u string   -> parameter of user defined kernel
Output Options:
         -a string   -> write all alphas to this file after learning
                        (in the same order as in the training set)
Application-Specific Options:

The following loss functions can be selected with the -l option:
     1  IPS-weighted rank of relevant documents.

NOTE: When the propensities of all relevant train docs are equal to 1, then
      SVM-light in '-z p' mode (with c_light = c_rank/n, where n is the number
      of training clicks) and unweighted SVM-rank with Swapped Pairs loss are
      equivalent.

SVM^rank learns a linear ranking policy (i.e. a rule w*x without explicit threshold). The loss function to be optimized is selected using the '-l' option, and the only option implemented so far is the (propensity-weighted) rank of the relevant documents.

You can in principle use kernels in SVM^rank using the '-t' option just like in SVM^light, but it is painfully slow. I recommend against it.

The file format of the training and test files is the same as for the conventional SVM^rank (see here for further details), but there are some exceptions as elaborated on below. In particular, one needs to specify the inverse-propensity-score (IPS) weight via the cost:<IPSWeight> parameter for each relevant document.

<line> .=. <target> qid:<qid> cost:<IPSWeight> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. <0 or 1> <qid> .=. <positive integer>
<feature> .=. <positive integer>
<value> .=. <float>
<info> .=. <string>

For each document that is revealed to be relevant, one has to specify a separate qid. The lines in the input files have to be sorted by increasing qid. For each qid, the relevant document has target value 1, the other documents in the candidate set for the query (possibly some of which are relevant as well) have target value 0. Note that if there are multiple relevant documents for the same query, one would generate multiple qid (with the same feature vectors) where only a single document has target 1. For each relevant document, you need to specify the IPS-Weight via the cost parameter.

To clarify the file format, consider the following example with two queries. In query 1, documents A and C are relevant, but only A is revealed as relevant (with propensity 0.5). In query 2, documents B and C and D are relevant, but only B and C are revealed as relevant (with propensities 0.3 and 0.1 respectively). This leads to the following file for Propensity SVM^rank:

1 qid:1 cost:2.0 1:1 2:1 3:0 4:0.2 5:0 # 1A 0 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B 0 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C 0 qid:1 1:0 2:0 3:1 4:0.3 5:0 # 1D 1 qid:2 cost:3.3 1:1 2:0 3:1 4:0.4 5:0 # 2B 0 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2A 0 qid:2 1:0 2:0 3:1 4:0.1 5:0 # 2C 0 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2D 0 qid:2 1:0 2:0 3:1 4:0.1 5:1 # 2E 1 qid:3 cost:10.0 1:0 2:0 3:1 4:0.1 5:0 # 2C 0 qid:3 1:0 2:0 3:1 4:0.2 5:0 # 2A 0 qid:3 1:1 2:0 3:1 4:0.4 5:0 # 2B 0 qid:3 1:0 2:0 3:1 4:0.2 5:0 # 2D 0 qid:3 1:0 2:0 3:1 4:0.1 5:1 # 2E

Files in this format can be given to svm_proprank_learn for training a model. However they can also be given svm_proprank_classify to compute an estimate of the ranking performance (in terms of average rank of the relevant documents) on a validation of test set. The self-normalize IPS estimator is used as defined in [Swaminathan & Joachims, 2017a]. Note that one can compute the test performance on full-information feedback by simply generating a qid block for each relevant document and setting the propensity to 1. Note that the code indexes rankings starting at position 0, not at position 1 as is more common.

svm_proprank_classify is called as follows:

svm_proprank_classify test.dat model.dat predictions

For each line in test.dat, the predicted ranking score is written to the file predictions. There is one line per test example in predictions in the same order as in test.dat. From these scores, the predicted ranking can be recovered via sorting.

Getting started: an Example Problem

You will find an example problem at

https://osmot.cs.cornell.edu/svm_light/examples/example4.tar.gz

It contains a propensity-scored training sample and a propensity-scored test sample with 100 clicks each. Unpack the archive with

gunzip -c example4.tar.gz | tar xvf -

This will create a subdirectory example4. To run the example, execute the commands:

svm_proprank_learn -c 0.01 example4/train.dat example4/model
svm_proprank_classify example4/test.dat example4/model example4/predictions

The estimated ranking accuracy is written to STDOUT by the classification module.

Avg Rank of Positive Examples: 17.15 (via SNIPS Estimator [Swaminathan & Joachims, 2015d])

The numbers written in the predictions file can be used to rank the test examples. The values in the predictions file do not have a meaning in an absolute sense - they are only used for ordering.

It can also be interesting to look at the "training error" of the propensity ranking SVM.

svm_proprank_classify example4/train.dat example4/model example4/predictions.train

Avg Rank of Positive Examples: 4.61 (via SNIPS Estimator [Swaminathan & Joachims, 2015d])

Note that scores are comparable only between examples with the same qid.

Disclaimer

This software is free only for non-commercial use. It must not be distributed without prior permission of the author. The author is not responsible for implications from the use of this software.

Known Problems

none

History

none

References

[1] T. Joachims, Training Linear SVMs in Linear Time, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2006. [Postscript] [PDF]

[2] T. Joachims, A Support Vector Method for Multivariate Performance Measures, Proceedings of the International Conference on Machine Learning (ICML), 2005. [Postscript] [PDF]

[3] Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, Journal of Machine Learning Research (JMLR), 6(Sep):1453-1484, 2005. [PDF]

[4] I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun. Support Vector Machine Learning for Interdependent and Structured Output Spaces. International Conference on Machine Learning (ICML), 2004. [Postscript] [PDF]

[5] T. Joachims, Making Large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT Press, 1999. [Postscript (gz)] [PDF]

[6] T. Joachims, T. Finley, Chun-Nam Yu, Cutting-Plane Training of Structural SVMs, Machine Learning Journal, 2009. [PDF]

[7] T. Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002. [Postscript] [PDF]

[8] T. Joachims, A. Swaminathan, T. Schnabel, Unbiased Learning-to-Rank with Biased Feedback, International Conference on Web Search and Data Mining (WSDM), 2017. [PDF]

[9] A. Swaminathan, T. Joachims, The Self-Normalized Estimator for Counterfactual Learning, Neural Information Processing Systems (NIPS), 2015. [PDF]

Propensity SVMrank