SVM^perf

Support Vector Machine for Multivariate Performance Measures

Author: Thorsten Joachims <thorsten@joachims.org>
Cornell University
Department of Computer Science

Version: 3.00
Date: 07.09.2009

Overview

SVM^perf is an implementation of the Support Vector Machine (SVM) formulation for optimizing multivariate performance measures described in [Joachims, 2005]. Furthermore, SVM^perf implements the alternative structural formulation of the SVM optimization problem for conventional binary classification with error rate and ordinal regression described in [Joachims, 2006]. So, there are three reasons for why you might want to use SVM^perf instead of SVM^light:

Optimize binary SVM classification rules directly to ROC-Area, F1-Score, and Precision/Recall Break-Even Point (see [Joachims, 2005]).
Train conventional linear classification SVMs optimizing error rate in time that is linear in the size of the training data through an alternative, but equivalent formulation of the training problem (see [Joachims, 2006]). This can be much faster than SVM^light for large training sets.
Train conventional linear ordinal regression SVMs [Herbrich et al., 1999] optimizing the number of misordered pairs in time that is O(n log(n)) in the size of the training data through an alternative, but equivalent formulation of the training problem (see [Joachims, 2006]). Currently this is implemented only for rankings with two ranks.
Training sparse kernel SVMs via the Cutting-Plane Subspace Pursuit (CPSP) algorithm (see [Joachims & Yu, 2009]). The algorithm enables large-scale training of non-linear kernel SVMs and fast predictions on test examples. This is achieved by limiting the number of support vectors and by allowing support vectors (or, more precisely, basis functions) that are not necessarily training vectors.
The new features of version V3.00 are described here.

This implementation is an instance of SVM^struct. More information on SVM^struct is available here.

Source Code

The program is free for scientific use. Please contact me, if you are planning to use the software for commercial purposes. The software must not be further distributed without prior permission of the author. If you use SVM^perf in your scientific work, please cite as

T. Joachims, Training Linear SVMs in Linear Time, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2006. [Postscript (gz)] [PDF]
T. Joachims, A Support Vector Method for Multivariate Performance Measures, Proceedings of the International Conference on Machine Learning (ICML), 2005. [Postscript (gz)] [PDF]
T. Joachims, Chun-Nam John Yu, Sparse Kernel SVMs via Cutting-Plane Training, Proceedings of the European Conference on Machine Learning (ECML), Machine Learning Journal, Special ECML Issue, 2009. [PDF]

depending on which aspect of the software you are using. The implementation was developed on Linux with gcc, but compiles also on Solaris, Cygwin, Windows (using MinGW in Cygwin) and Mac (might need small modifications, see FAQ). The source code is available at the following location:

https://osmot.cs.cornell.edu/svm_perf/current/svm_perf.tar.gz

If you just want the binaries, you can download them for the following systems:

Linux (32-bit): https://osmot.cs.cornell.edu/svm_perf/current/svm_perf_linux32.tar.gz
Linux (64-bit): https://osmot.cs.cornell.edu/svm_perf/current/svm_perf_linux64.tar.gz
Cygwin: https://osmot.cs.cornell.edu/svm_perf/current/svm_perf_cygwin.tar.gz
Windows: https://osmot.cs.cornell.edu/svm_perf/current/svm_perf_windows.zip

Please send me email and let me know that you got it. The archive contains the source code of the most recent version of SVM^perf, which includes the source code of SVM^struct and the SVM^light quadratic optimizer. Unpack the archive using the shell command:

      gunzip –c svm_perf.tar.gz | tar xvf –

This expands the archive into the current directory, which now contains all relevant files. You can compile SVM^perf using the command:

      make

This will produce the executables svm_perf_learn and svm_perf_classify. If the system does not compile properly, check this FAQ.

How to Use

SVM^perf consists of a learning module (svm_perf_learn) and a classification module (svm_perf_classify). The classification module can be used to apply the learned model to new examples. See also the examples below for how to use svm_perf_learn and svm_perf_classify.

Usage is much like SVM^light. You call it like

      svm_perf_learn -c 20.0 train.dat model.dat

which trains a linear SVM on the training set train.dat and outputs the learned rule to model.dat using the regularization parameter C set to 20.0. However, the interpretation of the parameter C in SVM^perf is different from SVM^light. In particular, C_light = C_perf*100/n, where n is the number of training examples. Note also that the C values given in [Joachims, 2006] should be divided by 100 for equivalent results and that the epsilon values should be multiplied by 100. Most of the other options come from SVM^struct and SVM^light and are described there. Only the "application-specific options" listed below are particular to SVM^perf.

General Options:
         -?          -> this help
         -v [0..3]   -> verbosity level (default 1)
         -y [0..3]   -> verbosity level for svm_light (default 0)
Learning Options:
         -c float    -> C: trade-off between training error
                        and margin (default 0.01)
         -p [1,2]    -> L-norm to use for slack variables. Use 1 for L1-norm,
                        use 2 for squared slacks. (default 1)
         -o [1,2]    -> Rescaling method to use for loss.
                        1: slack rescaling
                        2: margin rescaling
                        (default 2)
         -l [0..]    -> Loss function to use.
                        0: zero/one loss
                        ?: see below in application specific options
                        (default 2)
Optimization Options (see [2][5]):
         -w [0,..,9] -> choice of structural learning algorithm (default 9):
                        0: n-slack algorithm described in [2]
                        1: n-slack algorithm with shrinking heuristic
                        2: 1-slack algorithm (primal) described in [5]
                        3: 1-slack algorithm (dual) described in [5]
                        4: 1-slack algorithm (dual) with constraint cache [5]
                        9: custom algorithm in svm_struct_learn_custom.c
         -e float    -> epsilon: allow that tolerance for termination
                        criterion (default 0.100000)
         -k [1..]    -> number of new constraints to accumulate before
                        recomputing the QP solution (default 100)
                        (-w 0 and 1 only)
         -f [5..]    -> number of constraints to cache for each example
                        (default 5) (used with -w 4)
         -b [1..100] -> percentage of training set for which to refresh cache
                        when no epsilon violated constraint can be constructed
                        from current cache (default 100%) (used with -w 4)
SVM-light Options for Solving QP Subproblems (see [3]):
         -n [2..q]   -> number of new variables entering the working set
                        in each svm-light iteration (default n = q).
                        Set n < q to prevent zig-zagging.
         -m [5..]    -> size of svm-light cache for kernel evaluations in MB
                        (default 40) (used only for -w 1 with kernels)
         -h [5..]    -> number of svm-light iterations a variable needs to be
                        optimal before considered for shrinking (default 100)
         -# int      -> terminate svm-light QP subproblem optimization, if no
                        progress after this number of iterations.
                        (default 100000)
Kernel Options:
         -t int      -> type of kernel function:
                        0: linear (default)
                        1: polynomial (s a*b+c)^d
                        2: radial basis function exp(-gamma ||a-b||^2)
                        3: sigmoid tanh(s a*b + c)
                        4: user defined kernel from kernel.h
         -d int      -> parameter d in polynomial kernel
         -g float    -> parameter gamma in rbf kernel
         -s float    -> parameter s in sigmoid/poly kernel
         -r float    -> parameter c in sigmoid/poly kernel
         -u string   -> parameter of user defined kernel
Output Options:
         -a string   -> write all alphas to this file after learning
                        (in the same order as in the training set)
Application-Specific Options:
         --b float   -> value of L2-bias feature. A value of 0 implies not
                        having a bias feature. (default 1)
                        WARNING: This is implemented only for linear kernel!
         --p [0..]   -> fraction of positive examples to use as value of k for
                        Prec@k and Rec@k. 0 indicates to use (0.5 * #pos) for
                        Prec@k and (2 * #pos) for Rec@k. #pos is the number of
                        positive examples in the training set. (default 0)
         --t [0..]   -> Use a priori selection of basis for sparse kernel
                        training (must not use '-w 9'):
                        0: do not use a priori selection. (default)
                        1: sample basis vectors randomly from --f file.
                        2: incomplete Cholesky using vectors from --f option.
         --i [0..]   -> Use CPSP algorithm for sparse kernel training
                        (must use '-w 9') (see [Joachims/Yu/09]):
                        0: do not use CPSP algorithm.
                        1: CPSP using subset selection for preimages via
                           59/95 heuristic 
                        2: CPSP using fixed point search (RBF kernel
                           only) (default)
                        4: CPSP using fixed point search with starting point
                           via 59/95 heuristic (RBF kernel only)
         --k [0..]   -> Specifies the number of basis functions to use
                        for sparse kernel approximation (both --t and --i).
                        (default 500)
         --r [0..]   -> Number of times to recompute the sparse kernel
                        approximation and restart the optimizer. (default 0)
         --s [0,1]   -> Selects whether shrinking heuristic is used in the
                        custom algorithms for linear kernel and errorrate loss.
                        (default 1)
         --f string  -> Specifies file that contains basis functions to use
                        for sparse kernel approximation. (default training
                        file)

The following loss functions can be selected with the -l option:
     0  Zero/one loss: 1 if vector of predictions contains error, 0 otherwise.
     1  F1: 100 minus the F1-score in percent.
     2  Errorrate: Percentage of errors in prediction vector.
     3  Prec/Rec Breakeven: 100 minus PRBEP in percent.
     4  Prec@k: 100 minus precision at k in percent.
     5  Rec@k: 100 minus recall at k in percent.
    10  ROCArea: Percentage of swapped pos/neg pairs (i.e. 100 - ROCArea).

SVM^perf learns an unbiased linear classification rule (i.e. a rule sign(w*x) without explicit threshold). Using the parameter '--b', an artificial feature with constant value is added to each training example which acts like a threshold. The loss function to be optimized is selected using the '-l' option. Note that I never really tested Prec@k and Rec@k, so take those with a grain of salt.

For training non-linear SVMs with kernels, SVM^perf employs the Cutting-Plane Subspace Pursuit (CPSP) algorithm [Joachims & Yu, 2009]. The algorithm is enabled with the '--i' option. It allows you to limit the number of basis functions (i.e. support vectors) via the '--k' option. The larger '--k', the better the approximation quality but the longer training and testing times. Typically, you enable CPSP via '--i 2 -w 9 --b 0 --k 500' and set the kernel parameters via '-t' just like in SVM^light. Currently, the full CPSP algorithm is implemented only for RBF kernels.

Another option for training non-linear SVMs with kernels is the Nystrom method ('--i 0 --t 1 -w 3') and the incomplete Cholesky factorization ('--i 0 --t 2 -w 3'). Again, you can limit the number of basis functions via the --k option. The larger '--k', the better the approximation quality but the longer training and testing times. Typically, you enable Nystrom (and incomplete Cholesky respectively) via '--i 0 --t 1 -w 3 --b 0 --k 500' and set the kernel parameters via '-t' just like in SVM^light. Normally, the basis functions are sampled from the training set, but you can give a file with an explicit set of basis function using the '--f' option. The file has the same format as a training/test file.

You can in principle also train non-linear kernels in SVM^perf exactly using '--t 0 --i 0 -w 3', and setting the kernel options just like in SVM^light. However, this is painfully slow.

The file format of the training and test files is the same as for SVM^light. The first lines may contain comments and are ignored if they start with #. Each of the following lines represents one training example and is of the following format:

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. {+1,-1}
<feature> .=. <integer>
<value> .=. <float>
<info> .=. <string>

The target value and each of the feature/value pairs are separated by a space character. Feature/value pairs MUST be ordered by increasing feature number. Features with value zero can be skipped. The target value denotes the class of the example via +1 or -1. So, for example, the line

-1 1:0.43 3:0.12 9284:0.2 # abcdef

specifies an negative example (i.e. -1) for which feature number 1 has the value 0.43, feature number 3 has the value 0.12, feature number 9284 has the value 0.2, and all the other features have value 0. In addition, the string abcdef is stored with the vector, which can serve as a way of providing additional information when adding user defined kernels.

The result of svm_perf_learn is the model which is learned from the training data in train.dat. The model is written to model.dat. To make predictions on test examples, svm_perf_classify reads this file. svm_perf_classify is called as follows:

svm_perf_classify [options] test.dat model.dat predictions

For all test examples in test.dat the predicted classes are written to the file predictions. There is one line per test example in predictions in the same order as in test.dat.

Getting started: an Example Problem

You will find an example problem at

https://osmot.cs.cornell.edu/svm_light/examples/example1.tar.gz

Download this file into your svm_perf directory and unpack it with

gunzip -c example1.tar.gz | tar xvf -

This will create a subdirectory example1. There are 2000 examples in the file train.dat and 600 in the file test.dat. To run the example, execute the commands:

svm_perf_learn -c 20 -l 2 --b 0 example1/train.dat example1/model
svm_perf_classify example1/test.dat example1/model example1/predictions

The accuracy on the test set is printed to stdout. The equivalent call to SVM^light that will find the same classification rule (up to numerical precision) is svm_learn -c 1 -b 0 example1/train.dat example1/model.

Extensions and Additions

Matlab Interface: a MATLAB MEX-interface to SVM^perf written by Oscar Luaces.

Disclaimer

This software is free only for non-commercial use. It must not be distributed without prior permission of the author. The author is not responsible for implications from the use of this software.

Known Problems

none

History

V2.50 - 3.00

Includes an implementation of the CPSP algorithm for training sparse non-linear kernel SVMs.

Includes Nystrom and incomplete Cholesky methods for approximate training of kernel SVMs.

Source code for SVM^perf V2.50.

V2.10 - 2.50

Includes a new implementation of the 1-slack algorithm in svm_struct_learn_custom.c for linear kernels optimizing errorrate that bypasses the API and avoids a lot of overhead this way. This makes training substantially faster.
Includes version V3.10 of SVM^struct.
Now writes linear model files in a more compact format (i.e. stores only the weight vectors, not the support vectors).
Fixed bug in RBF Kernel.
Fixed precision issues on 64-bit AMD and Intel machines.
Source code for SVM^perf V2.10.

V2.00 - 2.10

Now based on V3.00 of SVM^struct. See changes introduced by SVM^struct V3.00.
New IO routines that are faster for reading large data and model files.
Source code for SVM^perf V2.00.

V1.00 - 2.00

Extended algorithm for fast training of linear models, and using the sparse kernel approximation.

References

[1] T. Joachims, Training Linear SVMs in Linear Time, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2006. [Postscript] [PDF]

[2] T. Joachims, A Support Vector Method for Multivariate Performance Measures, Proceedings of the International Conference on Machine Learning (ICML), 2005. [Postscript] [PDF]

[3] Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, Journal of Machine Learning Research (JMLR), 6(Sep):1453-1484, 2005. [PDF]

[4] I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun. Support Vector Machine Learning for Interdependent and Structured Output Spaces. International Conference on Machine Learning (ICML), 2004. [Postscript] [PDF]

[5] T. Joachims, Making Large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT Press, 1999. [Postscript (gz)] [PDF]

[6] T. Joachims, T. Finley, Chun-Nam Yu, Cutting-Plane Training of Structural SVMs, Machine Learning Journal, 77(1):27-59, 2009. [PDF]

[7] T. Joachims, Chun-Nam Yu, Sparse Kernel SVMs via Cutting-Plane Training, European Conference on Machine Learning (ECML), Machine Learning Journal, Special ECML Issue, 76(2-3):179-193, 2009. [PDF]

SVMperf