SVM^hmm

Sequence Tagging with Structural Support Vector Machines

Version V3.03
21. April 2008

Author: Thorsten Joachims <thorsten@joachims.org>
Cornell University
Department of Computer Science

with thanks to Evan Herbst for earlier versions.

Introduction

SVM^hmm is an implementation of structural SVMs for sequence tagging [Altun et. al, 2003] (e.g. part-of-speech tagging, named-entity recognition, motif finding) using the training algorithm described in [Tsochantaridis et al. 2004, Tsochantaridis et al. 2005] and the new algorithm of SVM^struct V3.00 [Joachims et al. 2007]. Unlike previous versions of SVM^hmm, this version

can easily handle tagging problems with millions of words and millions of features;
can train higher order models with arbitrary length dependencies for both the transitions and the emissions;
includes an optional beam search for fast approximate training and prediction;
is completely written in C without the need for additional libraries

SVM^hmmm is implemented as a specialization of the SVM^struct package. SVM^hmm discriminatively trains models that are isomorphic to an kth-order hidden Markov model using the Structural Support Vector Machine (SVM) formulation. In particular, given an observed input sequence x=(x₁...x_l) of feature vectors x₁...x_l, the model predicts a tag sequence y=(y₁...y_l) according to the following linear discriminant function

y = argmax_y {Σ_{i = 1..l} [Σ_{j = 1..k}(x_i • w_yi-j...yi) + φ_trans(y_i-j,...,y_i) • w_trans]}.

The SVM^hmm learns one emission weight vector w_yi-k...yi for each different kth-order tag sequence y_i-k...y_i and one transition weight vector w_trans for the transition weights between adjacent tags. φ_trans(y_i-j,...,y_i) is an indicator vector that has exactly one entry set to 1 corresponding to the sequence y_i-j,...,y_i. Note that in contrast to a conventional HMM, the observations x₁...x_l can naturally be expressed as feature vectors, not just as atomic tokens. Also note that a kth-order model includes all the parameters of the lower-order models as well. During training, SVM^hmm solves the following optimization problem given training examples (x¹,y¹) ... (xⁿ,yⁿ) of sequences of feature vectors with their correct tag sequences. The following optimization problem corresponds to first-order models, but it should be obvious how it generalizes to higher order models given the discriminant function from above. As the loss function Δ(yⁱ,y), the number of misclassified tags in the sentence is used.

        min w*w + C/n Σ_{i = 1..n} ξ_i
         s.t. for all y: [Σ_{i = 1..l}(x¹_i • w_{y¹_i}) + φ_trans(y¹_i-1,y¹_i) • w_trans] >= [Σ_{i = 1..l}(x¹_i • w_yi) + φ_trans(y_i-1,y_i) • w_trans] + Δ(y¹,y) - ξ_{1


...} for all y: [Σ_{i = 1..l}(xⁿ_i • w_{y_ⁿi}) + φ_trans(yⁿ_i-1,yⁿ_i) • w_trans] >= [Σ_{i = 1..l}(xⁿ_i • w_yi) + φ_trans(y_i-1,y_i) • w_trans] + Δ(yⁿ,y) - ξ_n

C is a parameter that trades off margin size and training error. While this problem has exponentially many constraints, we use the cutting-plane algorithms implemented in SVM^struct to solve this problem up to a precision of ε in polynomial time [Tsochantaridis et al. 2004, Tsochantaridis et al. 2005]. In particular, we use the one-slack formulation of the training problems introduced in SVM^struct V3.00 [Joachims et al. 2007], which makes this version of SVM^hmm substantially faster than previous versions.

More information on SVM^struct is available here.

Source Code and Binaries

The source code of SVM^hmm is available at the following location:

http://download.joachims.org/svm_hmm/v3.03/svm_hmm.tar.gz

If you just want the binaries, you can download them for the following systems:

Please send me email and let me know that you got it. All archives contain the most recent version of SVM^hmm, which includes SVM^struct and the SVM^light quadratic optimizer. Unpack the archive using the respective shell command:

      gunzip –c svm_hmm.tar.gz | tar xvf –
      gunzip –c svm_hmm_linux.tar.gz | tar xvf –
      gunzip –c svm_hmm_cygwin.tar.gz | tar xvf –
      unzip svm_hmm_windows.zip

This expands the archive into the current directory, which now contains all relevant files. If you downloaded the source code, you can compile SVM^hmm using the command:

      make

This will produce the executables svm_hmm_learn (the learning module) and svm_hmm_classify (the classification module). For any questions about how to best compile the system, check this FAQ.

How to Use

SVM^hmm is built on top of SVM^struct, a general implementation of SVMs for predicting complex structures containing interactions between elements. The site includes examples of its use for other applications as well as for sequence tagging. Use SVM^hmm just like SVM^struct, but there are some added parameters explained below. To learn a model and then classify a test set, run

      svm_hmm_learn -c <C> -e <EPSILON> training_input.dat modelfile.dat 
      svm_hmm_classify test_input.dat modelfile.dat classify.tags

where

training_input.dat and test_input.dat are files containing the training and test examples. Their format is described below.
modelfile.dat contains the SVM^hmm model information learned from the input.
classify.tags will receive lists of tags predicted for the test examples, one tag for each line in test_input.dat.
Parameter "-c <C>": Typical SVM parameter trading-off slack vs. magnitude of the weight-vector.
NOTE: Unlike in V1.01, the value of C is divided by the number of training examples. So, to get results equivalent to V1.01, multiply C by the number of training examples.
Parameter "-e <EPSILON>": This specifies the precision to which constraints are required to be satisfied by the solution. The smaller EPSILON, the longer and the more memory training takes, but the solution is more precise. However, solutions more accurate than 0.5 typically do not improve prediction accuracy.
Parameter "--t <ORDER_T>": Order of dependencies of transitions in HMM. Can be any number larger than 1. (default 1)
Parameter "--e <ORDER_E>": Order of dependencies of emissions in HMM. Can be any number larger than 1. (default 1)
Parameter "--b <WIDTH>": A non-zero value turns on (approximate) beam search to replace the exact Viterbi algorithm both for finding the most violated constraint, as well as for computing predictions. The value is the width of the beam used (e.g. 100). (default 0).

Input File Format

NOTE: The file format has changed from version V2.xx. A perl script for converting to the new format is available.

Input files of training and test examples follow the usual SVM^light input file format. Each line in an example file holds one tag, the feature vector of the associated token, and an optional comment.

      TAG qid:EXNUM FEATNUM:FEATVAL FEATNUM:FEATVAL ... #comment goes here

      ...

TAG is a natural number (i.e. 1..k) identify the tag that is assigned to the example. As the maximum tag number impacts the memory use, it is recommended to use the numbers 1 to k for a problem with k tags. The EXNUM gives the example number. The first line with a given EXNUM value is interpreted as the first element of the sequence, the second line as the second element, etc. All lines with the same EXNUM have to be in consecutive order. In the case of POS tagging, an example file with two sequences may look like the following (using the following mapping between tag numbers and part-of-speech classes: 1=PRON, 2=NOUN, 3=VERB, 4=JJ, 5=PERIOD):

      1 qid:1 1:1 2:1 3:1 #see 

      2 qid:1 4:1 5:1 #jane

      3 qid:1 2:-1 6:1 7:2.5 #play

      3 qid:1 3:-1 8:1 #ball

      2 qid:1 9:1 10:1 #.

      1 qid:2 3:1 11:1 12:1 #she

      3 qid:2 1:-1 13:1 14:2 #is

      4 qid:2 15:1 16:0.1 17:-1 #good

      5 qid:2 9:0.5 18:1 #.

Here we've used the words in the sentence as the comments on each line, so that all the information we have about the input is contained in one file. Note that feature numbers are integral, starting at 1. Feature values are real numbers. Features for each token (eg "see") must appear in increasing order of their feature number FEATNUM; you should skip features whose values are zero, although leaving them in won't do harm other than increasing runtime.

Example Problem

You will find an example POS tagging dataset (thanks to Evan Herbst) at

http://download.joachims.org/svm_hmm/examples/example7.tar.gz

Download this file into your svm_hmm directory and unpack it with

     gunzip -c example7.tar.gz | tar xvf -

This will create a subdirectory example7. Try the following:

     ./svm_hmm_learn -c 5 -e 0.5 example7/declaration_of_independence.dat declaration.model 
     ./svm_hmm_classify example7/gettysburg_address.dat declaration.model test.outtags

The fraction of misclassified tags (i.e. "average loss per word" written to stdout by the classification routine) should be about 0.28. Tagging accuracy is 1 - this number, or 72%. The program also prints the average number of misclassified tags per sentence (i.e. "Average loss on test set") and the percentage of sentences that had at least one misclassified tag (i.e. "Zero/one-error on test set").

Disclaimer: this example wasn't tagged with professional accuracy; don't use it for any real experiments. The Gettysburg tagging is based off the document at http://faculty.washington.edu/dillon/GramResources/gettysburgtreetagger.html, which was tagged with Schmid's tagger--but TreeTagger had probably been trained on a real corpus first.

Known Problems

none at this time.

History

V3.02 - V3.03

Added that svm_hmm_classify prints the predicted labels to a file.

V2.13 - V3.02

Complete reimplementation in C.
Can train higher order models with arbitrary length dependencies for both the transitions and the emissions.
Now includes an optional beam search for fast approximate training and prediction.
Format of training and test files has changed.
Source code for SVM^hmm V2.13.

V1.01 - V2.13

New training algorithm based on equivalent 1-slack reformulation of the training problem. This makes training on the linear kernel several orders of magnitude faster than in V1.01. See changes introduced by SVM^struct V3.00 for a general properties of the new structural SVM training algorithm.
New IO routines that are faster for reading large data and model files.
Source code for SVM^hmm V1.01.

V1.00 - 1.01

Fixed a small bug related to the debugging message "There is probably a bug in 'find_most_violated_constraint_*'..'.

References

T. Joachims, T. Finley, Chun-Nam Yu, Cutting-Plane Training of Structural SVMs, under review, 2007.
[PDF]
I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, Journal of Machine Learning Research (JMLR), 6(Sep):1453-1484, 2005.
[PDF]
I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun. Support Vector Machine Learning for Interdependent and Structured Output Spaces. International Conference on Machine Learning (ICML), 2004.
[Postscript] [PDF]
Y. Altun, I. Tsochantaridis, T. Hofmann, Hidden Markov Support Vector Machines. International Conference on Machine Learning (ICML), 2003.

SVMhmm