SVM^hmm

Sequence Tagging with Structural Support Vector Machines

and its Application to Part-of-Speech Tagging

Version V2.13
1. July 2007

Evan Herbst, Thorsten Joachims
Cornell University
[evh4 cornell edu] [tj cs cornell edu]

Introduction

SVM^hmm is an implementation of structural SVMs for sequence tagging [Altun et. al, 2003] using the training algorithm described in [Tsochantaridis et al. 2004, Tsochantaridis et al. 2005] and the new algorithm of SVM^struct V3.00. Unlike previous versions of SVM^hmm, this version can easily handle tagging problems with millions of words and millions of features. SVM^hmmm is implemented as a specialization of the SVM^struct package. SVM^hmm discriminatively trains models that are isomorphic to a first-order hidden Markov model using the Structural Support Vector Machine (SVM) formulation. In particular, given an observed input sequence x=(x₁...x_l) of feature vectors x₁...x_l, the model predicts a tag sequence y=(y₁...y_l) according to the following linear discriminant function

y = argmax_y {Σ_{i = 1..l}(x_i • w_{y_i}) + φ_trans(y_i-1,y_i) • w_trans}.

The SVM^hmm learns one weight vector w_y for each tag y and one weight vector w_trans for the transition weights between adjacent tags. φ_trans(y_i-1,y_i) is an indicator vector that has exactly one entry set to 1 corresponding to the pair y_i-1,y_i. Note that in contrast to a conventional HMM, the observations x₁...x_l can naturally be expressed as feature vectors, not just as atomic tokens. During training, SVM^hmm solves the following optimization problem given training examples (x¹,y¹) ... (xⁿ,yⁿ) of sequences of feature vectors with their correct tag sequences. As the loss function Δ(yⁱ,y), the number of misclassified tags in the sentence is used.

        min w*w + C/n Σ_{i = 1..n} ξ_i
         s.t. for all y: [Σ_{i = 1..l}(x¹_i • w_{y¹_i}) + φ_trans(y¹_i-1,y¹_i) • w_trans] >= [Σ_{i = 1..l}(x¹_i • w_yi) + φ_trans(y_i-1,y_i) • w_trans] + Δ(y¹,y) - ξ_{1


...} for all y: [Σ_{i = 1..l}(xⁿ_i • w_{y_ⁿi}) + φ_trans(yⁿ_i-1,yⁿ_i) • w_trans] >= [Σ_{i = 1..l}(xⁿ_i • w_yi) + φ_trans(y_i-1,y_i) • w_trans] + Δ(yⁿ,y) - ξ_n

C is a parameter that trades off margin size and training error. While this problem has exponentially many constraints, we use the cutting-plane algorithms implemented in SVM^struct to solve this problem up to a precision of ε in polynomial time [Tsochantaridis et al. 2004, Tsochantaridis et al. 2005]. In particular, we use the one-slack formulation of the training problems introduced in SVM^struct V3.00, which makes this version of SVM^hmm substantially faster than previous versions.

More information on SVM^struct is available here.

Source Code and Binaries

The source code of SVM^hmm is available at the following location:

http://download.joachims.org/svm_hmm/v2.13/svm_hmm.tar.gz

If you just want the binaries, you can download them for the following systems:

Unpack the respective archive you downloaded with

tar -xzf svm_hmm.tar.gz

after substituting the appropriate file name. If you want to compile SVM^hmm yourself, you also need to have "boost", the semi-standard C++ library, installed. You don't need boost installed to run the software, only to compile and build it. Nowadays it's included with most linux distributions; you can also get it from boost.org. Once boost is downloaded, edit makefile.template to specify the top-level boost directory. Now execute

make

in the top-level directory. This compiles the system and creates the two executables: svm_hmm_learn (the learning module) and svm_hmm_classify (the classification module).

SVM^hmm doesn't compile under MS Visual C as written. SVM^hmm uses the SGI hash_map extension to the STL, as well as some other less important extensions. If you don't have an SGI-compatible STL (you most likely do if you run any linux) you can get one from http://www.sgi.com/tech/stl/download.html, although theirs may not compile for you without tweaking. If you want to compile SVM^hmm under MS Visual C you'll also need to create a project. The much easier option for compiling on Windows is to use MinGW or Cywin.

How to Use

SVM^hmm is built on top of SVM^struct, a general implementation of SVMs for predicting complex structures containing interactions between elements. The site includes examples of its use for other applications as well as for sequence tagging. Use SVM^hmm just like SVM^struct and there are no added parameters. To learn a model and then classify a test set, run

svm_hmm_learn_hideo -c <C> -e <EPSILON> training_input.dat modelfile.dat
svm_hmm_classify test_input.dat modelfile.dat classify.tags

where

<C> is the parameter used to weight the slack term in the SVM formulation with respect to the weight-vector term. NOTE: Unlike in V1.01, the value of C is divided by the number of training examples. So, to get results equivalent to V1.01, multiply C by the number of training examples.
<EPSILON> is the precision to which constraints are required to be satisfied by the solution. The smaller EPSILON, the longer and the more memory training takes, but the solution is more precise. However, solutions more accurate than 0.5 typically do not improve prediction accuracy. training_input.dat and test_input.dat are files containing the training and test examples. They have the format given below for input files.
modelfile.dat contains the SVM^hmm model information learned from the input. The SVM^struct model is written to modelfile_svmModel.dat. The SVM-model file has the .dat extension regardless of the extension of the model file.
classify.tags will receive lists of tags predicted for the test examples, ie, for each test example, classify.tags will contain something like { TAG1 TAG2 TAG3 } }. There is zero or more whitespace between adjacent sets of brackets.

SVM^struct.

Input File Format

Input files of training and test example have the usual SVM^light input file format. Each line in an example file holds one tag, the feature vector of the associated token, and an optional comment.

TAG qid:EXNUM.INDEX FEATNUM:FEATVAL FEATNUM:FEATVAL ... #comment goes here
...

The query ID gives the example number and token index within the example. Eg, in the case of POS tagging, the qid is SENTENCE_NUM.WORD_NUM, and the example file resembles

PRON qid:1.1 1:1 2:1 3:1 #see
NOUN qid:1.2 4:1 5:1 #jane
VERB qid:1.3 2:-1 6:1 7:2.5 #play
VERB qid:1.4 3:-1 8:1 #ball
NOUN qid:1.5 9:1 10:1 #.
PRON qid:2.1 3:1 11:1 12:1 #she
VERB qid:2.2 1:-1 13:1 14:2 #is
JJ qid:2.3 15:1 16:0.1 17:-1 #good
PERIOD qid:2.4 9:0.5 18:1 #.

Here we've used the words in the sentence as the comments on each line, so that all the information we have about the input is contained in one file. Note that feature numbers are integral, starting at 1, as are both parts of the query ID. Feature values are real numbers. QIDs can appear in any order, but be careful not to skip any numbers; for example, if qid 3.4 appears, example numbers 1 and 2 must be present, as well as qids 3.1 through 3.3. Features for each token (eg "see") must appear in increasing order of their feature number FEATNUM; you should skip features whose values are zero, although leaving them in won't do harm other than increasing runtime.

A TAG can be any non-space token except the special tag "~NONE~" (without the quotes), which is reserved for use with higher-order markov models (a planned future extension).

Example Problem

You will find an example POS tagging dataset at

http://download.joachims.org/svm_hmm/examples/example6.tar.gz

Download this file into your svm_hmm directory and unpack it with

gunzip -c example6.tar.gz | tar xvf -

This will create a subdirectory example6. Try the following:

./svm_hmm_learn -c 5 -e 0.5 example6/declaration_of_independence.shin declaration.model

./svm_hmm_classify example6/gettysburg_address.shin declaration.model test.outtags

The fraction of misclassified tags (i.e. "average loss per word" written to stdout by the classification routine) should be about 0.28. Tagging accuracy is 1 - this number, or 72%. The program also prints the average number of misclassified tags per sentence (i.e. "Average loss on test set") and the percentage of sentences that had at least one misclassified tag (i.e. "Zero/one-error on test set"). Let's find out how many of the test words didn't show up in the training set:

cat declaration_of_independence.shin | awk '{print $NF}' | sort | uniq > decl.words

cat gettysburg_address.shin | awk '{print $NF}' | sort | uniq > gett.words

comm -1 -3 decl.words gett.words | wc -l
77

wc -l gettysburg_address.shin
289

At least 77 of the 289 test tokens hadn't been seen in the training set--about 27%. The percentage generally ought to be much lower; this is a very small training set. 72% accuracy is pretty good for a tiny example with so many unknown tokens.

Disclaimer: this example wasn't tagged with professional accuracy; don't use it for any real experiments. My Gettysburg tagging is based off the document at http://faculty.washington.edu/dillon/GramResources/gettysburgtreetagger.html, which was tagged with Schmid's tagger--but TreeTagger had probably been trained on a real corpus first.

Known Problems

On 64-bit AMD and Intel machines, there can be a precision problem when using the w3 or w4 algorithms on problems with many features. To fix this problem, change "#define CFLOAT" to "double" in svm_common.h. Thank you to Chih-Jen Lin for pointing this out.
The RBF kernel is broken.

History

V1.01 - V2.13

New training algorithm based on equivalent 1-slack reformulation of the training problem. This makes training on the linear kernel several orders of magnitude faster than in V1.01. See changes introduced by SVM^struct V3.00 for a general properties of the new structural SVM training algorithm.
New IO routines that are faster for reading large data and model files.
Source code for SVM^hmm V1.01.

V1.00 - 1.01

Fixed a small bug related to the debugging message "There is probably a bug in 'find_most_violated_constraint_*'..'.

References

I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, Journal of Machine Learning Research (JMLR), 6(Sep):1453-1484, 2005.
[PDF]
I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun. Support Vector Machine Learning for Interdependent and Structured Output Spaces. International Conference on Machine Learning (ICML), 2004.
[Postscript] [PDF]
Y. Altun, I. Tsochantaridis, T. Hofmann, Hidden Markov Support Vector Machines. International Conference on Machine Learning (ICML), 2003.

SVMhmm