Sequence Tagging with Structural Support Vector Machines

Version V3.10
14. August 2008

Author: Thorsten Joachims <>
Cornell University
Department of Computer Science  

with thanks to Evan Herbst for earlier versions.


SVMhmm is an implementation of structural SVMs for sequence tagging [Altun et. al, 2003] (e.g. part-of-speech tagging, named-entity recognition, motif finding) using the training algorithm described in [Tsochantaridis et al. 2004, Tsochantaridis et al. 2005] and the new algorithm of SVMstruct V3.10 [Joachims et al. 2009]. Unlike versions of SVMhmm prior to V3.xx, this version

SVMhmm is implemented as an instance of the SVMstruct package.

Source Code and Binaries

The program is free for scientific use. Please contact me, if you are planning to use the software for commercial purposes. The software must not be further distributed without prior permission of the author. The implementation was developed on Linux with gcc, but compiles also on Cygwin, Windows (using the MinGW option of Cygwin), and Mac (after small modifications, see FAQ). The source code of SVMhmm is available at the following location:

If you just want the binaries, you can download them for the following systems:

Please send me email and let me know that you got it. All archives contain the most recent version of SVMhmm, which includes SVMstruct and the SVMlight quadratic optimizer. Unpack the archive using the respective shell command:

      gunzip c svm_hmm.tar.gz | tar xvf 
      gunzip c svm_hmm_linux.tar.gz | tar xvf 
      gunzip c svm_hmm_cygwin.tar.gz | tar xvf 
This expands the archive into the current directory, which now contains all relevant files. If you downloaded the source code, you can compile SVMhmm using the command:
This will produce the executables svm_hmm_learn (the learning module) and svm_hmm_classify (the classification module). For any questions about how to best compile the system, check this

How to Use

SVMhmm is built on top of SVMstruct, a general implementation of SVMs for predicting complex structures containing interactions between elements. The site includes examples of its use for other applications as well as for sequence tagging. Use SVMhmm just like SVMstruct, but there are some added parameters explained below. To learn a model and then classify a test set, run
      svm_hmm_learn -c <C> -e <EPSILON> training_input.dat modelfile.dat 
      svm_hmm_classify test_input.dat modelfile.dat classify.tags 
where The other paramters are identical to those of SVMstruct. I strongly advise against using non-linear Kernels. Since the code was not written with Kernels in mind, it is going to be painfully slow and I have never tested it.

Example Problem

You will find an example part-of-speech (POS) tagging dataset (thanks to Evan Herbst) at

Download this file into your svm_hmm directory and unpack it with

     gunzip -c example7.tar.gz | tar xvf -

This will create a subdirectory example7. Try the following:

     ./svm_hmm_learn -c 5 -e 0.5 example7/declaration_of_independence.dat declaration.model 
./svm_hmm_classify example7/gettysburg_address.dat declaration.model test.outtags
The fraction of misclassified tags (i.e. "average loss per word" written to stdout by the classification routine) should be about 0.28. Tagging accuracy is 1 - this number, or 72%. The program also prints the average number of misclassified tags per sentence (i.e. "Average loss on test set") and the percentage of sentences that had at least one misclassified tag (i.e. "Zero/one-error on test set").

Disclaimer: this example wasn't tagged with professional accuracy; don't use it for any real experiments. The Gettysburg tagging is based off the document at, which was tagged with Schmid's tagger--but TreeTagger had probably been trained on a real corpus first.

Algorithm and Model

SVMhmm discriminatively trains models that are isomorphic to an kth-order hidden Markov model using the Structural Support Vector Machine (SVM) formulation. In particular, given an observed input sequence x=(x1...xl) of feature vectors x1...xl, the model predicts a tag sequence y=(y1...yl) according to the following linear discriminant function:

        y = argmaxy i = 1..lj = 1..k (xiwyi-j...yi) + φtrans(yi-j,...,yi) • wtrans]}

The SVMhmm learns one emission weight vector wyi-k...yi for each different kth-order tag sequence yi-k...yi and one transition weight vector wtrans for the transition weights between adjacent tags. φtrans(yi-j,...,yi) is an indicator vector that has exactly one entry set to 1 corresponding to the sequence yi-j,...,yi. Note that in contrast to a conventional HMM, the observations x1...xl can naturally be expressed as feature vectors, not just as atomic tokens. Also note that a kth-order model includes all the parameters of the lower-order models as well. During training, SVMhmm solves the following optimization problem given training examples (x1,y1) ... (xn,yn) of sequences of feature vectors xj=(xj1,...,xjl) with their correct tag sequences yj=(yj1,...,yjl). The following optimization problem corresponds to a model with first-order transitions and zeroth-order emissions, but it should be obvious how it generalizes to higher order models given the discriminant function from above. As the loss function Δ(yi,y), the number of misclassified tags in the sentence is used.

        min 1/2 w*w + C/n Σi = 1..n ξi
         s.t. for all y: [Σi = 1..l(x1iwy1i) + φtrans(y1i-1,y1i) • wtrans] >= [Σi = 1..l(x1iwyi) + φtrans(yi-1,yi) • wtrans] + Δ(y1,y) - ξ1
for all y: [Σi = 1..l(xniwyni) + φtrans(yni-1,yni) • wtrans] >= [Σi = 1..l(xniwyi) + φtrans(yi-1,yi) • wtrans] + Δ(yn,y) - ξn

C is a parameter that trades off margin size and training error. While this problem has exponentially many constraints, we use the cutting-plane algorithms  implemented in SVMstruct to solve this problem up to a precision of ε in polynomial time [Tsochantaridis et al. 2004, Tsochantaridis et al. 2005]. In particular, we solve the one-slack reformulation of the training problems using SVMstruct V3.10 [Joachims et al. 2009], which makes this version of SVMhmm substantially faster than previous versions. See [Joachims et al. 2009] for more details on SVMhmm, including some part-of-speech tagging experiments.

More information on SVMstruct is available here.

Input File Format

NOTE: The file format has changed from version V2.xx. A perl script for converting to the new format is available.

Input files of training and test examples follow the usual SVMlight input file format. Each line in an example file holds one tag yji, the feature vector xji of the associated token, and an optional comment.

      TAG qid:EXNUM FEATNUM:FEATVAL FEATNUM:FEATVAL ... #comment goes here
TAG is a natural number (i.e. 1..k) that identifies the tag yji that is assigned to the example. As the maximum tag number impacts the memory use, it is recommended to use the numbers 1 to k for a problem with k tags. The EXNUM gives the example number j. The first line with a given EXNUM value is interpreted as the first element of the sequence, the second line as the second element, etc. All lines with the same EXNUM have to be in consecutive order. In the case of POS tagging, an example file with two sequences may look like the following (using the following mapping between tag numbers and part-of-speech classes: 1=PRON, 2=NOUN, 3=VERB, 4=JJ, 5=PERIOD):
      3 qid:1 1:1 2:1 3:1 #see 
      2 qid:1 4:1 5:1 #jane
      3 qid:1 2:-1 6:1 7:2.5 #play
      2 qid:1 3:-1 8:1 #ball
      5 qid:1 9:1 10:1 #.
      1 qid:2 3:1 11:1 12:1 #she
      3 qid:2 1:-1 13:1 14:2 #is
      4 qid:2 15:1 16:0.1 17:-1 #good
      5 qid:2 9:0.5 18:1 #.
Here we've used the words in the sentence as the comments on each line, so that all the information we have about the input is contained in one file. Note that feature numbers are integral, starting at 1. Feature values are real numbers. Features for each token (e.g. "see") must appear in increasing order of their feature number FEATNUM; you should skip features whose values are zero, although leaving them in won't do harm other than increasing runtime.

What features you chose to represent each token is entirely up to you. Features can describe any property of the current token and its relationship to any other token in the sequence. For example, in the part-of-speech tagging experiments reported in [Joachims et al., 2009], each token was described by binary features indicating each possible prefix and suffix of the current token, the previous token, and the next token. Examples of such features are "Does the current token end in 'ing'?", "Does the previous token end in 'ing'?", "Does the next token start with 'A'?", "Is the current token 3 letters long?", etc. In total, there were more than 400000 of such features, and the overall dimensionality of the weight vector was greater than 18 million. In my experience binary features work well, and I often discretize real-valued features into multiple binary features.

Known Problems


V3.03 - V3.10

V3.02 - V3.03

V2.13 - V3.02

V1.01 - V2.13

V1.00 - 1.01