SVMhmm
Sequence Tagging with Support Vector Machines
and its Application to Part-of-Speech Tagging
Version V1.01
4. July 2006 Evan Herbst, Thorsten Joachims
Cornell University
[evh4 cornell edu] [tj cs cornell edu]
Introduction
SVMhmm is an implementation of structural SVMs for sequence
tagging [Altun et. al, 2003] using the training algorithm described in [Tsochantaridis
et al. 2004, Tsochantaridis et al. 2005]. It is implemented as a specialization of the
SVMstruct
package. SVMhmm discriminatively trains models that are
isomorphic to a
first-order hidden Markov model using the Structural Support Vector Machine (SVM) formulation. In
particular, given an observed input sequence x=(x1...xl)
of feature vectors x1...xl,
the model predicts a tag sequence y=(y1...yl) according to
the following linear discriminant function
y = argmaxy
{Σi = 1..l(xi
• wyi)
+ φtrans(yi-1,yi) • wtrans}.
The SVMhmm learns one weight vector wy
for each tag y and one weight vector wtrans
for the transition weights between adjacent tags. φtrans(yi-1,yi)
is an indicator vector that has exactly one entry set to 1 corresponding to the
pair yi-1,yi. Note that in contrast to a conventional HMM,
the observations x1...xl
can naturally be expressed as feature vectors, not just as atomic tokens. During
training, SVMhmm solves the following optimization problem given training
examples (x1,y1) ... (xn,yn)
of sequences of feature vectors with their correct tag sequences. As the loss
function Δ(yi,y),
the number of misclassified tags in the sentence is used.
min w*w + C
Σi = 1..n ξi
s.t. for all y:
[Σi = 1..l(x1i
• wy1i)
+ φtrans(y1i-1,y1i) • wtrans]
>= [Σi = 1..l(x1i
• wyi)
+ φtrans(yi-1,yi) • wtrans]
+ Δ(y1,y) - ξ1
...
for all y:
[Σi = 1..l(xni
• wyni)
+ φtrans(yni-1,yni) • wtrans]
>= [Σi = 1..l(xni
• wyi)
+ φtrans(yi-1,yi) • wtrans]
+ Δ(yn,y) - ξn
C is a parameter that trades off margin size and training error. While this problem has exponentially many constraints, we use the algorithm
described in [Tsochantaridis
et al. 2004, Tsochantaridis et al. 2005] and implemented in SVMstruct to solve this
problem up to a precision of ε in polynomial time. We have used SVMhmm for part-of-speech (POS) tagging and achieved
excellent results, using observation vectors xi
with more than 400,000 features and outperforming the three of the most popular
current POS taggers. Details are forthcoming.
Source Code and Binaries
To install SVMhmm you need to download
Unpack it with
tar -xzf svmhmm.tar.gz
You'll find precompiled executables for linux and win32 in the bin/
subdirectory of the archive you downloaded. If you want to compile SVMhmm
yourself, you also need to have "boost", the semi-standard C++ library, installed. You don't need boost installed to run the software, only to
compile and build it. Nowadays it's included with most linux distributions; you can also get it from
boost.org. Once boost is downloaded, edit makefile.template in src/ to specify the
top-level boost directory. You also need to select your platform, which may mean adding another set of compile flags.
Now execute
cd src; make
which compiles the system and creates the two executables: svm_hmm_learn_hideo, the learning module,
and svm_hmm_classify, the classification module.
SVMhmm doesn't compile under MSVC as written. SVMhmm uses the SGI hash_map extension to the STL, as well as some other less important extensions. If you don't have an SGI-compatible STL (you most likely do if you run any linux) you can get one from http://www.sgi.com/tech/stl/download.html, although theirs may not compile for you without tweaking. If you want to compile SVMhmm under MSVC you'll also need to create a project.
Usage
SVMhmm is built on top of SVMstruct, a general implementation of SVMs for predicting complex structures containing interactions between elements.
The site includes examples of its use for other applications as well as for sequence tagging.
Use SVMhmm just like SVMstruct and there are no added parameters. To learn a model
and then classify
a test set, run
svm_hmm_learn_hideo -c <C> -e <EPSILON> training_input.dat modelfile.dat
svm_hmm_classify test_input.dat modelfile.dat classify.tags
where
- <C> is the parameter used to weight the slack term in the SVM formulation with respect to the weight-vector term.
- <EPSILON> is the precision to which constraints are required
to be satisfied by the solution. The smaller EPSILON,
the longer and the more memory training takes, but the solution is more precise.
However, solutions more accurate than 0.5 typically do not improve prediction
accuracy.
- The use of kernels (i.e. -t option) is currently not supported.
- training_input.dat and test_input.dat are files containing
the training and test examples. They have the format given below for input files.
- modelfile.dat contains the SVMhmm model information learned from the input.
The SVMstruct model is written to modelfile_svmModel.dat. The SVM-model
file has the .dat extension regardless of the extension of the model file.
- classify.tags will receive lists of tags predicted for the test examples, ie, for each
test example,
classify.tags will contain something like { TAG1 TAG2 TAG3 }. There is zero or more whitespace between adjacent
sets of brackets.
The other paramters are identical to those of SVMstruct.
Input File Format
Input files of training and test example have the usual SVMlight input file format. Each line
in an example file holds one tag, the feature vector of the associated token, and an optional comment.
TAG qid:EXNUM.INDEX FEATNUM:FEATVAL FEATNUM:FEATVAL ... #comment goes here
...
The query ID gives the example number and token index within the example. Eg, in the case of POS tagging, the qid is
SENTENCE_NUM.WORD_NUM, and the example file resembles
PRON qid:1.1 1:1 2:1 3:1 #see
NOUN qid:1.2 4:1 5:1 #jane
VERB qid:1.3 2:-1 6:1 7:2.5 #play
VERB qid:1.4 3:-1 8:1 #ball
NOUN qid:1.5 9:1 10:1 #.
PRON qid:2.1 3:1 11:1 12:1 #she
VERB qid:2.2 1:-1 13:1 14:2 #is
JJ qid:2.3 15:1 16:0.1 17:-1 #good
PERIOD qid:2.4 9:0.5 18:1 #.
Here we've used the words in the sentence as the comments on each line, so that all the information we have about the input is
contained in one file. Note that feature numbers are integral, starting at 1, as are both parts of the query ID. Feature values are real
numbers. QIDs can appear in any order, but be careful not to skip any numbers; for example, if qid 3.4 appears, example numbers 1
and 2 must be present, as well as qids 3.1 through 3.3. Features for each token (eg "see") must appear in increasing order
of their feature number FEATNUM;
you should skip features whose values are zero, although leaving them in won't do harm other than increasing runtime.
A TAG can be any non-space token except the special tag "~NONE~" (without the quotes), which is reserved for use with
higher-order markov models (a planned future extension).
Features for any given token can be chosen arbitrarily based on its position in the input and its context. The
number of features is unrestricted in principle and we have routinely worked
with more than 400,000 features. Our POS-tagging results have been obtained using approximately 450,000 simple lexical features
like "previous word ends in -ation".
Example Problem
Example input files for the POS-tagging problem are provided in data/. Try the following (it shouldn't run more than a minute or two):
./svm_hmm_learn_hideo -c 0.1 -e 0.5 declaration_of_independence.shin declaration.model
./svm_hmm_classify gettysburg_address.shin declaration.model test.outtags
The fraction of misclassified tags (i.e. "average loss per word" written to stdout by the classification routine) should be about 0.28. Tagging accuracy is 1 - this number, or
72%.
The program also prints the average number of misclassified tags per sentence
(i.e. "Average loss on test set") and the percentage of sentences that had at
least one misclassified tag (i.e. "Zero/one-error on test set"). Let's find out how many of the test words didn't show up in the training set:
cat declaration_of_independence.shin | awk '{print $NF}' | sort | uniq > decl.words
cat gettysburg_address.shin | awk '{print $NF}' | sort | uniq > gett.words
comm -1 -3 decl.words gett.words | wc -l
77
wc -l gettysburg_address.shin
289
At least 77 of the 289 test tokens hadn't been seen in the training set--about 27%. The percentage generally ought to be much lower;
this is a very small training set. 72% accuracy is pretty good for a tiny example with so many unknown tokens.
Disclaimer: this example wasn't tagged with professional accuracy; don't use
it for any real experiments. My Gettysburg tagging is based off the document at http://faculty.washington.edu/dillon/GramResources/gettysburgtreetagger.html, which was tagged with Schmid's tagger--but TreeTagger had probably been trained on a real corpus first.
History
V1.00 - 1.01
- Fixed a small bug related to the debugging message "There is
probably a bug in 'find_most_violated_constraint_*'.
References
- I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin
Methods for Structured and Interdependent Output Variables, Journal of
Machine Learning Research (JMLR), 6(Sep):1453-1484, 2005.
[PDF] - I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun. Support Vector Machine Learning for Interdependent and Structured
Output Spaces. International Conference on Machine Learning (ICML), 2004.
[Postscript] [PDF]
-
Y. Altun, I. Tsochantaridis, T. Hofmann, Hidden Markov Support Vector
Machines. International Conference on Machine Learning (ICML), 2003.