14. August 2008

Author:
Thorsten Joachims
<thorsten@joachims.org>

Cornell University

Department of Computer
Science

with thanks to Evan Herbst for earlier versions.

- can easily handle tagging problems with millions of words and millions of features;
- can train higher order models with arbitrary length dependencies for both the transitions and the emissions;
- includes an optional beam search for fast approximate training and prediction;
- is completely written in C without the need for additional libraries
- a more detailed list of updates compared to version V3.03 is here

SVM^{hmm} is implemented as an instance of the
SVM^{struct}
package.

https://download.joachims.org/svm_hmm/current/svm_hmm.tar.gz

- Linux (32-bit): https://download.joachims.org/svm_hmm/current/svm_hmm_linux32.tar.gz
- Linux (64-bit): https://download.joachims.org/svm_hmm/current/svm_hmm_linux64.tar.gz
- Cygwin: https://download.joachims.org/svm_hmm/current/svm_hmm_cygwin.tar.gz
- Windows: https://download.joachims.org/svm_hmm/current/svm_hmm_windows.zip

Please send me email
and let me know that you got it. All archives contain the most recent version of
*SVM ^{hmm}*, which includes

gunzip –c svm_hmm.tar.gz | tar xvf – gunzip –c svm_hmm_linux.tar.gz | tar xvf – gunzip –c svm_hmm_cygwin.tar.gz | tar xvf – unzip svm_hmm_windows.zipThis expands the archive into the current directory, which now contains all relevant files. If you downloaded the source code, you can compile

makeThis will produce the executables

svm_hmm_learn -c <C> -e <EPSILON> training_input.dat modelfile.dat svm_hmm_classify test_input.dat modelfile.dat classify.tagswhere

`training_input.dat`and`test_input.dat`are files containing the training and test examples. Their format is described below.`modelfile.dat`contains the SVM^{hmm}model information learned from the input.`classify.tags`will receive lists of tags predicted for the test examples, one tag for each line in`test_input.dat`.- Parameter
`"-c <C>`": Typical SVM parameter C trading-off slack vs. magnitude of the weight-vector.**NOTE:**The default value for this parameter is unlikely to work well for your particular problem. A good value for C must be selected via cross-validation, ideally exploring values over several orders of magnitude.

**NOTE:**Unlike in V1.01, the value of`C`is divided by the number of training examples. So, to get results equivalent to V1.01, multiply`C`by the number of training examples. - Parameter
`"-e <EPSILON>`": This specifies the precision to which constraints are required to be satisfied by the solution. The smaller EPSILON, the longer and the more memory training takes, but the solution is more precise. However, solutions more accurate than 0.5 typically do not improve prediction accuracy. - Parameter
`"--t <ORDER_T>`": Order of dependencies of transitions in HMM. Can be any number larger than 1. (default 1) - Parameter
`"--e <ORDER_E>`": Order of dependencies of emissions in HMM. Can be any number larger than 0. (default 0) - Parameter
`"--b <WIDTH>`": A non-zero value turns on (approximate) beam search to replace the exact Viterbi algorithm both for finding the most violated constraint, as well as for computing predictions. The value is the width of the beam used (e.g. 100). (default 0).

You will find an example part-of-speech (POS) tagging dataset (thanks to Evan Herbst) at

https://download.joachims.org/svm_hmm/examples/example7.tar.gz

Download this file into your svm_hmm directory and unpack it with

gunzip -c example7.tar.gz | tar xvf -

This will create a subdirectory `example7`. Try the following:

./svm_hmm_learn -c 5 -e 0.5 example7/declaration_of_independence.dat declaration.modelThe fraction of misclassified tags (i.e. "average loss per word" written to stdout by the classification routine) should be about 0.28. Tagging accuracy is 1 - this number, or 72%. The program also prints the average number of misclassified tags per sentence (i.e. "Average loss on test set") and the percentage of sentences that had at least one misclassified tag (i.e. "Zero/one-error on test set").

./svm_hmm_classify example7/gettysburg_address.dat declaration.model test.outtags

Disclaimer: this example wasn't tagged with professional accuracy; don't use it for any real experiments. The Gettysburg tagging is based off the document at http://faculty.washington.edu/dillon/GramResources/gettysburgtreetagger.html, which was tagged with Schmid's tagger--but TreeTagger had probably been trained on a real corpus first.

**y** = argmax_{y}
{Σ_{i = 1..l} [Σ_{j = 1..k
}(x_{i}
• w_{yi-j...yi})
+ φ_{trans}(y_{i-j},...,y_{i}) • w_{trans}]}

The SVM^{hmm} learns one emission weight vector w_{yi-k...yi}
for each different kth-order tag sequence y_{i-k}...y_{i} and one
transition weight vector w_{trans}
for the transition weights between adjacent tags. φ_{trans}(y_{i-j},...,y_{i})
is an indicator vector that has exactly one entry set to 1 corresponding to the
sequence y_{i-j},...,y_{i}. Note that in contrast to a conventional HMM,
the observations x_{1}...x_{l}
can naturally be expressed as feature vectors, not just as atomic tokens. Also
note that a kth-order model includes all the parameters of the lower-order
models as well. During
training, SVM^{hmm} solves the following optimization problem given training
examples (**x ^{1},y^{1}**) ... (

min 1/2 w*w + C/n
Σ_{i = 1..n} ξ_{i}

s.t. for all y:
[Σ_{i = 1..l}(x^{1}_{i}
• w_{y1i})
+ φ_{trans}(y^{1}_{i-1},y^{1}_{i}) • w_{trans}]
>= [Σ_{i = 1..l}(x^{1}_{i}
• w_{yi})
+ φ_{trans}(y_{i-1},y_{i}) • w_{trans}]
+ Δ(y^{1},y) - ξ_{1
...
}
for all y:
[Σ_{i = 1..l}(x^{n}_{i}
• w_{yni})
+ φ_{trans}(y^{n}_{i-1},y^{n}_{i}) • w_{trans}]
>= [Σ_{i = 1..l}(x^{n}_{i}
• w_{yi})
+ φ_{trans}(y_{i-1},y_{i}) • w_{trans}]
+ Δ(y^{n},y) - ξ_{n}

C is a parameter that trades off margin size and training error. While this
problem has exponentially many constraints, we use the cutting-plane algorithms implemented in SVM^{struct} to solve this
problem up to a precision of ε in polynomial time [Tsochantaridis
et al. 2004, Tsochantaridis et al. 2005]. In particular, we solve the one-slack
reformulation of the training problems using *SVM ^{struct}* V3.10
[Joachims et al. 2009],
which makes this version of SVM

More information on *SVM ^{struct}* is available here.

Input files of training and test examples follow the usual SVM^{light} input file format. Each line
in an example file holds one tag y^{j}_{i}, the feature vector x^{j}_{i} of the associated token, and an optional comment.

TAG qid:EXNUM FEATNUM:FEATVAL FEATNUM:FEATVAL ... #comment goes hereTAG is a natural number (i.e. 1..k) that identifies the tag y

3 qid:1 1:1 2:1 3:1 #see 2 qid:1 4:1 5:1 #jane 3 qid:1 2:-1 6:1 7:2.5 #play 2 qid:1 3:-1 8:1 #ball 5 qid:1 9:1 10:1 #. 1 qid:2 3:1 11:1 12:1 #she 3 qid:2 1:-1 13:1 14:2 #is 4 qid:2 15:1 16:0.1 17:-1 #good 5 qid:2 9:0.5 18:1 #.Here we've used the words in the sentence as the comments on each line, so that all the information we have about the input is contained in one file. Note that feature numbers are integral, starting at 1. Feature values are real numbers. Features for each token (e.g. "see") must appear in increasing order of their feature number FEATNUM; you should skip features whose values are zero, although leaving them in won't do harm other than increasing runtime.

What features you chose to represent each token is entirely up to you. Features can describe any property of the current token and its relationship to any other token in the sequence. For example, in the part-of-speech tagging experiments reported in [Joachims et al., 2009], each token was described by binary features indicating each possible prefix and suffix of the current token, the previous token, and the next token. Examples of such features are "Does the current token end in 'ing'?", "Does the previous token end in 'ing'?", "Does the next token start with 'A'?", "Is the current token 3 letters long?", etc. In total, there were more than 400000 of such features, and the overall dimensionality of the weight vector was greater than 18 million. In my experience binary features work well, and I often discretize real-valued features into multiple binary features.

- Non-linear Kernels are not supported. In the best case, training with non-linear kernels is slow, in the worst case it may be buggy.

- Includes version V3.10 of
*SVM*, which makes it substantially faster.^{struct} - Improved memory management.
- Now writes linear model files in a more compact format (i.e. stores only the weight vector, not the support vectors).
- Fixed memory bug in svm_hmm_classify that appeared when there were previously unseen features in a test example.

- Added that svm_hmm_classify prints the predicted labels to a file.
- Source code for
*SVM*V3.03.^{hmm}

- Complete reimplementation in C.
- Can train higher order models with arbitrary length dependencies for both the transitions and the emissions.
- Now includes an optional beam search for fast approximate training and prediction.
- Format of training and test files has changed.
- Source code for
*SVM*V2.13.^{hmm}

- New training algorithm based on equivalent 1-slack reformulation of the
training problem. This makes training on the linear kernel several orders of
magnitude faster than in V1.01. See changes introduced by
*SVM*V3.00 for a general properties of the new structural SVM training algorithm.^{struct} - New IO routines that are faster for reading large data and model files.
- Source code for
*SVM*V1.01.^{hmm}

- Fixed a small bug related to the debugging message "There is probably a bug in 'find_most_violated_constraint_*'..'.

- T. Joachims, T. Finley, Chun-Nam Yu, Cutting-Plane Training of
Structural SVMs, Machine Learning Journal, 77(1):27-59, 2009.

[PDF] - I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin
Methods for Structured and Interdependent Output Variables, Journal of
Machine Learning Research (JMLR), 6(Sep):1453-1484, 2005.

[PDF] - I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun.
*Support Vector Machine Learning for Interdependent and Structured Output Spaces*. International Conference on Machine Learning (ICML), 2004.

[Postscript] [PDF] -
Y. Altun, I. Tsochantaridis, T. Hofmann,
*Hidden Markov Support Vector Machines*. International Conference on Machine Learning (ICML), 2003.