SemiCRF++: A C++ implementation of semi-Markov Conditional Random Fields

============
Introduction
============

SemiCRF++ is a C++ implementation of semi-Markov Conditional Random Fields. It inherits the memory data structures used in the implementation of CRF++ (https://taku910.github.io/crfpp/) and is effective in memory usage.

SemiCRF++ can be applied to general sequence tagging problems. The main advantage of semiCRF over CRF is that it can easily incorporate phrase-level syntactic structure as features. This can be important in identifying long expressions in text.

=====
Usage
=====

To compile the C++ code:
% cd ./src/Release
% make

To run semiCRF++:
% ./src/Release/semiCRF++ config.default


==============================
Training and Test file formats
==============================

* Examples of the training/test file locate in ./data/. They were created based on the MPQA corpus. Please find the original MPQA corpus at http://mpqa.cs.pitt.edu/corpora/mpqa_corpus/. 

The training and test files are in the same format as in CRF++ (See the section of “Training and Test file formats” at https://taku910.github.io/crfpp/).

An example of the files is:

the	DT	O
vet	NN	O
had	VBD	DSE
forbidden	VBN	DSE
him	PRP	O
to	TO	O
do	VB	O
so	RB	O
.	.	O

There are three columns of each token: the word itself, the part-of-speech tag of the word, and the tag in “IO” (DSE,O) or “BIO” format (B_DSE, I_DSE, O).

If using the parse tree, add an additional column to store the parse tree (same as the parse tree field in the CoNLL format).

the	DT	(S(NP*	O
vet	NN	*)	O
had	VBD	(VP*	DSE
forbidden	VBN	(VP*	DSE
him	PRP	(S(NP*)	O
to	TO	(VP*	O
do	VB	(VP*	O
so	RB	(ADVP*)))))))))	O
.	.	*))	O


==============
Basic Features
==============

By default, the program creates features using the words and part-of-speech tags in each sentence. If the parse tree information is available, it will generate candidate segments according to the parse tree structure, and only create features for the these segments. 


=================
Optional Features
=================

1. Lexical features (binary features). One can create arbitrary features for each word token and store them in a feature file. An example file is features/features.dict. Each row in the file contains features for a token. For example:

zipper/N	FEA_8 FEA_9 FEA_4 FEA_5 FEA_6 FEA_7 FEA_0 FEA_1 FEA_2 FEA_3

This row stores features for the noun "zipper". Features can be created from resources like WordNet and FrameNet. The feature names can be arbitrary strings.

2. Embedding features (continuous features). Each token can be associated with a fixed dimension of continuous embedding features. This can be created from pre-trained embeddings. An example file is features/mpqa_embeddings. Each row in the file contains a continuous vector for a token. 


=======================
Model Parameter Setting
=======================

See detailed comments in config.default.


==========
References
==========

Bishan Yang and Claire Cardie. "Extracting Opinion Expressions with semi-Markov Conditional Random Fields." EMNLP 2012.

Sarawagi, Sunita, and William W. Cohen. "Semi-markov conditional random fields for information extraction." NIPS 2004.

CRF++: Yet Another CRF toolkit. https://taku910.github.io/crfpp/


If you have questions, please email me at
bishan.yang@gmail.com