SemiCRF++: A C++ implementation of semi-Markov Conditional Random Fields ============ Introduction ============ SemiCRF++ is a C++ implementation of semi-Markov Conditional Random Fields. It inherits the memory data structures used in the implementation of CRF++ (https://taku910.github.io/crfpp/) and is effective in memory usage. SemiCRF++ can be applied to general sequence tagging problems. The main advantage of semiCRF over CRF is that it can easily incorporate phrase-level syntactic structure as features. This can be important in identifying long expressions in text. ===== Usage ===== To compile the C++ code: % cd ./src/Release % make To run semiCRF++: % ./src/Release/semiCRF++ config.default ============================== Training and Test file formats ============================== * Examples of the training/test file locate in ./data/. They were created based on the MPQA corpus. Please find the original MPQA corpus at http://mpqa.cs.pitt.edu/corpora/mpqa_corpus/. The training and test files are in the same format as in CRF++ (See the section of “Training and Test file formats” at https://taku910.github.io/crfpp/). An example of the files is: the DT O vet NN O had VBD DSE forbidden VBN DSE him PRP O to TO O do VB O so RB O . . O There are three columns of each token: the word itself, the part-of-speech tag of the word, and the tag in “IO” (DSE,O) or “BIO” format (B_DSE, I_DSE, O). If using the parse tree, add an additional column to store the parse tree (same as the parse tree field in the CoNLL format). the DT (S(NP* O vet NN *) O had VBD (VP* DSE forbidden VBN (VP* DSE him PRP (S(NP*) O to TO (VP* O do VB (VP* O so RB (ADVP*))))))))) O . . *)) O ============== Basic Features ============== By default, the program creates features using the words and part-of-speech tags in each sentence. If the parse tree information is available, it will generate candidate segments according to the parse tree structure, and only create features for the these segments. ================= Optional Features ================= 1. Lexical features (binary features). One can create arbitrary features for each word token and store them in a feature file. An example file is features/features.dict. Each row in the file contains features for a token. For example: zipper/N FEA_8 FEA_9 FEA_4 FEA_5 FEA_6 FEA_7 FEA_0 FEA_1 FEA_2 FEA_3 This row stores features for the noun "zipper". Features can be created from resources like WordNet and FrameNet. The feature names can be arbitrary strings. 2. Embedding features (continuous features). Each token can be associated with a fixed dimension of continuous embedding features. This can be created from pre-trained embeddings. An example file is features/mpqa_embeddings. Each row in the file contains a continuous vector for a token. ======================= Model Parameter Setting ======================= See detailed comments in config.default. ========== References ========== Bishan Yang and Claire Cardie. "Extracting Opinion Expressions with semi-Markov Conditional Random Fields." EMNLP 2012. Sarawagi, Sunita, and William W. Cohen. "Semi-markov conditional random fields for information extraction." NIPS 2004. CRF++: Yet Another CRF toolkit. https://taku910.github.io/crfpp/ If you have questions, please email me at bishan.yang@gmail.com