Mostly-unsupervised statistical segmentation of Japanese: Application to kanji sequences
Rie Kubota Ando and Lillian Lee
Proceedings of NAACL, pp. 241--248, 2000. Journal version: Natural Language Engineering 9(2):127--149, 2003.

Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and syntactic analysis or on pre-segmented data; but these are labor-intensive, and the lexico-syntactic techniques are vulnerable to the unknown word problem. In contrast, we introduce a novel, more robust statistical method utilizing unsegmented training data. Despite its simplicity, the algorithm yields performance on long kanji sequences comparable to and sometimes surpassing that of state-of-the-art morphological analyzers over a variety of error metrics. The algorithm also outperforms another mostly-unsupervised statistical algorithm previously proposed for Chinese.
Additionally, we present a two-level annotation scheme for Japanese to incorporate multiple segmentation granularities, and introduce two novel evaluation metrics, both based on the notion of a compatible bracket, that can account for multiple granularities simultaneously.

@inproceedings{Ando+Lee:00a, author = {Rie Kubota Ando and Lillian Lee}, title = {Mostly-unsupervised statistical segmentation of {Japanese}: Application to kanji sequences}, year = {2000}, pages = {241--248}, booktitle = {Proceedings of NAACL} }

TANGO algorithm

This material is based on work supported in part by a grant from the GE Foundation and by the National Science Foundation under ITR/IM grant IIS-0081334. Any opinions, findings, and conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views or official policies, either expressed or implied, of any sponsoring institutions, the U.S. government, or any other entity.