The journal version of this paper, Mostly-Unsupervised Statistical Segmentation of Japanese Kanji Sequences, appears in Natural Language Engineering; follow the link for the paper and other information.
Abstract: Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and grammar or on pre-segmented data. In contrast, we introduce a novel statistical method utilizing *unsegmented* training data, with performance on kanji sequences comparable to and sometimes surpassing that of morphological analyzers over a variety of error metrics.
BibTeX entry:
@InProceedings{Ando+Lee:00a,
author = {Rie Kubota Ando and Lillian Lee},
title = {Mostly-Unsupervised Statistical Segmentation of {J}apanese},
booktitle = {First Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
year = 2000,
pages ={241--248}
}