Unsupervised Statistical Segmentation of Japanese Kanji Strings.
Rie Kubota Ando and Lillian Lee.
Cornell University CS Technical Report TR99-1756, 1999.

See also the journal version of this paper, Mostly-Unsupervised Statistical Segmentation of Japanese Kanji Sequences, in Natural Language Engineering; it contains different experimental analysis. Follow the link for the paper and other information.

Abstract: Word segmentation is an important issue in Japanese language processing because Japanese is written without space delimiters between words. We propose a simple dictionary-less method to segment Japanese kanji sequences into words based solely on character $n$-gram counts from an unannotated corpus. The performance was often better than that of rule-based morphological analyzers over a variety of both standard and novel error metrics.

Paper formats: ps, pdf other


Back links: Lillian Lee's home page or papers page; Cornell NLP page.