Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji.
Rie Kubota Ando and Lillian Lee.
First Conference of the NAACL, pp. 241--248, 2000.

The journal version of this paper, Mostly-Unsupervised Statistical Segmentation of Japanese Kanji Sequences, appears in Natural Language Engineering; follow the link for the paper and other information.

Abstract: Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and grammar or on pre-segmented data. In contrast, we introduce a novel statistical method utilizing *unsegmented* training data, with performance on kanji sequences comparable to and sometimes surpassing that of morphological analyzers over a variety of error metrics.

Paper formats: ps, pdf

BibTeX entry:

@InProceedings{Ando+Lee:00a,
  author = 	 {Rie Kubota Ando and Lillian Lee},
  title = 	 {Mostly-Unsupervised Statistical Segmentation of {J}apanese},
  booktitle = 	 {First Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year =	 2000,
  pages ={241--248}
}


Back links: Lillian Lee's home page or papers page; Cornell NLP page.