Robust n-Gram Model of Japanese Character and Its Application to Document Recognition

Hiroki MORI, Hirotomo ASO, Shozo MAKINO

  • Full Text Views

    0

  • Cite this

Summary :

A new postprocessing method using interpolated n-gram model for Japanese documents is proposed. The method has the advantages over conventional approaches in enabling high-speed, knowledge-free processing. In parameter estimation of an n-gram model for a large size of vocabulary, it is difficult to obtain sufficient training samples. To overcome poverty of samples, two smoothing methods for Japanese character trigram model are evaluated, and the superiority of deleted interpolation method is shown by using perplexity. A document recognition system based on the trigram model is constructed, which finds maximum likelihood solutions through Viterbi algorithm. Experimental results for three kinds of documents show that the performance is high when using deleted interpolation method for smoothing. 90% of OCR errors are corrected for the documents similar to training text data, and 75% of errors are corrected for the documents not so similar to training text data.

Publication
IEICE TRANSACTIONS on Information Vol.E79-D No.5 pp.471-476
Publication Date
1996/05/25
Publicized
Online ISSN
DOI
Type of Manuscript
Special Section PAPER (Special Issue on Character Recognition and Document Understanding)
Category
Postprocessing

Authors

Keyword

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.