Robust <I>n</I>-Gram Model of Japanese Character and Its Application to Document Recognition

Hiroki MORI; Hirotomo ASO; Shozo MAKINO

Robust n-Gram Model of Japanese Character and Its Application to Document Recognition

Hiroki MORI, Hirotomo ASO, Shozo MAKINO

Full Text Views

0

Share
Cite this

Summary :

A new postprocessing method using interpolated n-gram model for Japanese documents is proposed. The method has the advantages over conventional approaches in enabling high-speed, knowledge-free processing. In parameter estimation of an n-gram model for a large size of vocabulary, it is difficult to obtain sufficient training samples. To overcome poverty of samples, two smoothing methods for Japanese character trigram model are evaluated, and the superiority of deleted interpolation method is shown by using perplexity. A document recognition system based on the trigram model is constructed, which finds maximum likelihood solutions through Viterbi algorithm. Experimental results for three kinds of documents show that the performance is high when using deleted interpolation method for smoothing. 90% of OCR errors are corrected for the documents similar to training text data, and 75% of errors are corrected for the documents not so similar to training text data.

Publication: IEICE TRANSACTIONS on Information Vol.E79-D No.5 pp.471-476

Publication Date: 1996/05/25

Publicized

Online ISSN

DOI

Type of Manuscript: Special Section PAPER (Special Issue on Character Recognition and Document Understanding)

Category: Postprocessing

Cite this

Copy

Hiroki MORI, Hirotomo ASO, Shozo MAKINO, "Robust n-Gram Model of Japanese Character and Its Application to Document Recognition" in IEICE TRANSACTIONS on Information, vol. E79-D, no. 5, pp. 471-476, May 1996, doi: .
Abstract: A new postprocessing method using interpolated n-gram model for Japanese documents is proposed. The method has the advantages over conventional approaches in enabling high-speed, knowledge-free processing. In parameter estimation of an n-gram model for a large size of vocabulary, it is difficult to obtain sufficient training samples. To overcome poverty of samples, two smoothing methods for Japanese character trigram model are evaluated, and the superiority of deleted interpolation method is shown by using perplexity. A document recognition system based on the trigram model is constructed, which finds maximum likelihood solutions through Viterbi algorithm. Experimental results for three kinds of documents show that the performance is high when using deleted interpolation method for smoothing. 90% of OCR errors are corrected for the documents similar to training text data, and 75% of errors are corrected for the documents not so similar to training text data.
URL: https://globals.ieice.org/en_transactions/information/10.1587/e79-d_5_471/_p

Copy

@ARTICLE{e79-d_5_471,
author={Hiroki MORI, Hirotomo ASO, Shozo MAKINO, },
journal={IEICE TRANSACTIONS on Information},
title={Robust n-Gram Model of Japanese Character and Its Application to Document Recognition},
year={1996},
volume={E79-D},
number={5},
pages={471-476},
abstract={A new postprocessing method using interpolated n-gram model for Japanese documents is proposed. The method has the advantages over conventional approaches in enabling high-speed, knowledge-free processing. In parameter estimation of an n-gram model for a large size of vocabulary, it is difficult to obtain sufficient training samples. To overcome poverty of samples, two smoothing methods for Japanese character trigram model are evaluated, and the superiority of deleted interpolation method is shown by using perplexity. A document recognition system based on the trigram model is constructed, which finds maximum likelihood solutions through Viterbi algorithm. Experimental results for three kinds of documents show that the performance is high when using deleted interpolation method for smoothing. 90% of OCR errors are corrected for the documents similar to training text data, and 75% of errors are corrected for the documents not so similar to training text data.},
keywords={},
doi={},
ISSN={},
month={May},}

Copy

TY - JOUR
TI - Robust n-Gram Model of Japanese Character and Its Application to Document Recognition
T2 - IEICE TRANSACTIONS on Information
SP - 471
EP - 476
AU - Hiroki MORI
AU - Hirotomo ASO
AU - Shozo MAKINO
PY - 1996
DO -
JO - IEICE TRANSACTIONS on Information
SN -
VL - E79-D
IS - 5
JA - IEICE TRANSACTIONS on Information
Y1 - May 1996
AB - A new postprocessing method using interpolated n-gram model for Japanese documents is proposed. The method has the advantages over conventional approaches in enabling high-speed, knowledge-free processing. In parameter estimation of an n-gram model for a large size of vocabulary, it is difficult to obtain sufficient training samples. To overcome poverty of samples, two smoothing methods for Japanese character trigram model are evaluated, and the superiority of deleted interpolation method is shown by using perplexity. A document recognition system based on the trigram model is constructed, which finds maximum likelihood solutions through Viterbi algorithm. Experimental results for three kinds of documents show that the performance is high when using deleted interpolation method for smoothing. 90% of OCR errors are corrected for the documents similar to training text data, and 75% of errors are corrected for the documents not so similar to training text data.
ER -