A new postprocessing method using interpolated n-gram model for Japanese documents is proposed. The method has the advantages over conventional approaches in enabling high-speed, knowledge-free processing. In parameter estimation of an n-gram model for a large size of vocabulary, it is difficult to obtain sufficient training samples. To overcome poverty of samples, two smoothing methods for Japanese character trigram model are evaluated, and the superiority of deleted interpolation method is shown by using perplexity. A document recognition system based on the trigram model is constructed, which finds maximum likelihood solutions through Viterbi algorithm. Experimental results for three kinds of documents show that the performance is high when using deleted interpolation method for smoothing. 90% of OCR errors are corrected for the documents similar to training text data, and 75% of errors are corrected for the documents not so similar to training text data.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Hiroki MORI, Hirotomo ASO, Shozo MAKINO, "Robust n-Gram Model of Japanese Character and Its Application to Document Recognition" in IEICE TRANSACTIONS on Information,
vol. E79-D, no. 5, pp. 471-476, May 1996, doi: .
Abstract: A new postprocessing method using interpolated n-gram model for Japanese documents is proposed. The method has the advantages over conventional approaches in enabling high-speed, knowledge-free processing. In parameter estimation of an n-gram model for a large size of vocabulary, it is difficult to obtain sufficient training samples. To overcome poverty of samples, two smoothing methods for Japanese character trigram model are evaluated, and the superiority of deleted interpolation method is shown by using perplexity. A document recognition system based on the trigram model is constructed, which finds maximum likelihood solutions through Viterbi algorithm. Experimental results for three kinds of documents show that the performance is high when using deleted interpolation method for smoothing. 90% of OCR errors are corrected for the documents similar to training text data, and 75% of errors are corrected for the documents not so similar to training text data.
URL: https://globals.ieice.org/en_transactions/information/10.1587/e79-d_5_471/_p
Copy
@ARTICLE{e79-d_5_471,
author={Hiroki MORI, Hirotomo ASO, Shozo MAKINO, },
journal={IEICE TRANSACTIONS on Information},
title={Robust n-Gram Model of Japanese Character and Its Application to Document Recognition},
year={1996},
volume={E79-D},
number={5},
pages={471-476},
abstract={A new postprocessing method using interpolated n-gram model for Japanese documents is proposed. The method has the advantages over conventional approaches in enabling high-speed, knowledge-free processing. In parameter estimation of an n-gram model for a large size of vocabulary, it is difficult to obtain sufficient training samples. To overcome poverty of samples, two smoothing methods for Japanese character trigram model are evaluated, and the superiority of deleted interpolation method is shown by using perplexity. A document recognition system based on the trigram model is constructed, which finds maximum likelihood solutions through Viterbi algorithm. Experimental results for three kinds of documents show that the performance is high when using deleted interpolation method for smoothing. 90% of OCR errors are corrected for the documents similar to training text data, and 75% of errors are corrected for the documents not so similar to training text data.},
keywords={},
doi={},
ISSN={},
month={May},}
Copy
TY - JOUR
TI - Robust n-Gram Model of Japanese Character and Its Application to Document Recognition
T2 - IEICE TRANSACTIONS on Information
SP - 471
EP - 476
AU - Hiroki MORI
AU - Hirotomo ASO
AU - Shozo MAKINO
PY - 1996
DO -
JO - IEICE TRANSACTIONS on Information
SN -
VL - E79-D
IS - 5
JA - IEICE TRANSACTIONS on Information
Y1 - May 1996
AB - A new postprocessing method using interpolated n-gram model for Japanese documents is proposed. The method has the advantages over conventional approaches in enabling high-speed, knowledge-free processing. In parameter estimation of an n-gram model for a large size of vocabulary, it is difficult to obtain sufficient training samples. To overcome poverty of samples, two smoothing methods for Japanese character trigram model are evaluated, and the superiority of deleted interpolation method is shown by using perplexity. A document recognition system based on the trigram model is constructed, which finds maximum likelihood solutions through Viterbi algorithm. Experimental results for three kinds of documents show that the performance is high when using deleted interpolation method for smoothing. 90% of OCR errors are corrected for the documents similar to training text data, and 75% of errors are corrected for the documents not so similar to training text data.
ER -