Optical Character Reader (OCR) incorrect recognition is a serious problem when searching for OCR-scanned documents in databases such as digital libraries. In order to reduce costs, this paper proposes fuzzy retrieval methods for English text containing errors in the recognized text without correcting the errors manually. The proposed methods generate multiple search terms for each input query term based on probabilistic automata which reflect both error-occurrence probabilities and character-connection probabilities. Experimental results of test-set retrieval indicate that one of the proposed methods improves the recall rate from 95.96% to 98.15% at the cost of a decrease in precision from 100.00% to 96.01% with 20 expanded search terms.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Manabu OHTA, Atsuhiro TAKASU, Jun ADACHI, "Probabilistic Automaton-Based Fuzzy English-Text Retrieval" in IEICE TRANSACTIONS on Information,
vol. E86-D, no. 9, pp. 1835-1844, September 2003, doi: .
Abstract: Optical Character Reader (OCR) incorrect recognition is a serious problem when searching for OCR-scanned documents in databases such as digital libraries. In order to reduce costs, this paper proposes fuzzy retrieval methods for English text containing errors in the recognized text without correcting the errors manually. The proposed methods generate multiple search terms for each input query term based on probabilistic automata which reflect both error-occurrence probabilities and character-connection probabilities. Experimental results of test-set retrieval indicate that one of the proposed methods improves the recall rate from 95.96% to 98.15% at the cost of a decrease in precision from 100.00% to 96.01% with 20 expanded search terms.
URL: https://globals.ieice.org/en_transactions/information/10.1587/e86-d_9_1835/_p
Copy
@ARTICLE{e86-d_9_1835,
author={Manabu OHTA, Atsuhiro TAKASU, Jun ADACHI, },
journal={IEICE TRANSACTIONS on Information},
title={Probabilistic Automaton-Based Fuzzy English-Text Retrieval},
year={2003},
volume={E86-D},
number={9},
pages={1835-1844},
abstract={Optical Character Reader (OCR) incorrect recognition is a serious problem when searching for OCR-scanned documents in databases such as digital libraries. In order to reduce costs, this paper proposes fuzzy retrieval methods for English text containing errors in the recognized text without correcting the errors manually. The proposed methods generate multiple search terms for each input query term based on probabilistic automata which reflect both error-occurrence probabilities and character-connection probabilities. Experimental results of test-set retrieval indicate that one of the proposed methods improves the recall rate from 95.96% to 98.15% at the cost of a decrease in precision from 100.00% to 96.01% with 20 expanded search terms.},
keywords={},
doi={},
ISSN={},
month={September},}
Copy
TY - JOUR
TI - Probabilistic Automaton-Based Fuzzy English-Text Retrieval
T2 - IEICE TRANSACTIONS on Information
SP - 1835
EP - 1844
AU - Manabu OHTA
AU - Atsuhiro TAKASU
AU - Jun ADACHI
PY - 2003
DO -
JO - IEICE TRANSACTIONS on Information
SN -
VL - E86-D
IS - 9
JA - IEICE TRANSACTIONS on Information
Y1 - September 2003
AB - Optical Character Reader (OCR) incorrect recognition is a serious problem when searching for OCR-scanned documents in databases such as digital libraries. In order to reduce costs, this paper proposes fuzzy retrieval methods for English text containing errors in the recognized text without correcting the errors manually. The proposed methods generate multiple search terms for each input query term based on probabilistic automata which reflect both error-occurrence probabilities and character-connection probabilities. Experimental results of test-set retrieval indicate that one of the proposed methods improves the recall rate from 95.96% to 98.15% at the cost of a decrease in precision from 100.00% to 96.01% with 20 expanded search terms.
ER -