2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI) | 2021

Key-Value Pair Searhing System via Tesseract OCR and Post Processing

 
 

Abstract


Optical character recognition systems make it possible to extract text from images. In many cases, this may be sufficient, but there are cases where key-value pairs are required. In this paper, we investigate the use of the open source Tesseract OCR system, to extract text data from images, and perform a key-value pair search. Image noise needs to be minimized with image processing algorithms before recognition. It is necessary to perform so-called post processing procedures on the output of the Tesseract. These post-processors can transform the result of the recognition performed by the OCR system. Those can improve the accuracy of the information extracted during the transformation, for example with the help of regular expressions. The key value pair search is performed after these procedures.

Volume None
Pages 000461-000464
DOI 10.1109/SAMI50585.2021.9378680
Language English
Journal 2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI)

Full Text