2021 IEEE 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI) | 2021
Key-Value Pair Searhing System via Tesseract OCR and Post Processing
Abstract
Optical character recognition systems make it possible to extract text from images. In many cases, this may be sufficient, but there are cases where key-value pairs are required. In this paper, we investigate the use of the open source Tesseract OCR system, to extract text data from images, and perform a key-value pair search. Image noise needs to be minimized with image processing algorithms before recognition. It is necessary to perform so-called post processing procedures on the output of the Tesseract. These post-processors can transform the result of the recognition performed by the OCR system. Those can improve the accuracy of the information extracted during the transformation, for example with the help of regular expressions. The key value pair search is performed after these procedures.