2021 International Conference of Optical Imaging and Measurement (ICOIM) | 2021

Form Recognition Based on Lightweight U-Net and Tesseract after Multi-level Retraining

 
 
 
 
 

Abstract


With the rapid development of Internet information technology and the advancement of enterprise digitization, the digitization of paper forms has also received extensive attention. The automatic conversion of paper form documents into electronic form documents mainly faces three problems. The first is that the format of the form file is diverse and the structure is complex. This article uses the XML file of the form to accurately analyze the structure of the file, which is more accurate than the current semantic segmentation method. The second problem is table area detection, this paper uses traditional algorithms to find the contours of the candidate table area, and screens according to the characteristics of the table area to complete the detection and extraction of the table area. The third is that the recognition of the table text is more difficult, not only the interference information such as the table frame will also affect the accuracy of text recognition, and the type of text information in the table is complex, including Chinese, English, numbers, symbols and mixed types, which bring huge challenges to text recognition. This paper uses the lightweight U-Net network model to segment the text area at pixel level, eliminating the interference information of text recognition. The neural network of Tesseract was retrained in a multiple, multi-level manner, and successfully realized the recognition of complex types of text information with an accuracy of about 96%. Based on deep learning and XML table structure analysis algorithm, this paper realizes the recognition of paper version of the form file and the reconstruction of the electronic version of the file.

Volume None
Pages 243-248
DOI 10.1109/ICOIM52180.2021.9524385
Language English
Journal 2021 International Conference of Optical Imaging and Measurement (ICOIM)

Full Text