2021 International Conference on Information Technology (ICIT) | 2021

An Optimal Data Entry Method, Using Web Scraping and Text Recognition

 
 
 

Abstract


Data entry is one of the most tedious jobs which consumes huge manpower in creating structured data from the given inputs. A large amount of data entered in the system can be contrasting to the original data causing confusions, especially when the data has to be gathered from image files. In this paper, we propose a text recognition system that can be employed to detect text from images automatically and update it to a target file. The proposed method accepts a web URL as the input and fetches the text or image using web scraping technique. The system extracts textual data from a user specified region. Further, the extracted text is classified using Support Vector Machine (SVM) and Naive Bayes Classifier. The output is saved it in the form of Google-Sheet, CSV, PDF, text, or Excel based on the user s choice. Contemporary models for text recognition such as PyTesseract, PyOCR, and TesserOCR are compared based on evaluation metrics such as accuracy, precision, execution speed. The experimental results exhibit that PyTesseract gives an accuracy of 83.45% and precision of 75.55%. The performance of the Support Vector Machine (SVM) and the Naive Bayes Classifier are compared. Naive Bayes Classifier with 92.08% precision, 90.148% recall, and 90.99% F-measure shows better performance than SVM.

Volume None
Pages 92-97
DOI 10.1109/ICIT52682.2021.9491643
Language English
Journal 2021 International Conference on Information Technology (ICIT)

Full Text