Proceedings of the 3rd Workshop on Structuring and Understanding of Multimedia heritAge Contents | 2021

Evaluation of Deep Learning Techniques for Content Extraction in Spanish Colonial Notary Records

 
 
 
 
 
 

Abstract


Processing and analyzing historical manuscripts is considered one of the most challenging problems in the document analysis and recognition domain. Manuscripts written in cursive are even more difficult due to overlapping words with random spacing, irregular and varying characters shapes, poor scan quality, and insufficient labeled data. Despite the significant achievements of deep learning approaches in computer vision, handwritten word recognition is far from solved. Most of the existing methods focus on well-segmented word datasets. In this paper, we present an empirical study investigating how well state-of-the-art deep learning models perform on detection and recognition of handwritten words in Spanish American notary records. Professional historians were involved in preparing a labeled dataset of 26,482 Spanish words employed in the experiments. We investigate the performance of some state-of-the-art models on optical character recognition (OCR) on handwritten text documents: Keras-OCR, the object detection algorithm You Only Look Once (YOLO), Tesseract OCR, Kraken, and Calamari-OCR. Since YOLO does not include a text recognizer, we propose YOLO-OCR, an innovative model to detect and recognize words in historical manuscripts written in Spanish. Our results show the performance of pre-trained models on our dataset and that Keras-OCR and YOLO-OCR models are highly valuable for content extraction.

Volume None
Pages None
DOI 10.1145/3475720.3484443
Language English
Journal Proceedings of the 3rd Workshop on Structuring and Understanding of Multimedia heritAge Contents

Full Text