2021 IEEE 17th International Conference on eScience (eScience) | 2021

Exploring Learning Approaches for Ancient Greek Character Recognition with Citizen Science Data

 
 
 
 
 
 
 
 
 
 

Abstract


The central dogma of handwritten character recognition remains inextricably linked to optical character recognition methods for print media. Alongside their reliance on proprietary data and lack of open-access software, the applicability of these optical character recognition methods to handwritten characters from low-quality documents (e.g., that are damaged) remains unknown. In this paper, we compare and contrast the performance of state-of-the-art optical character recognition tools for print and learning models engineered with state-of-the-art machine learning toolkits trained on handwritten inputs. Using Tesseract OCR as a baseline, we build, optimize, and evaluate three types of convolutional neural networks that are trained on the AL-ALLand AL-PUBdatasets, a collection of images of handwritten ancient Greek characters that were labeled by volunteers through the Ancient Lives online citizen science project. We find our best-performing machine learning model to be 92.57% accurate compared to Tesseract OCR’s 11.15%. Following our analysis, we present a brief examination of our models’ shortcomings, introduce the publicly-available AL-PUBdataset, and, describe Theia, a web-based tool that democratizes our machine learning models for public use. We conclude by discussing the promise of our findings for advancing research at the intersection of machine learning, manuscript transcription, and the digital humanities.

Volume None
Pages 128-137
DOI 10.1109/eScience51609.2021.00023
Language English
Journal 2021 IEEE 17th International Conference on eScience (eScience)

Full Text