Archive | 2021

iiit-indic-hw-words: A Dataset for Indic Handwritten Text Recognition

 
 

Abstract


Handwritten text recognition (htr) for Indian languages is not yet a well-studied problem. This is primarily due to the unavailability of large annotated datasets in the associated scripts. Existing datasets are small in size. They also use small lexicons. Such datasets are not sufficient to build robust solutions to htr using modern machine learning techniques. In this work, we introduce a large-scale handwritten dataset for Indic scripts containing 868K handwritten instances written by 135 writers in 8 widely-used scripts. A comprehensive dataset of ten Indic scripts are derived by combining the newly introduced dataset with the earlier datasets developed for Devanagari (iiit-hw-dev) and Telugu (iiit-hw-telugu), referred to as the iiit-indic-hw-words. We further establish a high baseline for text recognition in eight Indic scripts. Our recognition scheme follows the contemporary design principles from other recognition literature, and yields competitive results on English. iiit-indic-hw-words along with the recognizers are available publicly. We further (i) study the reasons for changes in htr performance across scripts (ii) explore the utility of pre-training for Indic htrs. We hope our efforts will catalyze research and fuel applications related to handwritten document understanding in Indic scripts.

Volume None
Pages 444-459
DOI 10.1007/978-3-030-86337-1_30
Language English
Journal None

Full Text