Archive | 2021

CATALIST: CAmera TrAnsformations for Multi-LIngual Scene Text Recognition

Abstract

We present a CATALIST model that ‘tames’ the attention (heads) of an attention-based scene text recognition model. We provide supervision to the attention masks at multiple levels, i.e., line, word, and character levels while training the multi-head attention model. We demonstrate that such supervision improves training performance and testing accuracy. To train CATALIST and its attention masks, we also present a synthetic data generator ALCHEMIST that enables the synthetic creation of large scene-text video datasets, along with mask information at character, word, and line levels. We release a real scene-text dataset of 2k videos, CATALISTd with videos of real scenes that potentially contain scene-text in a combination of three different languages, namely, English, Hindi, and Marathi. We record these videos using 5 types of camera transformations (i) translation, (ii) roll, (iii) tilt, (iv) pan, and (v) zoom to create transformed videos. The dataset and other useful resources are available as a documented public repository for use by the community.

Volume None

Archive | 2021

CATALIST: CAmera TrAnsformations for Multi-LIngual Scene Text Recognition

Abstract

Volume None

Pages 213-228

DOI 10.1007/978-3-030-86198-8_16

Language English

Journal None

Full Text