Pattern Recognition Letters | 2021

PMMN: Pre-trained multi-Modal network for scene text recognition

 
 
 
 

Abstract


Abstract Scene Text Recognition (STR) task needs to consume large-amount data to develop a powerful recognizer, including visual data like images and linguistic data like texts. However, existing methods mainly leverage a one-stage training manner to train the entire framework end-to-end, which deeply relies on the well-annotated images and does not effectively use the data of the two modalities mentioned above. To solve this, in this paper, we propose a pre-trained multi-modal network (PMMN) that utilizes visual and linguistic data to pre-train the vision model and language model respectively to learn modality-specific knowledge for accurate scene text recognition. In detail, we first pre-train the proposed off-the-shelf vision model and language model to convergence. And then, we combine the pre-trained models in a unified framework for end-to-end fine-tuning and utilize the learned multi-modal information to interact with each other to generate robust features for character prediction. Extensive experiments are conducted to demonstrate the effectiveness of PMMN. The evaluation results on six benchmarks show that our proposed method exceeds most existing methods, achieving state-of-the-art performance.

Volume 151
Pages 103-111
DOI 10.1016/J.PATREC.2021.07.016
Language English
Journal Pattern Recognition Letters

Full Text