Proceedings of the 2021 International Conference on Multimedia Retrieval | 2021

Reading Scene Text by Fusing Visual Attention with Semantic Representations

 
 
 

Abstract


Recognizing text in an unconstrained environment is a challenging task in computer vision. Many prevalent approaches to it employ a recurrent neural network that is difficult to train or rely heavily on sophisticated model designs for sequence modeling. In contrast to these methods, we propose a unified lexicon-free framework to enhance the accuracy of text recognition using only attention and convolution. We use a relational attention module to leverage visual patterns and word representations. To ensure that the predicted sequence captures the contextual dependencies within a word, we embed linguistic dependencies from a language model into the optimization framework. The proposed mutual attention model is an ensemble of visual cues and linguistic contexts that together improve performance. The results of experiments show that our system achieves state-of-the-art performance on datasets of texts from regular and irregular scenes. It also significantly enhances recognition performance on noisy scanned documents.

Volume None
Pages None
DOI 10.1145/3460426.3463612
Language English
Journal Proceedings of the 2021 International Conference on Multimedia Retrieval

Full Text