Proceedings of the 2021 International Conference on Multimedia Retrieval | 2021

Scene Text Recognition with Cascade Attention Network

Abstract

Scene text recognition (STR) has experienced increasing popularity both in academia and in industry. Regarding STR as a sequence prediction task, most state-of-the-art (SOTA) approaches employ the attention-based encoder-decoder architecture to recognize texts. However, these methods still struggle in localizing the precise alignment center associated with the current character, which is also named as the attention drift phenomenon. One major reason is that directly converting low-quality or distorted word images to sequential features may introduce confusing information and thus mislead the network. To address the problem, this paper proposes a cascade attention network. The model is composed of three novel attention modules: a vanilla attention module that attends to sequential features from the horizontal direction, a cross-network attention module to take advantage of both one-dimension contextual information and two-dimension visual distributions, and an aspects fusion attention module to fuse spatial and channel-wise information. Accordingly, the network manages to yield distinguished and refined representations correlated to the target sequence. Compared to SOTA methods, experimental results on seven benchmarks demonstrate the superiority of our framework in recognizing scene texts on various conditions.

Volume None

Proceedings of the 2021 International Conference on Multimedia Retrieval | 2021

Scene Text Recognition with Cascade Attention Network

Abstract

Volume None

Pages None

DOI 10.1145/3460426.3463639

Language English

Journal Proceedings of the 2021 International Conference on Multimedia Retrieval

Full Text