Journal of Visual Communication and Image Representation | 2021
Sequential Alignment Attention Model for Scene Text Recognition
Abstract
Abstract Scene text recognition has been a hot research topic in computer vision due to its various applications. The state-of-the-art solutions usually depend on the attention-based encoder-decoder framework that learns the mapping between input images and output sequences in a purely data-driven way. Unfortunately, there often exists severe misalignment between feature areas and text labels in real-world scenarios. To address this problem, this paper proposes a sequential alignment attention model to enhance the alignment between input images and output character sequences. In this model, an attention gated recurrent unit (AGRU) is first devised to distinguish the text and background regions, and further extract the localized features focusing on sequential text regions. Furthermore, CTC guided decoding strategy is integrated into the popular attention-based decoder, which not only helps to boost the convergence of the training but also enhances the well-aligned sequence recognition. Extensive experiments on various benchmarks, including the IIIT5k, SVT, and ICDAR datasets, show that our method substantially outperforms the state-of-the-art methods.