2020 25th International Conference on Pattern Recognition (ICPR) | 2021

MEAN: Multi - Element Attention Network for Scene Text Recognition

 
 
 
 
 

Abstract


Scene text recognition is a challenging problem due to the wide variances in contents, styles, orientations, and image quality of text instances in natural scene images. To learn the intrinsic representation of scene texts, a novel multi-element attention (MEA) mechanism is proposed to exploit geometric structures from local to global levels in feature maps extracted from a scene text image. The MEA mechanism is a generalized form of self-attention technique. The elements in feature maps are taken as the nodes of an undirected graph, and three kinds of adjacency matrices are designed to aggregate information at local, neighborhood and global levels before calculating the attention weights. A multi-element attention network (MEAN) is implemented, which includes a CNN for feature extraction, an encoder with MEA mechanism and a decoder for predicting text codes. Orientational positional encoding is added to feature maps output by the CNN, and a feature vector sequence transformed from the feature maps is used as the input of the encoder. Experimental results show that MEAN has achieved state-of-the-art or competitive performance on seven public English scene text datasets (IIITSk, SVT, IC03, IC13, IC15, SVTP, and CUTE). Further experiments have been conducted on a selected subset of the RCTW Chinese scene text dataset, demonstrating that MEAN can handle horizontal, vertical, and irregular scene text samples.

Volume None
Pages 1-8
DOI 10.1109/ICPR48806.2021.9413166
Language English
Journal 2020 25th International Conference on Pattern Recognition (ICPR)

Full Text