[PDF] RFBTD: RFB Text Detector

Abstract

Text detection plays a critical role in the whole procedure of textual information extraction and understanding. On a high note, recent years have seen a surge in the high recall text detectors in scene text images, however text boxes for individual words is still a challenging when dense text is present in the scene. In this work, we propose an elegant solution that promotes prediction of words or text lines of arbitrary orientations and directions, providing emphasis on individual words. We also investigate the effects of Receptive Field Blocks(RFB) and its impact in receptive fields for text segments. Experiments were done on the ICDAR2015 and achieves an F-score of 47.09 at 720p

Full PDF

RFBTD: RFB Text Detector

Christen M , AB Saravanan Signzy.com [email protected] , [email protected] Abstract — Text detection plays a critical role in the whole procedure of textual information extraction and understanding. On a high note, recent years have seen a surge in the high recall text detectors in scene text images, however text boxes for individual words is still a challenging when dense text is present in the scene. In this work, we propose an elegant solution that promotes prediction of words or text lines of arbitrary orientations and directions, providing emphasis on individual words. We also investigate the effects of Receptive Field Blocks (RFB) and its impact in receptive fields for text segments. Experiments were done on the ICDAR2015 and achieved an F-score of 47.09 at 720p. Implementation: https://github.com/Chris10M/RFB-Text-Detection Keywords — Text Detection, Receptive Field Block (RFB), Multi Scale, Multi Oriented Text Detector I. I

NTRODUCTION

There has been a recent surge of scene text detectors, after the resurgence of deep neural networks, but one pitfall which is prominent among most of the text detectors is that the maximum size of text instances the detectors can handle is proportional to the receptive field of the network. This encumbers the network to predict longer text regions without increasing the spatial resolution of the input image. Now in order to ameliorate this condition we investigate the effects of receptive field eccentricity [1] and the length of the text lines. II. R

ELATED W ORK

Scene text detection has been researched tremendously over the years, and has made significant progress. The conventional approaches relied on hand crafted features such as Stroke Width Transform (SWT) [2] and Maximally Stable Extremal Regions (MSER) [3, 4] based methods which generally seek character candidates via edge detection or extremal region extraction. The others [5, 6] also improved the accuracy of conventional detectors but these methods fall way behind the likes of detectors based on Deep Neural Networks (DNN), due to more robust nature and higher accuracy. The detectors [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] employed deep neural networks in their detection pipeline to significantly improve the accuracy. Xinyu Zhou et al . [19] employed a single DNN with a FCN [20] and predicted score map and region proposals. The predictions are then thresholded and Non Maximal Suppression (NMS) is carried out to get the final predicted results. Songtao Liu et al. [21] has investigated the effect of receptive fields and its eccentricity towards positive correlation in object detection performance, their work Receptive Field Block (RFB), has been further explored here. We investigate how the receptive field correlates between having stacked filters of various kernel size and the granularity of the individual words detected across text lines/ segments. III. M ETHODOLOGY

The Text detector uses a Resnet [22] backbone, and outputs predictions in the form of rotated boxes, where rotated boxes are represented by 4 channels of axis-aligned bounding boxes and 1 channel for the rotation angle θ . The formulation of the 4 channels represents 4 distances from the pixel location to the top, right, bottom, left boundaries of the rectangle respectively. The basic architecture of the text detector is shown in Fig. 1. Figure 1. Basic Architecture of RFBTD. Before discussing the model architecture, we first investigate the role of receptive field in detecting the text regions. In Fig 2. the concentric receptive field enforced by EAST’s backbone’s symmetrical Conv blocks makes the eccentricity of the receptive field very small concentrated at the center, this creates a need to provide a high resolution feature map across the network to induce maximum region proposals for the given input image. When the Conv blocks are supplanted by the RFB blocks, which elicits asymmetrical receptive fields by sandwiching filters of various kernel sizes. The RFB block is spatial array of bottlenecks, shortcuts and atrous convolutions as shown in Fig 4. and Fig 5. The two blocks RFB and RFB-s are similar yet have a subtle difference where the 5x5 Conv filters are completely replaced by the 3x3 Conv filters and asymmetrical kernel are also introduced(RFB-s) to reduce the computational overhead of the RFB blocks. These configurations promotes a receptive field of greater eccentricity, as shown in Fig 3. Figure 2. Spatial structures of RFs, in EAST Resnet50 backbone.

A. Model Architecture

The proposed RFB Text Detection (RFBTD) is a fully-convolutional neural(FCN) network which is modelled to output pixel wise predictions of text lines or individual words. The predicted score map and Rotated Box(RBOX) proposals are performed with thresholding and Non Maximal Suppression (NMS) for the final predicted text regions. The model contains two output branches, one with score map for pixel wise predictions with values in the range of [0, 1]. The other branch outputs RBOX and rotation angle θ for the proposed regions. The score denotes the confidence of the RBOX proposals in that pixel location. Thresholding is done on the score map to ensure valid proposals, and then NMS is applied for the final RBOXs to get the predicted regions of word boxes/ text lines. Figure 4. The architecture of RFB

Figure 3. Spatial structures of RFs, in RFBTD

B. Network

The network is forked from EAST, but with several modifications. The backbone used is a resnet50, and the feature fusions are done by adding features from the lower layers, rather than concatenating them as shown in Fig 6. The stem is pre-trained on ImageNet [23] dataset. Five stages of feature maps, denoted fi , are extracted from the stem. For every stage when upsampling we apply 3x3 Conv except f5 to prevent adverse aliasing effects due to the upsampling operation. Each feature map fi is convolved with a 1x1 Conv block, and then added to the previous layer fi-1 feature map, this promotes high spatial resolution of the feature map available at the output of the network. The fi feature maps are added up to i=2 and then the final feature map is produced. Figure 5.The architecture of RFB-s. The 5x5 Conv layers are replaced with 3 × 3 Conv to reduce the number of parameters. The final feature map contains two branches from convolving with 1x1 Conv Block, where branch one contains score map Fs and a multi-channel map, Rotated boxes which contains four channels of axis-aligned bounding box R and 1 channel rotation angle θ . The 4 channels represents 4 distances from the pixel location to the top, right, bottom, left boundaries of the rectangle respectively. Figure 6. The Network architecture of RFBTD

C. Loss Functions Ls gLg L = + λ where and represents the losses for the score map sL gL and the RBOX, respectively, and weighs the importance g λ between two losses. λg is set to 1. Dice loss is used for the score map loss ,as this was sL tested to be better alternative in faster convergence compared to cross-entropy loss thats is first adopted in text detection by Yao et al. For RBOX loss Lg, we use IOU loss [24], Jiahui Yuet et al . The loss favors the intersection area as large as possible even while the predicted box as small as possible, this ameliorates variations in loss due to the huge disparity in the sizes of the text regions / boxes. Next, the loss of rotation angle is derived from EAST, where the computed as L θ ( θ , θ ) 1 − cos ( θ ) L θ ︿ * = ︿ − θ * where is the prediction to the rotation angle and θ ︿ θ * represents the ground truth. Finally, the overall geometry loss is the weighted sum of RBOX loss and angle loss, given by, Lg LRBOX λ L = + θ θ Where is set to 10 in our experiments.λ θ D. Training

The network is trained end-to-end using ADAGRAD [26] optimizer. To speed up learning, we uniformly sample 512x512 crops from images to form a minibatch of size 16. Exponential decay is induced from one-tenth every 27300 minibatches upto 1e-5. The network is trained until performance stops improving. IV. E

XPERIMENTS

The proposed method was benchmarked in ICDAR 2015 [25]. It includes a total of 1500 pictures, 1000 of which are used for training and the remaining are for testing. The text regions are annotated by 4 offsets vertices of the quadrangle, by generating RBOX labels by fitting a rotated rectangle which has the minimum area. These images are taken by Google Glass in an incidental way. Therefore text in the scene can be in arbitrary orientations, or suffer from motion blur and low resolution. The RFBTD achieved 47.09 F1-score on this dataset. Figure 7. Detector performance of the proposed algorithm on ICDAR 2015. V. C

ONCLUSIONS

A scene text detector that predict word boxes or line level region proposals from arbitrary input images has been proposed. The RFB block module provides an eccentric receptive field which aid in fine granularity to clearly distinguish word boxes in text lines / segments. This helps in many instances, for one it is superior to perform text recognition on words boxes since they offer more accuracy compared to text lines.

Figure 8a. Detector performance of the proposed algorithm on dense text regions with fine granularity between each text boxes.

VI. F

UTURE W ORK

The future work can branch out into, ● Invest in a robust detection of curved text. ● Integrating a text recognizer for performing End to End Text spotting . Figure 8b. Detector performance of the proposed algorithm on dense text regions with fine granularity between each text boxes. R

EFERENCES [1] Kaoru Amano, Brian A. Wandell and Serge O. Dumoulin, “Visual Field Maps, Population Receptive Field Sizes, and Visual Field Coverage in the Human MT+ Complex” [2] B. Epshtein, E. Ofek, and Y. Wexler. Detecting text in natural scenes with stroke width transform. In Proc. of CVPR, Proc. of ACCV , 2010. [4] L. Neumann and J. Matas. Real-time scene text localization and recognition. In Proc. of CVPR , 2012. [5] M. Busta, L. Neumann, and J. Matas. Fastext: Efficient unconstrained scene text detector. In Proc. of ICCV , 2015. [6] Z. Zhang, W. Shen, C. Yao, and X. Bai. Symmetry-based text line detection in natural scenes. In Proc. of CVPR , 2015 [7] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai. Multi-oriented text detection with fully convolutional networks. In Proc. of CVPR , 2015. [8] Z. Zhong, L. Jin, S. Zhang, and Z. Feng. Deeptext: A unified framework for text proposal generation and text detection in natural images. arXiv preprint arXiv:1605.07314 , 2016 [9] Z. Zhang, W. Shen, C. Yao, and X. Bai. Symmetry-based text line detection in natural scenes. In Proc. of CVPR , 2015. [10] X. C. Yin, X. Yin, K. Huang, and H. Hao. Robust text detection in natural scene images. IEEE Trans. on PAMI, 36(5):970–983 , 2014. [11] C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, and Z. Cao. Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002 , 2016. [12] C. Yao, X. Bai, and W. Liu. A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing, 23(11):4737–4749 , 2014. [13] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao. Detecting text in natural image with connectionist text proposal network. In European Conference on Computer Vision, pages 56–72. Springer , 2016. [14] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Analysis and Machine Intelligence , 2016. [15] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014. [16] S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. L. Tan. Text flow: A unified text detection system in natural scene images. In Proc. of ICCV , 2015. [17] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99 , 2015. [18] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer , 2015. [19] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. EAST: An Efficient and Accurate Scene Text Detector in Jonathan Long, Evan Shelhamer, Trevor Darrell. Fully convolutional networks for semantic segmentation in Songtao Liu, Di Huang, Yunhong Wang. Receptive Field Block Net for Accurate and Fast Object Detection in ECCV 2018: Computer Vision – ECCV 2018 pp 404-419 [22]